You are viewing documentation about an older version (1.7.0). View latest version

snowflake.ml.modeling.xgboost.XGBClassifier

class snowflake.ml.modeling.xgboost.XGBClassifier(*, objective='binary:logistic', use_label_encoder=None, input_cols: Optional[Union[str, Iterable[str]]] = None, output_cols: Optional[Union[str, Iterable[str]]] = None, label_cols: Optional[Union[str, Iterable[str]]] = None, passthrough_cols: Optional[Union[str, Iterable[str]]] = None, drop_input_cols: Optional[bool] = False, sample_weight_col: Optional[str] = None, use_external_memory_version: bool = False, batch_size: int = 10000, **kwargs)

Bases: BaseTransformer

Implementation of the scikit-learn API for XGBoost classification For more details on this class, see xgboost.XGBClassifier

Parameters:
  • input_cols (Optional[Union[str, List[str]]]) – A string or list of strings representing column names that contain features. If this parameter is not specified, all columns in the input DataFrame except the columns specified by label_cols, sample_weight_col, and passthrough_cols parameters are considered input columns. Input columns can also be set after initialization with the set_input_cols method.

  • label_cols (Optional[Union[str, List[str]]]) – A string or list of strings representing column names that contain labels. Label columns must be specified with this parameter during initialization or with the set_label_cols method before fitting.

  • output_cols (Optional[Union[str, List[str]]]) – A string or list of strings representing column names that will store the output of predict and transform operations. The length of output_cols must match the expected number of output columns from the specific predictor or transformer class used. If you omit this parameter, output column names are derived by adding an OUTPUT_ prefix to the label column names for supervised estimators, or OUTPUT_<IDX>for unsupervised estimators. These inferred output column names work for predictors, but output_cols must be set explicitly for transformers. In general, explicitly specifying output column names is clearer, especially if you don’t specify the input column names. To transform in place, pass the same names for input_cols and output_cols. be set explicitly for transformers. Output columns can also be set after initialization with the set_output_cols method.

  • sample_weight_col (Optional[str]) – A string representing the column name containing the sample weights. This argument is only required when working with weighted datasets. Sample weight column can also be set after initialization with the set_sample_weight_col method.

  • passthrough_cols (Optional[Union[str, List[str]]]) – A string or a list of strings indicating column names to be excluded from any operations (such as train, transform, or inference). These specified column(s) will remain untouched throughout the process. This option is helpful in scenarios requiring automatic input_cols inference, but need to avoid using specific columns, like index columns, during training or inference. Passthrough columns can also be set after initialization with the set_passthrough_cols method.

  • drop_input_cols (Optional[bool], default=False) – If set, the response of predict(), transform() methods will not contain input columns.

  • use_external_memory_version (bool, default=False) – If set, external memory version of XGBoost trainer is used. External memory training is done in a two-step process. First,in the preprocessing step, input data is read and parsed into an internal format, which can be CSR, CSC, or sorted CSC, and stored in in-memory buffers. The in-memory buffers are continuously flushed out to disk when predefined memory limit is reached. Second, in the tree construction step, the data pages are streamed from disk via a multi-threaded pre-fetcher as needed for tree construction. Note:’tree_method’s ‘approx’, and ‘hist’ are supported in the external memory version. Note:’grow_policy=depthwise’ is used for optimal performance in the external memory version.

  • batch_size (int, default=10000) – Number of rows in each batch of input data while using external memory training. It is not recommended to set small batch sizes, like 32 samples per batch, as this can seriously hurt performance in gradient boosting. Set the batch_size as large as possible based on the available memory.

  • n_estimators (int) –

    Number of boosting rounds.

    max_depth: Optional[int]

    Maximum tree depth for base learners.

    max_leaves :

    Maximum number of leaves; 0 indicates no limit.

    max_bin :

    If using histogram-based algorithm, maximum number of bins per feature

    grow_policy :

    Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.

    learning_rate: Optional[float]

    Boosting learning rate (xgb’s “eta”)

    verbosity: Optional[int]

    The degree of verbosity. Valid values are 0 (silent) - 3 (debug).

    objective: typing.Union[str, typing.Callable[[numpy.ndarray, numpy.ndarray], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType]

    Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).

    booster: Optional[str]

    Specify which booster to use: gbtree, gblinear or dart.

    tree_method: Optional[str]

    Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method

    n_jobs: Optional[int]

    Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.

    gamma: Optional[float]

    (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.

    min_child_weight: Optional[float]

    Minimum sum of instance weight(hessian) needed in a child.

    max_delta_step: Optional[float]

    Maximum delta step we allow each tree’s weight estimation to be.

    subsample: Optional[float]

    Subsample ratio of the training instance.

    sampling_method :
    Sampling method. Used only by gpu_hist tree method.
    • uniform: select random training instances uniformly.

    • gradient_based select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)

    colsample_bytree: Optional[float]

    Subsample ratio of columns when constructing each tree.

    colsample_bylevel: Optional[float]

    Subsample ratio of columns for each level.

    colsample_bynode: Optional[float]

    Subsample ratio of columns for each split.

    reg_alpha: Optional[float]

    L1 regularization term on weights (xgb’s alpha).

    reg_lambda: Optional[float]

    L2 regularization term on weights (xgb’s lambda).

    scale_pos_weight: Optional[float]

    Balancing of positive and negative weights.

    base_score: Optional[float]

    The initial prediction score of all instances, global bias.

    random_state: Optional[Union[numpy.random.RandomState, int]]

    Random number seed.

    Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.

    missing: float, default np.nan

    Value in the data which needs to be present as a missing value.

    num_parallel_tree: Optional[int]

    Used for boosting random forest.

    monotone_constraints: Optional[Union[Dict[str, int], str]]

    Constraint of variable monotonicity. See tutorial for more information.

    interaction_constraints: Optional[Union[str, List[Tuple[str]]]]

    Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. [[0, 1], [2, 3, 4]], where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more information

    importance_type: Optional[str]

    The feature importance type for the feature_importances_ property:

    • For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.

    • For linear model, only “weight” is defined and it’s the normalized coefficients without bias.

    gpu_id: Optional[int]

    Device ordinal.

    validate_parameters: Optional[bool]

    Give warnings for unknown parameter.

    predictor: Optional[str]

    Force XGBoost to use specific predictor, available choices are [cpu_predictor, gpu_predictor].

    enable_categorical: bool

    Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.

    feature_types: FeatureTypes

    Used for specifying feature types without constructing a dataframe. See DMatrix for details.

    max_cat_to_onehot: Optional[int]

    A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and cat-param for details.

    max_cat_threshold: Optional[int]

    Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and cat-param for details.

    eval_metric: Optional[Union[str, List[str], Callable]]

    Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in sklearn.metrics, or any other user defined metric that looks like sklearn.metrics.

    If custom objective is also provided, then custom metric should implement the corresponding reverse link function.

    Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.

    For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see xgboost.callback.EarlyStopping.

    See Custom Objective and Evaluation Metric for more.

    This parameter replaces eval_metric in fit() method. The old one receives un-transformed prediction regardless of whether custom objective is being used.

    from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor(

    tree_method=”hist”, eval_metric=mean_absolute_error,

    ) reg.fit(X, y, eval_set=[(X, y)])

    early_stopping_rounds: Optional[int]

    Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in fit().

    The method returns the model from the last iteration (not the best one). If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.

    If early stopping occurs, the model will have three additional fields: best_score, best_iteration and best_ntree_limit.

    This parameter replaces early_stopping_rounds in fit() method.

    callbacks: Optional[List[TrainingCallback]]

    List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

    States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.

    for params in parameters_grid:

    # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] xgboost.train(params, Xy, callbacks=callbacks)

    kwargs: dict, optional

    Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.

    **kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.

    A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

    y_true: array_like of shape [n_samples]

    The target values

    y_pred: array_like of shape [n_samples]

    The predicted values

    grad: array_like of shape [n_samples]

    The value of the gradient for each sample point.

    hess: array_like of shape [n_samples]

    The value of the second derivative for each sample point

Base class for all transformers.

Methods

fit(dataset: Union[DataFrame, DataFrame]) BaseEstimator

Runs universal logics for all fit implementations.

fit_transform(dataset: Union[DataFrame, DataFrame], output_cols_prefix: str = 'fit_transform_') Union[DataFrame, DataFrame]

Method not supported for this class.

Raises:

TypeError – Supported dataset types: snowpark.DataFrame, pandas.DataFrame.

Parameters:

dataset – Union[snowflake.snowpark.DataFrame, pandas.DataFrame] Snowpark or Pandas DataFrame.

output_cols_prefix: Prefix for the response columns :returns: Transformed dataset.

get_input_cols() List[str]

Input columns getter.

Returns:

Input columns.

get_label_cols() List[str]

Label column getter.

Returns:

Label column(s).

get_output_cols() List[str]

Output columns getter.

Returns:

Output columns.

get_params(deep: bool = True) Dict[str, Any]

Get the snowflake-ml parameters for this transformer.

Parameters:

deep – If True, will return the parameters for this transformer and contained subobjects that are transformers.

Returns:

Parameter names mapped to their values.

get_passthrough_cols() List[str]

Passthrough columns getter.

Returns:

Passthrough column(s).

get_sample_weight_col() Optional[str]

Sample weight column getter.

Returns:

Sample weight column.

get_sklearn_args(default_sklearn_obj: Optional[object] = None, sklearn_initial_keywords: Optional[Union[str, Iterable[str]]] = None, sklearn_unused_keywords: Optional[Union[str, Iterable[str]]] = None, snowml_only_keywords: Optional[Union[str, Iterable[str]]] = None, sklearn_added_keyword_to_version_dict: Optional[Dict[str, str]] = None, sklearn_added_kwarg_value_to_version_dict: Optional[Dict[str, Dict[str, str]]] = None, sklearn_deprecated_keyword_to_version_dict: Optional[Dict[str, str]] = None, sklearn_removed_keyword_to_version_dict: Optional[Dict[str, str]] = None) Dict[str, Any]

Get sklearn keyword arguments.

This method enables modifying object parameters for special cases.

Parameters:
  • default_sklearn_obj – Sklearn object used to get default parameter values. Necessary when sklearn_added_keyword_to_version_dict is provided.

  • sklearn_initial_keywords – Initial keywords in sklearn.

  • sklearn_unused_keywords – Sklearn keywords that are unused in snowml.

  • snowml_only_keywords – snowml only keywords not present in sklearn.

  • sklearn_added_keyword_to_version_dict – Added keywords mapped to the sklearn versions in which they were added.

  • sklearn_added_kwarg_value_to_version_dict – Added keyword argument values mapped to the sklearn versions in which they were added.

  • sklearn_deprecated_keyword_to_version_dict – Deprecated keywords mapped to the sklearn versions in which they were deprecated.

  • sklearn_removed_keyword_to_version_dict – Removed keywords mapped to the sklearn versions in which they were removed.

Returns:

Sklearn parameter names mapped to their values.

predict(dataset: Union[DataFrame, DataFrame]) Union[DataFrame, DataFrame]

Predict with X For more details on this function, see xgboost.XGBClassifier.predict

Raises:

TypeError – Supported dataset types: snowpark.DataFrame, pandas.DataFrame.

Parameters:

dataset – Union[snowflake.snowpark.DataFrame, pandas.DataFrame] Snowpark or Pandas DataFrame.

Returns:

Transformed dataset.

predict_proba(dataset: Union[DataFrame, DataFrame], output_cols_prefix: str = 'predict_proba_') Union[DataFrame, DataFrame]

Predict the probability of each X example being of a given class For more details on this function, see xgboost.XGBClassifier.predict_proba

Raises:

TypeError – Supported dataset types: snowpark.DataFrame, pandas.DataFrame.

Parameters:
  • dataset – Union[snowflake.snowpark.DataFrame, pandas.DataFrame] Snowpark or Pandas DataFrame.

  • output_cols_prefix – Prefix for the response columns

Returns:

Output dataset with probability of the sample for each class in the model.

score(dataset: Union[DataFrame, DataFrame]) float

Return the mean accuracy on the given test data and labels For more details on this function, see xgboost.XGBClassifier.score

Raises:

TypeError – Supported dataset types: snowpark.DataFrame, pandas.DataFrame.

Parameters:

dataset – Union[snowflake.snowpark.DataFrame, pandas.DataFrame] Snowpark or Pandas DataFrame.

Returns:

Score.

score_samples(dataset: Union[DataFrame, DataFrame], output_cols_prefix: str = 'score_samples_') Union[DataFrame, DataFrame]

Method not supported for this class.

Raises:

TypeError – Supported dataset types: snowpark.DataFrame, pandas.DataFrame.

Parameters:
  • dataset – Union[snowflake.snowpark.DataFrame, pandas.DataFrame] Snowpark or Pandas DataFrame.

  • output_cols_prefix – Prefix for the response columns

Returns:

Output dataset with probability of the sample for each class in the model.

set_drop_input_cols(drop_input_cols: Optional[bool] = False) None
set_input_cols(input_cols: Optional[Union[str, Iterable[str]]]) XGBClassifier

Input columns setter.

Parameters:

input_cols – A single input column or multiple input columns.

Returns:

self

set_label_cols(label_cols: Optional[Union[str, Iterable[str]]]) Base

Label column setter.

Parameters:

label_cols – A single label column or multiple label columns if multi task learning.

Returns:

self

set_output_cols(output_cols: Optional[Union[str, Iterable[str]]]) Base

Output columns setter.

Parameters:

output_cols – A single output column or multiple output columns.

Returns:

self

set_params(**params: Any) None

Set the parameters of this transformer.

The method works on simple transformers as well as on sklearn compatible pipelines with nested objects, once the transformer has been fit. Nested objects have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params – Transformer parameter names mapped to their values.

Raises:

SnowflakeMLException – Invalid parameter keys.

set_passthrough_cols(passthrough_cols: Optional[Union[str, Iterable[str]]]) Base

Passthrough columns setter.

Parameters:

passthrough_cols – Column(s) that should not be used or modified by the estimator/transformer. Estimator/Transformer just passthrough these columns without any modifications.

Returns:

self

set_sample_weight_col(sample_weight_col: Optional[str]) Base

Sample weight column setter.

Parameters:

sample_weight_col – A single column that represents sample weight.

Returns:

self

to_xgboost() Any

Get xgboost.XGBClassifier object.

Attributes

model_signatures

Returns model signature of current class.

Raises:

SnowflakeMLException – If estimator is not fitted, then model signature cannot be inferred

Returns:

Dict with each method and its input output signature