XGBoost Distributor¶

Classes¶

class snowflake.ml.modeling.distributors.xgboost.xgboost_estimator.XGBEstimator(n_estimators: int | None = 100, objective: str | None = 'reg:squarederror', params: Dict | None = None, scaling_config: XGBScalingConfig | None = None, callbacks: List[TrainingCallback] | None = None)

Bases: BaseEstimator

Xgboost Estimator that supports distributed training.

Parameters:

n_estimators (int) – Number of estimators. Default is 100.
objective (str) – The objective function used for training. ‘reg:squarederror’[Default] for regression, ‘binary:logistic’ for binary classification, ‘multi:softmax’ for multi-class classification.
params (Optional[dict]) –
Additional parameters for the XGBoost Estimator. Some key parameters are:
- booster: Specify which booster to use: gbtree[Default], gblinear or dart.
- max_depth: Maximum depth of a tree. Default is 6.
- max_leaves: Maximum number of nodes to be added. Default is 0.
- max_bin: Maximum number of bins that continuous feature values will be bucketed in. Default is 256
- eval_metric: Evaluation metrics for validation data.
Full list of supported parameter can be found at https://xgboost.readthedocs.io/en/stable/parameter.html Note: If params dict contains keys ‘n_estimators’ or ‘objective’, they will override the value provided by n_estimators and objective arguments.
scaling_config (Optional[XGBScalingConfig]) – Scaling config for XGBoost Estimator. Defaults to None. If None, the estimator will use all available resources.
callbacks (Optional[List[xgboost.callback.TrainingCallback]]) – List of callbacks to be applied during training. For more information, refer to https://xgboost.readthedocs.io/en/stable/python/callbacks.html.

There are two ways to train a XGBoost model:

Train with Snowflake DataConnector
The DataConnector should contains a single source with both input data and label data. DataConnector can be created via DataConnector.from_dataframe for snowpark dataframe or DataConnector.from_dataset

Train with in-memory data
The input data should be provided as one of the in-memory data type:(pd.DataFrame, pd.Series, csr_matrix, np.ndarray). When training with in-memory data, y should be provided as well

Parameters:

dataset (Union[DataConnector, pd.DataFrame, pd.Series, csr_matrix, np.ndarray]) – The dataset to train the model. If DataConnector is provided, input_cols and label_col are required. Otherwise, y is required.
y (Optional[Union[pd.DataFrame, pd.Series, csr_matrix, np.ndarray]]) – If dataset is one of the in-memory data types y is required. If dataset is DataConnector, y should not be provided.
input_cols (Optional[List[str]]) – The input columns to train the model. Required when dataset is DataConnector.
label_col (Optional[str]) – The label column to train the model. Required when DataConnector is provided.
eval_set (Optional[DataConnector]) – The evaluation dataset.
verbose_eval (Optional[Union[bool, int]]) – Whether to print the evaluation result. For detailed documentation, please refer to xgboost.train documentation.
xgb_model (Optional[xgb.Booster]) – The initial model to start with.

Returns:

The trained model.

Return type:

xgb.Booster

get_booster() → Booster

Get the trained booster.

Returns:: The trained booster.
Return type:: xgb.Booster
Raises:: ValueError – If the model is not trained yet.

get_params(deep=True) → Dict[str, Any]

Get parameters for this estimator.

This is needed by sklearn when using CR estimator in a pipeline. This method is used to get the parameters

Parameters:: deep (bool) – If True, will return the parameters for this estimator and contained sub-objects that are estimators. However, since we don’t have any sub-objects in our class, this parameter is not used.

Predict using the trained model. This is an in-memory implementation of predict.

Parameters:

X (Union[DataConnector, pd.DataFrame, pd.Series, csr_matrix, np.ndarray]) – the input data to predict. If DataConnector is provided, input_cols are required. if in-memory data is provided (pd.DataFrame, pd.Series, csr_matrix, np.ndarray), y should be provided as well.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Optional[Any]) – Global bias for each instance. Please check xgboost official documentation for intercept.
iteration_range (Optional[Tuple[int, int]]) – Specifies which layer of trees are used in prediction.
(Optional (output_col_name) – str): Only required for DataConnector input. Defaults to “PREDICTIONS”. This is necessary because DataConnector row order is not deterministic, so predictions must be co-located with their input features.

Returns:

The prediction result. Will be a numpy array if an in memory dataset was passed. If a DataConnector is passed, the full dataframe is returned with predictions specified in “PREDICTIONS” column

Return type:

(Union[np.ndarray, pd.DataFrame])

Raises:

ValueError – If the model is not trained yet.

class snowflake.ml.modeling.distributors.xgboost.xgboost_estimator.XGBScalingConfig(num_workers: int = -1, num_cpu_per_worker: int = -1, use_gpu: bool | None = None)

Bases: BaseScalingConfig

Scaling config for XGBoost Estimator

num_workers

Number of workers to use for distributed training. Default is -1, meaning the estimator will use all available workers.

Type:: int

num_cpu_per_worker

Number of CPU cores to use per worker. Default is -1, meaning the estimator will use all available CPU cores.

Type:: int

use_gpu

Whether to use GPU for training. If None, the estimator will choose to use GPU or not based on the environment.

Type:: Optional[bool]

num_cpu_per_worker: int = -1

num_workers: int = -1

use_gpu: bool | None = None