XGBoost Distributor

Classes

class snowflake.ml.modeling.distributors.xgboost.xgboost_estimator.XGBEstimator(n_estimators: int | None = 100, objective: str | None = 'reg:squarederror', params: Dict | None = None, scaling_config: XGBScalingConfig | None = None)

Bases: BaseEstimator

Xgboost Estimator that supports distributed training.

Parameters:
  • n_estimators (int) – Number of estimators. Default is 100.

  • objective (str) – The objective function used for training. ‘reg:squarederror’[Default] for regression, ‘binary:logistic’ for binary classification, ‘multi:softmax’ for multi-class classification.

  • params (Optional[dict]) –

    Additional parameters for the XGBoost Estimator. Some key parameters are:

    • booster: Specify which booster to use: gbtree[Default], gblinear or dart.

    • max_depth: Maximum depth of a tree. Default is 6.

    • max_leaves: Maximum number of nodes to be added. Default is 0.

    • max_bin: Maximum number of bins that continuous feature values will be bucketed in. Default is 256

    • eval_metric: Evaluation metrics for validation data.

    Full list of supported parameter can be found at https://xgboost.readthedocs.io/en/stable/parameter.html Note: If params dict contains keys ‘n_estimators’ or ‘objective’, they will override the value provided by n_estimators and objective arguments.

  • scaling_config (Optional[XGBScalingConfig]) – Scaling config for XGBoost Estimator. Defaults to None. If None, the estimator will use all available resources.

fit(dataset: DataConnector | DataFrame | Series | csr_matrix | ndarray, y: DataFrame | Series | csr_matrix | ndarray | None = None, input_cols: List[str] | None = None, label_col: str | None = None, eval_set: DataConnector | None = None, verbose_eval: bool | int | None = None, xgb_model: Booster | None = None) Booster

There are two ways to train a XGBoost model:

  1. Train with Snowflake DataConnector

    The DataConnector should contains a single source with both input data and label data. DataConnector can be created via DataConnector.from_dataframe for snowpark dataframe or DataConnector.from_dataset

  2. Train with in-memory data

    The input data should be provided as one of the in-memory data type:(pd.DataFrame, pd.Series, csr_matrix, np.ndarray). When training with in-memory data, y should be provided as well

Parameters:
  • dataset (Union[DataConnector, pd.DataFrame, pd.Series, csr_matrix, np.ndarray]) – The dataset to train the model. If DataConnector is provided, input_cols and label_col are required. Otherwise, y is required.

  • y (Optional[Union[pd.DataFrame, pd.Series, csr_matrix, np.ndarray]]) – If dataset is one of the in-memory data types y is required. If dataset is DataConnector, y should not be provided.

  • input_cols (Optional[List[str]]) – The input columns to train the model. Required when dataset is DataConnector.

  • label_col (Optional[str]) – The label column to train the model. Required when DataConnector is provided.

  • eval_set (Optional[DataConnector]) – The evaluation dataset.

  • verbose_eval (Optional[Union[bool, int]]) – Whether to print the evaluation result. For detailed documentation, please refer to xgboost.train documentation.

  • xgb_model (Optional[xgb.Booster]) – The initial model to start with.

Returns:

The trained model.

Return type:

xgb.Booster

get_booster() Booster

Get the trained booster.

Returns:

The trained booster.

Return type:

xgb.Booster

Raises:

ValueError – If the model is not trained yet.

get_params(deep=True) Dict[str, Any]

Get parameters for this estimator.

This is needed by sklearn when using CR estimator in a pipeline. This method is used to get the parameters

Parameters:

deep (bool) – If True, will return the parameters for this estimator and contained sub-objects that are estimators. However, since we don’t have any sub-objects in our class, this parameter is not used.

predict(X: DataConnector | DataFrame | Series | csr_matrix | ndarray, output_margin: bool = False, validate_features: bool = False, base_margin: Any | None = None, iteration_range: Tuple[int, int] | None = None, output_col_name: str | None = 'PREDICTIONS') ndarray | DataFrame

Predict using the trained model. This is an in-memory implementation of predict.

Parameters:
  • X (Union[DataConnector, pd.DataFrame, pd.Series, csr_matrix, np.ndarray]) – the input data to predict. If DataConnector is provided, input_cols are required. if in-memory data is provided (pd.DataFrame, pd.Series, csr_matrix, np.ndarray), y should be provided as well.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • base_margin (Optional[Any]) – Global bias for each instance. Please check xgboost official documentation for intercept.

  • iteration_range (Optional[Tuple[int, int]]) – Specifies which layer of trees are used in prediction.

  • (Optional (output_col_name) – str): Only required for DataConnector input. Defaults to “PREDICTIONS”. This is necessary because DataConnector row order is not deterministic, so predictions must be co-located with their input features.

Returns:

The prediction result. Will be a numpy array if an in memory dataset was passed. If a DataConnector is passed, the full dataframe is returned with predictions specified in “PREDICTIONS” column

Return type:

(Union[np.ndarray, pd.DataFrame])

Raises:

ValueError – If the model is not trained yet.

class snowflake.ml.modeling.distributors.xgboost.xgboost_estimator.XGBScalingConfig(num_workers: int = -1, num_cpu_per_worker: int = -1, use_gpu: bool | None = None)

Bases: BaseScalingConfig

Scaling config for XGBoost Estimator

num_workers

Number of workers to use for distributed training. Default is -1, meaning the estimator will use all available workers.

Type:

int

num_cpu_per_worker

Number of CPU cores to use per worker. Default is -1, meaning the estimator will use all available CPU cores.

Type:

int

use_gpu

Whether to use GPU for training. If None, the estimator will choose to use GPU or not based on the environment.

Type:

Optional[bool]

num_cpu_per_worker: int = -1
num_workers: int = -1
use_gpu: bool | None = None