XGBoost Distributor¶
Classes¶
- class snowflake.ml.modeling.distributors.xgboost.xgboost_estimator.XGBEstimator(n_estimators: int | None = 100, objective: str | None = 'reg:squarederror', params: Dict | None = None, scaling_config: XGBScalingConfig | None = None)¶
Bases:
BaseEstimator
Xgboost Estimator that supports distributed training.
- Parameters:
n_estimators (int) – Number of estimators. Default is 100.
objective (str) – The objective function used for training. ‘reg:squarederror’[Default] for regression, ‘binary:logistic’ for binary classification, ‘multi:softmax’ for multi-class classification.
params (Optional[dict]) –
Additional parameters for the XGBoost Estimator. Some key parameters are:
booster: Specify which booster to use: gbtree[Default], gblinear or dart.
max_depth: Maximum depth of a tree. Default is 6.
max_leaves: Maximum number of nodes to be added. Default is 0.
max_bin: Maximum number of bins that continuous feature values will be bucketed in. Default is 256
eval_metric: Evaluation metrics for validation data.
Full list of supported parameter can be found at https://xgboost.readthedocs.io/en/stable/parameter.html Note: If params dict contains keys ‘n_estimators’ or ‘objective’, they will override the value provided by n_estimators and objective arguments.
scaling_config (Optional[XGBScalingConfig]) – Scaling config for XGBoost Estimator. Defaults to None. If None, the estimator will use all available resources.
- fit(dataset: DataConnector | DataFrame | Series | csr_matrix | ndarray, y: DataFrame | Series | csr_matrix | ndarray | None = None, input_cols: List[str] | None = None, label_col: str | None = None, eval_set: DataConnector | None = None, verbose_eval: bool | int | None = None, xgb_model: Booster | None = None) Booster ¶
There are two ways to train a XGBoost model:
- Train with Snowflake DataConnector
The DataConnector should contains a single source with both input data and label data. DataConnector can be created via DataConnector.from_dataframe for snowpark dataframe or DataConnector.from_dataset
- Train with in-memory data
The input data should be provided as one of the in-memory data type:(pd.DataFrame, pd.Series, csr_matrix, np.ndarray). When training with in-memory data, y should be provided as well
- Parameters:
dataset (Union[DataConnector, pd.DataFrame, pd.Series, csr_matrix, np.ndarray]) – The dataset to train the model. If DataConnector is provided, input_cols and label_col are required. Otherwise, y is required.
y (Optional[Union[pd.DataFrame, pd.Series, csr_matrix, np.ndarray]]) – If dataset is one of the in-memory data types y is required. If dataset is DataConnector, y should not be provided.
input_cols (Optional[List[str]]) – The input columns to train the model. Required when dataset is DataConnector.
label_col (Optional[str]) – The label column to train the model. Required when DataConnector is provided.
eval_set (Optional[DataConnector]) – The evaluation dataset.
verbose_eval (Optional[Union[bool, int]]) – Whether to print the evaluation result. For detailed documentation, please refer to xgboost.train documentation.
xgb_model (Optional[xgb.Booster]) – The initial model to start with.
- Returns:
The trained model.
- Return type:
xgb.Booster
- get_booster() Booster ¶
Get the trained booster.
- Returns:
The trained booster.
- Return type:
xgb.Booster
- Raises:
ValueError – If the model is not trained yet.
- get_params(deep=True) Dict[str, Any] ¶
Get parameters for this estimator.
This is needed by sklearn when using CR estimator in a pipeline. This method is used to get the parameters
- Parameters:
deep (bool) – If True, will return the parameters for this estimator and contained sub-objects that are estimators. However, since we don’t have any sub-objects in our class, this parameter is not used.
- predict(X: DataConnector | DataFrame | Series | csr_matrix | ndarray, output_margin: bool = False, validate_features: bool = False, base_margin: Any | None = None, iteration_range: Tuple[int, int] | None = None, output_col_name: str | None = 'PREDICTIONS') ndarray | DataFrame ¶
Predict using the trained model. This is an in-memory implementation of predict.
- Parameters:
X (Union[DataConnector, pd.DataFrame, pd.Series, csr_matrix, np.ndarray]) – the input data to predict. If DataConnector is provided, input_cols are required. if in-memory data is provided (pd.DataFrame, pd.Series, csr_matrix, np.ndarray), y should be provided as well.
output_margin (bool) – Whether to output the raw untransformed margin value.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
base_margin (Optional[Any]) – Global bias for each instance. Please check xgboost official documentation for intercept.
iteration_range (Optional[Tuple[int, int]]) – Specifies which layer of trees are used in prediction.
(Optional (output_col_name) – str): Only required for DataConnector input. Defaults to “PREDICTIONS”. This is necessary because DataConnector row order is not deterministic, so predictions must be co-located with their input features.
- Returns:
The prediction result. Will be a numpy array if an in memory dataset was passed. If a DataConnector is passed, the full dataframe is returned with predictions specified in “PREDICTIONS” column
- Return type:
(Union[np.ndarray, pd.DataFrame])
- Raises:
ValueError – If the model is not trained yet.
- class snowflake.ml.modeling.distributors.xgboost.xgboost_estimator.XGBScalingConfig(num_workers: int = -1, num_cpu_per_worker: int = -1, use_gpu: bool | None = None)¶
Bases:
BaseScalingConfig
Scaling config for XGBoost Estimator
- num_workers¶
Number of workers to use for distributed training. Default is -1, meaning the estimator will use all available workers.
- Type:
int
- num_cpu_per_worker¶
Number of CPU cores to use per worker. Default is -1, meaning the estimator will use all available CPU cores.
- Type:
int
- use_gpu¶
Whether to use GPU for training. If None, the estimator will choose to use GPU or not based on the environment.
- Type:
Optional[bool]
- num_cpu_per_worker: int = -1¶
- num_workers: int = -1¶
- use_gpu: bool | None = None¶