LightGBM Distributor

Classes

class snowflake.ml.modeling.distributors.lightgbm.lightgbm_estimator.LightGBMEstimator(n_estimators: int = 100, objective: str = 'regression', params: Dict[str, Any] | None = None, scaling_config: LightGBMScalingConfig | None = None)

Bases: BaseEstimator

LightGBM Estimator for distributed training and inference.

params

Additional parameters for LightGBM. Defaults to None. Some key params are:

  • boosting: The type of boosting to use. Use “gbdt”[Default] for Gradient Boosting Decision Tree.

  • num_leaves: The maximum number of leaves in one tree. Default is 31.

  • max_depth: The maximum depth of the tree. Default is -1, which means no limit.

  • early_stopping_rounds: Activates early stopping. Default is 0, means no early stopping.

Full list of supported parameter can be found at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

Type:

Optional[Dict[str, Any]], optional

scaling_config

Configuration for scaling. Defaults to None. If None, the estimator will use all available resources.

Type:

Optional[LightGBMScalingConfig], optional

fit(dataset: DataConnector | DataFrame | Series | ndarray | csr_matrix, y: DataFrame | Series | csr_matrix | ndarray | None = None, input_cols: List[str] | None = None, label_col: str | None = None, eval_set: DataConnector | None = None, init_model: Booster | None = None)

Train the LightGBM model.

This method trains the LightGBM model using the provided dataset and labels. It can accept various data formats and supports validation datasets for monitoring training performance. The model can also be initialized from a pre-trained model for warm starts.

Parameters:
  • dataset (Union[DataConnector, pd.DataFrame, pd.Series, csr_matrix, np.ndarray]) – the dataset to train the model. If DataConnector is provided, input_cols and label_col are required. if in-memory data is provided (pd.DataFrame, pd.Series, csr_matrix, np.ndarray), y should be provided as well.

  • y (Optional[Union[pd.DataFrame, pd.Series, csr_matrix, np.ndarray]]) – If dataset is one of the in-memory data type, y should be provided. If dataset is DataConnector, y should not be provided. Instead, dataset should contain both input_cols and label_col.

  • input_cols (Optional[List[str]]) – the input columns to train the model. Required when dataset is DataConnector.

  • label_col (Optional[str]) – the label column to train the model. Required when DataConnector is provided.

  • eval_set (Optional[DataConnector], optional) – An optional evaluation dataset to monitor performance during training. Defaults to None.

  • init_model (Optional[lgb.Booster], optional) – An optional pre-trained LightGBM model for warm starting the training process. Defaults to None.

Returns:

The trained LightGBM model.

Raises:

ValueError – If input validation fails or if the model cannot be trained.

get_booster() Booster

Returns the trained LightGBM Booster.

Returns:

The trained booster model.

Return type:

lgb.Booster

Raises:

ValueError – If the model has not been trained yet.

get_params(deep=True) Dict[str, Any]

Get parameters for this estimator.

This is needed by sklearn when using CR estimator in a pipeline. This method is used to get the parameters

Parameters:

deep (bool) – If True, will return the parameters for this estimator and contained subobjects that are estimators. However, since we don’t have any subobjects in our class, this parameter is not used.

predict(X: DataConnector | DataFrame | Series | csr_matrix | ndarray, raw_score: bool = False, start_iteration: int = 0, num_iteration: int | None = None, pred_leaf: bool = False, pred_contrib: bool = False, validate_features: bool = True, output_col_name: str | None = 'PREDICTIONS') ndarray | DataFrame

Make predictions using the trained LightGBM model.

This function utilizes the trained model to generate predictions on the provided dataset. It supports various options for prediction behavior, such as returning raw scores, predicting leaf indices, and validating features.

Parameters:
  • X (Union[DataConnector, pd.DataFrame, pd.Series, csr_matrix, np.ndarray]) – The input data for which predictions are to be made.

  • raw_score (bool, optional) – If True, returns the raw scores from the model. Defaults to False.

  • start_iteration (int, optional) – The index of the iteration to start predictions from. If <= 0, starts from the first iteration. Defaults to 0.

  • num_iteration (Optional[int], optional) – The total number of iterations to use for predictions. If None and if the best iteration exists while start_iteration <= 0, the best iteration is used; otherwise, all iterations from start_iteration are used. If <= 0, all iterations from start_iteration are used. Defaults to None.

  • pred_leaf (bool, optional) – If True, predicts the leaf index instead of the actual prediction. Defaults to False.

  • pred_contrib (bool, optional) – If True, predicts feature contributions instead of the actual prediction. Defaults to False.

  • validate_features (bool, optional) – If True, validates that the features in X match those used during training. Defaults to True.

  • (Optional (output_col_name) – str): Only required for DataConnector input. Defaults to “PREDICTIONS”. This is necessary because DataConnector row order is not deterministic, so predictions must be co-located with their input features.

Returns:

The prediction result. Will be a numpy array if an in memory dataset was passed. If a DataConnector is passed, the full dataframe is returned with predictions specified in “PREDICTIONS” column

Return type:

(Union[np.ndarray, pd.DataFrame])

Raises:

ValueError – If the model has not been trained yet.

class snowflake.ml.modeling.distributors.lightgbm.lightgbm_estimator.LightGBMScalingConfig(num_workers: int = -1, num_cpu_per_worker: int = -1, use_gpu: bool | None = None)

Bases: BaseScalingConfig

Scaling config for LightGBM Estimator.

num_workers

The number of worker processes to use. Default is -1, which utilizes all available resources.

Type:

int

num_cpu_per_worker

Number of CPUs allocated per worker. Default is -1, which means all available resources.

Type:

int

use_gpu

Whether to use GPU for training. Default is None, allowing the estimator to choose automatically based on the environment.

Type:

Optional[bool]

num_cpu_per_worker: int = -1
num_workers: int = -1
use_gpu: bool | None = None