Snowpark ML Modeling: ML Model Development

Snowpark ML Modeling is a collection of Python APIs for preprocessing data and training models. By using Snowpark ML Modeling to perform these tasks within Snowflake, you can:

  • Transform your data and train your models without moving your data out of Snowflake.

  • Work with APIs similar to those you’re already familiar with, such as scikit-learn.

  • Keep your ML pipeline running within Snowflake’s security and governance frameworks.

  • Benefit from the performance and scalability of Snowflake’s data warehouses.

The Snowpark ML Modeling package described in this topic provides estimators and transformers that are compatible with those in the scikit-learn, xgboost, and lightgbm libraries. You can use these APIs to build and train machine learning models within Snowflake.

For a quick introduction to Snowpark ML Modeling, see our Quickstart.

Note

This topic assumes that the Snowpark ML module is already installed. If it isn’t, see Installing Snowpark ML.

Snowpark ML Modeling Classes

All Snowpark ML modeling and preprocessing classes are in the snowflake.ml.modeling namespace. The Snowpark ML modules have the same name as the corresponding modules from the sklearn namespace. For example, the Snowpark ML module corresponding to sklearn.calibration is snowflake.ml.modeling.calibration.

The xgboost and lightgbm modules correspond to snowflake.ml.modeling.xgboost and snowflake.ml.modeling.lightgbm, respectively.

Not all of the classes from scikit-learn are supported in Snowpark ML. The following table indicates which classes are supported. Classes marked with an asterisk (*) support distributed execution.

Snowpark ML module name

Classes

snowflake.ml.modeling.calibration

  • CalibratedClassifierCV

snowflake.ml.modeling.cluster

  • AgglomerativeClustering

  • AffinityPropagation

  • Birch

  • DBSCAN

  • FeatureAgglomeration

  • KMeans

  • MeanShift

  • MiniBatchKMeans

  • OPTICS

  • SpectralBiclustering

  • SpectralClustering

  • SpectralCoclustering

snowflake.ml.modeling.compose

  • ColumnTransformer

  • TransformedTargetRegressor

snowflake.ml.modeling.covariance

  • EllipticEnvelope

  • EmpiricalCovariance

  • GraphicalLasso

  • GraphicalLassoCV

  • LedoitWolf

  • MinCovDet

  • OAS

  • ShrunkCovariance

snowflake.ml.modeling.decomposition

  • DictionaryLearning

  • FactorAnalysis

  • FastICA

  • IncrementalPCA

  • KernelPCA

  • MiniBatchDictionaryLearning

  • MiniBatchSparsePca

  • PCA

  • SparsePCA

  • TruncatedSVD

snowflake.ml.modeling.discriminant_analysis

  • LinearDiscriminantAnalysis

  • QuadraticDiscriminantAnalysis

snowflake.ml.modeling.ensemble

  • AdaBoostClassifier

  • AdaBoostRegressor

  • BaggingClassifier

  • BaggingRegressor

  • ExtraTreesClassifier

  • ExtraTreesRegressor

  • GradientBoostingClassifier

  • GradientBoostingRegressor

  • IsolationForest

  • RandomForestClassifier

  • RandomForestRegressor

  • StackingRegressor

  • VotingClassifier

  • VotingRegressor

snowflake.ml.modeling.feature_selection

  • GenericUnivariateSelect

  • SelectFdr

  • SelectFpr

  • SelectFwe

  • SelectKBest

  • SelectPercentile

  • SequentialFeatureSelector

  • VarianceThreshold

snowflake.ml.modeling.gaussian_process

  • GaussianProcessClassifier

  • GaussianProcessRegressor

snowflake.ml.modeling.impute

  • IterativeImputer

  • KNNImputer

  • MissingIndicator

  • SimpleImputer *

snowflake.ml.modeling.kernel_approximation

  • AdditiveChi2Sampler

  • Nystroem

  • PolynomialCountSketch

  • RBFSampler

  • SkewedChi2Sampler

snowflake.ml.modeling.kernel_ridge

  • KernelRidge

snowflake.ml.modeling.lightgbm

  • LGBMClassifier

  • LGBMRegressor

snowflake.ml.modeling.linear_model

  • ARDRegression

  • BayesianRidge

  • ElasticNet

  • ElasticNetCV

  • GammaRegressor

  • HuberRegressor

  • Lars

  • LarsCV

  • Lasso

  • LassoCV

  • LassoLars

  • LassoLarsCV

  • LassoLarsIC

  • LinearRegression

  • LogisticRegression

  • LogisticRegressionCV

  • MultiTaskElasticNet

  • MultiTaskElasticNetCV

  • MultiTaskLasso

  • MultiTaskLassoCV

  • OrthogonalMatchingPursuit

  • PassiveAggressiveClassifier

  • PassiveAggressiveRegressor

  • Perceptron

  • PoissonRegressor

  • RANSACRegressor

  • Ridge

  • RidgeClassifier

  • RidgeClassifierCV

  • RidgeCV

  • SGDClassifier

  • SGDOneClassSvm

  • SGDRegressor

  • TheilSenRegressor

  • TweedieRegressor

snowflake.ml.modeling.manifold

  • Isomap

  • MDS

  • SpectralEmbedding

  • TSNE

snowflake.ml.modeling.metrics

correlation:

  • correlation *

covariance:

  • covariance *

classification:

  • accuracy_score *

  • confusion_matrix *

  • f1_score

  • fbeta_score

  • log_loss

  • precision_recall_fscore_support

  • precision_score

  • recall_score

ranking:

  • precision_recall_curve

  • roc_auc_score

  • roc_curve

regression:

  • d2_absolute_error_score

  • d2_pinball_score

  • explained_variance_score

  • mean_absolute_error

  • mean_absolute_percentage_error

  • mean_squared_error

  • r2_score *

snowflake.ml.modeling.mixture

  • BayesianGaussianMixture

  • GaussianMixture

snowflake.ml.modeling.model_selection

  • GridSearchCV

  • RandomizedSearchCV

snowflake.ml.modeling.multiclass

  • OneVsOneClassifier

  • OneVsRestClassifier

  • OutputCodeClassifier

snowflake.ml.modeling.naive_bayes

  • BernoulliNB

  • CategoricalNB

  • ComplementNB

  • GaussianNB

  • MultinomialNB

snowflake.ml.modeling.neighbors

  • KernelDensity

  • KNeighborsClassifier

  • KNeighborsRegressor

  • LocalOutlierFactor

  • NearestCentroid

  • NearestNeighbors

  • NeighborhoodComponentsAnalysis

  • RadiusNeighborsClassifier

  • RadiusNeighborsRegressor

snowflake.ml.modeling.neural_network

  • BernoulliRBM

  • MLPClassifier

  • MLPRegressor

snowflake.ml.modeling.preprocessing

  • Binarizer *

  • KBInsDiscretizer *

  • LabelEncoder *

  • MaxAbsScaler *

  • MinMaxScaler *

  • Normalizer *

  • OneHotEncoder *

  • OrdinalEncoder *

  • RobustScaler *

  • StandardScaler *

snowflake.ml.modeling.semi_supervised

  • LabelPropagation

  • LabelSpreading

snowflake.ml.modeling.svm

  • LinearSVC

  • LinearSVR

  • NuSVC

  • NuSVR

  • SVC

  • SVR

snowflake.ml.modeling.tree

  • DecisionTreeClassifier

  • DecisionTreeRegressor

  • ExtraTreeClassifier

  • ExtraTreeRegressor

snowflake.ml.modeling.xgboost

  • XGBClassifier

  • XGBRegressor

  • XGBRFClassifier

  • XGBRFRegressor

General API Differences

Snowpark ML Modeling includes data preprocessing, transformation, and prediction algorithms based on scikit-learn, xgboost, and lightgbm. The Snowpark Python classes are replacements for the corresponding classes from the original packages, with similar signatures. However, these APIs are designed to work with Snowpark DataFrames instead of NumPy arrays.

Although the Snowpark ML Modeling API is similar to scikit-learn, there are some key differences. This section explains how to call the __init__ (constructor), fit, and predict methods for the estimator and transformer classes provided in Snowpark ML.

  • The constructor of all Snowpark ML Python classes accepts five additional parameters—input_cols, output_cols, sample_weight_col, label_cols, and drop_input_cols—in addition to the parameters accepted by the equivalent classes in scikit-learn, xgboost, or lightgbm. These are strings or sequences of strings that specify the names of the input columns, output columns, sample weight column, and label columns in a Snowpark or Pandas DataFrame.

  • The fit and predict methods in Snowpark ML accept a single DataFrame instead of separate arrays representing input data, labels, and weights. With Snowpark ML, you specify the names of the columns to be used for these purposes when you instantiate the class; these names are then used to find the required columns in the DataFrame that you pass to fit or predict. See fit and predict.

  • The transform and predict methods in Snowpark ML return a DataFrame containing all of the columns from the DataFrame passed to the method, with the output from the prediction stored in additional columns. (You can transform in place by specifying the same input and output column names or drop the input columns by passing drop_input_cols = True.) The scikit-learn, xgboost, and lightgbm equivalents return arrays containing only the results.

  • Snowpark Python transformers do not have a fit_transform method. However, as with scikit-learn, parameter validation is only performed in the fit method, so you should call fit at some point before transform, even when the transformer does not do any fitting. fit returns the transformer, so the method calls may be chained; for example, Binarizer(threshold=0.5).fit(df).transform(df).

  • Snowpark ML transformers do not have an inverse_transform method. This method is unnecessary with Snowpark ML because the original representation remains available in the input columns of the input DataFrame, which are preserved unless you explicitly perform an in-place transform by specifying the same names for both the input and output columns.

Constructing a Model

In addition to the parameters accepted by individual scikit-learn model classes, all Snowpark ML Modeling classes accept five additional parameters, listed in the following table, at instantiation.

These parameters are all technically optional, but you will often want to specify input_cols, output_cols, or both. label_cols and sample_weight_col are required in specific situations noted in the table, but can be omitted in other cases.

Parameter

Description

input_cols

A string or list of strings representing column names that contain features.

If you omit this parameter, all columns in the input DataFrame, except the columns specified by label_cols and sample-weight_col parameters, are considered input columns.

label_cols

A string or list of strings representing column names that contain labels.

You must specify label columns for estimators because inferring these columns is not possible. If you omit this parameter, the model is considered a transformer and is fitted without labels.

output_cols

A string or list of strings representing column names that will store the output of predict and transform operations. The length of output_cols must match the expected number of output columns from the specific predictor or transformer class used.

If you omit this parameter, output column names are derived by adding an OUTPUT_ prefix to the label column names. These inferred output column names work for predictors, but output_cols must be set explicitly for transformers. Explicitly specifying output column names is clearer, especially if you don’t specify the input column names.

To transform in place, pass the same names for input_cols and output_cols.

sample_weight_col

A string representing the column name containing the examples’ weights.

This argument is required for weighted datasets.

drop_input_cols

A Boolean value indicating whether the input columns are removed from the result DataFrame. The default is False.

Example

The DecisionTreeClassifier constructor does not have any required arguments in scikit-learn; all arguments have default values. So in scikit-learn, you might write:

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
Copy

In Snowpark ML, you must specify the column names (or accept the defaults by not specifying them). In this example, they are explicitly specified.

You can initialize a Snowpark ML DecisionTreeClassifier by passing the arguments directly to the constructor or by setting them as attributes of the model after instantiation. (The attributes may be changed at any time.)

  • As constructor arguments:

    from snowflake.ml.modeling.tree import DecisionTreeClassifier
    
    model = DecisionTreeClassifier(
        input_cols=feature_column_names, label_cols=label_column_names, sample_weight_col=weight_column_name,
        output_cols=expected_output_column_names
    )
    
    Copy
  • By setting model attributes:

    from snowflake.ml.modeling.tree import DecisionTreeClassifier
    
    model = DecisionTreeClassifier()
    model.set_input_cols(feature_column_names)
    model.set_label_cols(label_column_names)
    model.set_sample_weight_col(weight_column_name)
    model.set_output_cols(output_column_names)
    
    Copy

fit

The fit method of a Snowpark ML classifier takes a single Snowpark or Pandas DataFrame containing all columns, including features, labels, and weights. This is different from scikit-learn’s fit method, which takes separate inputs for features, labels, and weights.

In scikit-learn, the DecisionTreeClassifier.fit method call looks like this:

model.fit(
    X=df[feature_column_names], y=df[label_column_names], sample_weight=df[weight_column_name]
)
Copy

In Snowpark ML, you only need to pass the DataFrame. You have already set the input, label, and weight column names at initialization or by using setter methods, as shown in Constructing a Model.

model.fit(df)
Copy

predict

The predict method of a Snowpark ML class also takes a single Snowpark or Pandas DataFrame containing all feature columns. The result is a DataFrame that contains all the columns in the input DataFrame unchanged and the output columns appended. You must extract the output columns from this DataFrame. This is different from the predict method in scikit-learn, which returns only the results.

Example

In scikit-learn, predict returns only the prediction results:

prediction_results = model.predict(X=df[feature_column_names])
Copy

To get only the prediction results in Snowpark ML, extract the output columns from the returned DataFrame. Here, output_column_names is a list containing the names of the output columns:

prediction_results = model.predict(df)[output_column_names]
Copy

Deploying and Running Your Model

The result of training a model is a Python Snowpark ML model object. You can use the trained model to make predictions by calling the model’s predict method. This creates a temporary user-defined function to run the model in your Snowflake virtual warehouse. This function is automatically deleted at the end of your Snowpark ML session (for example, when your script ends or when you close your Jupyter notebook).

To keep the user-defined function after your session ends, you can create it manually. See the Quickstart on the topic for further information.

The Snowpark ML model registry, an upcoming feature, also supports persistent models and makes finding and deploying them easier. For early access to documentation on this feature, contact your Snowflake representative.

Pipeline for Multiple Transformations

With scikit-learn, it is common to run a series of transformations using a pipeline. scikit-learn pipelines do not work with Snowpark ML classes, so Snowpark ML provides a Snowpark Python version of sklearn.pipeline.Pipeline for running a series of transformations. This class is in the snowflake.ml.modeling.pipeline package, and it works the same as the scikit-learn version.

Known Limitations

  • Snowpark ML estimators and transformers do not currently support sparse inputs or sparse responses. If you have sparse data, convert it to a dense format before passing it to Snowpark ML’s estimators or transformers.

  • The Snowpark ML package does not currently support matrix data types. Any operation on estimators and transformers that would produce a matrix as a result fails.

  • The order of rows in result data is not guaranteed to match the order of rows in input data.

Troubleshooting

Adding More Detail to Logging

Snowpark ML uses Snowpark Python’s logging. By default, Snowpark ML logs INFO level messages to standard output. To get logs that are more detailed, which can help you troubleshoot issues with Snowpark ML, you can change the level to one of the supported levels.

DEBUG produces logs with the most details. To set the logging level to DEBUG:

import logging, sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
Copy

Solutions to Common Issues

The following table provides some suggestions for solving possible problems with Snowflake ML Modeling.

Problem or error message

Possible cause

Resolution

NameError, such as “name x is not defined,” ImportError, or ModuleNotFoundError

Typographical error in module or class name, or Snowpark ML is not installed.

Refer to the Snowpark ML Modeling Classes table for the correct module and class name. Ensure that the Snowpark ML module is installed (see Installing Snowpark ML).

KeyError (“not in index” or “none of [Index[..]] are in the [columns]”)

Incorrect column name.

Check and correct the column name.

SnowparkSQLException, “does not exist or not authorize”

Table does not exist, or you do not have sufficient privileges on the table.

Ensure that the table exists and that the user’s role has the privileges.

SnowparkSQLException, “invalid identifier PETALLENGTH”

Incorrect number of columns (usually a missing column).

Check the number of columns specified when you created the model class, and ensure that you are passing the right number.

InvalidParameterError

An inappropriate type or value has been passed as a parameter.

Check the class’s or method’s help using the help function in an interactive Python session, and correct the values.

TypeError, “unexpected keyword argument”

Typographical error in named argument.

Check the class’s or method’s help using the help function in an interactive Python session, and correct the argument name.

ValueError, “array with 0 sample(s)”

The dataset that was passed in is empty.

Ensure that the dataset is not empty.

SnowparkSQLException, “authentication token has expired”

The session has expired.

If you’re using a Jupyter notebook, restart the kernel to create a new session.

ValueError, such as “cannot convert string to float”

Data type mismatch.

Check the class’s or method’s help using the help function in an interactive Python session, and correct the values.

SnowparkSQLException, “cannot create temporary table”

A model class is being used inside a stored procedure that doesn’t run with the caller’s rights.

Create the stored procedure with the caller’s rights instead of with the owner’s rights.

SnowparkSQLException, “function available memory exceeded”

Your data set is larger than 5 GB in a standard warehouse.

Switch to a Snowpark-optimized warehouse.

OSError, “no space left on device”

Your model is larger than about 500 MB in a standard warehouse.

Switch to a Snowpark-optimized warehouse.

Incompatible xgboost version or error when importing xgboost

You installed using pip, which does not handle dependencies well.

Upgrade or downgrade the package as requested by the error message.

AttributeError involving to_sklearn, to_xgboost, or to_lightgbm

An attempt to use one of these methods on a model of a different type.

Use to_sklearn with scikit-learn-based models, etc.

Further Reading

See the documentation of the original libraries for complete information on their functionality.

Acknowledgement

Some parts of this document are derived from the Scikit-learn documentation, which is licensed under the BSD-3 “New” or “Revised” license and Copyright © 2007-2023 The scikit-learn developers. All rights reserved.

Some parts of this document are derived from the XGboost documentation, which is covered by Apache License 2.0, January 2004 and Copyright © 2019. All rights reserved.

Some parts of this document are derived from the LightGBM documentation, which is MIT-licensed and Copyright © Microsoft Corp. All rights reserved.