Snowpark ML Modeling: ML Model Development¶
Snowpark ML Modeling is a collection of Python APIs for preprocessing data and training models. By using Snowpark ML Modeling to perform these tasks within Snowflake, you can:
Transform your data and train your models without moving your data out of Snowflake.
Work with APIs similar to those you’re already familiar with, such as scikit-learn.
Keep your ML pipeline running within Snowflake’s security and governance frameworks.
Benefit from the performance and scalability of Snowflake’s data warehouses.
The Snowpark ML Modeling package described in this topic provides estimators and transformers that are compatible with those in the scikit-learn, xgboost, and lightgbm libraries. You can use these APIs to build and train machine learning models within Snowflake.
For a quick introduction to Snowpark ML Modeling, see our Quickstart.
Note
This topic assumes that the Snowpark ML module is already installed. If it isn’t, see Installing Snowpark ML.
Snowpark ML Modeling Classes¶
All Snowpark ML modeling and preprocessing classes are in the snowflake.ml.modeling
namespace. The Snowpark ML
modules have the same name as the corresponding modules from the sklearn
namespace. For example, the Snowpark ML
module corresponding to sklearn.calibration
is snowflake.ml.modeling.calibration
.
The xgboost
and lightgbm
modules correspond to snowflake.ml.modeling.xgboost
and
snowflake.ml.modeling.lightgbm
, respectively.
Not all of the classes from scikit-learn are supported in Snowpark ML. The following table indicates which classes are supported. Classes marked with an asterisk (*) support distributed execution.
Snowpark ML module name |
Classes |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
General API Differences¶
Snowpark ML Modeling includes data preprocessing, transformation, and prediction algorithms based on scikit-learn, xgboost, and lightgbm. The Snowpark Python classes are replacements for the corresponding classes from the original packages, with similar signatures. However, these APIs are designed to work with Snowpark DataFrames instead of NumPy arrays.
Although the Snowpark ML Modeling API is similar to scikit-learn, there are some key differences. This section explains
how to call the __init__
(constructor), fit
, and predict
methods for the estimator and
transformer classes provided in Snowpark ML.
The constructor of all Snowpark ML Python classes accepts five additional parameters—
input_cols
,output_cols
,sample_weight_col
,label_cols
, anddrop_input_cols
—in addition to the parameters accepted by the equivalent classes in scikit-learn, xgboost, or lightgbm. These are strings or sequences of strings that specify the names of the input columns, output columns, sample weight column, and label columns in a Snowpark or Pandas DataFrame.The
fit
andpredict
methods in Snowpark ML accept a single DataFrame instead of separate arrays representing input data, labels, and weights. With Snowpark ML, you specify the names of the columns to be used for these purposes when you instantiate the class; these names are then used to find the required columns in the DataFrame that you pass tofit
orpredict
. See fit and predict.The
transform
andpredict
methods in Snowpark ML return a DataFrame containing all of the columns from the DataFrame passed to the method, with the output from the prediction stored in additional columns. (You can transform in place by specifying the same input and output column names or drop the input columns by passingdrop_input_cols = True
.) The scikit-learn, xgboost, and lightgbm equivalents return arrays containing only the results.Snowpark Python transformers do not have a
fit_transform
method. However, as with scikit-learn, parameter validation is only performed in thefit
method, so you should callfit
at some point beforetransform
, even when the transformer does not do any fitting.fit
returns the transformer, so the method calls may be chained; for example,Binarizer(threshold=0.5).fit(df).transform(df)
.Snowpark ML transformers do not have an
inverse_transform
method. This method is unnecessary with Snowpark ML because the original representation remains available in the input columns of the input DataFrame, which are preserved unless you explicitly perform an in-place transform by specifying the same names for both the input and output columns.
Constructing a Model¶
In addition to the parameters accepted by individual scikit-learn model classes, all Snowpark ML Modeling classes accept five additional parameters, listed in the following table, at instantiation.
These parameters are all technically optional, but you will often want to specify input_cols
,
output_cols
, or both. label_cols
and sample_weight_col
are required in specific situations noted
in the table, but can be omitted in other cases.
Parameter |
Description |
---|---|
|
A string or list of strings representing column names that contain features. If you omit this parameter, all columns in the input DataFrame, except the columns specified by |
|
A string or list of strings representing column names that contain labels. You must specify label columns for estimators because inferring these columns is not possible. If you omit this parameter, the model is considered a transformer and is fitted without labels. |
|
A string or list of strings representing column names that will store the output of If you omit this parameter, output column names are derived by adding an To transform in place, pass the same names for |
|
A string representing the column name containing the examples’ weights. This argument is required for weighted datasets. |
|
A Boolean value indicating whether the input columns are removed from the result DataFrame. The default is
|
Example¶
The DecisionTreeClassifier
constructor does not have any required arguments in scikit-learn; all arguments have default
values. So in scikit-learn, you might write:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
In Snowpark ML, you must specify the column names (or accept the defaults by not specifying them). In this example, they are explicitly specified.
You can initialize a Snowpark ML DecisionTreeClassifier
by passing the arguments directly to the constructor or
by setting them as attributes of the model after instantiation. (The attributes may be changed at any time.)
As constructor arguments:
from snowflake.ml.modeling.tree import DecisionTreeClassifier model = DecisionTreeClassifier( input_cols=feature_column_names, label_cols=label_column_names, sample_weight_col=weight_column_name, output_cols=expected_output_column_names )
By setting model attributes:
from snowflake.ml.modeling.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.set_input_cols(feature_column_names) model.set_label_cols(label_column_names) model.set_sample_weight_col(weight_column_name) model.set_output_cols(output_column_names)
fit
¶
The fit
method of a Snowpark ML classifier takes a single Snowpark or Pandas DataFrame containing all columns,
including features, labels, and weights. This is different from scikit-learn’s fit
method, which takes separate
inputs for features, labels, and weights.
In scikit-learn, the DecisionTreeClassifier.fit
method call looks like this:
model.fit(
X=df[feature_column_names], y=df[label_column_names], sample_weight=df[weight_column_name]
)
In Snowpark ML, you only need to pass the DataFrame. You have already set the input, label, and weight column names at initialization or by using setter methods, as shown in Constructing a Model.
model.fit(df)
predict
¶
The predict
method of a Snowpark ML class also takes a single Snowpark or Pandas DataFrame containing all
feature columns. The result is a DataFrame that contains all the columns in the input DataFrame unchanged and the output
columns appended. You must extract the output columns from this DataFrame. This is different from the predict
method in scikit-learn, which returns only the results.
Example¶
In scikit-learn, predict
returns only the prediction results:
prediction_results = model.predict(X=df[feature_column_names])
To get only the prediction results in Snowpark ML, extract the output columns from the returned DataFrame. Here,
output_column_names
is a list containing the names of the output columns:
prediction_results = model.predict(df)[output_column_names]
Deploying and Running Your Model¶
The result of training a model is a Python Snowpark ML model object. You can use the trained model to make predictions by
calling the model’s predict
method. This creates a temporary user-defined function to run the model in your Snowflake
virtual warehouse. This function is automatically deleted at the end of your Snowpark ML session (for example, when your
script ends or when you close your Jupyter notebook).
To keep the user-defined function after your session ends, you can create it manually. See the Quickstart on the topic for further information.
The Snowpark ML model registry, an upcoming feature, also supports persistent models and makes finding and deploying them easier. For early access to documentation on this feature, contact your Snowflake representative.
Pipeline for Multiple Transformations¶
With scikit-learn, it is common to run a series of transformations using a pipeline. scikit-learn pipelines do not work
with Snowpark ML classes, so Snowpark ML provides a Snowpark Python version of sklearn.pipeline.Pipeline
for
running a series of transformations. This class is in the snowflake.ml.modeling.pipeline
package, and it works
the same as the scikit-learn version.
Known Limitations¶
Snowpark ML estimators and transformers do not currently support sparse inputs or sparse responses. If you have sparse data, convert it to a dense format before passing it to Snowpark ML’s estimators or transformers.
The Snowpark ML package does not currently support matrix data types. Any operation on estimators and transformers that would produce a matrix as a result fails.
The order of rows in result data is not guaranteed to match the order of rows in input data.
Troubleshooting¶
Adding More Detail to Logging¶
Snowpark ML uses Snowpark Python’s logging. By default, Snowpark ML logs INFO level messages to standard output. To get logs that are more detailed, which can help you troubleshoot issues with Snowpark ML, you can change the level to one of the supported levels.
DEBUG produces logs with the most details. To set the logging level to DEBUG:
import logging, sys
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
Solutions to Common Issues¶
The following table provides some suggestions for solving possible problems with Snowflake ML Modeling.
Problem or error message |
Possible cause |
Resolution |
---|---|---|
NameError, such as “name x is not defined,” ImportError, or ModuleNotFoundError |
Typographical error in module or class name, or Snowpark ML is not installed. |
Refer to the Snowpark ML Modeling Classes table for the correct module and class name. Ensure that the Snowpark ML module is installed (see Installing Snowpark ML). |
KeyError (“not in index” or “none of [Index[..]] are in the [columns]”) |
Incorrect column name. |
Check and correct the column name. |
SnowparkSQLException, “does not exist or not authorize” |
Table does not exist, or you do not have sufficient privileges on the table. |
Ensure that the table exists and that the user’s role has the privileges. |
SnowparkSQLException, “invalid identifier PETALLENGTH” |
Incorrect number of columns (usually a missing column). |
Check the number of columns specified when you created the model class, and ensure that you are passing the right number. |
InvalidParameterError |
An inappropriate type or value has been passed as a parameter. |
Check the class’s or method’s help using the |
TypeError, “unexpected keyword argument” |
Typographical error in named argument. |
Check the class’s or method’s help using the |
ValueError, “array with 0 sample(s)” |
The dataset that was passed in is empty. |
Ensure that the dataset is not empty. |
SnowparkSQLException, “authentication token has expired” |
The session has expired. |
If you’re using a Jupyter notebook, restart the kernel to create a new session. |
ValueError, such as “cannot convert string to float” |
Data type mismatch. |
Check the class’s or method’s help using the |
SnowparkSQLException, “cannot create temporary table” |
A model class is being used inside a stored procedure that doesn’t run with the caller’s rights. |
Create the stored procedure with the caller’s rights instead of with the owner’s rights. |
SnowparkSQLException, “function available memory exceeded” |
Your data set is larger than 5 GB in a standard warehouse. |
Switch to a Snowpark-optimized warehouse. |
OSError, “no space left on device” |
Your model is larger than about 500 MB in a standard warehouse. |
Switch to a Snowpark-optimized warehouse. |
Incompatible xgboost version or error when importing xgboost |
You installed using |
Upgrade or downgrade the package as requested by the error message. |
AttributeError involving |
An attempt to use one of these methods on a model of a different type. |
Use |
Further Reading¶
See the documentation of the original libraries for complete information on their functionality.
Acknowledgement¶
Some parts of this document are derived from the Scikit-learn documentation, which is licensed under the BSD-3 “New” or “Revised” license and Copyright © 2007-2023 The scikit-learn developers. All rights reserved.
Some parts of this document are derived from the XGboost documentation, which is covered by Apache License 2.0, January 2004 and Copyright © 2019. All rights reserved.
Some parts of this document are derived from the LightGBM documentation, which is MIT-licensed and Copyright © Microsoft Corp. All rights reserved.