Snowpark ML: End-to-End Machine Learning in Snowflake¶
Note
The Snowpark ML Modeling API is Generally Available as of package version 1.1.1, and the Snowpark Model Registry is available as a preview as of package version 1.2.0. For early access to documentation for upcoming Snowpark ML features, contact your Snowflake representative.
Snowpark ML is the Python library and underlying infrastructure for end-to-end ML workflows in Snowflake, including components for model development and operations. With Snowpark ML, you can use familiar Python frameworks for preprocessing, feature engineering, and training. You can deploy and manage models entirely in Snowflake without any data movement, silos, or governance tradeoffs.
Tip
See Introduction to Machine Learning with Snowpark ML for an example of an end-to-end workflow in Snowpark ML.
Key Components of Snowpark ML¶
Snowpark ML provides APIs to support each stage of an end-to-end machine learning development and management process and includes the following key pieces of functionality:
Snowpark ML also provides ways for your models to access data.
Snowpark ML Modeling¶
Snowpark ML Modeling supports data preprocessing, feature engineering, and model training in Snowflake using popular machine learning frameworks, such as scikit-learn, xgboost, and lightgbm. This API also includes a preprocessing module that can use compute resources provided by a Snowpark-optimized warehouse to provide scalable data transformations.
Snowpark ML Operations¶
Snowpark ML Operations (MLOps), featuring the Snowpark ML Model Registry, complements the Snowpark ML Development API. The model registry allows secure deployment and management of models in Snowflake, and supports models trained both inside and outside of Snowflake.
Snowpark ML Data Access¶
Snowpark ML Data Access provides simple and performant ways to feed data into your machine learning models.
The FileSet API provides a Python fsspec-compliant API for materializing data into a Snowflake internal stage from a query or Snowpark DataFrame. It also provides convenient methods for working with the data and feeding it to PyTorch or TensorFlow.
A set of framework connectors provide optimized, secure, and performant data provisioning for Pytorch and Tensorflow frameworks in their native data loader formats.
Installing Snowpark ML¶
Important
Recent changes to the Snowpark Connector for Python library removed its dependency on PyArrow. Snowpark ML requires PyArrow, but does not have an explicit dependency on it before Snowpark ML 1.1.2. If you are using an earlier version and have upgraded Snowpark Connector for Python recently, you might need to install PyArrow manually. To do this, use one of the following commands depending on whether you are using conda or pip in your project.
conda install pyarrow
python -m pip install pyarrow
All Snowpark ML features are available in a single package, snowflake-ml-python
.
You can install Snowpark ML from the Snowflake conda channel using the conda
command or from the Python Package Index
(PyPI) using pip
. Conda is preferred.
Installing Snowpark ML from the Snowflake conda Channel¶
Warning
Installing Snowpark ML from conda on an arm-based Mac (with M1 or M2 chip) requires specifying the system architecture when
creating the conda environment. This is done by setting CONDA_SUBDIR=osx-arm64
in the conda create
command: CONDA_SUBDIR=osx-arm64 conda create --name snowpark-ml
.
Create the conda environment where you will install Snowpark ML. If you prefer to use an existing environment, skip this step.
conda create --name snowpark-ml
Activate the conda environment:
conda activate snowpark-ml
Install Snowpark ML from the Snowflake conda channel:
conda install --override-channels --channel https://repo.anaconda.com/pkgs/snowflake/ snowflake-ml-python
Tip
When working with Snowpark ML, install packages from the Snowflake conda channel whenever possible to ensure that you receive packages that have been validated with Snowpark ML.
Installing Snowpark ML from PyPI¶
You can install the Snowpark ML package from the Python Package Index (PyPI) by using the standard Python package manager,
pip
.
Warning
Do not use this installation procedure if you are using a conda environment. Use the conda instructions instead.
Change to your project directory and activate your Python virtual environment:
cd ~/projects/ml source .venv/bin/activate
Install the Snowpark ML package:
python -m pip install snowflake-ml-python
Installing Optional Modeling Dependencies¶
Some Snowpark ML Modeling APIs require dependencies that are not installed as dependencies of Snowpark ML. The
scikit-learn and xgboost packages installed by default when you install Snowpark ML Modeling, but lightgbm is an
optional dependency. If you plan to use classes in the snowflake.ml.modeling.lightgbm
namespace, install
lightgbm yourself.
Use the following commands to activate your conda environment and install lightgbm from the Snowflake conda channel.
conda activate snowpark-ml
conda install --override-channels --channel https://repo.anaconda.com/pkgs/snowflake/ lightgbm
Use the following commands to activate your virtual environment and install lightgbm using pip.
.venv/bin/activate
python -m pip install 'snowflake-ml-python[lightgbm]'
Snowflake might add additional optional dependencies to Snowpark ML from time to time. To install all optional dependencies using pip:
.venv/bin/activate
python -m pip install 'snowflake-ml-python[all]'
Setting Up Snowpark Python¶
Snowpark Python is a dependency of Snowpark ML and is installed automatically with Snowpark ML. If Snowpark Python is not set up on your system, you might need to perform additional configuration steps. See Setting Up Your Development Environment for Snowpark Python for Snowpark Python setup instructions.
Connecting to Snowflake¶
Snowpark ML requires that you connect to Snowflake using a Snowpark Session
object. Use the
SnowflakeLoginOptions
function in the snowflake.ml.utils.connection_params
module to get the
configuration settings to create the session. The function can read the connection settings from a named connection in
your SnowSQL configuration file or from environment variables that you set. It
returns a dictionary containing these parameters, which can be used to create a connection.
The following examples read the connection parameters from the named connection myaccount
in the SnowSQL
configuration file. To create a Snowpark Python session, create a builder for the Session
class, and pass the
connection information to the builder’s configs
method:
from snowflake.snowpark import Session
from snowflake.ml.utils import connection_params
params = connection_params.SnowflakeLoginOptions("myaccount")
sp_session = Session.builder.configs(params).create()
You can now pass the session to any Snowpark ML function that needs it.
Tip
To create a Snowpark Python session from an existing Snowflake Connector for Python connection, pass the connection
object to the session builder. Here, connection
is the Snowflake Connector for Python connection.
session = Session.builder.configs({"connection": connection}).create()
Specifying a Warehouse¶
Many parts of Snowpark ML, for example training a model or running inference, run code in a Snowflake warehouse. These
operations run in the warehouse specified by the session you use to connect. For example, if you create a session from a
named connection in your SnowSQL configuration file, you can specify a warehouse
using the warehousename
parameter in the named configuration.
You can add the warehouse setting when creating the Session
object, as shown here, if it does not already
exist in the configuration.
from snowflake.snowpark import Session
from snowflake.ml.utils import connection_params
# Get named connection from SnowQSL configuration file
params = connection_params.SnowflakeLoginOptions("myaccount")
# Add warehouse name for model method calls if it's not already present
if "warehouse" not in params:
params["warehouse"] = "mlwarehouse"
sp_session = Session.builder.configs(params).create()
If no warehouse is specified in the session, or if you want to use a different warehouse, call the session’s
use_warehouse
method.
sp_session.use_warehouse("mlwarehouse")
Cost Considerations¶
When you train and use models in Snowflake, you run code in a virtual warehouse, which incurs compute costs. These costs vary depending on the type of model and the quantity of data used in training and prediction. See Understanding compute cost for general information about Snowflake compute costs.
Further Reading¶
See the following resources for information about Snowpark ML Modeling and Snowpark ML Ops.
End-to-End ML Workflows
Modeling
Data Access
ML Ops
Contact your Snowflake representative for early access to documentation on other features currently under development.
API Reference¶
The Snowpark ML API reference includes documentation on
all publicly-released functionality. You can also obtain detailed API documentation for any class by using Python’s
help
function in an interactive Python session. For example:
from snowflake.ml.modeling.preprocessing import OneHotEncoder
help(OneHotEncoder)