Snowflake ML Model Development

Snowflake supports using open-source packages. You can bring existing Python modeling code or start from scratch with Snowflake Notebooks. With Snowflake Notebooks on Container Runtime, you can leverage Snowflake ML capabilities such as the following to build a comprehensive modeling workflow:

  • Feature Store

  • Distributed ML APIs

  • Model Registry

Tip

See Introduction to Machine Learning for an example of an end-to-end ML workflow, including the modeling API.

Developing models

With Container Runtime for ML, available in Notebooks on Container Runtime, you can use popular open-source ML packages with your Snowflake data, leveraging one or more CPU or GPU nodes, within the Snowflake cloud, ensuring security and governance for the entire ML workflow.

With optimized data loading APIs, you can accelerate data ingestion and speed up your OSS-based workflows. For large-scale modeling and tuning, where you might run into resource constraints with standard OSS libraries, the container runtime extends popular ML frameworks (such as XGBoost, Lightgbm, and Pytorch) to automatically scale out to available resources.

For more information, see Getting Started with Snowflake Notebook Container Runtime, which provides a simple ML workflow that leverages the optimized data loading APIs and the container runtime.

APIs for ML Development

You use both open source software (OSS) or Snowflake proprietary APIs for ML development. For most ML workflow development tasks, such as feature engineering, data pre-processing, and model training, you can use OSS packages.

If you’re running into perfomance issues with the open source packages, Snowflake provides a set of APIs that take advantage of the Container Runtime’s scalable processing.

Distributed Preprocessing

When feature engineering, you can use distributed processing to overcome compute limitations with running open-source preprocessing libraries. For example, Snowflake’s preprocessing APIs with sci-kit learn to distribute processing across Snowflake’s compute resources.

Distributed Training

For model training, use OSS packages when possible. For large datasets or compute-intensive tasks where vertical scaling isn’t enough, use the distributed training APIs in the Container Runtime. For more information, see Distributed Modeling APIs.

Distributed Hyperparameter Optimization

For model tuning, users can use any OSS library. To scale out, they can leverage the Container Runtime’s distributed hyperparameter optimization API, which supports random and Bayesian search. For more information, see Parallel Hyperparameter Optimization (HPO) on Container Runtime for ML.

Snowpark ML Modeling

Snowflake recommends starting with any OSS package and using distributed Container Runtime APIs to scale out.

Prior to the availability of the Container Runtime, Snowflake also supported wrappers over scikit-learn, xgboost, and lightgbm, most of which execute as stored procedures in a single virtual warehouse node. These APIs are in support-mode only.

All these modeling methods can be found in the snowflake.ml.modeling namespace. This also includes a distributed variant of hyperparameter optimization that is executed in virtual warehouses, providing distributed implementations of the scikit-learn GridSearchCV and RandomizedSearchCV APIs.

Partitioned Custom Models

The model registry also supports a special type of custom model where fit and inference are executed in parallel for a set of partitions. This can be a performant way to create many models at once from one dataset and execute inference immediately. For more information, see Using partitioned models.