Snowpark ML Model Registry

The Snowpark ML model registry stores Python ML models so they can easily be found and used by others. You can create your own registries in Snowflake, and store and maintain your models, using Snowpark Python and Snowpark ML. Registries are Snowflake databases.

Supported model types include:

  • Snowpark ML

  • scikit-learn

  • xgboost

  • Pytorch

  • TensorFlow

  • MLFlow

Tip

A version of the Snowpark ML model registry is now available to the public (see Snowflake Model Registry). However, that version does not yet support deploying models to Snowpark Container Services compute pools. If you are using this functionality, please feel free to continue using the private preview version of the model registry described in this topic.

Model APIs

Snowpark ML includes two separate APIs for working with models.

  • A relational API where model operations are methods of the registry object, and you provide a name and version to these methods to specify the model to be operated upon.

  • An object API where operations on a specific model are methods of a ModelReference object obtained from the registry.

Operations performed through these two APIs are equivalent; use the one you find most convenient. In general, the object API is more convenient if you already have a reference to the model. The relational API is more convenient if you are, for example, reading model names and versions from a file and performing some operation on those models, such as updating their metadata.

You can convert calls to the object API into calls to the relational API by calling the method of the same name on the registry and passing the model name and versions. For example, the two calls below are equivalent.

# set a tag on a model using the object API when you have a reference to it
model.set_tag(tag_name="stage", tag_value="production")

# set a tag on a model using the relational API when you have its name and version
registry.set_tag(model_name="my_model", model_version="102", tag_name="stage", tag_value="production")
Copy

The code examples in this topic use the object API when working with models. It is likely we will choose one API or the other for the public release of the model registry.

Examples

A Jupyter notebook containing example code for the model registry is available in the examples subfolder of the releases folder that we shared with you.

Installing the Snowpark ML Library

See Installing Snowpark ML for instructions for installing Snowpark ML.

Connecting to Snowflake

The model registry connects to Snowflake using a Snowpark session, which you can create in several ways, including by passing all the configuration settings in your Python code.

A better way to create a Snowpark session to have Snowpark read connection settings from the SnowSQL configuration file located at ~/.snowsql/config. This approach avoids exposing your connection settings, including your password, in your code. If you have already added one or more connection settings to this file, you can use them with the model registry, or you might add a new named connection specifically for use with Snowpark ML. For more information on adding connection settings to the SnowSQL configuration file, see Configuring default connection settings.

In your client code, you can log in with the default SnowSQL connection using the SnowflakeLoginOptions utility class as shown here.

from snowflake.snowpark import Session
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions

session = Session.builder.configs(SnowflakeLoginOptions()).create()
Copy

To use named SnowSQL connection settings to connect to Snowflake, specify its name as shown here.

session = Session.builder.configs(
    SnowflakeLoginOptions(connection_name="snowpark-ml")).create()
Copy

In either case, you’ll use the resulting session object when creating or opening a registry.

Tip

When using the registry in a stored procedure, the stored procedure session can be used as your Snowpark session.

Required Privileges

Creating a registry requires the following privilege if the database does not already exist:

  • CREATE DATABASE global privilege

If the database does exist, the following privileges are necessary to use it as a model registry.

  • USAGE on the registry database

  • USAGE on the registry’s PUBLIC schema

  • CREATE TABLE on the registry’s PUBLIC schema

  • CREATE VIEW on the registry’s PUBLIC schema

  • SELECT on all tables in the registry’s PUBLIC schema

Using a registry (adding and working with models) requires the privileges below.

  • INSERT on all tables in the registry’s PUBLIC schema

  • SELECT on all views in the registry’s PUBLIC schema

  • CREATE STAGE on the registry’s PUBLIC schema

Creating the Model Registry

To create a model registry, use the create_model_registry function, passing it your Snowpark session.

from snowflake.ml.registry import model_registry

result = model_registry.create_model_registry(session=session, database_name="MODEL_REGISTRY")
Copy

The database name is optional. If you do not specify it, MODEL_REGISTRY is the default. By using different database names, you can create multiple registries in your account for access control, lifecycle management, or other purposes.

create_model_registry returns True if the registry was successfully created, or False if it was not. It is not an error to create a registry more than once, although you will receive a warning.

Getting a Reference to the Model Registry

Before you can create or modify models in the registry, you must obtain a reference to the registry.

registry = model_registry.ModelRegistry(session=session, database_name="MODEL_REGISTRY")
Copy

As with registry creation, the database name is optional; the default value MODEL_REGISTRY is used if you do not specify it.

You use the registry object to perform registry operations (such as registering models) and, optionally, model operations.

Registering a Model

Add a model by to the registry calling the registry’s log_model method. This method:

  • Serializes the model and uploads it to a Snowflake stage. The model, a Python object, must be serializable (“pickleable”).

  • Creates an entry in the model registry for the model, referencing the staged location.

  • Adds metadata such as description and tags to the model as specified in the log_model call.

In the example below, clf, short for “classifier,” is the model, assumed to already have been created elsewhere. You can add a name and tags at registration time, as shown here. The combination of name and version must be unique for each model in the registry.

model_id = registry.log_model(model=clf,
                model_name="my_model",
                model_version="101",
                conda_dependencies=["scikit-learn"],
                description="My awesome ML model",
                tags={"stage": "testing", "classifier_type": "svm.SVC",
                    "svc_gamma": svc_gamma, "svc_C": svc_C},
                sample_input_data=train_features
                )
Copy

The arguments shown here are described below.

Argument

Description

Required

model

The Python model object of a supported model type. Must be serializable (“pickleable”).

model_name

The model’s name, used with model_version to identify the model in the registry. The name cannot be changed after model has been added.

model_version

String specifying the model’s version, used with model_name to identify the model in the registry. The version cannot be changed after model has been added.

Optional

code_paths

Path to directory of code to import when loading or deploying the model.

conda_dependencies

List of Conda packages required by your model. This argument specifies package names and optional versions in Conda format, that is, "[channel::]package [operator version]". If you do not specify a channel, the Snowflake channel is assumed.

description

Model description.

options

Dictionary containing options for model creation. Currently only one option is recognized.

  • embed_local_ml_library: whether to embed a copy of the local Snowpark ML library into the model. Default: False.

pip_requirements

List of package specs for PyPI packages required by your model. Models with pip requirements can be deployed only to Snowpark Container Services (SPCS) compute pools, not to Snowflake warehouses.

sample_input_data

A Snowpark DataFrame containing sample input data. The feature names required by the model, and their types, are extracted from this DataFrame. Either this argument or signatures must be provided for all models except Snowpark ML and MLFlow models.

signatures

Model signatures as a mapping from target method name to signatures of input and output. Either this argument or sample_input_data must be provided for all models except Snowpark ML and MLFlow models.

tags

Dictionary containing metadata used to record a model’s purpose, algorithm, training data set, lifecycle stage, or other information you choose.

Note

The combination of model name and version must be unique in the registry.

log_model returns the new model’s ID, an opaque string identifier assigned by the registry to identify a specific version of a model. The ID is used internally by the registry and is not needed by client code.

Once registered, the model itself is immutable (although you can change its metadata in the registry). To update a model, delete the old version and register the new version.

Listing Models in the Registry

To get a list of the models in the registry:

model_list = registry.list_models()
Copy

The model list is a Snowpark DataFrame, so you can easily choose which columns you want as well as filter and sort the list as desired. The most important columns are shown below. (A few additional columns are present but not currently used.)

Column

Data type

Description

CREATION_ROLE

string

Name of the role used to create the model.

CREATION_TIME

timestamp

The date and time at which the model was created.

ID

string

The unique ID assigned to this model by the registry.

NAME

string

The model’s name.

VERSION

string

The model’s version.

DESCRIPTION

string

The model’s description.

METRICS

variant

Mapping of model’s metric names to values.

TAGS

variant

Mapping of model’s tag names to values.

To obtain the model registry sorted with the most recently created models first, for example:

model_list.select("ID", "NAME", "CREATION_TIME", "TAGS").order_by(
    "CREATION_TIME", ascending=False).show()
Copy

Or to filter by model ID:

model_list.filter(model_list["ID"] == model_id).select(
        "NAME", "TAGS", "METRICS").show()
Copy

In both examples, only the columns specified in the select method call are retrieved.

Getting a Reference to a Model

After a model has been registered, you can get a reference to it by creating a ModelReference from its name and version.

model = model_registry.ModelReference(registry=registry, model_name="my_model", model_version="101")
Copy

You can use this model reference to make changes to the model’s metadata, to deploy the model, and to manage its deployments.

Viewing and Updating a Model’s Metadata

You can view and update a model’s metadata attributes in the registry, including its description, tags, and metrics.

Viewing and Updating a Model’s Description

Use the model’s get_model_description and set_model_description methods to view and update the model’s description.

print(model.get_model_description())

model.set_model_description("A better description than the one I provided originally")
Copy

Viewing and Updating a Model’s Tags

Tags are metadata used to record a model’s purpose, algorithm, training data set, lifecycle stage, or other information you choose. You can set tags when the model is registered or at any time afterward. You can also update the values of existing tags or remove tags entirely, as shown here.

# get all tags
print(model.get_tags())

# add tag or set new value of existing tag
model.set_tag("minor_version", "1")

# remove tag
model.remove_tag("minor_version")
Copy

One important use of tags is to manage the lifecycle of a model. Here we use a tag called “stage” for this purpose.

# Set model stage
model.set_tag("stage", "prod")
Copy

You can then get a list of all models and their stage as follows.

# List models and their tags
lm = registry.list_models()
lm.select('NAME', 'VERSION', functions.parse_json(lm["TAGS"]).getField("stage")).show()
Copy

Model references support the following methods for working with tags.

Method

Arguments

Description

get_tags

none

Retrieves all of the model’s tags and their values as a Python dictionary.

get_tag_value

tag_name

Retrieves the value of the specified tag.

has_tag

tag_name

Returns True if the model has the specified tag, False if not.

remove_tag

tag_name

Removes the specified tag from the model.

set_tag

tag_name,
tag_value

Sets the specified tag’s value, creating the tag if necessary.

Viewing and Updating a Model’s Metrics

Metrics are metadata used to track prediction accuracy and other characteristics of a model. The registry supports both scalars and more complex objects. Use the model’s set_metric method to set metrics. The following examples illustrate the use of a scalar, a dictionary, and a two-dimensional NumPy array as metrics.

The test accuracy metric might be generated using sklearn’s accuracy_score:

from sklearn import metrics

test_accuracy = metrics.accuracy_score(test_labels, prediction)
Copy

The confusion matrix can be generated similarly using sklearn:

test_confusion_matrix = metrics.confusion_matrix(test_labels, prediction)
Copy

Then we can set these values as metrics as follows.

# scalar metric
model.set_metric("test_accuracy", test_accuracy)

# hierarchical (dictionary) metric
model.set_metric("dataset_test", {"accuracy": test_accuracy})

# multivalent (matrix) metric
model.set_metric("confusion_matrix", test_confusion_matrix)
Copy

To view a model’s metrics, use get_metrics.

print(model.get_metrics())
Copy

Model references support the following methods for working with metrics.

Method

Arguments

Description

get_metrics

none

Retrieves all of the model’s metrics and their values as a Python dictionary.

get_metric_value

metric_name

Retrieves the value of the specified metric.

has_metric

metric_name

Returns True if the model has the specified metric, False if not.

remove_metric

metric_name

Removes the specified metric from the model.

set_metric

metric_name,
metric_value

Sets the specified metric’s value, creating the metric if necessary.

Deploying and Using Models from the Registry

Models can be deployed to a Snowflake warehouse or in a Snowpark Container Services (SPCS) compute pool.

It is also possible to use a model locally.

Using a Model in a Snowflake Warehouse

To deploy a model to a warehouse, use the model’s deploy method. The registry generates a user-defined function (UDF), using a name you provide, that invokes the model’s predict method.

model.deploy(deployment_name="my_warehouse_predict",
             target_method="predict",    # the name of the model's method, usually predict
             permanent=True)
Copy

If you do not specify the permanent argument as shown above, the deployment is temporary and will be removed automatically when your session ends (for example, when you close your Jupyter notebook, or when your Python script exits).

After the model has been deployed, you can use it by calling the model’s predict method with the deployment name you specified when you deployed it.

result_dataframe = model.predict("my_warehouse_predict", test_dataframe)
Copy

Using a Model in an SPCS Compute Pool

Before you can deploy your model to a Snowpark Container Services (SPCS) compute pool, you must have a Docker client installed.

To deploy a model to a SPCS compute pool, use code like the following. Note that you must specify a compute pool name, and the specified compute pool must already exist.

from snowflake.ml.model import deploy_platforms

model.deploy(deployment_name="my_spcs_predict",
              platform=deploy_platforms.TargetPlatform.SNOWPARK_CONTAINER_SERVICES,
              target_method="predict",
              options={
                "compute_pool": "my_pool",
              })
Copy

Tip

You can optionally set the min_instances and max_instances keys in the options argument to control how many instances of the SCPS service can be run simultaneously. These options both default to 1.

After the model has been deployed, you can use it by calling the model’s predict method with the deployment name you specified when you deployed it.

result_dataframe = model.predict("my_spcs_predict", test_dataframe)
Copy

Deployment Arguments and Options

A complete list of available arguments for the deploy method is shown here.

Argument

Description

Required

model_name

The model’s name, used with model_version to identify the model in the registry.

model_version

String specifying the model’s version, used with model_name to identify the model in the registry.

deployment_name

The name of the generated user-defined function.

Optional

options

Dictionary containing options for model deployment. These are listed in the following tables.

permanent

If True, the deployed model remains in place when the current Python session ends. Default is False. Note that SPCS deployments are always permanent.

Options for all deployments

Option

Description

Optional

output_with_input_features

If True, include the input columns in the output. Default is False.

keep_order

If True, preserve the row order in the output. Applies only for DataFrames containing fewer than 2 64 rows. Default is True.

Options for warehouse deployments

Option

Description

Optional

permanent_udf_stage_location

Stage location where the UDF should be stored when the permanent argument is True.

relax_version

If True, allow the version constraints of dependencies to be relaxed slightly. May cause some errors if the selected dependencies are not fully compatible. Defaults to False.

replace_udf

If True, a new generated UDF can replace an existing UDF of the same name. Defaults to False.

Options for SPCS compute pool deployments

Option

Description

Required

compute_pool

Compute pool name.

Optional

endpoint

The name of the endpoint that the service function will communicate with. This option is useful when the service has multiple endpoints. Defaults to predict.

image_repo

SnowService image repository path in format image_registry/atabase/schema/repository. By default inferred from session information.

max_instances

Maximum number of service replicas. Default is 1

min_instances

Minimum number of service replicas. Default is 1.

num_workers

Number of workers used for model inference. Make sure the number of workers multiplied by the size of the model does not exceed available memory. Default is twice the number of CPU cores plus 1.

prebuilt_snowflake_image

Specifies a previously-built Docker image to be used as-is; no image is built. This option is for users who often use a single image for multiple use cases. For this situation, you can have the initial deployment build an image (the name of the image is logged to the console), then re-use that image for additional deployments by passing its name here. Default: None.

use_gpu

When True, a CUDA-enabled Docker image is used to provide a runtime CUDA environment. Default is False.

Managing Deployments

Model references support the following methods for managing deployments.

Method

Arguments

Description

delete_deployment

deployment_name

Deletes the specified deployment.

get_deployment

deployment_name

Gets information about the specified deployment. The result is a Snowpark DataFrame containing attributes of the deployment.

list_deployments

none

Gets a list of the names of all deployments of the model.

For example, to delete a deployment named my_deployment, use:

model.delete_deployment(deployment_name="my_deployment")
Copy

Using a Model Locally

Tip

It is convenient to use a model locally when you first develop and test it. As the model grows, however, it will eventually exceed the capacity of your local system. At this point, deploy it to a warehouse or SPCS compute pool and run it there instead.

To use a model locally, you first deserialize (“unpickle”) the model from the registry in your own Python code using the model’s load_model method. You can then call the model’s predict method with your test data.

For the deserialization operation to be successful, the target Python environment must be very similar to the one originally used to add the model to the registry. Ideally, the environment should include not only the same version of Python, but of every dependency used by the model. Other versions of some dependencies (especially later point releases) may also be compatible.

Tip

Use a virtual environment and a requirements.txt file to manage your Python dependencies so you can easily recreate the original environment.

The following is an example of deserializing and using a model from the registry in Python.

clf = model.load_model()
results = clf.predict(test_features)    # Snowpark DataFrame
Copy

Deleting Models

You can delete models using the registry’s delete_model method. By default, the model itself is deleted from the stage, not just its entry in the registry; pass delete_artifact = False to keep the model.

# delete from registry and also delete the model itself
registry.delete_model(model_name="my_model", model_version="100")

# delete from registry but keep the model
registry.delete_model(model_name="my_model", model_version="100", delete_artifact=False)
Copy

Existing references to a model are no longer valid after you delete the model from the registry.

Auditing Model and Registry Changes

The registry maintains a history of changes mode to both the registry itself and to the models in it. To see this information, use registry.get_history or model.get_model_history. Both methods return a Snowpark DataFrame, which can be sorted and filtered as desired.

# print complete registry history
registry.get_history().show()

# print history of metadata changes to a model
model.get_model_history().show()
Copy