Model Training and Inference

Generating Tables for Training

You can generate a training data set with the feature store’s generate_training_set method, which enriches a Snowpark DataFrame that contains the source data with the derived feature values. To select a subset of features from a feature view, use fv.slice.

For time-series features, provide the timestamp column name to automate the point-in-time feature value lookup.

training_set = fs.generate_training_set(
    spine_df=MySourceDataFrame,
    features=[registered_fv],
    save_as="data_20240101"                     # optional
    spine_timestamp_col="TS",                   # optional
    spine_label_cols=["LABEL1", "LABEL2"],      # optional
    include_feature_view_timestamp_col=False,   # optional
)
Copy

Note

Here, the spine_df (MySourceDataFrame) is a DataFrame containing the entity IDs in source data, the time stamp, label columns, and any additional columns containing training data. Requested features are retrieved for the list of entity IDs, with point-in-time correctness with respect to the provided time stamp.

Training sets are ephemeral by default, meaning they only exist as Snowpark DataFrames and are not materialized. To materialize the training set to a Table, specify the argument save_as with a valid, non-existing table name. The training set will be written to the newly created table.

Materialized tables currently don’t guarantee immutability and have limited metadata support. If you require these features, consider using Snowflake Datasets instead.

Note

The generate_training_set API is only available in snowflake-ml-python version 1.5.4 or later. For earlier versions, use generate_dataset with argument output_type="table".

Generating Datasets for Training

You can generate a Snowflake Dataset with the feature store’s generate_dataset method. The method signature is similar to generate_training_set with the key differences being the required name argument, optional version argument, and additional metadata fields. generate_dataset always materializes the result.

Snowflake Datasets provide an immutable, file-based snapshot of data, which helps to ensure model reproducibility and efficient data ingestion for large datasets and/or distributed training. Datasets also have expanded metadata support for easier discoverability and consumption.

dataset: Dataset = fs.generate_dataset(
    name="MY_DATASET",
    spine_df=MySourceDataFrame,
    features=[registered_fv],
    version="v1",                               # optional
    spine_timestamp_col="TS",                   # optional
    spine_label_cols=["LABEL1", "LABEL2"],      # optional
    include_feature_view_timestamp_col=False,   # optional
    desc="my new dataset",                      # optional
)
Copy

Note

For more information on datasets, see Snowflake Datasets.

Model Training

After creating a training data set, you can pass it to your model when training:

# training_set is a Snowpark DataFrame
my_model = train_my_model(training_set)

# Datasets can be easily converted into Snowpark DataFrames
my_model = train_my_model(dataset.read.to_snowpark_dataframe())
Copy

The model can then be logged in the Snowflake Model Registry.

Retrieving Features and Making Predictions

If you have created a model in your Python session, you can simply retrieve the feature from the feature store and pass it to your model for prediction, as shown here. You may exclude specified columns using the exclude_columns argument or include the timestamp column from the feature view by setting include_feature_view_timestamp_col.

prediction_df: snowpark.DataFrame = fs.retrieve_feature_values(
    spine_df=prediction_source_dataframe,
    features=[registered_fv],
    spine_timestamp_col="TS",
    exclude_columns=[],
)

# predict with your previously-trained model
my_model.predict(prediction_df)
Copy