Model Training and Inference¶
Generating Tables for Training¶
You can generate a training data set with the feature store’s generate_training_set
method, which enriches a
Snowpark DataFrame that contains the source data with the derived feature values. To select a subset of features from a
feature view, use fv.slice
.
For time-series features, provide the timestamp column name to automate the point-in-time feature value lookup.
training_set = fs.generate_training_set(
spine_df=MySourceDataFrame,
features=[registered_fv],
save_as="data_20240101" # optional
spine_timestamp_col="TS", # optional
spine_label_cols=["LABEL1", "LABEL2"], # optional
include_feature_view_timestamp_col=False, # optional
)
Note
Here, the spine_df
(MySourceDataFrame
) is a DataFrame containing the entity IDs in source data, the time
stamp, label columns, and any additional columns containing training data. Requested features are retrieved for
the list of entity IDs, with point-in-time correctness with respect to the provided time stamp.
Training sets are ephemeral by default, meaning they only exist as Snowpark DataFrames and are not materialized.
To materialize the training set to a Table, specify the argument save_as
with a valid, non-existing table name. The
training set will be written to the newly created table.
Materialized tables currently don’t guarantee immutability and have limited metadata support. If you require these features, consider using Snowflake Datasets instead.
Note
The generate_training_set
API is only available in snowflake-ml-python
version 1.5.4
or later.
For earlier versions, use generate_dataset
with argument output_type="table"
.
Generating Datasets for Training¶
You can generate a Snowflake Dataset with the feature store’s
generate_dataset
method. The method signature is similar to generate_training_set
with the key
differences being the required name
argument, optional version
argument, and additional metadata fields.
generate_dataset
always materializes the result.
Snowflake Datasets provide an immutable, file-based snapshot of data, which helps to ensure model reproducibility and efficient data ingestion for large datasets and/or distributed training. Datasets also have expanded metadata support for easier discoverability and consumption.
dataset: Dataset = fs.generate_dataset(
name="MY_DATASET",
spine_df=MySourceDataFrame,
features=[registered_fv],
version="v1", # optional
spine_timestamp_col="TS", # optional
spine_label_cols=["LABEL1", "LABEL2"], # optional
include_feature_view_timestamp_col=False, # optional
desc="my new dataset", # optional
)
Note
For more information on datasets, see Snowflake Datasets.
Model Training¶
After creating a training data set, you can pass it to your model when training:
# training_set is a Snowpark DataFrame
my_model = train_my_model(training_set)
# Datasets can be easily converted into Snowpark DataFrames
my_model = train_my_model(dataset.read.to_snowpark_dataframe())
The model can then be logged in the Snowflake Model Registry.
Retrieving Features and Making Predictions¶
If you have created a model in your Python session, you can simply retrieve the feature from the feature store and pass
it to your model for prediction, as shown here. You may exclude specified columns using the exclude_columns
argument
or include the timestamp column from the feature view by setting include_feature_view_timestamp_col
.
prediction_df: snowpark.DataFrame = fs.retrieve_feature_values(
spine_df=prediction_source_dataframe,
features=[registered_fv],
spine_timestamp_col="TS",
exclude_columns=[],
)
# predict with your previously-trained model
my_model.predict(prediction_df)