Time-Series Forecasting (Snowflake Cortex ML Functions)¶
Note
Forecasting is part of Snowflake Cortex, Snowflake’s intelligent, fully-managed AI and ML service. This feature is part of the Snowflake Cortex ML function suite.
Forecasting employs a machine learning algorithm to predict future data by using historical time series data. Time series forecasting produces univariate predictions of future data based on historical input data. A common use case is to forecast sales based on seasonality and other factors.
The historical data must include:
A timestamp column, which should have a fixed frequency (for example, hourly, every 5 minutes, and so on).
A target column representing some quantity of interest at each timestamp.
The historical data can also include additional columns that might have influenced the target (exogenous variables). These columns can be either numerical or character data.
The historical data is used to train a machine learning model that produces a forecast of the target value at future timestamps. The model is a schema-level object and can be used for multiple forecasts after it has been trained. (The model cannot be used to produce forecasts in the same range as the historical data.)
Forecasting works with either single-series or multi-series data. Multi-series data represents multiple distinct threads of events. For example, if you have sales data for multiple stores, each store’s sales can be forecast separately by a single model based on the store identifier.
To produce forecasts of time series data, use the Snowflake built-in class FORECAST, and follow these steps:
Create a forecast model object, passing in a reference to the training data. This object will fit (train) a model to the training data that you provide. The model is a schema-level object.
Using this forecast model object, call the forecast method to produce a forecast of the target for future timestamps, passing in either the number of time steps, of if using exogenous variables, the future-valued exogenous data.
Important
Legal notice. This Snowflake Cortex ML function is powered by machine learning technology. Machine learning technology and results provided may be inaccurate, inappropriate, or biased. Decisions based on machine learning outputs, including those built into automatic pipelines, should have human oversight and review processes to ensure model-generated content is accurate. Snowflake Cortex ML function queries will be treated as any other SQL query and may be considered metadata.
Metadata. When you use Snowflake Cortex ML functions, Snowflake logs generic error messages returned by an ML function, in addition to what is mentioned in Metadata Fields. These error logs help us troubleshoot issues that arise and improve these functions to serve you better.
For further information, see Snowflake AI Trust and Safety FAQ.
About the Forecasting Algorithm¶
The forecasting algorithm is powered by a gradient boosting machine (GBM). Like an ARIMA model, it uses a differencing transformation to model data with a non-stationary trend and uses auto-regressive lags of the historical target data as model variables.
Additionally, the algorithm uses rolling averages of historical target data to help predict trends, and automatically produces cyclic calendar variables (such as day of week and week of year) from timestamp data.
You can fit models with only historical target and timestamp data, or you may include exogenous data (variables) that might have influenced the target value. Exogenous variables can be numerical or categorical and may be NULL (rows containing NULLs for exogenous variables are not dropped).
The algorithm does not rely on one-hot encoding when training on categorical variables, so you can use categorical data with many dimensions (high cardinality).
If your model incorporates exogenous variables, when generating a forecast you must provide values for those variables at every timestamp of the full forecast horizon. Appropriate exogenous variables could include weather data (temperature, rainfall), company-specific information (historic and planned company holidays, advertisement campaigns, event schedules), or any other external factors you believe may help predict your target variable.
The algorithm also generates prediction intervals, in addition to forecasts. A prediction interval is an estimated range of values within an upper bound and a lower bound in which a certain percentage of data is likely to fall. For example, a 0.95 value means that 95% of the data likely appears within the interval. You may specify a prediction interval percentage, or use the default, which is 0.95. Lower and upper bounds of the prediction interval are returned as part of the forecast output.
Important
From time to time, Snowflake may refine the forecasting algorithm and will roll out such improvements through the regular Snowflake release process. You cannot revert to a previous version of the feature, but models you have created with a previous version will continue to use that version for predictions.
Current Limitations¶
The current release has the following limitations:
You cannot choose or adjust the forecasting algorithm.
The minimum number of rows for the main forecasting algorithm is 12 per time series. For time series with between 2 and 11 observations, forecasting produces a “naive” forecast where all forecasted values are equal to the last observed target value.
The forecasting functions do not provide parameters to override trend, seasonality, or seasonal amplitudes; these are inferred from the data.
The minimum acceptable granularity of data is one second. (Timestamps must not be less than one second apart.)
The minimum granularity of seasonal components is one minute. (The function cannot detect cyclic patterns at smaller time deltas.)
The timestamps in your data must represent fixed time intervals. If your input data is irregular, try using DATE_TRUNC or TIME_SLICE on your timestamp column when training the model.
The “season length” of autoregressive features is tied to the input frequency (24 for hourly data, 7 for daily data, and so on).
Forecast models, once trained, are immutable. You cannot update existing models with new data; you must train an entirely new model. Models do not support versioning. Snowflake recommends retraining a model on a regular cadence, perhaps daily, weekly, or monthly, depending on how frequently you receive new data, allowing the model to adjust to changing patterns and trends.
You cannot clone models or share models across roles or accounts. When cloning a schema or database, model objects are skipped.
You cannot replicate an instance of the FORECAST class.
Preparing for Forecasting¶
Before you can use forecasting, you must:
Select a virtual warehouse in which to train and run your models.
You might also want to modify your search path to include SNOWFLAKE.ML.
Selecting a Virtual Warehouse¶
A Snowflake virtual warehouse provides the compute resources for training and using the machine learning models for this feature. This section provides general guidance on selecting the best type and size of warehouse for this purpose, focusing on the training step, the most time-consuming and memory-intensive part of the process.
Training on Single-Series Data¶
For models trained on single-series data, choose the warehouse type based on the size of your training data. Standard warehouses are subject to a lower Snowpark memory limit, and are more appropriate for training jobs with fewer rows and exogenous variables.
If your training data does not contain any exogenous variables, you can train on a standard warehouse if the dataset has 5 million rows or less. If your training data uses 5 or more exogenous variables, the maximum row count is lower. Otherwise, Snowflake suggests using a Snowpark-optimized warehouse for larger training jobs.
In general, for single-series data, a larger warehouse size does not result in a faster training time or higher memory limits. As a rough rule of thumb, training time is proportional to the number of rows in your time series. For example, on a XS standard warehouse, training on a 100,000-row dataset takes about 30 seconds. Training on a 1,000,000-row dataset takes about 140 seconds.
For best performance, Snowflake recommends using a dedicated warehouse without other concurrent workloads to train your model.
Training on Multi-Series Data¶
Like with single-series data, Snowflake recommends choosing the warehouse type based on the number of rows in your largest time series. If your largest time series contains more than 5 million rows, the training job is likely to exceed memory limits on a standard warehouse.
Unlike single-series data, multi-series data trains considerably faster on larger warehouse sizes. The following data points can guide you in your selection.
Warehouse type |
Number of time series |
Number of rows per time series |
Warehouse size |
Training time |
---|---|---|---|---|
Standard |
1 |
100,000 |
XS |
38 seconds |
10 |
100,000 |
XS |
112 seconds |
|
100 |
100,000 |
XS |
594 seconds |
|
10 |
100,000 |
XL |
34 seconds |
|
100 |
100,000 |
XL |
114 seconds |
|
1000 |
100,000 |
XL |
572 seconds |
|
Snowpark-optimized |
10 |
100,000 |
XL |
74 seconds |
100 |
100,000 |
XL |
215 seconds |
|
1000 |
100,000 |
XL |
1429 seconds |
Granting Privileges to Create Forecast Objects¶
Training a forecasting model results in a schema-level object. Therefore, the role you use to create models must have the CREATE SNOWFLAKE.ML.FORECAST privilege on the schema where the model is created, allowing the model to be stored there. This privilege is similar to other schema privileges like CREATE TABLE or CREATE VIEW.
Snowflake recommends that you create a role named analyst
to be used by people who need to create forecasts.
In the following example, the admin
role is the owner of the schema admin_db.admin_schema
. The
analyst
role needs to create models in this schema.
USE ROLE admin;
GRANT USAGE ON DATABASE admin_db TO ROLE analyst;
GRANT USAGE ON SCHEMA admin_schema TO ROLE analyst;
GRANT CREATE SNOWFLAKE.ML.FORECAST ON SCHEMA admin_db.admin_schema TO ROLE analyst;
To use this schema, a user assumes the role analyst
:
USE ROLE analyst;
USE SCHEMA admin_db.admin_schema;
If the analyst
role has CREATE SCHEMA privileges in database analyst_db
, the role can create a new schema
analyst_db.analyst_schema
and create forecast models in that schema:
USE ROLE analyst;
CREATE SCHEMA analyst_db.analyst_schema;
USE SCHEMA analyst_db.analyst_schema;
To revoke a role’s forecast model creation privilege on the schema, use REVOKE <privileges>:
REVOKE CREATE SNOWFLAKE.ML.FORECAST ON SCHEMA admin_db.admin_schema FROM ROLE analyst;
Setting up the Data for the Examples¶
The example below creates two tables. Views of these tables are included in the examples later in this topic.
The sales_data
table contains sales data. Each sale includes a store ID, an item identifier, a timestamp, and
the sales amount. Exogenous or external variables (temperature, humidity, and holiday) are also included.
The future_features
table contains future values of the exogenous variables, which are necessary when forecasting
with such variables.
CREATE OR REPLACE TABLE sales_data (store_id NUMBER, item VARCHAR, date TIMESTAMP_NTZ,
sales FLOAT, temperature NUMBER, humidity FLOAT, holiday VARCHAR);
INSERT INTO sales_data VALUES
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 50, 0.3, 'new year'),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 52, 0.3, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-03'), 4.0, 54, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-04'), 5.0, 54, 0.3, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-05'), 6.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-06'), 7.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-07'), 8.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-08'), 9.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-09'), 10.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-10'), 11.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-11'), 12.0, 55, 0.2, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-12'), 13.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 50, 0.3, 'new year'),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 52, 0.3, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-03'), 4.0, 54, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-04'), 5.0, 54, 0.3, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-05'), 6.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-06'), 7.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-07'), 8.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-08'), 9.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-09'), 10.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-10'), 11.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-11'), 12.0, 55, 0.2, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-12'), 13.0, 55, 0.2, NULL);
-- future values for exogenous variables (additional features)
CREATE OR REPLACE TABLE future_features (store_id NUMBER, item VARCHAR,
date TIMESTAMP_NTZ, temperature NUMBER, humidity FLOAT, holiday VARCHAR);
INSERT INTO future_features VALUES
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-13'), 52, 0.3, NULL),
(1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-14'), 53, 0.3, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-13'), 52, 0.3, NULL),
(2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-14'), 53, 0.3, NULL);
Training, Using, Viewing, Deleting, and Updating Models¶
Use CREATE SNOWFLAKE.ML.FORECAST to create and train a model. The model is trained on the dataset you provide.
CREATE SNOWFLAKE.ML.FORECAST <name>(...);
See FORECAST for complete details about the SNOWFLAKE.ML.FORECAST constructor. For examples of creating a model, see Examples.
Note
SNOWFLAKE.ML.FORECAST runs using limited privileges, so by default, it does not have access to your data. You must therefore pass tables and views as references, which pass along the caller’s privileges. You can also provide a query reference instead of a reference to table or a view.
To generate a forecast, call the model’s <model_name>!FORECAST method:
CALL <name>!FORECAST(...)
To view a list of your models, use the SHOW SNOWFLAKE.ML.FORECAST command:
SHOW SNOWFLAKE.ML.FORECAST;
To remove a model, use the DROP SNOWFLAKE.ML.FORECAST command:
DROP SNOWFLAKE.ML.FORECAST <name>;
To update a model, delete it and train a new one. Models are immutable and cannot be updated in place.
Producing Forecasts¶
After creating your model, you can produce a forecast with name!FORECAST(...)
. You can create forecasts from
multiple datasets using the same model.
The following examples demonstrate how to create models and produce forecasts from them; consult the <model_name>!FORECAST for details on all the parameters.
Tip
To use a model with a dataset that has column names different from those in the dataset that was used to train the model, create a view or a query that renames the columns (using AS) to the names the model expects. Use that view or query as the input to the FORECAST method.
See FORECAST for information about the parameters used in creating and using a model.
Examples¶
The following examples demonstrate using time series forecasting for various use cases.
Forecasting on a Single Series¶
This example uses a single time series (that is, all the rows are part of a single series) that has two columns, a
timestamp column and a target value column, without additional features. First, prepare the data set from
sales_data
to train the model. The following code creates a view named v1
:
CREATE OR REPLACE VIEW v1 AS SELECT date, sales
FROM sales_data WHERE store_id=1 AND item='jacket';
SELECT * FROM v1;
The SELECT statement returns:
+-------------------------+-------+
| DATE | SALES |
+-------------------------+-------+
| 2020-01-01 00:00:00.000 | 2 |
| 2020-01-02 00:00:00.000 | 3 |
| 2020-01-03 00:00:00.000 | 4 |
| 2020-01-04 00:00:00.000 | 5 |
| 2020-01-05 00:00:00.000 | 6 |
+-------------------------+-------+
The following statement trains a forecasting model using the above dataset:
CREATE SNOWFLAKE.ML.FORECAST model1(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v1'),
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'sales'
);
The following message appears after the model is trained:
Instance MODEL1 successfully created.
The forecasting model is now available as model1
. To forecast the next three timestamps:
call model1!FORECAST(FORECASTING_PERIODS => 3);
Output
Note that the model has inferred the interval between timestamps from the training data.
+-------------------------+-----------+--------------+--------------+
| TS | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------------+-----------+--------------+--------------+
| 2020-01-13 00:00:00.000 | 14 | 14 | 14 |
| 2020-01-14 00:00:00.000 | 15 | 15 | 15 |
| 2020-01-15 00:00:00.000 | 16 | 16 | 16 |
+-------------------------+-----------+--------------+--------------+
In this example, because the forecast yields a perfectly linear prediction that has zero errors compared to the actual values, the prediction interval (LOWER_BOUND, UPPER_BOUND) is the same as the FORECAST value.
To customize the size of the prediction interval, pass prediction_interval
as part of a configuration object:
CALL model1!FORECAST(FORECASTING_PERIODS => 3, CONFIG_OBJECT => {'prediction_interval': 0.8});
To save your results directly to a table, use the following code:
BEGIN
CALL model1!FORECAST(FORECASTING_PERIODS => 3);
LET x := SQLID;
CREATE TABLE my_forecasts AS SELECT * FROM TABLE(RESULT_SCAN(:x));
END;
SELECT * FROM my_forecasts;
Although the above code uses RESULT_SCAN, it is not subject to race conditions even if multiple processes are running at once. Results from the above model call are stored
deterministically into the my_forecasts
table.
Forecasting on a Single Series with Exogenous Variables¶
If you want additional features (for example, holidays or weather) to influence the forecasting, you must include these features
in your training data. Here you create a view containing those fields from the sales_data
table:
CREATE OR REPLACE VIEW v2 AS SELECT date, sales, temperature, humidity, holiday
FROM sales_data WHERE store_id=1 AND item='jacket';
SELECT * FROM v2;
Output
This is the first five rows of the result of the SELECT query.
+-------------------------+--------+-------------+----------+----------+
| DATE | SALES | TEMPERATURE | HUMIDITY | HOLIDAY |
+-------------------------+--------+-------------+----------+----------+
| 2020-01-01 00:00:00.000 | 2 | 50 | 0.3 | new year |
| 2020-01-02 00:00:00.000 | 3 | 52 | 0.3 | null |
| 2020-01-03 00:00:00.000 | 4 | 54 | 0.2 | null |
| 2020-01-04 00:00:00.000 | 5 | 54 | 0.3 | null |
| 2020-01-05 00:00:00.000 | 6 | 55 | 0.2 | null |
+-------------------------+--------+-------------+----------+----------+
Now you can use this view to train a model. You are only required to specify the timestamp and target column names; additional columns in the input data are assumed to be exogenous variables.
CREATE SNOWFLAKE.ML.FORECAST model2(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v2'),
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'sales'
);
The new model is named model2
. To perform forecasting for future timestamps, you must provide future values
for the exogenous variables in your model: in this case, TEMPERATURE
, HUMIDITY
and HOLIDAY
. This
allows you to answer “what-if” questions, such as “What if temperatures were cooler than normal?” Now create a
view from the future_features
table containing this data for future timestamps:
CREATE OR REPLACE VIEW v2_forecast AS select date, temperature, humidity, holiday
FROM future_features WHERE store_id=1 AND item='jacket';
SELECT * FROM v2_forecast;
Output
+-------------------------+-------------+----------+---------+
| DATE | TEMPERATURE | HUMIDITY | HOLIDAY |
+-------------------------+-------------+----------+---------+
| 2020-01-13 00:00:00.000 | 52 | 0.3 | null |
| 2020-01-14 00:00:00.000 | 53 | 0.3 | null |
+-------------------------+-------------+----------+---------+
Now you can generate a forecast using this data:
CALL model2!FORECAST(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v2_forecast'),
TIMESTAMP_COLNAME =>'date'
);
In this variation of the FORECAST method, you do not specify the number of timestamps to predict. Instead, the timestamps
of the forecast come from the v2_forecast
view.
+-------------------------+-----------+--------------+--------------+
| TS | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------------+-----------+--------------+--------------+
| 2020-01-13 00:00:00.000 | 14 | 14 | 14 |
| 2020-01-14 00:00:00.000 | 15 | 15 | 15 |
+-------------------------+-----------+--------------+--------------+
Forecast on Multiple Series¶
The preceding forecasts are on a single series: all rows in the training data represent a single sequence of values. You can also create a forecast model for multiple series at the same time.
The sample data contains store_id
and item
columns. To forecast sales separately for every store/item
combination in the dataset, create a new column that combines these values, and specify that as the series
column.
For example, the following query creates a new view combining store_id
and item
into a new column named
store_item
:
CREATE OR REPLACE VIEW v3 AS SELECT [store_id, item] AS store_item, date, sales FROM sales_data;
SELECT * FROM v3;
Output
The first five rows for each series for the resulting dataset are:
+-------------------+-------------------------+-------+
| STORE_ITEM | DATE | SALES |
+-------------------+-------------------------+-------+
| [ 1, "jacket" ] | 2020-01-01 00:00:00.000 | 2 |
| [ 1, "jacket" ] | 2020-01-02 00:00:00.000 | 3 |
| [ 1, "jacket" ] | 2020-01-03 00:00:00.000 | 4 |
| [ 1, "jacket" ] | 2020-01-04 00:00:00.000 | 5 |
| [ 1, "jacket" ] | 2020-01-05 00:00:00.000 | 6 |
| [ 2, "umbrella" ] | 2020-01-01 00:00:00.000 | 2 |
| [ 2, "umbrella" ] | 2020-01-02 00:00:00.000 | 3 |
| [ 2, "umbrella" ] | 2020-01-03 00:00:00.000 | 4 |
| [ 2, "umbrella" ] | 2020-01-04 00:00:00.000 | 5 |
| [ 2, "umbrella" ] | 2020-01-05 00:00:00.000 | 6 |
+-------------------+-------------------------+-------+
The following statement trains a forecasting model using the preceding dataset. Be sure to specify the series column name,
which is store_item
:
CREATE SNOWFLAKE.ML.FORECAST model3(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v3'),
SERIES_COLNAME => 'store_item',
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'sales'
);
After training, you can forecast two steps forward on all series at the same time (that is, two steps per series) with:
CALL model3!FORECAST(FORECASTING_PERIODS => 2);
Output
+-------------------+------------------------+----------+-------------+-------------+
| SERIES | TS | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------+------------------------+----------+-------------+-------------+
| [ 1, "jacket" ] | 2020-01-13 00:00:00.000 | 14 | 14 | 14 |
| [ 1, "jacket" ] | 2020-01-14 00:00:00.000 | 15 | 15 | 15 |
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 14 | 14 | 14 |
| [ 2, "umbrella" ] | 2020-01-14 00:00:00.000 | 15 | 15 | 15 |
+-------------------+-------------------------+---------+-------------+-------------+
You can also forecast only a specific series with:
CALL model3!FORECAST(SERIES_VALUE => [2,'umbrella'], FORECASTING_PERIODS => 2);
Output
The result shows only the next two steps for store 2’s sales of umbrellas.
+-------------------+------------ ------------+-----------+-------------+-------------+
| SERIES | TS | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------+---------- --------------+-----------+-------------+-------------+
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 14 | 14 | 14 |
| [ 2, "umbrella" ] | 2020-01-15 00:00:00.000 | 15 | 15 | 15 |
+-------------------+-------------------------+-----------+-------------+-------------+
Tip
Specifying one series with the FORECAST method is more efficient than filtering the results of a multi-series forecast to include only the series you’re interested in, because only one series’ forecast is generated.
Forecast on Multiple Series with Exogenous Variables¶
To train and forecast multiple time series in parallel with exogenous variables, essentially combining the previous two
examples, prepare your data in the following manner, incorporating both a series column (store_item
in this
case) and at least one exogenous column (temperature
here):
CREATE OR REPLACE VIEW v4 AS SELECT [store_id, item] AS store_item,
date, sales, temperature FROM sales_data;
SELECT * FROM v4;
Output
The first five rows of each series of the resulting dataset looks like this.
+-------------------+-------------------------+-------+-------------+
| STORE_ITEM | DATE | SALES | TEMPERATURE |
+-------------------+-------------------------+-------+-------------+
| [ 1, "jacket" ] | 2020-01-01 00:00:00.000 | 2 | 50 |
| [ 1, "jacket" ] | 2020-01-02 00:00:00.000 | 3 | 52 |
| [ 1, "jacket" ] | 2020-01-03 00:00:00.000 | 4 | 54 |
| [ 1, "jacket" ] | 2020-01-04 00:00:00.000 | 5 | 54 |
| [ 1, "jacket" ] | 2020-01-05 00:00:00.000 | 6 | 55 |
| [ 2, "umbrella" ] | 2020-01-01 00:00:00.000 | 2 | 50 |
| [ 2, "umbrella" ] | 2020-01-02 00:00:00.000 | 3 | 52 |
| [ 2, "umbrella" ] | 2020-01-03 00:00:00.000 | 4 | 54 |
| [ 2, "umbrella" ] | 2020-01-04 00:00:00.000 | 5 | 54 |
| [ 2, "umbrella" ] | 2020-01-05 00:00:00.000 | 6 | 55 |
+-------------------+-------------------------+-------+-------------+
The following statement trains a forecasting model using the preceding dataset. Be sure to specify the series column name,
store_item
:
CREATE SNOWFLAKE.ML.FORECAST model4(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v4'),
SERIES_COLNAME => 'store_item',
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'sales'
);
Tip
You can pass in a query reference for input_data
instead of
creating an intermediate view. The following statement creates a model with the same input data that the previous
statement created:
CREATE SNOWFLAKE.ML.FORECAST model4(
INPUT_DATA => SYSTEM$QUERY_REFERENCE('SELECT [store_id, item] AS store_item, date, sales, temperature FROM sales_data'),
SERIES_COLNAME => 'store_item',
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'sales'
);
As before, when forecasting, you need to pass in future values for exogenous variables, but this time for each series. Prepare the future value of exogenous variables:
CREATE OR REPLACE VIEW V4_FORECAST AS SELECT [store_id, item] AS store_item,
date, temperature FROM future_features;
SELECT * FROM v4_forecast;
Output
+-------------------+-------------------------+-------------+
| STORE_ITEM | DATE | TEMPERATURE |
+-------------------+-------------------------+-------------+
| [ 1, "jacket" ] | 2020-01-13 00:00:00.000 | 52 |
| [ 1, "jacket" ] | 2020-01-14 00:00:00.000 | 53 |
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 52 |
| [ 2, "umbrella" ] | 2020-01-14 00:00:00.000 | 53 |
+-------------------+-------------------------+-------------+
Now, you can forecast on the preceding timestamps:
CALL model4!FORECAST(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v4_forecast'),
SERIES_COLNAME => 'store_item',
TIMESTAMP_COLNAME =>'date'
);
Output
+-------------------+-------------------------+----------+-------------+-------------+
| SERIES | TS | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------+-------------------------+----------+-------------+-------------+
| [ 1, "jacket" ] | 2020-01-13 00:00:00.000 | 14 | 14 | 14 |
| [ 1, "jacket" ] | 2020-01-14 00:00:00.000 | 15 | 15 | 15 |
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 14 | 14 | 14 |
| [ 2, "umbrella" ] | 2020-01-14 00:00:00.000 | 15 | 15 | 15 |
+-------------------+-------------------------+----------+-------------+-------------+
Visualizing Forecasts¶
To visualize a forecast, use the Snowsight chart feature. After generating a forecast, click Charts above the query’s results table.
You can join your forecasts to your original dataset by using UNION ALL to combine SELECTs on the original data and the forecast results, so you can visualize both together. For example:
CALL model4!FORECAST(FORECASTING_PERIODS => 3);
SELECT date AS ts, sales AS actual, NULL AS forecast, NULL AS lower_bound, NULL AS upper_bound
FROM sales_data
UNION ALL
SELECT ts, NULL AS actual, forecast, lower_bound, upper_bound
FROM TABLE(RESULT_SCAN(-1));
Rename the timestamp and exogenous variable columns if necessary so they match in the original dataset and the forecast.
Use NULL to represent the forecast target value in the training data, and conversely to represent the historical target
value in the forecast results. In the preceding example, the actual (historical) target column is renamed actual
and the forecast target forecast
.
After you prepare your data this way, follow these steps in the Data section on the right side of the chart:
Select the ACTUAL column, and then under Aggregation, select None.
Select the TS column, and then under Bucketing, select None.
Add the FORECAST column, choosing Use as Line, and then under Aggregation, select None.
Add the LOWER_BOUND and UPPER_BOUND columns the same way.
The resulting chart should look similar to this:
Understanding Feature Importance¶
A forecast model can explain the relative importance of all features used in your model, including any exogenous variables that you choose, automatically generated time features (such as day of week or week of year), and transformations of your target variable (such as rolling averages and auto-regressive lags). This information is useful in understanding what factors are really influencing your data.
The <model_name>!EXPLAIN_FEATURE_IMPORTANCE method counts the number of times the model’s trees used each feature to make a decision. These feature importance scores are then normalized to values between 0 and 1 so that their sum is 1. The resulting scores represent an approximate ranking of the features in your trained model.
Features that are close in score have similar importance. For extremely simple series (for example, when the target column has a constant value), all feature importance scores may be zero.
Using multiple features that are very similar to each other may result in reduced importance scores for those features. For example, if one feature is quantity of items sold and another is quantity of items in inventory, the values may be correlated because you can’t sell more than you have and because you try to manage inventory so you won’t have more in stock than you will sell. If two features are exactly identical, the model may treat them as interchangeable when making decisions, resulting in feature importance scores that are half of what those scores would be if only one of the identical features were included.
Feature importance also reports lag features. During training, the model infers the frequency (hourly, daily, or weekly)
of your training data. The feature lagx
(e.g. lag24
) is the the value of the target variable x time units ago.
For example, if your data is inferred to be hourly, lag24
represents your target variable 24 hours ago.
All other transformations of your target variable (rolling averages, etc.) are summarized as
aggregated_endogenous_features
in the results table.
Limitations¶
You cannot choose the technique used to calculate feature importance.
Feature importance scores can be helpful for gaining intuition about which features are important to your model’s accuracy, but the actual values should be considered estimates.
Example¶
To understand the relative importance of your features to your model, train a model, and then call <model_name>!EXPLAIN_FEATURE_IMPORTANCE. In this example, you first create random data with two exogenous variables: one that is random and therefore unlikely to be very important to your model, and one that is a copy of your target and therefore likely to be more important to your model.
Execute the following statements to generate the data, train a model on it, and get the importance of the features:
CREATE OR REPLACE VIEW v_random_data AS SELECT
DATEADD('minute', ROW_NUMBER() over (ORDER BY 1), '2023-12-01')::TIMESTAMP_NTZ ts,
MOD(SEQ1(),10) y,
UNIFORM(1, 100, RANDOM(0)) exog_a
FROM TABLE(GENERATOR(ROWCOUNT => 500));
CREATE OR REPLACE SNOWFLAKE.ML.FORECAST forecast_feature_importance_demo(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v_random_data'),
TIMESTAMP_COLNAME => 'ts',
TARGET_COLNAME => 'y'
);
CALL forecast_feature_importance_demo!EXPLAIN_FEATURE_IMPORTANCE();
Output
+------+--------------------------------------+-------+-------------------------+
| RANK | FEATURE_NAME | SCORE | FEATURE_TYPE |
+------+--------------------------------------+-------+-------------------------+
| 1 | aggregated_endogenous_trend_features | 0.36 | derived_from_endogenous |
| 2 | exog_a | 0.22 | user_provided |
| 3 | epoch_time | 0.15 | derived_from_timestamp |
| 4 | minute | 0.13 | derived_from_timestamp |
| 5 | lag60 | 0.07 | derived_from_endogenous |
| 6 | lag120 | 0.06 | derived_from_endogenous |
| 7 | hour | 0.01 | derived_from_timestamp |
+------+--------------------------------------+-------+-------------------------+
Understanding Evaluation Metrics¶
By default, every forecasting model is cross-validated. In addition to training a model on all provided training data,
one or more models are trained on subsets of the training data, then used to “predict” the withheld data. The predicted
target values are then compared to the actual target values. If you don’t need evaluation metrics, or don’t want to
spend compute resources on them, set evaluate
to FALSE when instantiating the model.
Limitations¶
The
n_splits
parameter must be set to at least 2.Small datasets may not have enough data to perform evaluation. The total number of training rows must be equal to or greater than
(n_splits * test_size) + gap
. If not enough data is available to train an evaluation model, no evaluation metrics are available even whenevaluate
is TRUE.
Example¶
CREATE OR REPLACE VIEW v_random_data AS SELECT
DATEADD('minute', ROW_NUMBER() over (ORDER BY 1), '2023-12-01')::TIMESTAMP_NTZ ts,
MOD(SEQ1(),10) y,
UNIFORM(1, 100, RANDOM(0)) exog_a
FROM TABLE(GENERATOR(ROWCOUNT => 500));
CREATE OR REPLACE SNOWFLAKE.ML.FORECAST model(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v_random_data'),
TIMESTAMP_COLNAME => 'ts',
TARGET_COLNAME => 'y'
);
CALL model!SHOW_EVALUATION_METRICS();
Output
+--------------------------+--------------+--------------------+------+
| ERROR_METRIC | METRIC_VALUE | STANDARD_DEVIATION | LOGS |
+--------------------------+--------------+--------------------+------+
| "MAE" | 7.107 | 1.998 | NULL |
| "MAPE" | 0.475 | 0.237 | NULL |
| "MDA" | 0.920 | 0.025 | NULL |
| "MSE" | 86.020 | 66.798 | NULL |
| "SMAPE" | 0.241 | 0.047 | NULL |
| "COVERAGE_INTERVAL=0.95" | 0.981 | 0.025 | NULL |
| "WINKLER_ALPHA=0.05" | 56.697 | 45.199 | NULL |
+--------------------------+--------------+--------------------+------+
Inspecting Training Logs¶
When you train multiple series with CONFIG_OBJECT => 'ON_ERROR': 'SKIP'
, individual time series models can
fail to train without the overall training process failing. To understand which time series failed and why, call
<model_instance>!SHOW_TRAINING_LOGS
.
Example¶
CREATE TABLE t_error(date TIMESTAMP_NTZ, sales FLOAT, series VARCHAR);
INSERT INTO t_error VALUES
(TO_TIMESTAMP_NTZ('2019-12-20'), 1.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-21'), 2.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-22'), 3.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-23'), 2.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-24'), 1.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-25'), 2.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-26'), 3.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-27'), 2.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-28'), 1.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-29'), 2.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-30'), 3.0, 'A'),
(TO_TIMESTAMP_NTZ('2019-12-31'), 2.0, 'A'),
(TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 'A'),
(TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 'A'),
(TO_TIMESTAMP_NTZ('2020-01-03'), 3.0, 'A'),
(TO_TIMESTAMP_NTZ('2020-01-04'), 7.0, 'A'),
(TO_TIMESTAMP_NTZ('2020-01-05'), 10.0, 'B'),
(TO_TIMESTAMP_NTZ('2020-01-06'), 13.0, 'B'),
(TO_TIMESTAMP_NTZ('2020-01-06'), 12.0, 'B'), -- duplicate timestamp
(TO_TIMESTAMP_NTZ('2020-01-07'), 15.0, 'B'),
(TO_TIMESTAMP_NTZ('2020-01-08'), 14.0, 'B'),
(TO_TIMESTAMP_NTZ('2020-01-09'), 18.0, 'B'),
(TO_TIMESTAMP_NTZ('2020-01-10'), 12.0, 'B');
CREATE SNOWFLAKE.ML.FORECAST model(
INPUT_DATA => SYSTEM$QUERY_REFERENCE('SELECT date, sales, series FROM t_error'),
SERIES_COLNAME => 'series',
TIMESTAMP_COLNAME => 'date',
TARGET_COLNAME => 'sales',
CONFIG_OBJECT => {'ON_ERROR': 'SKIP'}
);
CALL model!SHOW_TRAINING_LOGS();
Output
+--------+-------------------------------------------------------------------------------------------------+
| SERIES | LOGS |
+--------+-------------------------------------------------------------------------------------------------+
| "B" | { "Errors": [ "Frequency cannot be inferred when duplicate timestamps are present." ] } |
| "A" | NULL |
+--------+-------------------------------------------------------------------------------------------------+
Cost Considerations¶
Training and using forecasting models incurs storage and compute costs.
As described in Selecting a Virtual Warehouse, training consumes more compute than forecasting from a trained model. However, the costs of forecasting can accumulate with repeated use of the model. See Understanding compute cost for general information on Snowflake compute costs.
The storage costs you incur reflect storage of the ML model instances created during the training step. To
view the objects associated with your model instance, navigate to your account usage views (e.g., ACCOUNT_USAGE.TABLES and
ACCOUNT_USAGE.STAGES). These objects will appear with NULL database and schema columns. The instance_id
column, however,
is populated and indicates that these objects are contained inside a model instance. These objects are fully managed
by the model instance, and you cannot access or delete them separately. To reduce storage costs associated
with your models, delete unused or obsolete models.
Using Forecasting in Snowpark¶
session.call
is not yet compatible with forecasting models. To call a forecasting model in Snowpark, use
session.sql
instead, as shown here.
session.sql('call my_model!forecast(...)').collect()