Time-Series Forecasting (Snowflake Cortex ML Functions)

Note

Forecasting is part of Snowflake Cortex, Snowflake’s intelligent, fully-managed AI and ML service. This feature is part of the Snowflake Cortex ML function suite.

Forecasting employs a machine learning algorithm to predict future data by using historical time series data. Time series forecasting produces univariate predictions of future data based on historical input data. A common use case is to forecast sales based on seasonality and other factors.

The historical data must include:

  • A timestamp column, which should have a fixed frequency (for example, hourly, every 5 minutes, and so on).

  • A target column representing some quantity of interest at each timestamp.

The historical data can also include additional columns that might have influenced the target (exogenous variables). These columns can be either numerical or character data.

The historical data is used to train a machine learning model that produces a forecast of the target value at future timestamps. The model is a schema-level object and can be used for multiple forecasts after it has been trained. (The model cannot be used to produce forecasts in the same range as the historical data.)

Forecasting works with either single-series or multi-series data. Multi-series data represents multiple distinct threads of events. For example, if you have sales data for multiple stores, each store’s sales can be forecast separately by a single model based on the store identifier.

To produce forecasts of time series data, use the Snowflake built-in class FORECAST, and follow these steps:

  1. Create a forecast model object, passing in a reference to the training data. This object will fit (train) a model to the training data that you provide. The model is a schema-level object.

  2. Using this forecast model object, call the forecast method to produce a forecast of the target for future timestamps, passing in either the number of time steps, of if using exogenous variables, the future-valued exogenous data.

Important

Legal notice. This Snowflake Cortex ML function is powered by machine learning technology. Machine learning technology and results provided may be inaccurate, inappropriate, or biased. Decisions based on machine learning outputs, including those built into automatic pipelines, should have human oversight and review processes to ensure model-generated content is accurate. Snowflake Cortex ML function queries will be treated as any other SQL query and may be considered metadata.

Metadata. When you use Snowflake Cortex ML functions, Snowflake logs generic error messages returned by an ML function, in addition to what is mentioned in Metadata Fields. These error logs help us troubleshoot issues that arise and improve these functions to serve you better.

For further information, see Snowflake AI Trust and Safety FAQ.

About the Forecasting Algorithm

The forecasting algorithm is powered by a gradient boosting machine (GBM). Like an ARIMA model, it uses a differencing transformation to model data with a non-stationary trend and uses auto-regressive lags of the historical target data as model variables.

Additionally, the algorithm uses rolling averages of historical target data to help predict trends, and automatically produces cyclic calendar variables (such as day of week and week of year) from timestamp data.

You can fit models with only historical target and timestamp data, or you may include exogenous data (variables) that might have influenced the target value. Exogenous variables can be numerical or categorical and may be NULL (rows containing NULLs for exogenous variables are not dropped).

The algorithm does not rely on one-hot encoding when training on categorical variables, so you can use categorical data with many dimensions (high cardinality).

If your model incorporates exogenous variables, when generating a forecast you must provide values for those variables at every timestamp of the full forecast horizon. Appropriate exogenous variables could include weather data (temperature, rainfall), company-specific information (historic and planned company holidays, advertisement campaigns, event schedules), or any other external factors you believe may help predict your target variable.

The algorithm also generates prediction intervals, in addition to forecasts. A prediction interval is an estimated range of values within an upper bound and a lower bound in which a certain percentage of data is likely to fall. For example, a 0.95 value means that 95% of the data likely appears within the interval. You may specify a prediction interval percentage, or use the default, which is 0.95. Lower and upper bounds of the prediction interval are returned as part of the forecast output.

Important

From time to time, Snowflake may refine the forecasting algorithm and will roll out such improvements through the regular Snowflake release process. You cannot revert to a previous version of the feature, but models you have created with a previous version will continue to use that version for predictions.

Current Limitations

The current release has the following limitations:

  • You cannot choose or adjust the forecasting algorithm.

  • The minimum number of rows for the main forecasting algorithm is 12 per time series. For time series with between 2 and 11 observations, forecasting produces a “naive” forecast where all forecasted values are equal to the last observed target value.

  • The forecasting functions do not provide parameters to override trend, seasonality, or seasonal amplitudes; these are inferred from the data.

  • The minimum acceptable granularity of data is one second. (Timestamps must not be less than one second apart.)

  • The minimum granularity of seasonal components is one minute. (The function cannot detect cyclic patterns at smaller time deltas.)

  • The timestamps in your data must represent fixed time intervals. If your input data is irregular, try using DATE_TRUNC or TIME_SLICE on your timestamp column when training the model.

  • The “season length” of autoregressive features is tied to the input frequency (24 for hourly data, 7 for daily data, and so on).

  • Forecast models, once trained, are immutable. You cannot update existing models with new data; you must train an entirely new model. Models do not support versioning. Snowflake recommends retraining a model on a regular cadence, perhaps daily, weekly, or monthly, depending on how frequently you receive new data, allowing the model to adjust to changing patterns and trends.

  • You cannot clone models or share models across roles or accounts. When cloning a schema or database, model objects are skipped.

  • You cannot replicate an instance of the FORECAST class.

Preparing for Forecasting

Before you can use forecasting, you must:

You might also want to modify your search path to include SNOWFLAKE.ML.

Selecting a Virtual Warehouse

A Snowflake virtual warehouse provides the compute resources for training and using the machine learning models for this feature. This section provides general guidance on selecting the best type and size of warehouse for this purpose, focusing on the training step, the most time-consuming and memory-intensive part of the process.

Training on Single-Series Data

For models trained on single-series data, choose the warehouse type based on the size of your training data. Standard warehouses are subject to a lower Snowpark memory limit, and are more appropriate for training jobs with fewer rows and exogenous variables.

If your training data does not contain any exogenous variables, you can train on a standard warehouse if the dataset has 5 million rows or less. If your training data uses 5 or more exogenous variables, the maximum row count is lower. Otherwise, Snowflake suggests using a Snowpark-optimized warehouse for larger training jobs.

In general, for single-series data, a larger warehouse size does not result in a faster training time or higher memory limits. As a rough rule of thumb, training time is proportional to the number of rows in your time series. For example, on a XS standard warehouse, training on a 100,000-row dataset takes about 30 seconds. Training on a 1,000,000-row dataset takes about 140 seconds.

For best performance, Snowflake recommends using a dedicated warehouse without other concurrent workloads to train your model.

Training on Multi-Series Data

Like with single-series data, Snowflake recommends choosing the warehouse type based on the number of rows in your largest time series. If your largest time series contains more than 5 million rows, the training job is likely to exceed memory limits on a standard warehouse.

Unlike single-series data, multi-series data trains considerably faster on larger warehouse sizes. The following data points can guide you in your selection.

Warehouse type

Number of time series

Number of rows per time series

Warehouse size

Training time

Standard

1

100,000

XS

38 seconds

10

100,000

XS

112 seconds

100

100,000

XS

594 seconds

10

100,000

XL

34 seconds

100

100,000

XL

114 seconds

1000

100,000

XL

572 seconds

Snowpark-optimized

10

100,000

XL

74 seconds

100

100,000

XL

215 seconds

1000

100,000

XL

1429 seconds

Granting Privileges to Create Forecast Objects

Training a forecasting model results in a schema-level object. Therefore, the role you use to create models must have the CREATE SNOWFLAKE.ML.FORECAST privilege on the schema where the model is created, allowing the model to be stored there. This privilege is similar to other schema privileges like CREATE TABLE or CREATE VIEW.

Snowflake recommends that you create a role named analyst to be used by people who need to create forecasts.

In the following example, the admin role is the owner of the schema admin_db.admin_schema. The analyst role needs to create models in this schema.

USE ROLE admin;
GRANT USAGE ON DATABASE admin_db TO ROLE analyst;
GRANT USAGE ON SCHEMA admin_schema TO ROLE analyst;
GRANT CREATE SNOWFLAKE.ML.FORECAST ON SCHEMA admin_db.admin_schema TO ROLE analyst;
Copy

To use this schema, a user assumes the role analyst:

USE ROLE analyst;
USE SCHEMA admin_db.admin_schema;
Copy

If the analyst role has CREATE SCHEMA privileges in database analyst_db, the role can create a new schema analyst_db.analyst_schema and create forecast models in that schema:

USE ROLE analyst;
CREATE SCHEMA analyst_db.analyst_schema;
USE SCHEMA analyst_db.analyst_schema;
Copy

To revoke a role’s forecast model creation privilege on the schema, use REVOKE <privileges>:

REVOKE CREATE SNOWFLAKE.ML.FORECAST ON SCHEMA admin_db.admin_schema FROM ROLE analyst;
Copy

Setting up the Data for the Examples

The example below creates two tables. Views of these tables are included in the examples later in this topic.

The sales_data table contains sales data. Each sale includes a store ID, an item identifier, a timestamp, and the sales amount. Exogenous or external variables (temperature, humidity, and holiday) are also included.

The future_features table contains future values of the exogenous variables, which are necessary when forecasting with such variables.

CREATE OR REPLACE TABLE sales_data (store_id NUMBER, item VARCHAR, date TIMESTAMP_NTZ,
  sales FLOAT, temperature NUMBER, humidity FLOAT, holiday VARCHAR);

INSERT INTO sales_data VALUES
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 50, 0.3, 'new year'),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 52, 0.3, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-03'), 4.0, 54, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-04'), 5.0, 54, 0.3, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-05'), 6.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-06'), 7.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-07'), 8.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-08'), 9.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-09'), 10.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-10'), 11.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-11'), 12.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-12'), 13.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 50, 0.3, 'new year'),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 52, 0.3, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-03'), 4.0, 54, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-04'), 5.0, 54, 0.3, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-05'), 6.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-06'), 7.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-07'), 8.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-08'), 9.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-09'), 10.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-10'), 11.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-11'), 12.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-12'), 13.0, 55, 0.2, NULL);

-- future values for exogenous variables (additional features)
CREATE OR REPLACE TABLE future_features (store_id NUMBER, item VARCHAR,
  date TIMESTAMP_NTZ, temperature NUMBER, humidity FLOAT, holiday VARCHAR);

INSERT INTO future_features VALUES
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-13'), 52, 0.3, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-14'), 53, 0.3, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-13'), 52, 0.3, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-14'), 53, 0.3, NULL);
Copy

Training, Using, Viewing, Deleting, and Updating Models

Use CREATE SNOWFLAKE.ML.FORECAST to create and train a model. The model is trained on the dataset you provide.

CREATE SNOWFLAKE.ML.FORECAST <name>(...);
Copy

See FORECAST for complete details about the SNOWFLAKE.ML.FORECAST constructor. For examples of creating a model, see Examples.

Note

SNOWFLAKE.ML.FORECAST runs using limited privileges, so by default, it does not have access to your data. You must therefore pass tables and views as references, which pass along the caller’s privileges. You can also provide a query reference instead of a reference to table or a view.

To generate a forecast, call the model’s <model_name>!FORECAST method:

CALL <name>!FORECAST(...)
Copy

To view a list of your models, use the SHOW SNOWFLAKE.ML.FORECAST command:

SHOW SNOWFLAKE.ML.FORECAST;
Copy

To remove a model, use the DROP SNOWFLAKE.ML.FORECAST command:

DROP SNOWFLAKE.ML.FORECAST <name>;
Copy

To update a model, delete it and train a new one. Models are immutable and cannot be updated in place.

Producing Forecasts

After creating your model, you can produce a forecast with name!FORECAST(...). You can create forecasts from multiple datasets using the same model.

The following examples demonstrate how to create models and produce forecasts from them; consult the <model_name>!FORECAST for details on all the parameters.

Tip

To use a model with a dataset that has column names different from those in the dataset that was used to train the model, create a view or a query that renames the columns (using AS) to the names the model expects. Use that view or query as the input to the FORECAST method.

See FORECAST for information about the parameters used in creating and using a model.

Examples

The following examples demonstrate using time series forecasting for various use cases.

Forecasting on a Single Series

This example uses a single time series (that is, all the rows are part of a single series) that has two columns, a timestamp column and a target value column, without additional features. First, prepare the data set from sales_data to train the model. The following code creates a view named v1:

CREATE OR REPLACE VIEW v1 AS SELECT date, sales
  FROM sales_data WHERE store_id=1 AND item='jacket';
SELECT * FROM v1;
Copy

The SELECT statement returns:

+-------------------------+-------+
| DATE                    | SALES |
+-------------------------+-------+
| 2020-01-01 00:00:00.000 | 2     |
| 2020-01-02 00:00:00.000 | 3     |
| 2020-01-03 00:00:00.000 | 4     |
| 2020-01-04 00:00:00.000 | 5     |
| 2020-01-05 00:00:00.000 | 6     |
+-------------------------+-------+

The following statement trains a forecasting model using the above dataset:

CREATE SNOWFLAKE.ML.FORECAST model1(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v1'),
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales'
);
Copy

The following message appears after the model is trained:

Instance MODEL1 successfully created.

The forecasting model is now available as model1. To forecast the next three timestamps:

call model1!FORECAST(FORECASTING_PERIODS => 3);
Copy

Output

Note that the model has inferred the interval between timestamps from the training data.

+-------------------------+-----------+--------------+--------------+
| TS                      | FORECAST  | LOWER_BOUND  | UPPER_BOUND  |
+-------------------------+-----------+--------------+--------------+
| 2020-01-13 00:00:00.000 | 14        | 14           | 14           |
| 2020-01-14 00:00:00.000 | 15        | 15           | 15           |
| 2020-01-15 00:00:00.000 | 16        | 16           | 16           |
+-------------------------+-----------+--------------+--------------+

In this example, because the forecast yields a perfectly linear prediction that has zero errors compared to the actual values, the prediction interval (LOWER_BOUND, UPPER_BOUND) is the same as the FORECAST value.

To customize the size of the prediction interval, pass prediction_interval as part of a configuration object:

CALL model1!FORECAST(FORECASTING_PERIODS => 3, CONFIG_OBJECT => {'prediction_interval': 0.8});
Copy

To save your results directly to a table, use the following code:

BEGIN
  CALL model1!FORECAST(FORECASTING_PERIODS => 3);
  LET x := SQLID;
  CREATE TABLE my_forecasts AS SELECT * FROM TABLE(RESULT_SCAN(:x));
END;

SELECT * FROM my_forecasts;
Copy

Although the above code uses RESULT_SCAN, it is not subject to race conditions even if multiple processes are running at once. Results from the above model call are stored deterministically into the my_forecasts table.

Forecasting on a Single Series with Exogenous Variables

If you want additional features (for example, holidays or weather) to influence the forecasting, you must include these features in your training data. Here you create a view containing those fields from the sales_data table:

CREATE OR REPLACE VIEW v2 AS SELECT date, sales, temperature, humidity, holiday
  FROM sales_data WHERE store_id=1 AND item='jacket';
SELECT * FROM v2;
Copy

Output

This is the first five rows of the result of the SELECT query.

+-------------------------+--------+-------------+----------+----------+
| DATE                    | SALES  | TEMPERATURE | HUMIDITY | HOLIDAY  |
+-------------------------+--------+-------------+----------+----------+
| 2020-01-01 00:00:00.000 | 2      | 50          | 0.3      | new year |
| 2020-01-02 00:00:00.000 | 3      | 52          | 0.3      | null     |
| 2020-01-03 00:00:00.000 | 4      | 54          | 0.2      | null     |
| 2020-01-04 00:00:00.000 | 5      | 54          | 0.3      | null     |
| 2020-01-05 00:00:00.000 | 6      | 55          | 0.2      | null     |
+-------------------------+--------+-------------+----------+----------+

Now you can use this view to train a model. You are only required to specify the timestamp and target column names; additional columns in the input data are assumed to be exogenous variables.

CREATE SNOWFLAKE.ML.FORECAST model2(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v2'),
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales'
);
Copy

The new model is named model2. To perform forecasting for future timestamps, you must provide future values for the exogenous variables in your model: in this case, TEMPERATURE, HUMIDITY and HOLIDAY. This allows you to answer “what-if” questions, such as “What if temperatures were cooler than normal?” Now create a view from the future_features table containing this data for future timestamps:

CREATE OR REPLACE VIEW v2_forecast AS select date, temperature, humidity, holiday
  FROM future_features WHERE store_id=1 AND item='jacket';
SELECT * FROM v2_forecast;
Copy

Output

+-------------------------+-------------+----------+---------+
| DATE                    | TEMPERATURE | HUMIDITY | HOLIDAY |
+-------------------------+-------------+----------+---------+
| 2020-01-13 00:00:00.000 | 52          | 0.3      | null    |
| 2020-01-14 00:00:00.000 | 53          | 0.3      | null    |
+-------------------------+-------------+----------+---------+

Now you can generate a forecast using this data:

CALL model2!FORECAST(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v2_forecast'),
  TIMESTAMP_COLNAME =>'date'
);
Copy

In this variation of the FORECAST method, you do not specify the number of timestamps to predict. Instead, the timestamps of the forecast come from the v2_forecast view.

+-------------------------+-----------+--------------+--------------+
| TS                      | FORECAST  | LOWER_BOUND  | UPPER_BOUND  |
+-------------------------+-----------+--------------+--------------+
| 2020-01-13 00:00:00.000 | 14        | 14           | 14           |
| 2020-01-14 00:00:00.000 | 15        | 15           | 15           |
+-------------------------+-----------+--------------+--------------+

Forecast on Multiple Series

The preceding forecasts are on a single series: all rows in the training data represent a single sequence of values. You can also create a forecast model for multiple series at the same time.

The sample data contains store_id and item columns. To forecast sales separately for every store/item combination in the dataset, create a new column that combines these values, and specify that as the series column.

For example, the following query creates a new view combining store_id and item into a new column named store_item:

CREATE OR REPLACE VIEW v3 AS SELECT [store_id, item] AS store_item, date, sales FROM sales_data;
SELECT * FROM v3;
Copy

Output

The first five rows for each series for the resulting dataset are:

+-------------------+-------------------------+-------+
| STORE_ITEM        | DATE                    | SALES |
+-------------------+-------------------------+-------+
| [ 1, "jacket" ]   | 2020-01-01 00:00:00.000 | 2     |
| [ 1, "jacket" ]   | 2020-01-02 00:00:00.000 | 3     |
| [ 1, "jacket" ]   | 2020-01-03 00:00:00.000 | 4     |
| [ 1, "jacket" ]   | 2020-01-04 00:00:00.000 | 5     |
| [ 1, "jacket" ]   | 2020-01-05 00:00:00.000 | 6     |
| [ 2, "umbrella" ] | 2020-01-01 00:00:00.000 | 2     |
| [ 2, "umbrella" ] | 2020-01-02 00:00:00.000 | 3     |
| [ 2, "umbrella" ] | 2020-01-03 00:00:00.000 | 4     |
| [ 2, "umbrella" ] | 2020-01-04 00:00:00.000 | 5     |
| [ 2, "umbrella" ] | 2020-01-05 00:00:00.000 | 6     |
+-------------------+-------------------------+-------+

The following statement trains a forecasting model using the preceding dataset. Be sure to specify the series column name, which is store_item:

CREATE SNOWFLAKE.ML.FORECAST model3(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v3'),
  SERIES_COLNAME => 'store_item',
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales'
);
Copy

After training, you can forecast two steps forward on all series at the same time (that is, two steps per series) with:

CALL model3!FORECAST(FORECASTING_PERIODS => 2);
Copy

Output

+-------------------+------------------------+----------+-------------+-------------+
| SERIES            | TS                     | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------+------------------------+----------+-------------+-------------+
| [ 1, "jacket" ]   | 2020-01-13 00:00:00.000 | 14      | 14          | 14          |
| [ 1, "jacket" ]   | 2020-01-14 00:00:00.000 | 15      | 15          | 15          |
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 14      | 14          | 14          |
| [ 2, "umbrella" ] | 2020-01-14 00:00:00.000 | 15      | 15          | 15          |
+-------------------+-------------------------+---------+-------------+-------------+

You can also forecast only a specific series with:

CALL model3!FORECAST(SERIES_VALUE => [2,'umbrella'], FORECASTING_PERIODS => 2);
Copy

Output

The result shows only the next two steps for store 2’s sales of umbrellas.

+-------------------+------------ ------------+-----------+-------------+-------------+
| SERIES            | TS                      | FORECAST  | LOWER_BOUND | UPPER_BOUND |
+-------------------+---------- --------------+-----------+-------------+-------------+
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 14        | 14          | 14          |
| [ 2, "umbrella" ] | 2020-01-15 00:00:00.000 | 15        | 15          | 15          |
+-------------------+-------------------------+-----------+-------------+-------------+

Tip

Specifying one series with the FORECAST method is more efficient than filtering the results of a multi-series forecast to include only the series you’re interested in, because only one series’ forecast is generated.

Forecast on Multiple Series with Exogenous Variables

To train and forecast multiple time series in parallel with exogenous variables, essentially combining the previous two examples, prepare your data in the following manner, incorporating both a series column (store_item in this case) and at least one exogenous column (temperature here):

CREATE OR REPLACE VIEW v4 AS SELECT [store_id, item] AS store_item,
  date, sales, temperature FROM sales_data;
SELECT * FROM v4;
Copy

Output

The first five rows of each series of the resulting dataset looks like this.

+-------------------+-------------------------+-------+-------------+
| STORE_ITEM        | DATE                    | SALES | TEMPERATURE |
+-------------------+-------------------------+-------+-------------+
| [ 1, "jacket" ]   | 2020-01-01 00:00:00.000 | 2     | 50          |
| [ 1, "jacket" ]   | 2020-01-02 00:00:00.000 | 3     | 52          |
| [ 1, "jacket" ]   | 2020-01-03 00:00:00.000 | 4     | 54          |
| [ 1, "jacket" ]   | 2020-01-04 00:00:00.000 | 5     | 54          |
| [ 1, "jacket" ]   | 2020-01-05 00:00:00.000 | 6     | 55          |
| [ 2, "umbrella" ] | 2020-01-01 00:00:00.000 | 2     | 50          |
| [ 2, "umbrella" ] | 2020-01-02 00:00:00.000 | 3     | 52          |
| [ 2, "umbrella" ] | 2020-01-03 00:00:00.000 | 4     | 54          |
| [ 2, "umbrella" ] | 2020-01-04 00:00:00.000 | 5     | 54          |
| [ 2, "umbrella" ] | 2020-01-05 00:00:00.000 | 6     | 55          |
+-------------------+-------------------------+-------+-------------+

The following statement trains a forecasting model using the preceding dataset. Be sure to specify the series column name, store_item:

CREATE SNOWFLAKE.ML.FORECAST model4(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v4'),
  SERIES_COLNAME => 'store_item',
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales'
);
Copy

Tip

You can pass in a query reference for input_data instead of creating an intermediate view. The following statement creates a model with the same input data that the previous statement created:

CREATE SNOWFLAKE.ML.FORECAST model4(
  INPUT_DATA => SYSTEM$QUERY_REFERENCE('SELECT [store_id, item] AS store_item, date, sales, temperature FROM sales_data'),
  SERIES_COLNAME => 'store_item',
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales'
);
Copy

As before, when forecasting, you need to pass in future values for exogenous variables, but this time for each series. Prepare the future value of exogenous variables:

CREATE OR REPLACE VIEW V4_FORECAST AS SELECT [store_id, item] AS store_item,
  date, temperature FROM future_features;
SELECT * FROM v4_forecast;
Copy

Output

+-------------------+-------------------------+-------------+
| STORE_ITEM        | DATE                    | TEMPERATURE |
+-------------------+-------------------------+-------------+
| [ 1, "jacket" ]   | 2020-01-13 00:00:00.000 | 52          |
| [ 1, "jacket" ]   | 2020-01-14 00:00:00.000 | 53          |
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 52          |
| [ 2, "umbrella" ] | 2020-01-14 00:00:00.000 | 53          |
+-------------------+-------------------------+-------------+

Now, you can forecast on the preceding timestamps:

CALL model4!FORECAST(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v4_forecast'),
  SERIES_COLNAME => 'store_item',
  TIMESTAMP_COLNAME =>'date'
);
Copy

Output

+-------------------+-------------------------+----------+-------------+-------------+
| SERIES            | TS                      | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------+-------------------------+----------+-------------+-------------+
| [ 1, "jacket" ]   | 2020-01-13 00:00:00.000 | 14       | 14          | 14          |
| [ 1, "jacket" ]   | 2020-01-14 00:00:00.000 | 15       | 15          | 15          |
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 14       | 14          | 14          |
| [ 2, "umbrella" ] | 2020-01-14 00:00:00.000 | 15       | 15          | 15          |
+-------------------+-------------------------+----------+-------------+-------------+

Visualizing Forecasts

To visualize a forecast, use the Snowsight chart feature. After generating a forecast, click Charts above the query’s results table.

You can join your forecasts to your original dataset by using UNION ALL to combine SELECTs on the original data and the forecast results, so you can visualize both together. For example:

CALL model4!FORECAST(FORECASTING_PERIODS => 3);

SELECT date AS ts, sales AS actual, NULL AS forecast, NULL AS lower_bound, NULL AS upper_bound
  FROM sales_data
UNION ALL
SELECT ts, NULL AS actual, forecast, lower_bound, upper_bound
  FROM TABLE(RESULT_SCAN(-1));
Copy

Rename the timestamp and exogenous variable columns if necessary so they match in the original dataset and the forecast. Use NULL to represent the forecast target value in the training data, and conversely to represent the historical target value in the forecast results. In the preceding example, the actual (historical) target column is renamed actual and the forecast target forecast.

After you prepare your data this way, follow these steps in the Data section on the right side of the chart:

  1. Select the ACTUAL column, and then under Aggregation, select None.

  2. Select the TS column, and then under Bucketing, select None.

  3. Add the FORECAST column, choosing Use as Line, and then under Aggregation, select None.

  4. Add the LOWER_BOUND and UPPER_BOUND columns the same way.

The resulting chart should look similar to this:

Example of creating a chart from a time series forecast

Understanding Feature Importance

A forecast model can explain the relative importance of all features used in your model, including any exogenous variables that you choose, automatically generated time features (such as day of week or week of year), and transformations of your target variable (such as rolling averages and auto-regressive lags). This information is useful in understanding what factors are really influencing your data.

The <model_name>!EXPLAIN_FEATURE_IMPORTANCE method counts the number of times the model’s trees used each feature to make a decision. These feature importance scores are then normalized to values between 0 and 1 so that their sum is 1. The resulting scores represent an approximate ranking of the features in your trained model.

Features that are close in score have similar importance. For extremely simple series (for example, when the target column has a constant value), all feature importance scores may be zero.

Using multiple features that are very similar to each other may result in reduced importance scores for those features. For example, if one feature is quantity of items sold and another is quantity of items in inventory, the values may be correlated because you can’t sell more than you have and because you try to manage inventory so you won’t have more in stock than you will sell. If two features are exactly identical, the model may treat them as interchangeable when making decisions, resulting in feature importance scores that are half of what those scores would be if only one of the identical features were included.

Feature importance also reports lag features. During training, the model infers the frequency (hourly, daily, or weekly) of your training data. The feature lagx (e.g. lag24) is the the value of the target variable x time units ago. For example, if your data is inferred to be hourly, lag24 represents your target variable 24 hours ago.

All other transformations of your target variable (rolling averages, etc.) are summarized as aggregated_endogenous_features in the results table.

Limitations

  • You cannot choose the technique used to calculate feature importance.

  • Feature importance scores can be helpful for gaining intuition about which features are important to your model’s accuracy, but the actual values should be considered estimates.

Example

To understand the relative importance of your features to your model, train a model, and then call <model_name>!EXPLAIN_FEATURE_IMPORTANCE. In this example, you first create random data with two exogenous variables: one that is random and therefore unlikely to be very important to your model, and one that is a copy of your target and therefore likely to be more important to your model.

Execute the following statements to generate the data, train a model on it, and get the importance of the features:

CREATE OR REPLACE VIEW v_random_data AS SELECT
  DATEADD('minute', ROW_NUMBER() over (ORDER BY 1), '2023-12-01')::TIMESTAMP_NTZ ts,
  MOD(SEQ1(),10) y,
  UNIFORM(1, 100, RANDOM(0)) exog_a
FROM TABLE(GENERATOR(ROWCOUNT => 500));

CREATE OR REPLACE SNOWFLAKE.ML.FORECAST forecast_feature_importance_demo(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v_random_data'),
  TIMESTAMP_COLNAME => 'ts',
  TARGET_COLNAME => 'y'
);

CALL forecast_feature_importance_demo!EXPLAIN_FEATURE_IMPORTANCE();
Copy

Output

+------+--------------------------------------+-------+-------------------------+
| RANK | FEATURE_NAME                         | SCORE | FEATURE_TYPE            |
+------+--------------------------------------+-------+-------------------------+
|    1 | aggregated_endogenous_trend_features |  0.36 | derived_from_endogenous |
|    2 | exog_a                               |  0.22 | user_provided           |
|    3 | epoch_time                           |  0.15 | derived_from_timestamp  |
|    4 | minute                               |  0.13 | derived_from_timestamp  |
|    5 | lag60                                |  0.07 | derived_from_endogenous |
|    6 | lag120                               |  0.06 | derived_from_endogenous |
|    7 | hour                                 |  0.01 | derived_from_timestamp  |
+------+--------------------------------------+-------+-------------------------+

Understanding Evaluation Metrics

By default, every forecasting model is cross-validated. In addition to training a model on all provided training data, one or more models are trained on subsets of the training data, then used to “predict” the withheld data. The predicted target values are then compared to the actual target values. If you don’t need evaluation metrics, or don’t want to spend compute resources on them, set evaluate to FALSE when instantiating the model.

Limitations

  • The n_splits parameter must be set to at least 2.

  • Small datasets may not have enough data to perform evaluation. The total number of training rows must be equal to or greater than (n_splits * test_size) + gap. If not enough data is available to train an evaluation model, no evaluation metrics are available even when evaluate is TRUE.

Example

CREATE OR REPLACE VIEW v_random_data AS SELECT
  DATEADD('minute', ROW_NUMBER() over (ORDER BY 1), '2023-12-01')::TIMESTAMP_NTZ ts,
  MOD(SEQ1(),10) y,
  UNIFORM(1, 100, RANDOM(0)) exog_a
FROM TABLE(GENERATOR(ROWCOUNT => 500));

CREATE OR REPLACE SNOWFLAKE.ML.FORECAST model(
  INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'v_random_data'),
  TIMESTAMP_COLNAME => 'ts',
  TARGET_COLNAME => 'y'
);

CALL model!SHOW_EVALUATION_METRICS();
Copy

Output

+--------------------------+--------------+--------------------+------+
| ERROR_METRIC             | METRIC_VALUE | STANDARD_DEVIATION | LOGS |
+--------------------------+--------------+--------------------+------+
| "MAE"                    |        7.107 |              1.998 | NULL |
| "MAPE"                   |        0.475 |              0.237 | NULL |
| "MDA"                    |        0.920 |              0.025 | NULL |
| "MSE"                    |       86.020 |             66.798 | NULL |
| "SMAPE"                  |        0.241 |              0.047 | NULL |
| "COVERAGE_INTERVAL=0.95" |        0.981 |              0.025 | NULL |
| "WINKLER_ALPHA=0.05"     |       56.697 |             45.199 | NULL |
+--------------------------+--------------+--------------------+------+

Inspecting Training Logs

When you train multiple series with CONFIG_OBJECT => 'ON_ERROR': 'SKIP', individual time series models can fail to train without the overall training process failing. To understand which time series failed and why, call <model_instance>!SHOW_TRAINING_LOGS.

Example

CREATE TABLE t_error(date TIMESTAMP_NTZ, sales FLOAT, series VARCHAR);
INSERT INTO t_error VALUES
  (TO_TIMESTAMP_NTZ('2019-12-20'), 1.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-21'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-22'), 3.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-23'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-24'), 1.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-25'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-26'), 3.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-27'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-28'), 1.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-29'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-30'), 3.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-31'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-03'), 3.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-04'), 7.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-05'), 10.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-06'), 13.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-06'), 12.0, 'B'), -- duplicate timestamp
  (TO_TIMESTAMP_NTZ('2020-01-07'), 15.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-08'), 14.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-09'), 18.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-10'), 12.0, 'B');

CREATE SNOWFLAKE.ML.FORECAST model(
  INPUT_DATA => SYSTEM$QUERY_REFERENCE('SELECT date, sales, series FROM t_error'),
  SERIES_COLNAME => 'series',
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales',
  CONFIG_OBJECT => {'ON_ERROR': 'SKIP'}
);

CALL model!SHOW_TRAINING_LOGS();
Copy

Output

+--------+-------------------------------------------------------------------------------------------------+
| SERIES | LOGS                                                                                            |
+--------+-------------------------------------------------------------------------------------------------+
| "B"    | {   "Errors": [     "Frequency cannot be inferred when duplicate timestamps are present."   ] } |
| "A"    | NULL                                                                                            |
+--------+-------------------------------------------------------------------------------------------------+

Cost Considerations

Training and using forecasting models incurs storage and compute costs.

As described in Selecting a Virtual Warehouse, training consumes more compute than forecasting from a trained model. However, the costs of forecasting can accumulate with repeated use of the model. See Understanding compute cost for general information on Snowflake compute costs.

The storage costs you incur reflect storage of the ML model instances created during the training step. To view the objects associated with your model instance, navigate to your account usage views (e.g., ACCOUNT_USAGE.TABLES and ACCOUNT_USAGE.STAGES). These objects will appear with NULL database and schema columns. The instance_id column, however, is populated and indicates that these objects are contained inside a model instance. These objects are fully managed by the model instance, and you cannot access or delete them separately. To reduce storage costs associated with your models, delete unused or obsolete models.

Using Forecasting in Snowpark

session.call is not yet compatible with forecasting models. To call a forecasting model in Snowpark, use session.sql instead, as shown here.

session.sql('call my_model!forecast(...)').collect()
Copy