Time-Series Forecasting (Snowflake ML Functions)

Forecasting employs a machine learning algorithm to predict future numeric data based on historical time series data. A common use case is to forecast sales by item for the next two weeks.

Quickstart to forecasting

This section gives the quickest way to get started with forecasting.

Prerequisites

To get started you must do the following:

  • Select a database, schema and virtual warehouse.

  • Confirm that you own your schema or have CREATE SNOWFLAKE.ML.FORECAST privileges in the schema you’ve chosen.

  • Have a table or view with at least two columns: one timestamp column and one numeric column. Be sure your timestamp column has timestamps at a fixed interval and isn’t missing too many timestamps. The following example shows a dataset with timestamp intervals of one day:

    ('2020-01-01 00:00:00.000', 2.0),
    ('2020-01-02 00:00:00.000', 3.0),
    ('2020-01-03 00:00:00.000', 4.0);
    
    Copy

Create forecasts

Once you have the prerequisites, you can use the AI & ML Studio in Snowsight to guide you through setup or you can use the following SQL commands to train a model and start creating forecasts:

-- Train your model
CREATE SNOWFLAKE.ML.FORECAST my_model(
  INPUT_DATA => TABLE(my_view),
  TIMESTAMP_COLNAME => 'my_timestamps',
  TARGET_COLNAME => 'my_metric'
);

-- Generate forecasts using your model
SELECT * FROM my_model!FORECAST(FORECASTING_PERIODS => 7);
Copy

For more details on syntax and available methods, see the FORECAST (SNOWFLAKE.ML) reference.

Dive deeper into forecasting

The forecasting function is built to predict any numeric time series data into the future. In addition to the simple case presented in the Quickstart to forecasting section, you can do the following:

  • Predict for multiple series at once. For example, you can predict the sales of multiple items for the next two weeks.

  • Train and predict using features. Features are additional factors that you believe influence the metric you want to forecast.

  • Assess your model’s accuracy.

  • Understand the relative importance of the features the model was trained on.

  • Debug training errors.

The following sections provide examples of these scenarios and additional details on how forecasting works.

Examples

This section provide examples of how to set up you data for forecasting and how to create a forecasting model based on your time-series data.

Set up example data

The example below creates two tables. Views of these tables are included in the examples later in this topic.

The sales_data table contains sales data. Each sale includes a store ID, an item identifier, a timestamp, and the sales amount. Additional columns, which are additional features (temperature, humidity, and holiday) are also included.

The future_features table contains future values of the feature columns, which are necessary when forecasting using features as part of your prediction process.

CREATE OR REPLACE TABLE sales_data (store_id NUMBER, item VARCHAR, date TIMESTAMP_NTZ,
  sales FLOAT, temperature NUMBER, humidity FLOAT, holiday VARCHAR);

INSERT INTO sales_data VALUES
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 50, 0.3, 'new year'),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 52, 0.3, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-03'), 4.0, 54, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-04'), 5.0, 54, 0.3, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-05'), 6.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-06'), 7.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-07'), 8.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-08'), 9.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-09'), 10.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-10'), 11.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-11'), 12.0, 55, 0.2, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-12'), 13.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 50, 0.3, 'new year'),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 52, 0.3, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-03'), 4.0, 54, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-04'), 5.0, 54, 0.3, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-05'), 6.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-06'), 7.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-07'), 8.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-08'), 9.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-09'), 10.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-10'), 11.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-11'), 12.0, 55, 0.2, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-12'), 13.0, 55, 0.2, NULL);

-- Future values for additional columns (features)
CREATE OR REPLACE TABLE future_features (store_id NUMBER, item VARCHAR,
  date TIMESTAMP_NTZ, temperature NUMBER, humidity FLOAT, holiday VARCHAR);

INSERT INTO future_features VALUES
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-13'), 52, 0.3, NULL),
  (1, 'jacket', TO_TIMESTAMP_NTZ('2020-01-14'), 53, 0.3, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-13'), 52, 0.3, NULL),
  (2, 'umbrella', TO_TIMESTAMP_NTZ('2020-01-14'), 53, 0.3, NULL);
Copy

Forecasting on a single series

This example uses a single time series (that is, all the rows are part of a single series) that has two columns, a timestamp column and a target value column, without additional features.

First, prepare the example dataset to train the model:

CREATE OR REPLACE VIEW v1 AS SELECT date, sales
  FROM sales_data WHERE store_id=1 AND item='jacket';
SELECT * FROM v1;
Copy

The SELECT statement returns:

+-------------------------+-------+
| DATE                    | SALES |
+-------------------------+-------+
| 2020-01-01 00:00:00.000 | 2     |
| 2020-01-02 00:00:00.000 | 3     |
| 2020-01-03 00:00:00.000 | 4     |
| 2020-01-04 00:00:00.000 | 5     |
| 2020-01-05 00:00:00.000 | 6     |
+-------------------------+-------+

Now, train a forecasting model using this view:

CREATE SNOWFLAKE.ML.FORECAST model1(
  INPUT_DATA => TABLE(v1),
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales'
);
Copy

The following message appears after the model is trained:

Instance MODEL1 successfully created.

Next, use the forecasting model to forecast the next three timestamps:

call model1!FORECAST(FORECASTING_PERIODS => 3);
Copy

Output

Note that the model has inferred the interval between timestamps from the training data.

+--------+-------------------------+-----------+--------------+--------------+
| SERIES | TS                      | FORECAST  | LOWER_BOUND  | UPPER_BOUND  |
+--------+-------------------------+-----------+--------------+--------------+
| NULL   | 2020-01-13 00:00:00.000 | 14        | 14           | 14           |
| NULL   | 2020-01-14 00:00:00.000 | 15        | 15           | 15           |
| NULL   | 2020-01-15 00:00:00.000 | 16        | 16           | 16           |
+--------+-------------------------+-----------+--------------+--------------+

In this example, because the forecast yields a perfectly linear prediction that has zero errors compared to the actual values, the prediction interval (LOWER_BOUND, UPPER_BOUND) is the same as the FORECAST value.

To customize the size of the prediction interval, pass prediction_interval as part of a configuration object:

CALL model1!FORECAST(FORECASTING_PERIODS => 3, CONFIG_OBJECT => {'prediction_interval': 0.8});
Copy

To save your results directly to a table, use CREATE TABLE … AS SELECT … and call the FORECAST method in the FROM clause:

CREATE TABLE my_forecasts AS
  SELECT * FROM TABLE(model1!FORECAST(FORECASTING_PERIODS => 3));
Copy

As shown in the example above, when calling the method, omit the CALL command. Instead, put the call in parentheses, preceded by the TABLE keyword.

Forecast on multiple series

To create a forecasting model for multiple series at once, use the series_colname parameter.

In this example, the data contains store_id and item columns. To forecast sales separately for every store/item combination in the dataset, create a new column that combines these values, and specify that as the series column.

The following query creates a new view combining store_id and item into a new column named store_item:

CREATE OR REPLACE VIEW v3 AS SELECT [store_id, item] AS store_item, date, sales FROM sales_data;
SELECT * FROM v3;
Copy

Output

The first five rows for each series for the resulting dataset are:

+-------------------+-------------------------+-------+
| STORE_ITEM        | DATE                    | SALES |
+-------------------+-------------------------+-------+
| [ 1, "jacket" ]   | 2020-01-01 00:00:00.000 | 2     |
| [ 1, "jacket" ]   | 2020-01-02 00:00:00.000 | 3     |
| [ 1, "jacket" ]   | 2020-01-03 00:00:00.000 | 4     |
| [ 1, "jacket" ]   | 2020-01-04 00:00:00.000 | 5     |
| [ 1, "jacket" ]   | 2020-01-05 00:00:00.000 | 6     |
| [ 2, "umbrella" ] | 2020-01-01 00:00:00.000 | 2     |
| [ 2, "umbrella" ] | 2020-01-02 00:00:00.000 | 3     |
| [ 2, "umbrella" ] | 2020-01-03 00:00:00.000 | 4     |
| [ 2, "umbrella" ] | 2020-01-04 00:00:00.000 | 5     |
| [ 2, "umbrella" ] | 2020-01-05 00:00:00.000 | 6     |
+-------------------+-------------------------+-------+

Now use the forecasting function to train a model for each series, all in one step. Note that the series_colname parameter is set to store_item:

CREATE SNOWFLAKE.ML.FORECAST model2(
  INPUT_DATA => TABLE(v3),
  SERIES_COLNAME => 'store_item',
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales'
);
Copy

Next, use that model to forecast the next two timestamps for all series:

CALL model2!FORECAST(FORECASTING_PERIODS => 2);
Copy

Output

+-------------------+------------------------+----------+-------------+-------------+
| SERIES            | TS                     | FORECAST | LOWER_BOUND | UPPER_BOUND |
+-------------------+------------------------+----------+-------------+-------------+
| [ 1, "jacket" ]   | 2020-01-13 00:00:00.000 | 14      | 14          | 14          |
| [ 1, "jacket" ]   | 2020-01-14 00:00:00.000 | 15      | 15          | 15          |
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 14      | 14          | 14          |
| [ 2, "umbrella" ] | 2020-01-14 00:00:00.000 | 15      | 15          | 15          |
+-------------------+-------------------------+---------+-------------+-------------+

You can also forecast a specific series with:

CALL model2!FORECAST(SERIES_VALUE => [2,'umbrella'], FORECASTING_PERIODS => 2);
Copy

Output

The result shows only the next two steps for store 2’s sales of umbrellas.

+-------------------+------------ ------------+-----------+-------------+-------------+
| SERIES            | TS                      | FORECAST  | LOWER_BOUND | UPPER_BOUND |
+-------------------+---------- --------------+-----------+-------------+-------------+
| [ 2, "umbrella" ] | 2020-01-13 00:00:00.000 | 14        | 14          | 14          |
| [ 2, "umbrella" ] | 2020-01-14 00:00:00.000 | 15        | 15          | 15          |
+-------------------+-------------------------+-----------+-------------+-------------+

Tip

Specifying one series with the FORECAST method is more efficient than filtering the results of a multi-series forecast to include only the series you’re interested in, because only one series’ forecast is generated.

Forecasting with features

If you want additional features (for example, holidays or weather) to influence your forecasts, you must include these features in your training data. Here you create a view containing those fields from the sales_data table:

CREATE OR REPLACE VIEW v2 AS SELECT date, sales, temperature, humidity, holiday
  FROM sales_data WHERE store_id=1 AND item='jacket';
SELECT * FROM v2;
Copy

Output

This is the first five rows of the result of the SELECT query.

+-------------------------+--------+-------------+----------+----------+
| DATE                    | SALES  | TEMPERATURE | HUMIDITY | HOLIDAY  |
+-------------------------+--------+-------------+----------+----------+
| 2020-01-01 00:00:00.000 | 2      | 50          | 0.3      | new year |
| 2020-01-02 00:00:00.000 | 3      | 52          | 0.3      | null     |
| 2020-01-03 00:00:00.000 | 4      | 54          | 0.2      | null     |
| 2020-01-04 00:00:00.000 | 5      | 54          | 0.3      | null     |
| 2020-01-05 00:00:00.000 | 6      | 55          | 0.2      | null     |
+-------------------------+--------+-------------+----------+----------+

Now you can use this view to train a model. You are only required to specify the timestamp and target column names; additional columns in the input data are assumed to be features for use in training.

CREATE SNOWFLAKE.ML.FORECAST model3(
  INPUT_DATA => TABLE(v2),
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales'
);
Copy

To generate forecasts with this model, you must provide future values for the features to the model: in this case, TEMPERATURE, HUMIDITY and HOLIDAY. This allows the model to adjust its sales forecasts based on temperature, humidity, and holiday forecasts.

Now create a view from the future_features table containing this data for future timestamps:

CREATE OR REPLACE VIEW v2_forecast AS select date, temperature, humidity, holiday
  FROM future_features WHERE store_id=1 AND item='jacket';
SELECT * FROM v2_forecast;
Copy

Output

+-------------------------+-------------+----------+---------+
| DATE                    | TEMPERATURE | HUMIDITY | HOLIDAY |
+-------------------------+-------------+----------+---------+
| 2020-01-13 00:00:00.000 | 52          | 0.3      | null    |
| 2020-01-14 00:00:00.000 | 53          | 0.3      | null    |
+-------------------------+-------------+----------+---------+

Now you can generate a forecast using this data:

CALL model3!FORECAST(
  INPUT_DATA => TABLE(v2_forecast),
  TIMESTAMP_COLNAME =>'date'
);
Copy

In this variation of the FORECAST method, you do not specify the number of timestamps to predict. Instead, the timestamps of the forecast come from the v2_forecast view.

+--------+-------------------------+-----------+--------------+--------------+
| SERIES | TS                      | FORECAST  | LOWER_BOUND  | UPPER_BOUND  |
+--------+-------------------------+-----------+--------------+--------------+
| NULL   | 2020-01-13 00:00:00.000 | 14        | 14           | 14           |
| NULL   | 2020-01-14 00:00:00.000 | 15        | 15           | 15           |
+--------+-------------------------+-----------+--------------+--------------+

Troubleshooting and model assessment

You can use the following helper functions to assess your model performance, understand which features are most impactful to your model, and to help you debug the training process if any error occurred:

Evaluation metrics

To get the evaluation metrics for your model, call the <model_name>!SHOW_EVALUATION_METRICS method. By default, the forecasting function evaluates all models it trains using a method called cross-validation. This means that under the hood, in addition to training the final model on all of the training data you provide, the function also trains models on subsets of your training data. Those models are then used to predict your target metric on the withheld data, allowing the function to compare those predictions to actual values in your historical data.

If you don’t need these evaluation metrics, you can set evaluate to FALSE. If you want to control the way cross-validation is run, you can use the following parameters:

  • n_splits: Represents the number of splits in your data for cross validation. Default is 5. Must be at least 2.

  • max_train_size: Represents the maximum number of rows for a single training set.

  • test_size: Limits number of rows included in each test set.

  • gap: Represents the gap between the end of each training set and the start of the test set.

For complete details on evaluation parameters, see Evaluation configuration.

Note

Small datasets may not have enough data to perform evaluation. The total number of training rows must be equal to or greater than (n_splits * test_size) + gap. If not enough data is available to train an evaluation model, no evaluation metrics are available even when evaluate is set to TRUE.

Example

CREATE OR REPLACE VIEW v_random_data AS SELECT
  DATEADD('minute', ROW_NUMBER() over (ORDER BY 1), '2023-12-01')::TIMESTAMP_NTZ ts,
  MOD(SEQ1(),10) y,
  UNIFORM(1, 100, RANDOM(0)) exog_a
FROM TABLE(GENERATOR(ROWCOUNT => 500));

CREATE OR REPLACE SNOWFLAKE.ML.FORECAST model(
  INPUT_DATA => TABLE(v_random_data),
  TIMESTAMP_COLNAME => 'ts',
  TARGET_COLNAME => 'y'
);

CALL model!SHOW_EVALUATION_METRICS();
Copy

Output

+--------+--------------------------+--------------+--------------------+------+
| SERIES | ERROR_METRIC             | METRIC_VALUE | STANDARD_DEVIATION | LOGS |
+--------+--------------------------+--------------+--------------------+------+
| NULL   | "MAE"                    |        7.107 |              1.998 | NULL |
| NULL   | "MAPE"                   |        0.475 |              0.237 | NULL |
| NULL   | "MDA"                    |        0.920 |              0.025 | NULL |
| NULL   | "MSE"                    |       86.020 |             66.798 | NULL |
| NULL   | "SMAPE"                  |        0.241 |              0.047 | NULL |
| NULL   | "COVERAGE_INTERVAL=0.95" |        0.981 |              0.025 | NULL |
| NULL   | "WINKLER_ALPHA=0.05"     |       56.697 |             45.199 | NULL |
+--------+--------------------------+--------------+--------------------+------+
Copy

Feature importance

To understand the relative importance of the features used in your model, use the <model_name>!EXPLAIN_FEATURE_IMPORTANCE method.

When you train a forecasting model, your model will take all provided data, such as timestamps, your target metric, additional columns you provide (features), and features that are automatically generated to improve the performance of your forecasts, to learn patterns in your data. Some of the features it trains on will be more important to making accurate predictions than others. Understanding the relative importance of these features on a scale of 0 to 1 is the purpose of this helper function.

Under the hood, this helper function counts the number of times the model used each feature to make a decision. These feature importance scores are then normalized to values between 0 and 1 so that their sum is 1. The resulting scores represent an approximate ranking of the features in your trained model.

Key considerations for this feature

  • Features that are close in score have similar importance.

  • For extremely simple series (for example, when the target column has a constant value), all feature importance scores may be zero.

  • Using multiple features that are very similar to each other may result in reduced importance scores for those features. For example, if two features are exactly identical, the model may treat them as interchangeable when making decisions, resulting in feature importance scores that are half of what those scores would be if only one of the identical features were included.

  • Auto-generated features include lag features (e.g., lag24, which would be a 24-hour lag of your target metric if your data are hourly), rolling averages summarized as aggregated_endogenous_features (e.g., a 7-day rolling window if your data are daily), and calendar features such as day of week, or month of year. These are reflected in the Feature Importance results.

Example

This example uses the data from the evaluation example and calls the feature importance method. You can see that the exog_a variable that was created is the second most important feature - behind all rolling averages, which are aggregated under the aggregated_endogenous_trend_features feature name.

Execute the following statements to get the importance of the features:

CALL model!EXPLAIN_FEATURE_IMPORTANCE();
Copy

Output

+--------+------+--------------------------------------+-------+-------------------------+
| SERIES | RANK | FEATURE_NAME                         | SCORE | FEATURE_TYPE            |
+--------+------+--------------------------------------+-------+-------------------------+
| NULL   |    1 | aggregated_endogenous_trend_features |  0.36 | derived_from_endogenous |
| NULL   |    2 | exog_a                               |  0.22 | user_provided           |
| NULL   |    3 | epoch_time                           |  0.15 | derived_from_timestamp  |
| NULL   |    4 | minute                               |  0.13 | derived_from_timestamp  |
| NULL   |    5 | lag60                                |  0.07 | derived_from_endogenous |
| NULL   |    6 | lag120                               |  0.06 | derived_from_endogenous |
| NULL   |    7 | hour                                 |  0.01 | derived_from_timestamp  |
+--------+------+--------------------------------------+-------+-------------------------+

Troubleshooting

When you train multiple series with CONFIG_OBJECT => 'ON_ERROR': 'SKIP', individual time series models can fail to train without the overall training process failing. To understand which time series failed and why, call the <model_name>!SHOW_TRAINING_LOGS method.

Example

CREATE TABLE t_error(date TIMESTAMP_NTZ, sales FLOAT, series VARCHAR);
INSERT INTO t_error VALUES
  (TO_TIMESTAMP_NTZ('2019-12-30'), 3.0, 'A'),
  (TO_TIMESTAMP_NTZ('2019-12-31'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-01'), 2.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-02'), 3.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-03'), 3.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-04'), 7.0, 'A'),
  (TO_TIMESTAMP_NTZ('2020-01-05'), 10.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-06'), 13.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-06'), 12.0, 'B'), -- duplicate timestamp
  (TO_TIMESTAMP_NTZ('2020-01-07'), 15.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-08'), 14.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-09'), 18.0, 'B'),
  (TO_TIMESTAMP_NTZ('2020-01-10'), 12.0, 'B');

CREATE SNOWFLAKE.ML.FORECAST error_model(
  INPUT_DATA => TABLE(SELECT date, sales, series FROM t_error),
  SERIES_COLNAME => 'series',
  TIMESTAMP_COLNAME => 'date',
  TARGET_COLNAME => 'sales',
  CONFIG_OBJECT => {'ON_ERROR': 'SKIP'}
);

CALL error_model!SHOW_TRAINING_LOGS();
Copy

Output

+--------+-------------------------------------------------------------------------------------------------+
| SERIES | LOGS                                                                                            |
+--------+-------------------------------------------------------------------------------------------------+
| "B"    | {   "Errors": [     "Frequency cannot be inferred when duplicate timestamps are present."   ] } |
| "A"    | NULL                                                                                            |
+--------+-------------------------------------------------------------------------------------------------+

Model management

To view a list of your models, use the SHOW SNOWFLAKE.ML.FORECAST command:

SHOW SNOWFLAKE.ML.FORECAST;
Copy

To delete a model, use the DROP SNOWFLAKE.ML.FORECAST command:

DROP SNOWFLAKE.ML.FORECAST my_model;
Copy

Models are immutable and cannot be updated in place. Train a new model instead.

Warehouse selection

A Snowflake virtual warehouse provides the compute resources for training and using the machine learning models for this feature. This section provides general guidance on selecting the best type and size of warehouse for this purpose, focusing on the training step, the most time-consuming and memory-intensive part of the process.

There are two key factors to keep in mind when choosing a warehouse:

  1. The number of rows and columns your data contains.

  2. The number of distinct series your data contains.

You can use the following rules of thumb to choose your warehouse:

  1. If you are training on a longer time series (> 5 million rows) or on many columns (many features), consider upgrading to Snowpark-optimized warehouses.

  2. If you are training on many series, size up. The forecasting function distributes model training across all available nodes in your warehouse when you are training for multiple series at once.

The following table provides this same guidance:

Series type

< 5 million rows

> 5 million rows and ≤ 100 million rows

> 100 million rows

One series

Standard warehouse; XS

Snowpark optimized warehouse; XS

Consider aggregating to a less frequent timestamp interval (e.g., hourly to daily)

Multiple series

Standard warehouse; Size up

Snowpark optimized warehouse; Size up

Consider batching training by series into multiple jobs

As a rough estimate, training time is proportional to the number of rows in your time series. For example, on a XS standard warehouse, training on a 100,000-row dataset takes about 30 seconds. Training on a 1,000,000-row dataset takes about 140 seconds.

Algorithm details

The forecasting algorithm is powered by a gradient boosting machine (GBM). Like an ARIMA model, it uses a differencing transformation to model data with a non-stationary trend and uses auto-regressive lags of the historical target data as model variables.

Additionally, the algorithm uses rolling averages of historical target data to help predict trends, and automatically produces cyclic calendar variables (such as day of week and week of year) from timestamp data.

You can fit models with only historical target and timestamp data, or you may include features (extra columns) that might have influenced the target value. Exogenous variables can be numerical or categorical and may be NULL (rows containing NULLs for exogenous variables are not dropped).

The algorithm does not rely on one-hot encoding when training on categorical variables, so you can use categorical data with many dimensions (high cardinality).

If your model incorporates features, when generating a forecast you must provide values for those features at every timestamp of the full forecast horizon. Appropriate features could include weather data (temperature, rainfall), company-specific information (historic and planned company holidays, advertisement campaigns, event schedules), or any other external factors you believe may help predict your target variable.

The algorithm also generates prediction intervals, in addition to forecasts. A prediction interval is an estimated range of values within an upper bound and a lower bound in which a certain percentage of data is likely to fall. For example, a 0.95 value means that 95% of the data likely appears within the interval. You may specify a prediction interval percentage, or use the default, which is 0.95. Lower and upper bounds of the prediction interval are returned as part of the forecast output.

Important

From time to time, Snowflake may refine the forecasting algorithm and will roll out such improvements through the regular Snowflake release process. You cannot revert to a previous version of the feature, but models you have created with a previous version will continue to use that version for predictions until deprecation through the Behavior Change Release process.

Current Limitations

The current release has the following limitations:

  • You cannot choose or adjust the forecasting algorithm.

  • The minimum number of rows for the main forecasting algorithm is 12 per time series. For time series with between 2 and 11 observations, forecasting produces a “naive” forecast where all forecasted values are equal to the last observed target value.

  • The forecasting functions do not provide parameters to override trend, seasonality, or seasonal amplitudes; these are inferred from the data.

  • The minimum acceptable granularity of data is one second. (Timestamps must not be less than one second apart.)

  • The minimum granularity of seasonal components is one minute. (The function cannot detect cyclic patterns at smaller time deltas.)

  • The timestamps in your data must represent fixed time intervals. If your input data is irregular, try using DATE_TRUNC or TIME_SLICE on your timestamp column when training the model.

  • The “season length” of autoregressive features is tied to the input frequency (24 for hourly data, 7 for daily data, and so on).

  • Forecast models, once trained, are immutable. You cannot update existing models with new data; you must train an entirely new model.

  • Models do not support versioning. Snowflake recommends retraining a model on a regular cadence, perhaps daily, weekly, or monthly, depending on how frequently you receive new data, allowing the model to adjust to changing patterns and trends.

  • You cannot clone models or share models across roles or accounts. When cloning a schema or database, model objects are skipped.

  • You cannot replicate an instance of the FORECAST class.

Granting privileges to create forecast objects

Training a forecasting model results in a schema-level object. Therefore, the role you use to create models must have the CREATE SNOWFLAKE.ML.FORECAST privilege on the schema where the model is created, allowing the model to be stored there. This privilege is similar to other schema privileges like CREATE TABLE or CREATE VIEW.

Snowflake recommends that you create a role named analyst to be used by people who need to create forecasts.

In the following example, the admin role is the owner of the schema admin_db.admin_schema. The analyst role needs to create models in this schema.

USE ROLE admin;
GRANT USAGE ON DATABASE admin_db TO ROLE analyst;
GRANT USAGE ON SCHEMA admin_schema TO ROLE analyst;
GRANT CREATE SNOWFLAKE.ML.FORECAST ON SCHEMA admin_db.admin_schema TO ROLE analyst;
Copy

To use this schema, a user assumes the role analyst:

USE ROLE analyst;
USE SCHEMA admin_db.admin_schema;
Copy

If the analyst role has CREATE SCHEMA privileges in database analyst_db, the role can create a new schema analyst_db.analyst_schema and create forecast models in that schema:

USE ROLE analyst;
CREATE SCHEMA analyst_db.analyst_schema;
USE SCHEMA analyst_db.analyst_schema;
Copy

To revoke a role’s forecast model creation privilege on the schema, use REVOKE <privileges>:

REVOKE CREATE SNOWFLAKE.ML.FORECAST ON SCHEMA admin_db.admin_schema FROM ROLE analyst;
Copy

Cost considerations

For details on costs for using ML functions, see Cost Considerations in the ML functions overview.