CREATE SNOWFLAKE.ML.ANOMALY_DETECTION

Creates a new anomaly detection model or replaces an existing one using the training data you provide.

Syntax

CREATE [ OR REPLACE ] SNOWFLAKE.ML.ANOMALY_DETECTION <model_name>(
  INPUT_DATA => <reference_to_training_data>,
  [ SERIES_COLNAME => '<series_column_name>', ]
  TIMESTAMP_COLNAME => '<timestamp_column_name>',
  TARGET_COLNAME => '<target_column_name>',
  LABEL_COLNAME => '<label_column_name>',
  [ CONFIG_OBJECT => <config_object> ]
)
[ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
[ COMMENT = '<string_literal>' ]
Copy

Parameters

model_name

Specifies the identifier (model_name) for the anomaly detector object; must be unique for the schema in which the object is created.

In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example, "My object"). Identifiers enclosed in double quotes are also case-sensitive. For more details, see Identifier requirements.

Constructor arguments

Required:

INPUT_DATA => reference_to_training_data

Specifies a reference to the table, view, or query that returns the training data for the model.

To create this reference, you can use the TABLE keyword with the table name, view name, or query, or you can call the SYSTEM$REFERENCE or SYSTEM$QUERY_REFERENCE function.

TIMESTAMP_COLNAME => 'timestamp_column_name'

Specifies the name of the column containing the timestamps (TIMESTAMP_NTZ) in the time series data.

TARGET_COLNAME => 'target_column_name'

Specifies the name of the column containing the data (NUMERIC or FLOAT) to analyze.

LABEL_COLNAME => 'label_column_name'

Specifies the name of the column containing the labels for the data. Labels are Boolean (true/false) values indicating whether a given row is a known anomaly. If you do not have labeled data, pass an empty string ('') for this argument.

Optional:

SERIES_COLNAME => 'series_column_name'

Name of the column containing the identifier for the series (for multi-series data). This column should be a VARIANT because it can be any kind of value or a combination of values from more than one column in an array.

CONFIG_OBJECT => config_object

An OBJECT containing key-value pairs used to configure the model training job.

Key

Type

Default

Description

aggregation_categorical

STRING

'MODE'

The aggregation method for categorical features. Supported values are:

  • 'MODE': The most frequent value.

  • 'FIRST': The earliest value.

  • 'LAST': The latest value.

aggregation_numeric

STRING

'MEAN'

The aggregation method for numeric features. Supported values are:

  • 'MEAN': The average of the values.

  • 'MEDIAN': The middle value.

  • MODE: The most frequent value.

  • 'MIN': The smallest value.

  • 'MAX': The largest value.

  • 'SUM': The total of the values.

  • 'FIRST': The earliest value.

  • 'LAST': The latest value.

aggregation_target

STRING

Same as aggregation_numeric, or 'MEAN' if not specified

The aggregation method for the target value. Supported values are:

  • 'MEAN': The average of the values.

  • 'MEDIAN': The middle value.

  • MODE: The most frequent value.

  • 'MIN': The smallest value.

  • 'MAX': The largest value.

  • 'SUM': The total of the values.

  • 'FIRST': The earliest value.

  • 'LAST': The latest value.

evaluate

BOOLEAN

TRUE

Whether evaluation metrics should be generated. If TRUE, additional models are trained for cross-validation using the parameters in the evaluation_config.

evaluation_config

OBJECT

See Evaluation configuration.

An optional config object to specify how out-of-sample evaluation metrics should be generated. See next section.

frequency

STRING

n/a

The frequency of the time series. If not specified, the model infers the frequency. The value must be a string representing a time period, such as '1 day'. Supported units include seconds, minutes, hours, days, weeks, months, quarters, and years. You may use singular (“hour”) or plural (“hours”) for the interval name, but may not abbreviate.

lower_bound

FLOAT or NULL

NULL

The lower bound for the target value. If specified, the model will not predict values below this threshold.

upper_bound

FLOAT or NULL

NULL

The upper bound for the target value. If specified, the model will not predict values above this threshold.

on_error

STRING

'ABORT'

String (constant) that specifies the error handling method for training. This is most useful when training multiple series. Supported values are:

  • 'abort': Abort training if an error is encountered in any time series.

  • 'skip': Skip any time series where training encounters an error. This allows training to succeed for other time series. To see which series failed during model training, call the model’s <model_name>!SHOW_TRAINING_LOGS method.

Evaluation configuration

The evaluation_config object contains key-value pairs that configure cross-validation. These parameters are from the scikit-learn TimeSeriesSplit cross-validator.

Key

Type

Default

Description

n_splits

INTEGER

5

Number of splits.

max_train_size

INTEGER or NULL (no maximum).

NULL

Maximum size for a single training set.

test_size

INTEGER or NULL.

NULL

Used to limit the size of the test set.

gap

INTEGER

0

Number of samples to exclude from the end of each training set before the test set.

prediction_interval

FLOAT

0.95

The prediction interval used in calculating interval metrics.

Usage notes

  • If the column names specified by the TIMESTAMP_COLNAME, TARGET_COLNAME, or LABEL_COLNAME arguments do not exist in the table, view, or query specified by the INPUT_DATA argument, an error occurs.

  • Replication is supported only for instances of the CUSTOM_CLASSIFIER class.

Examples

For a representative example, see the anomaly detection example.