Contribution Explorer (Snowflake Cortex ML-Based Functions)

Contribution Explorer is designed to streamline and improve the process of root cause analysis around changes in observed metrics. Taking values of some metric over time, Contribution Explorer analyzes shifts in data with respect to that metric to determine which data segments are driving each shift.

Note

Contribution Explorer is part of Snowflake Cortex, Snowflake’s intelligent, fully-managed AI and ML service. This feature is part of the Snowflake Cortex ML-based function suite.

For example, if you’re tracking sales, Contribution Explorer can help you identify the locations, salespeople, customers, industry verticals, and other factors that are driving a revenue shortfall. Then you can take immediate, targeted corrective action.

Good candidate datasets for analysis with Contribution Explorer have the following characteristics:

  • One or more strictly non-negative metrics. The change in metric from one row to the next may be negative, but the metric itself must never be.

  • One or more timestamps.

  • Columns or dimensions that can be used to segment the data. These are often categorical (location, market segment, etc.) but may be continuous (i.e., quantitative, such as temperature or attendance).

To use Contribution Explorer directly in your queries and pipelines, call the TOP_INSIGHTS table function. This function finds the most important dimensions in a dataset, builds segments from those dimensions, and then detects which of those segments influenced the metric.

Tip

To use Contribution Explorer to analyze a dashboard query, you might need to add dimensional columns that you wouldn’t usually include in your query when aggregating a metric across all dimensions.

TOP_INSIGHTS is well-suited to extracting root causes from datasets with a large number of dimensions. Continuous dimensions are also supported, and the results can indicate dimensions with negative conditions (for example, “region is not North America”).

The results of TOP_INSIGHTS are the dimensions contributing to variation in the data.

Important

Legal notice. This Snowflake Cortex ML-Based function is powered by machine learning technology. Machine learning technology and results provided may be inaccurate, inappropriate, or biased. Decisions based on machine learning outputs, including those built into automatic pipelines, should have human oversight and review processes to ensure model-generated content is accurate. Snowflake Cortex ML-based function queries will be treated as any other SQL query and may be considered metadata.

Metadata. When you use Snowflake Cortex ML-Based functions, Snowflake logs generic error messages returned by an ML function, in addition to what is mentioned in Metadata Fields. These error logs help us troubleshoot issues that arise and improve these functions to serve you better.

Updating Your Search Path

If you add SNOWFLAKE.ML to your search path, you can just write TOP_INSIGHTS instead of SNOWFLAKE.ML.TOP_INSIGHTS in your queries. The statements below show the difference:

SELECT SNOWFLAKE.ML.TOP_INSIGHTS(...);
SELECT TOP_INSIGHTS(...);
Copy

To add SNOWFLAKE.ML to your search path, execute the following:

ALTER SESSION SET SEARCH_PATH='$current, $public, SNOWFLAKE.ML';
Copy

The search path change remains in effect during your Snowflake session. To change your search path persistently, set it at the account level instead:

USE ROLE ACCOUNTADMIN;
ALTER ACCOUNT SET SEARCH_PATH='$current, $public, SNOWFLAKE.ML';
Copy

Example

This example query uses the table input_table. A control group is used to train the model, and the test group is the data that you want insights about.

  1. Create the table:

    CREATE OR REPLACE TABLE input_table(
      ds DATE, metric NUMBER, dim_country VARCHAR, dim_vertical VARCHAR);
    
    Copy
  2. Insert the control and test group records into the table:

    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'usa' AS dim_country,
        'tech' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'usa' AS dim_country,
        'auto' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, seq4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'usa' AS dim_country,
        'fashion' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'usa' AS dim_country,
        'finance' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'canada' AS dim_country,
        'fashion' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'canada' AS dim_country,
        'finance' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'canada' AS dim_country,
        'tech' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'canada' AS dim_country,
        'auto' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'france' AS dim_country,
        'fashion' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'france' AS dim_country,
        'finance' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'france' AS dim_country,
        'tech' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds,
        UNIFORM(1, 10, RANDOM()) AS metric,
        'france' AS dim_country,
        'auto' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    -- Data for the test group
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 8, 1)) AS ds,
        UNIFORM(300, 320, RANDOM()) AS metric,
        'usa' AS dim_country,
        'auto' AS dim_vertica
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    INSERT INTO input_table
      SELECT
        DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 8, 1))  AS ds,
        UNIFORM(400, 420, RANDOM()) AS metric,
        'usa' AS dim_country,
        'finance' AS dim_vertical
      FROM TABLE(GENERATOR(ROWCOUNT => 365));
    
    Copy

Example Query

The example query below generates top insights from the table input_table defined above.

WITH input AS (
  SELECT
    {
      'country': input_table.dim_country,
      'vertical': input_table.dim_vertical
    }
    AS categorical_dimensions,
    {
         'length_of_vertical': length(input_table.dim_country)
    }
    AS continuous_dimensions,
    input_table.metric,
    IFF(ds BETWEEN '2020-08-01' AND '2020-08-20', TRUE, FALSE) AS label
  FROM input_table
  WHERE
    (ds BETWEEN '2020-05-01' AND '2020-05-20') OR
    (ds BETWEEN '2020-08-01' AND '2020-08-20')
)
SELECT res.* from input, TABLE(
  SNOWFLAKE.ML.TOP_INSIGHTS(
    input.categorical_dimensions,
    input.continuous_dimensions,
    CAST(input.metric AS FLOAT),
    input.label
  )
  OVER (PARTITION BY 0)
) res ORDER BY res.surprise DESC;
Copy

Note

For information about the arguments of the TOP_INSIGHTS function, see TOP_INSIGHTS.

Details of the example query:

  • The columns dim_country and dim_vertical from the input table become the dimensions 'country' and 'vertical'.

  • A continuous dimension is derived from the length of the value in the vertical column in the input table. This column becomes the dimension length_of_vertical.

  • The metric value is in the column 'metric'.

  • The label (which determines whether a given row is test or control data) is a Boolean expression that is TRUE only for dates from August 1, 2020 through August 20, 2020.

  • The WHERE clause restricts the rows considered to the control and test ranges.

  • The OVER (PARTITION BY 0) clause ensures that the tabular input is processed together.

  • The result of the query is given the name res.

  • The results are ranked by their surprise, which represents the amount by which the test metric in the segment exceeds its expected value based on the metric change in the parent segment.

Results of the query:

+--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------+
| CONTRIBUTOR                    | METRIC_CONTROL | METRIC_TEST |         SURPRISE | RELATIVE_CHANGE | GROWTH_RATE  | EXPECTED_METRIC_TEST | OVERALL_METRIC_CONTROL | OVERALL_METRIC_TEST | OVERALL_GROWTH_RATE  | NEW_IN_TEST | MISSING_IN_TEST |
|--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------|
| [                              |            105 |        8327 |   7022.967741935 |     6.385578231 | 79.304761905 |       1304.032258065 |                   1271 |               15785 |         79.304761905 | False       | False           |
|   "country = usa",             |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
|   "vertical = finance"         |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
| ]                              |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
| [                              |            105 |        8327 |   7022.967741935 |     6.385578231 | 79.304761905 |       1304.032258065 |                   1271 |               15785 |         79.304761905 | False       | False           |
|   "not country = canada",      |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
|   "length_of_vertical <= 4.5", |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
|   "vertical = finance"         |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |
| ]                              |                |             |                  |                 |              |                      |                        |                     |                      |             |                 |

... (additional rows of output) ...

+--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------+

Cost Considerations

Using the TOP_INSIGHTS function incurs compute costs. Runtime scales with the number of dimensions and the cardinality of those dimensions. (Cardinality is automatically reduced if a dimension has cardinality greater than 5.)

See Understanding compute cost for general information about Snowflake compute costs.