Contribution Explorer (ML-Powered Function)¶
Contribution Explorer is designed to streamline and improve the process of root cause analysis around changes in observed metrics. Taking values of some metric over time, Contribution Explorer analyzes shifts in data with respect to that metric to determine which data segments are driving each shift.
Note
Contribution Explorer is part of a suite of analysis features powered by machine learning algorithms.
For example, if you’re tracking sales, Contribution Explorer can help you identify the locations, salespeople, customers, industry verticals, and other factors that are driving a revenue shortfall. Then you can take immediate, targeted corrective action.
Good candidate datasets for analysis with Contribution Explorer have the following characteristics:
One or more strictly non-negative metrics. The change in metric from one row to the next may be negative, but the metric itself must never be.
One or more timestamps.
Columns or dimensions that can be used to segment the data. These are often categorical (location, market segment, etc.) but may be continuous (i.e., quantitative, such as temperature or attendance).
To use Contribution Explorer directly in your queries and pipelines, call the TOP_INSIGHTS table function. This function finds the most important dimensions in a dataset, builds segments from those dimensions, and then detects which of those segments influenced the metric.
TOP_INSIGHTS is well-suited to extracting root causes from datasets with a large number of dimensions. Continuous dimensions are also supported, and the results can indicate dimensions with negative conditions (for example, “region is not North America”).
The results of TOP_INSIGHTS are the dimensions contributing to variation in the data.
Tip
To use Contribution Explorer to analyze a dashboard query, you might need to add dimensional columns that you wouldn’t usually include in your query when aggregating a metric across all dimensions.
Updating Your Search Path¶
If you add SNOWFLAKE.ML to your search path, you can just write TOP_INSIGHTS instead of SNOWFLAKE.ML.TOP_INSIGHTS in your queries. The statements below show the difference:
SELECT SNOWFLAKE.ML.TOP_INSIGHTS(...);
SELECT TOP_INSIGHTS(...);
To add SNOWFLAKE.ML to your search path, execute the following:
ALTER SESSION SET SEARCH_PATH='$current, $public, SNOWFLAKE.ML';
The search path change remains in effect during your Snowflake session. To change your search path persistently, set it at the account level instead:
USE ROLE ACCOUNTADMIN;
ALTER ACCOUNT SET SEARCH_PATH='$current, $public, SNOWFLAKE.ML';
Example¶
This example query uses the table input_table
. A control group is used to train the
model, and the test group is the data that you want insights about.
Create the table:
CREATE OR REPLACE TABLE input_table( ds DATE, metric NUMBER, idx NUMBER, dim_country VARCHAR, dim_vertical VARCHAR);
Insert the control and test group records into the table:
INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'usa' AS dim_country, 'tech' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'usa' AS dim_country, 'auto' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, seq4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'usa' AS dim_country, 'fashion' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'usa' AS dim_country, 'finance' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'canada' AS dim_country, 'fashion' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'canada' AS dim_country, 'finance' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'canada' AS dim_country, 'tech' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'canada' AS dim_country, 'auto' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'france' AS dim_country, 'fashion' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'france' AS dim_country, 'finance' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'france' AS dim_country, 'tech' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 4, 1)) AS ds, UNIFORM(1, 10, RANDOM()) AS metric, 'france' AS dim_country, 'auto' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365)); -- Data for the test group INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 8, 1)) AS ds, UNIFORM(300, 320, RANDOM()) AS metric, 'usa' AS dim_country, 'auto' AS dim_vertica FROM TABLE(GENERATOR(ROWCOUNT => 365)); INSERT INTO input_table SELECT DATEADD(day, SEQ4(), DATE_FROM_PARTS(2020, 8, 1)) AS ds, UNIFORM(400, 420, RANDOM()) AS metric, 'usa' AS dim_country, 'finance' AS dim_vertical FROM TABLE(GENERATOR(ROWCOUNT => 365));
Example Query¶
The example query below generates top insights from the table input_table
defined above.
WITH input AS (
SELECT
{
'country': input_table.dim_country,
'vertical': input_table.dim_vertical
}
AS categorical_dimensions,
{
'length_of_vertical': length(input_table.dim_country)
}
AS continuous_dimensions,
input_table.metric,
IFF(ds BETWEEN '2020-08-01' AND '2020-08-20', TRUE, FALSE) AS label
FROM input_table
WHERE
(ds BETWEEN '2020-05-01' AND '2020-05-20') OR
(ds BETWEEN '2020-08-01' AND '2020-08-20')
)
SELECT res.* from input, TABLE(
SNOWFLAKE.ML.TOP_INSIGHTS(
input.categorical_dimensions,
input.continuous_dimensions,
CAST(input.metric AS FLOAT),
input.label
)
OVER (PARTITION BY 0)
) res ORDER BY res.surprise DESC;
Note
For information about the arguments of the TOP_INSIGHTS function, see TOP_INSIGHTS.
Details of the example query:
The columns
dim_country
anddim_vertical
from the input table become the dimensions'country'
and'vertical'
.A continuous dimension is derived from the length of the value in the
vertical
column in the input table. This column becomes the dimensionlength_of_vertical
.The metric value is in the column
'metric'
.The label (which determines whether a given row is test or control data) is a Boolean expression that is TRUE only for dates from August 1, 2020 through August 20, 2020.
The WHERE clause restricts the rows considered to the control and test ranges.
The OVER (PARTITION BY 0) clause ensures that the tabular input is processed together.
The result of the query is given the name
res
.The results are ranked by their surprise, which represents the amount by which the test metric in the segment exceeds its expected value based on the metric change in the parent segment.
Results of the query:
+--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------+
| CONTRIBUTOR | METRIC_CONTROL | METRIC_TEST | SURPRISE | RELATIVE_CHANGE | GROWTH_RATE | EXPECTED_METRIC_TEST | OVERALL_METRIC_CONTROL | OVERALL_METRIC_TEST | OVERALL_GROWTH_RATE | NEW_IN_TEST | MISSING_IN_TEST |
|--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------|
| [ | 105 | 8327 | 7022.967741935 | 6.385578231 | 79.304761905 | 1304.032258065 | 1271 | 15785 | 79.304761905 | False | False |
| "country = usa", | | | | | | | | | | | |
| "vertical = finance" | | | | | | | | | | | |
| ] | | | | | | | | | | | |
| [ | 105 | 8327 | 7022.967741935 | 6.385578231 | 79.304761905 | 1304.032258065 | 1271 | 15785 | 79.304761905 | False | False |
| "not country = canada", | | | | | | | | | | | |
| "length_of_vertical <= 4.5", | | | | | | | | | | | |
| "vertical = finance" | | | | | | | | | | | |
| ] | | | | | | | | | | | |
... (additional rows of output) ...
+--------------------------------+----------------+-------------+------------------+-----------------+--------------+----------------------+------------------------+---------------------+----------------------+-------------+-----------------+
Cost Considerations¶
Using the TOP_INSIGHTS function incurs compute costs. Runtime scales with the number of dimensions and the cardinality of those dimensions. (Cardinality is automatically reduced if a dimension has cardinality greater than 5.)
See Understanding Compute Cost for general information about Snowflake compute costs.