Using Arrays to Compute Distinct Values for Hierarchical Aggregations

If you are counting distinct values for hierarchical aggregations (e.g. multiple grouping sets, rollups, or cubes), you can improve performance by producing ARRAYs that contain the distinct values and computing the number of distinct values from these ARRAYs. Using this approach can be faster than using COUNT(DISTINCT <expr>).

This topic explains how to use ARRAYs to count distinct values.

For other techniques for counting distinct values, see Computing the Number of Distinct Values.

Introduction

When computing the number of distinct values for hierarchical aggregations (e.g. multiple grouping sets, rollups, or cubes), you can speed up the computation by calling functions that produce arrays containing the distinct values. You can then call ARRAY_SIZE to compute the count of those distinct values.

These aggregation functions that produce ARRAYs of distinct values can perform better than COUNT(DISTINCT <expression>) in queries of the following forms:

  • GROUP BY ROLLUP aggregate queries

  • queries containing multiple grouping sets.

Unlike COUNT(DISTINCT <expression>) (which needs to be executed for each group), you can compose and reuse ARRAYs that contain the distinct values. For hierarchical aggregations, you avoid repeatedly computing the distinct counts by producing these ARRAYs once and reusing them in higher aggregation levels.

In addition, to improve performance further, you can produce these ARRAYs ahead of time (e.g. in a materialized view), rather than during the query, and you can use these precomputed ARRAYs in your query.

Creating an ARRAY Containing Distinct Values

To create an ARRAY that contains the distinct values in a column, call the ARRAY_UNIQUE_AGG function in a SELECT statement.

ARRAY_UNIQUE_AGG is an aggregation function. Aggregation in this context means returning only one instance of a value that appears in multiple rows. If multiple rows contain the value 3, ARRAY_UNIQUE_AGG just includes 3 once in the returned ARRAY.

For example, create the following table containing a column of numeric values, and insert some values into that column.

CREATE OR REPLACE TABLE array_unique_agg_test (a INTEGER);
INSERT INTO array_unique_agg_test VALUES (5), (2), (1), (2), (1);
Copy

Run the following command to produce an ARRAY that contains the distinct values in the column:

SELECT ARRAY_UNIQUE_AGG(a) AS distinct_values FROM array_unique_agg_test;
Copy
+-----------------+
| DISTINCT_VALUES |
|-----------------|
| [               |
|   5,            |
|   2,            |
|   1             |
| ]               |
+-----------------+
Copy

Computing the Number of Distinct Values from the ARRAYs

To get the total count of the distinct values from the ARRAY, call ARRAY_SIZE, passing in the ARRAY created by ARRAY_UNIQUE_AGG.

For example:

SELECT ARRAY_SIZE(ARRAY_UNIQUE_AGG(a)) AS number_of_distinct_values FROM array_unique_agg_test;
Copy
+---------------------------+
| NUMBER_OF_DISTINCT_VALUES |
|---------------------------|
|                         3 |
+---------------------------+
Copy

Using Arrays to Improve Query Performance

The following examples demonstrate how to use the aggregation functions that produce ARRAYs of distinct values as an alternative to COUNT(DISTINCT <expression>).

Example 1: Counting the Distinct Values in a Single Table

Suppose that you want to count the number of distinct values in my_column. The following table compares the SQL statements for performing this task with COUNT(DISTINCT expression) and ARRAY_UNIQUE_AGG(expression).

Example With COUNT(DISTINCT <expression>)

Example With ARRAY_UNIQUE_AGG(<expression>)

SELECT
  COUNT(DISTINCT my_column_1),
  COUNT(DISTINCT my_column_2)
FROM my_table;
Copy
SELECT
  ARRAY_SIZE(ARRAY_UNIQUE_AGG(my_column_1)),
  ARRAY_SIZE(ARRAY_UNIQUE_AGG(my_column_2))
FROM my_table;
Copy

Example 2: Using GROUP BY to Compute the Counts by Group

Suppose that you want to count the number of distinct values in my_column by my_key_1 and my_key_2. The following table compares the SQL statements for performing this task with COUNT(DISTINCT expression) and ARRAY_UNIQUE_AGG(expression).

Example With COUNT(DISTINCT <expression>)

Example With ARRAY_UNIQUE_AGG(<expression>)

SELECT
  COUNT(DISTINCT my_column_1),
  COUNT(DISTINCT my_column_2)
FROM my_table
GROUP BY my_key_1, my_key_2;
Copy
SELECT
  ARRAY_SIZE(ARRAY_UNIQUE_AGG(my_column_1)),
  ARRAY_SIZE(ARRAY_UNIQUE_AGG(my_column_2))
FROM my_table
GROUP BY my_key_1, my_key_2;
Copy

Example 3: Using GROUP BY ROLLUP to Roll up Counts by Group

ARRAY_UNIQUE_AGG works even more efficiently for GROUP BY ROLLUP aggregate queries. ARRAYs are composable (in contrast to COUNT(DISTINCT <expression>)), which results in less computation work and lower execution times.

Suppose that you want to roll up the number of distinct values in my_column by my_key_1 and my_key_2. The following table compares the SQL statements for performing this task with COUNT(DISTINCT expression) and ARRAY_UNIQUE_AGG(expression).

Example With COUNT(DISTINCT <expression>)

Example With ARRAY_UNIQUE_AGG(<expression>)

SELECT
  COUNT(DISTINCT my_column)
FROM my_table
GROUP BY ROLLUP(my_key_1, my_key_2);
Copy
SELECT
  ARRAY_SIZE(ARRAY_UNIQUE_AGG(my_column))
FROM my_table
GROUP BY ROLLUP(my_key_1, my_key_2);
Copy

Precomputing the ARRAYs

To improve performance, you can precompute the ARRAYs of distinct values in a table or materialized view.

For example, suppose that your data warehouse contains a fact table with multiple dimensions. You can define a materialized view that constructs the ARRAYs to perform a coarse-grained precomputation or pre-aggregation before computing the final aggregates or cubes that require a COUNT(DISTINCT <expression>).

To collect the distinct values from the ARRAYs in each row, call the ARRAY_UNION_AGG function.

The following example creates a table containing the ARRAYs and uses this table to compute the number of distinct values, aggregated by different dimensions.

The following statement creates a table named precompute that contains the ARRAYs:

CREATE TABLE precompute AS
SELECT
  my_dimension_1,
  my_dimension_2,
  ARRAY_UNIQUE_AGG(my_column) arr
FROM my_table
GROUP BY 1, 2;
Copy

The following statement computes the aggregates for my_dimension_1 and my_dimension_2:

SELECT
  my_dimension_1,
  my_dimension_2,
  ARRAY_SIZE(arr)
FROM precompute
GROUP BY 1, 2;
Copy

The following statement computes the aggregate only for my_dimension_1:

SELECT
  my_dimension_1,
  ARRAY_SIZE(ARRAY_UNION_AGG(arr))
FROM precompute
GROUP BY 1;
Copy

The following statement computes the aggregate only for my_dimension_2:

SELECT
  my_dimension_2,
  ARRAY_SIZE(ARRAY_UNION_AGG(arr))
FROM precompute
GROUP BY 1;
Copy

Limitations

In Snowflake, ARRAY data types are limited to 16 MiB, which means that ARRAY_UNIQUE_AGG or ARRAY_UNION_AGG will generate an error if the physical size of the output ARRAY exceeds this size.

In these cases, consider using a bitmap aggregation instead. As an alternative, you can apply a bucketization technique similar to the one used for bitmap aggregations but with a different bucketization function than BITMAP_BUCKET_NUMBER.