Computing the Number of Distinct Values for Hierarchical Aggregations

To compute the number of rows that have distinct values, you can call the SQL COUNT function and use the DISTINCT keyword. If you just need an approximate count of distinct values, you can use the HyperLogLog functions (e.g. APPROX_COUNT_DISTINCT). For details, see Estimating the Number of Distinct Values.

If you are counting distinct values for hierarchical aggregations (e.g. multiple grouping sets, rollups, or cubes), you can improve performance by producing and using bitmaps (rather than using COUNT(DISTINCT <expr>).

This topic explains how to produce and use bitmaps to compute the number of distinct values.

In this Topic:

Introduction

When computing the number of distinct values for hierarchical aggregations (e.g. multiple grouping sets, rollups, or cubes), you can speed up the computation by producing and querying a bitmap that represents the set of all possible distinct values.

  • In this bitmap, you set the bits that correspond to the distinct values that are present in the data.

  • When computing the number of distinct values, you use the bitmap functions to count the bits that are set in the bitmap (rather than querying the table with COUNT(DISTINCT <expression>)).

The bitmap functions can perform better than COUNT(DISTINCT <expression>) under the following conditions:

  • The query performs a hierarchical aggregation (e.g. for multiple grouping sets, rollups, or cubes) that counts distinct values.

    Unlike COUNT(DISTINCT <expression>) (which needs to be executed for each group), you can compose and reuse bitmaps by calling the bitmap functions. This can reduce the cost of the query plan.

  • The range of values is dense (e.g. the value is generated by a sequence)

    Note that if the value range is sparse, you can use the DENSE_RANK window function to transform the sparse range of values into a dense range of values.

  • The range of values is small. A large range of values might require multiple bitmaps that do not fit into main memory and must be saved to disk.

In addition, to improve performance further, you can compute these bitmaps ahead of time (e.g. in a materialized view), rather than during the query, and you can use these precomputed bitmaps in your query.

Understanding How Bitmaps Identify Distinct Values

A bitmap is a contiguous piece of memory that is stored as a BINARY data type. A bitmap effectively is an array of bits that can be set individually. For example, a 4-byte bitmap consists of 32 bits (4 bytes * 8 bits per byte).

For each possible distinct value, you can use a bit in the bitmap to represent the presence or absence of the distinct value in the data. For example, if the values 3 and 5 are present in the data, you can set the 3rd and 5th bits to 1 in the bitmap. (If the distinct values are not numeric values, you must map the values to numeric values.)

For the bitmap functions in Snowflake, the default size of a bitmap is 32,768 bits (4 KiB). Note that this size does not correspond to the physical size of the BINARY value. Internally, the bitmap functions manage the physical representation of the bitmap, which might not be an actual bitmap. (For example, the functions might use an index vector.) The physical size of a bitmap can vary from 10 bytes to 4108 bytes.

If the number of distinct values is greater than 32,768 bits, multiple bitmaps are needed to represent all of the values. The process of dividing up the bits for distinct values into different bitmaps is called bucketization. For example, the bits for the distinct values ranging from 1 - 65,536 are bucketized into two separate buckets. The bitmap in one bucket represents the values 1 - 32,768, and the bitmap in the other bucket represents the values 32,769 - 65,536. The bitmap in each bucket contains a subset of the bits representing the distinct values.

The following diagram shows the logical representation of a bitmap. (As mentioned earlier, the physical representation of the bitmap in the BINARY value might be different.)

Logical representation of a bitmap

A distinct value is represented by the combination of a bucket containing a bitmap and a bit that is set in that bitmap. To identify the bucket and bit that represents a specific value, use the following functions:

  • Call BITMAP_BUCKET_NUMBER to identify the bucket containing the bitmap that has the bit for the value.

  • Call BITMAP_BIT_POSITION to identify the zero-based position of the bit within the bitmap for the value.

For example, the numeric value 1 is represented by the bit at position 0 in bitmap 1:

select bitmap_bucket_number(1), bitmap_bit_position(1);

+-------------------------+------------------------+
| BITMAP_BUCKET_NUMBER(1) | BITMAP_BIT_POSITION(1) |
|-------------------------+------------------------|
|                       1 |                      0 |
+-------------------------+------------------------+

The numeric value 32,768 is represented by the bit at position 32,767 in bitmap 1:

select bitmap_bucket_number(32768), bitmap_bit_position(32768);

+-----------------------------+----------------------------+
| BITMAP_BUCKET_NUMBER(32768) | BITMAP_BIT_POSITION(32768) |
|-----------------------------+----------------------------|
|                           1 |                      32767 |
+-----------------------------+----------------------------+

As another example, the numeric value 32,769 is represented by the bit at position 0 in bitmap 2:

select bitmap_bucket_number(32769), bitmap_bit_position(32769);

+-----------------------------+----------------------------+
| BITMAP_BUCKET_NUMBER(32769) | BITMAP_BIT_POSITION(32769) |
|-----------------------------+----------------------------|
|                           2 |                          0 |
+-----------------------------+----------------------------+

Creating Bitmaps

To create bitmaps that represent all possible distinct values, call the BITMAP_CONSTRUCT_AGG function in a SELECT statement:

  1. Pass in the value returned by BITMAP_BIT_POSITION for the column to the BITMAP_CONSTRUCT_AGG function.

  2. In the SELECT statement, select BITMAP_BUCKET_NUMBER and use GROUP BY to aggregate the results for a given bitmap (identified by “bucket number”).

BITMAP_CONSTRUCT_AGG is an aggregation function. Aggregation in this context means setting the bit for a distinct value if any row has that distinct value. If multiple rows contain the value 3, BITMAP_CONSTRUCT_AGG just sets the bit for 3 once and does not change the value of the bit for the additional rows that contain 3.

For example, create the following table containing a column of numeric values. Insert two distinct values, one of which is greater than 32768.

select bitmap_bucket_number(32769), bitmap_bit_position(32769);
create or replace table bitmap_test_values (val int);
insert into bitmap_test_values values (1), (32769);

Run the following command to produce bitmaps with bits that represent the distinct values:

-- Display the bitmap in hexadecimal
alter session set binary_output_format='hex';

select bitmap_bucket_number(val) as bitmap_id,
    bitmap_construct_agg(bitmap_bit_position(val)) as bitmap
    from bitmap_test_values
    group by bitmap_id;

+-----------+----------------------+
| BITMAP_ID | BITMAP               |
|-----------+----------------------|
|         1 | 00010000000000000000 |
|         2 | 00010000000000000000 |
+-----------+----------------------+

Note that the binary value shown in the BITMAP column above is not an actual bitmap. The physical representation of the bitmap is managed by the bitmap functions and may change, depending on the number of values represented. You should not expect the binary value of the bitmap to be in a specific format.

Inserting additional rows with the same values does not change the resulting bitmap. The BITMAP_CONSTRUCT_AGG function only sets the bit for a distinct value once.

insert into bitmap_test_values values (32769), (32769), (1);

select bitmap_bucket_number(val) as bitmap_id,
    bitmap_construct_agg(bitmap_bit_position(val)) as bitmap
    from bitmap_test_values
    group by bitmap_id;

+-----------+----------------------+
| BITMAP_ID | BITMAP               |
|-----------+----------------------|
|         1 | 00010000000000000000 |
|         2 | 00010000000000000000 |
+-----------+----------------------+

Inserting other distinct values produces a different bitmap in which the corresponding bits for those values are set.

insert into bitmap_test_values values (2), (3), (4);

select bitmap_bucket_number(val) as bitmap_id,
    bitmap_construct_agg(bitmap_bit_position(val)) as bitmap
    from bitmap_test_values
    group by bitmap_id;

+-----------+----------------------+
| BITMAP_ID | BITMAP               |
|-----------+----------------------|
|         1 | 00040000010002000300 |
|         2 | 00010000000000000000 |
+-----------+----------------------+

Aggregating Bitmaps

If you need to aggregate different bitmaps in the same bucket (identified by the bucket number returned by BITMAP_BUCKET_NUMBER), call BITMAP_OR_AGG.

Computing the Number of Distinct Values from the Bitmaps

To get the total count of the distinct values from the bitmaps, call BITMAP_BIT_COUNT, passing in a bitmap created by BITMAP_CONSTRUCT_AGG or BITMAP_OR_AGG.

For example:

select bitmap_bucket_number(val) as bitmap_id,
    bitmap_count(bitmap_construct_agg(bitmap_bit_position(val))) as distinct_values
    from bitmap_test_values
    group by bitmap_id;

+-----------+-----------------+
| BITMAP_ID | DISTINCT_VALUES |
|-----------+-----------------|
|         1 |               4 |
|         2 |               1 |
+-----------+-----------------+

Using Bitmaps to Improve Query Performance

The following examples demonstrate how to use the bitmap functions as an alternative to COUNT(DISTINCT <expression>).

Example 1: Counting the distinct values in a single table.

Suppose that you want to execute the following command to count the number of distinct values in col:

-- Original query using COUNT(DISTINCT <expression>)
SELECT COUNT(DISTINCT <col>) FROM <table>;

You can change this query to use bitmap functions instead:

-- The same query using bitmap functions
SELECT SUM(cnt) FROM
(
SELECT
  BITMAP_COUNT(BITMAP_CONSTRUCT_AGG(BITMAP_BIT_POSITION(<col>))) cnt
FROM <table>
GROUP BY BITMAP_BUCKET_NUMBER(<col>)
);

Note that if the range of values in col is 0 to 32,768, you can use this simpler statement instead:

-- If the full value range of <col> fits into the bitmap:
-- MIN(<col>) >= 0 AND MAX(<col>) < 32,768
SELECT
  BITMAP_COUNT(BITMAP_CONSTRUCT_AGG(<col>))
FROM <table>;

Example 2: Using GROUP BY to compute the counts by group.

Suppose that you want to execute the following command to count the number of distinct values in col by keys:

-- Original query using COUNT(DISTINCT <expression>)
SELECT COUNT(DISTINCT <col>) FROM <table> GROUP BY <keys>;

You can change this query to use bitmap functions instead:

-- The same query using bitmap functions
SELECT <keys>, SUM(cnt) FROM
(
SELECT
  <keys>,
  BITMAP_COUNT(BITMAP_CONSTRUCT_AGG(BITMAP_BIT_POSITION(<col>))) cnt
FROM <table>
GROUP BY <keys>, BITMAP_BUCKET_NUMBER(<col>)
)
GROUP BY <keys>;

Example 3: Using GROUP BY ROLLUP to roll up counts by group.

Bitmap functions work even more efficiently for GROUP BY ROLLUP aggregate queries. Bitmaps are composable (in contrast to COUNT(DISTINCT <expression>)), which results in less computation work and lower execution times.

Suppose that you want to execute the following command to roll up the number of distinct values in col by keys:

-- Original query using COUNT(DISTINCT <expression>)
SELECT COUNT(DISTINCT <col>) FROM <table> GROUP BY ROLLUP(<keys>);

You can change this query to use bitmap functions instead:

-- The same query using bitmap functions
SELECT <keys>, SUM(cnt) FROM
(
SELECT
  <keys>,
  BITMAP_COUNT(BITMAP_CONSTRUCT_AGG(BITMAP_BIT_POSITION(<col>))) cnt
  FROM <table>
  GROUP BY ROLLUP(<keys>), BITMAP_BUCKET_NUMBER(<col>)
)
GROUP BY <keys>;

Precomputing the Bitmaps

To improve performance, you can precompute the counts of distinct values in a table or materialized view.

For example, suppose that your data warehouse contains a fact table with multiple dimensions. You can define a materialized view that constructs the bitmaps to perform a coarse-grained precomputation or pre-aggregation before computing the final aggregates or cubes that require a COUNT(DISTINCT <expression>).

The following example creates a table containing the bitmaps and uses this table to compute the number of distinct values, aggregated by different dimensions.

-- precomputation table
CREATE TABLE precompute AS
SELECT
  <dim1>,
  <dim2>,
  BITMAP_BUCKET_NUMBER(<col>) bucket,
  BITMAP_CONSTRUCT_AGG(BITMAP_BIT_POSITION(<col>)) bmp
FROM <table>
GROUP BY 1, 2, 3;

-- compute final aggregate for <dim1> and <dim2>
SELECT
  <dim1>,
  <dim2>,
  SUM(BITMAP_COUNT(bmp))
FROM precompute
GROUP BY 1, 2;

-- compute final aggregate for <dim1>-only
SELECT <dim1>, SUM(cnt) FROM
(
SELECT
  <dim1>,
  BITMAP_COUNT(BITMAP_OR_AGG(bmp)) cnt
FROM precompute
GROUP BY 1, bucket
)
GROUP BY 1;

-- compute final aggregate for <dim2>-only
SELECT <dim2>, SUM(cnt) FROM
(
SELECT
  <dim2>,
  BITMAP_COUNT(BITMAP_OR_AGG(bmp)) cnt
FROM precompute
GROUP BY 1, bucket
)
GROUP BY 1;