Using Bitmaps to Compute Distinct Values for Hierarchical Aggregations

If you are counting distinct values for hierarchical aggregations (e.g. multiple grouping sets, rollups, or cubes), you can improve performance by producing bitmaps that represent the distinct values and computing the number of distinct values from these bitmaps. Using this approach can be faster than using COUNT(DISTINCT <expr>).

This topic explains how to use bitmaps to count distinct values.

For other techniques for counting distinct values, see Computing the Number of Distinct Values.

Introduction

When computing the number of distinct values for hierarchical aggregations (e.g. multiple grouping sets, rollups, or cubes), you can speed up the computation by producing and querying a bitmap that represents the set of all possible distinct values.

  • In this bitmap, you set the bits that correspond to the distinct values that are present in the data.

  • When computing the number of distinct values, you use the bitmap functions to count the bits that are set in the bitmap (rather than querying the table with COUNT(DISTINCT <expression>)).

The bitmap functions can perform better than COUNT(DISTINCT <expression>) under the following conditions:

  • The query performs a hierarchical aggregation (e.g. for multiple grouping sets, rollups, or cubes) that counts distinct values.

    Unlike COUNT(DISTINCT <expression>) (which needs to be executed for each group), you can compose and reuse bitmaps by calling the bitmap functions. This can reduce the cost of the query plan.

  • The range of values is dense (e.g. the value is generated by a sequence)

    Note that if the value range is sparse, you can use the DENSE_RANK window function to transform the sparse range of values into a dense range of values.

  • The range of values is small. A large range of values might require multiple bitmaps that do not fit into main memory and must be saved to disk.

In addition, to improve performance further, you can compute these bitmaps ahead of time (e.g. in a materialized view), rather than during the query, and you can use these precomputed bitmaps in your query.

Understanding How Bitmaps Identify Distinct Values

A bitmap is a contiguous piece of memory that is stored as a BINARY data type. A bitmap effectively is an array of bits that can be set individually. For example, a 4-byte bitmap consists of 32 bits (4 bytes * 8 bits per byte).

For each possible distinct value, you can use a bit in the bitmap to represent the presence or absence of the distinct value in the data. For example, if the values 3 and 5 are present in the data, you can set the 3rd and 5th bits to 1 in the bitmap. (If the distinct values are not numeric values, you must map the values to numeric values.)

For the bitmap functions in Snowflake, the default size of a bitmap is 32,768 bits (4 KiB). Note that this size does not correspond to the physical size of the BINARY value. Internally, the bitmap functions manage the physical representation of the bitmap, which might not be an actual bitmap. (For example, the functions might use an index vector.) The physical size of a bitmap can vary from 10 bytes to 4108 bytes.

If the number of distinct values is greater than 32,768 bits, multiple bitmaps are needed to represent all of the values. The process of dividing up the bits for distinct values into different bitmaps is called bucketization. For example, the bits for the distinct values ranging from 1 - 65,536 are bucketized into two separate buckets. The bitmap in one bucket represents the values 1 - 32,768, and the bitmap in the other bucket represents the values 32,769 - 65,536. The bitmap in each bucket contains a subset of the bits representing the distinct values.

The following diagram shows the logical representation of a bitmap. (As mentioned earlier, the physical representation of the bitmap in the BINARY value might be different.)

Logical representation of a bitmap

A distinct value is represented by the combination of a bucket containing a bitmap and a bit that is set in that bitmap. To identify the bucket and bit that represents a specific value, use the following functions:

  • Call BITMAP_BUCKET_NUMBER to identify the bucket containing the bitmap that has the bit for the value.

  • Call BITMAP_BIT_POSITION to identify the zero-based position of the bit within the bitmap for the value.

For example, the numeric value 1 is represented by the bit at position 0 in bitmap 1:

select bitmap_bucket_number(1), bitmap_bit_position(1);

+-------------------------+------------------------+
| BITMAP_BUCKET_NUMBER(1) | BITMAP_BIT_POSITION(1) |
|-------------------------+------------------------|
|                       1 |                      0 |
+-------------------------+------------------------+
Copy

The numeric value 32,768 is represented by the bit at position 32,767 in bitmap 1:

select bitmap_bucket_number(32768), bitmap_bit_position(32768);

+-----------------------------+----------------------------+
| BITMAP_BUCKET_NUMBER(32768) | BITMAP_BIT_POSITION(32768) |
|-----------------------------+----------------------------|
|                           1 |                      32767 |
+-----------------------------+----------------------------+
Copy

As another example, the numeric value 32,769 is represented by the bit at position 0 in bitmap 2:

select bitmap_bucket_number(32769), bitmap_bit_position(32769);

+-----------------------------+----------------------------+
| BITMAP_BUCKET_NUMBER(32769) | BITMAP_BIT_POSITION(32769) |
|-----------------------------+----------------------------|
|                           2 |                          0 |
+-----------------------------+----------------------------+
Copy

Creating Bitmaps

To create bitmaps that represent all possible distinct values, call the BITMAP_CONSTRUCT_AGG function in a SELECT statement:

  1. Pass in the value returned by BITMAP_BIT_POSITION for the column to the BITMAP_CONSTRUCT_AGG function.

  2. In the SELECT statement, select BITMAP_BUCKET_NUMBER and use GROUP BY to aggregate the results for a given bitmap (identified by “bucket number”).

BITMAP_CONSTRUCT_AGG is an aggregation function. Aggregation in this context means setting the bit for a distinct value if any row has that distinct value. If multiple rows contain the value 3, BITMAP_CONSTRUCT_AGG just sets the bit for 3 once and does not change the value of the bit for the additional rows that contain 3.

For example, create the following table containing a column of numeric values. Insert two distinct values, one of which is greater than 32768.

CREATE OR REPLACE TABLE bitmap_test_values (val INT);
insert into bitmap_test_values values (1), (32769);
Copy

Run the following command to produce bitmaps with bits that represent the distinct values:

-- Display the bitmap in hexadecimal
alter session set binary_output_format='hex';

select bitmap_bucket_number(val) as bitmap_id,
    bitmap_construct_agg(bitmap_bit_position(val)) as bitmap
    from bitmap_test_values
    group by bitmap_id;

+-----------+----------------------+
| BITMAP_ID | BITMAP               |
|-----------+----------------------|
|         1 | 00010000000000000000 |
|         2 | 00010000000000000000 |
+-----------+----------------------+
Copy

Note

The BITMAP column contains a physical representation of the bitmap, which is not necessarily the actual bitmap. In this example, the column contains an index vector that represents the bitmap.

An index vector is one way in which the bitmap functions store the physical representation of the bitmap. Depending on the number of values represented by the bitmap, the bitmap functions can use different physical representations for the bitmap.

You should not expect the binary value of the bitmap to be stored in a specific format. To determine which bits are set, use the bitmap functions (rather than examining the binary value yourself).

Inserting additional rows with the same values does not change the resulting bitmap. The BITMAP_CONSTRUCT_AGG function only sets the bit for a distinct value once.

insert into bitmap_test_values values (32769), (32769), (1);

select bitmap_bucket_number(val) as bitmap_id,
    bitmap_construct_agg(bitmap_bit_position(val)) as bitmap
    from bitmap_test_values
    group by bitmap_id;

+-----------+----------------------+
| BITMAP_ID | BITMAP               |
|-----------+----------------------|
|         1 | 00010000000000000000 |
|         2 | 00010000000000000000 |
+-----------+----------------------+
Copy

Inserting other distinct values produces a different bitmap in which the corresponding bits for those values are set.

insert into bitmap_test_values values (2), (3), (4);

select bitmap_bucket_number(val) as bitmap_id,
    bitmap_construct_agg(bitmap_bit_position(val)) as bitmap
    from bitmap_test_values
    group by bitmap_id;

+-----------+----------------------+
| BITMAP_ID | BITMAP               |
|-----------+----------------------|
|         1 | 00040000010002000300 |
|         2 | 00010000000000000000 |
+-----------+----------------------+
Copy

Aggregating Bitmaps

If you need to aggregate different bitmaps in the same bucket (identified by the bucket number returned by BITMAP_BUCKET_NUMBER), call BITMAP_OR_AGG.

Computing the Number of Distinct Values from the Bitmaps

To get the total count of the distinct values from the bitmaps, call BITMAP_COUNT, passing in a bitmap created by BITMAP_CONSTRUCT_AGG or BITMAP_OR_AGG.

For example:

select bitmap_bucket_number(val) as bitmap_id,
    bitmap_count(bitmap_construct_agg(bitmap_bit_position(val))) as distinct_values
    from bitmap_test_values
    group by bitmap_id;

+-----------+-----------------+
| BITMAP_ID | DISTINCT_VALUES |
|-----------+-----------------|
|         1 |               4 |
|         2 |               1 |
+-----------+-----------------+
Copy

Using Bitmaps to Improve Query Performance

The following examples demonstrate how to use the bitmap functions as an alternative to COUNT(DISTINCT <expression>).

Example 1: Counting the Distinct Values in a Single Table

Suppose that you want to count the number of distinct values in my_column. The following table compares the SQL statements for performing this task with COUNT(DISTINCT expression) and the bitmap functions.

Example With COUNT(DISTINCT <expression>)

Example With Bitmap Functions

SELECT
  COUNT(DISTINCT my_column)
FROM my_table;
Copy
SELECT SUM(cnt) FROM (
  SELECT
    BITMAP_COUNT(BITMAP_CONSTRUCT_AGG(BITMAP_BIT_POSITION(my_column))) cnt
  FROM my_table
  GROUP BY BITMAP_BUCKET_NUMBER(my_table)
);
Copy

Note that if the range of values in my_column is 0 to 32,768, you can use this simpler statement instead:

-- If the full value range of my_column fits into the bitmap:
--   MIN(my_column) >= 0 AND MAX(my_column) < 32,768
SELECT
  BITMAP_COUNT(BITMAP_CONSTRUCT_AGG(my_column))
FROM my_table;
Copy

Example 2: Using GROUP BY to Compute the Counts by Group

Suppose that you want to count the number of distinct values in my_column by my_key_1 and my_key_2. The following table compares the SQL statements for performing this task with COUNT(DISTINCT expression) and the bitmap functions.

Example With COUNT(DISTINCT <expression>)

Example With Bitmap Functions

SELECT
  my_key_1,
  my_key_2,
  COUNT(DISTINCT my_column)
FROM my_table
GROUP BY my_key_1, my_key_2;
Copy
SELECT my_key_1, my_key_2, SUM(cnt) FROM (
  SELECT
    my_key_1,
    my_key_2,
    BITMAP_COUNT(BITMAP_CONSTRUCT_AGG(BITMAP_BIT_POSITION(my_column))) cnt
  FROM my_table
  GROUP BY my_key_1, my_key_2, BITMAP_BUCKET_NUMBER(my_column)
)
GROUP BY my_key_1, my_key_2;
Copy

Example 3: Using GROUP BY ROLLUP to Roll up Counts by Group

Bitmap functions work even more efficiently for GROUP BY ROLLUP aggregate queries. Bitmaps are composable (in contrast to COUNT(DISTINCT <expression>)), which results in less computation work and lower execution times.

Suppose that you want to roll up the number of distinct values in my_column by my_key_1 and my_key_2. The following table compares the SQL statements for performing this task with COUNT(DISTINCT expression) and the bitmap functions.

Example With COUNT(DISTINCT <expression>)

Example With Bitmap Functions

SELECT
  my_key_1,
  my_key_2,
  COUNT(DISTINCT my_column)
FROM my_table
GROUP BY ROLLUP(my_key_1, my_key_2);
Copy
SELECT my_key_1, my_key_2, SUM(cnt) FROM (
  SELECT
    my_key_1,
    my_key_2,
    BITMAP_COUNT(BITMAP_CONSTRUCT_AGG(BITMAP_BIT_POSITION(my_column))) cnt
  FROM my_table
  GROUP BY ROLLUP(my_key_1, my_key_2), BITMAP_BUCKET_NUMBER(my_column)
)
GROUP BY my_key_1, my_key_2;
Copy

Precomputing the Bitmaps

To improve performance, you can precompute the counts of distinct values in a table or materialized view.

For example, suppose that your data warehouse contains a fact table with multiple dimensions. You can define a materialized view that constructs the bitmaps to perform a coarse-grained precomputation or pre-aggregation before computing the final aggregates or cubes that require a COUNT(DISTINCT <expression>).

The following example creates a table containing the bitmaps and uses this table to compute the number of distinct values, aggregated by different dimensions.

The following statement creates a table named precompute that contains the bitmaps and bucket information:

CREATE TABLE precompute AS
SELECT
  my_dimension_1,
  my_dimension_2,
  BITMAP_BUCKET_NUMBER(my_column) bucket,
  BITMAP_CONSTRUCT_AGG(BITMAP_BIT_POSITION(my_column)) bmp
FROM my_table
GROUP BY 1, 2, 3;
Copy

The following statement computes the aggregates for my_dimension_1 and my_dimension_2:

SELECT
  my_dimension_1,
  my_dimension_2,
  SUM(BITMAP_COUNT(bmp))
FROM precompute
GROUP BY 1, 2;
Copy

The following statement computes the aggregate only for my_dimension_1:

SELECT my_dimension_1, SUM(cnt) FROM (
  SELECT
    my_dimension_1,
    BITMAP_COUNT(BITMAP_OR_AGG(bmp)) cnt
  FROM precompute
  GROUP BY 1, bucket
)
GROUP BY 1;
Copy

The following statement computes the aggregate only for my_dimension_2:

SELECT my_dimension_2, SUM(cnt) FROM (
  SELECT
    my_dimension_2,
    BITMAP_COUNT(BITMAP_OR_AGG(bmp)) cnt
  FROM precompute
  GROUP BY 1, bucket
)
GROUP BY 1;
Copy