Categories:

Aggregate functions (Cardinality Estimation) , Window functions

DATASKETCHES_HLL¶

Returns an approximation of the distinct cardinality of the input (that is, DATASKETCHES_HLL(col1) returns an approximation of COUNT(DISTINCT col1)).

This function is a version of the HLL HyperLogLog function that can read binary sketches in the format used by Apache DataSketches. For more information, see the Apache DataSketches documentation.

See also:

DATASKETCHES_HLL_ACCUMULATE , DATASKETCHES_HLL_COMBINE , DATASKETCHES_HLL_ESTIMATE

Syntax¶

Aggregate function

DATASKETCHES_HLL( [ DISTINCT ] <expr1> [ , <max_log_k> ] )
Copy

Window function

DATASKETCHES_HLL( [ DISTINCT ] <expr1> [ , <max_log_k> ] )
  OVER ( [ PARTITION BY <expr2> ] )
Copy

Required arguments¶

expr1

The expression for which you want to know the number of distinct values.

Optional arguments¶

max_log_k

The maximum value, in log2, of K to initialize the datasketches HLL object. Specify an INTEGER value between 4 and 21, inclusive. For more information, see the Apache DataSketches documentation.

Default: 12

expr2

The optional expression used to group rows into partitions.

Returns¶

The function returns a value of type DOUBLE.

If the input is empty, the output is 0.0.

Usage notes¶

  • DISTINCT is supported syntactically, but has no effect.

  • The function supports arguments that are values of the following data types:

    • String & binary data types (for example, VARCHAR and BINARY).

      For example, the following function calls are supported:

      SELECT DATASKETCHES_HLL_ACCUMULATE(1::TEXT);
      
      Copy
      SELECT DATASKETCHES_HLL_ACCUMULATE(TO_BINARY(HEX_ENCODE(1), 'HEX'));
      
      Copy
    • Data types for floating-point numbers (for example, FLOAT and DOUBLE)

      The DataSketches library casts these values to DOUBLE values.

    • Data types for fixed-point numbers (for example, INTEGER and NUMERIC).

      The function only supports numeric types with a scale of 0. However, you can cast numeric values with a scale other than 0 to a data types for a floating-point number.

      The DataSketches library casts these values in the range of a 64-bit signed INTEGER to a 64-bit signed INTEGER value.

      The DataSketches library doesn’t directly cast INTEGER values exceeding the 64-bit signed INTEGER range (such as 128-bit integer values). However, Snowflake still supports these values by automatically converting them to DOUBLE values, which DataSketches supports. This behavior is identical to the behavior of the datasketches-python library.

    Values of other data types aren’t supported. For example, VARIANT and ARRAY values aren’t supported.

  • For information about NULL values and aggregate functions, see Aggregate functions and NULL values.

  • When this function is called as a window function, it doesn’t support:

    • An ORDER BY clause within the OVER clause.

    • Explicit window frames.

Examples¶

Create a table and insert values:

CREATE OR REPLACE TABLE datasketches_demo(v INT, g INT);

INSERT INTO datasketches_demo SELECT 1, 1;
INSERT INTO datasketches_demo SELECT 2, 1;
INSERT INTO datasketches_demo SELECT 2, 1;
INSERT INTO datasketches_demo SELECT 2, 1;
INSERT INTO datasketches_demo SELECT 1, 2;
INSERT INTO datasketches_demo SELECT 1, 2;
INSERT INTO datasketches_demo SELECT 4, 2;
INSERT INTO datasketches_demo SELECT 4, 2;
INSERT INTO datasketches_demo SELECT 5, 2;
Copy

The following examples use the data in the table.

Return the estimated cardinality of grouped data in a column¶

Use the DATASKETCHES_HLL function to approximate the distinct cardinality of the data in column v grouped by the values in column g.

SELECT g,
       DATASKETCHES_HLL(v),
       COUNT(DISTINCT v)
  FROM datasketches_demo GROUP BY g;
Copy
+---+---------------------+-------------------+
| G | DATASKETCHES_HLL(V) | COUNT(DISTINCT V) |
|---+---------------------+-------------------|
| 1 |         2.000000005 |                 2 |
| 2 |         3.000000015 |                 3 |
+---+---------------------+-------------------+

The output shows that for value 1 in column g, there are about two distinct values in column v (that is, 1 and 2). For value 2 in column g, there are about three distinct values in column v (that is, 1, 4, and 5). The COUNT(DISTINCT v)) call returns exact number of distinct values instead of an estimate.

If you use the DATASKETCHES_HLL_ACCUMULATE function to create binary sketches from the grouped data, the DATASKETCHES_HLL_ESTIMATE function returns the same results for the accumulated sketches. For an example, see Return the cardinality estimate for accumulated binary sketches.

Return the estimated cardinality of all data in a column¶

Use the DATASKETCHES_HLL function to approximate the distinct cardinality of all of the data in column v.

SELECT DATASKETCHES_HLL(v),
       COUNT(DISTINCT v)
  FROM datasketches_demo;
Copy
+---------------------+-------------------+
| DATASKETCHES_HLL(V) | COUNT(DISTINCT V) |
|---------------------+-------------------|
|          4.00000003 |                 4 |
+---------------------+-------------------+

The output shows that there are about four distinct values in column v (that is, 1, 2, 4, and 5). The COUNT(DISTINCT v)) call returns exact number of distinct values instead of an estimate.

If you use the DATASKETCHES_HLL_ACCUMULATE function to create binary sketches from the grouped data, and then use the DATASKETCHES_HLL_COMBINE function to combine the sketches into one unified sketch, the DATASKETCHES_HLL_ESTIMATE function returns the same results for the unified sketch. For an example, see Return the cardinality estimate for combined binary sketches.