Categories:

Aggregate Functions (Frequency Estimation) , Window Functions

# APPROX_TOP_K¶

Uses Space-Saving to return an approximation of the most frequent values in the input, along with their approximate frequencies.

The output is a JSON array of arrays. In the inner arrays, the first entry is a value in the input, and the second entry corresponds to its estimated frequency in the input. The outer array contains k items, sorted by descending frequency.

## Syntax¶

Aggregate function

APPROX_TOP_K( <expr> [ , <k> [ , <counters> ] ] )


Window function

APPROX_TOP_K( <expr> [ , <k> [ , <counters> ] ] ) OVER ( [ PARTITION BY <expr4> ] )


## Arguments¶

• expr: The expression (e.g. column name) for which you want to find the most common values.

• k: The number of values whose counts you want approximated. For example, if you want to see the top 10 most common values, then set k to 10.

If k is omitted, the default is 1.

The maximum value is 100000 (100,000), and is automatically reduced if items cannot fit in the output.

• counters: This is the maximum number of distinct values that can be tracked at a time during the estimation process. For example, if counters is set to 100000, then the algorithm tracks 100,000 distinct values, attempting to keep the 100,000 most frequent values.

The maximum number of counters is 100000 (100,000).

expr4

This is the optional expression used to group rows into partitions.

## Usage Notes¶

• The approximation is more accurate if the number of counters is large, so in most cases counters should be considerably bigger than k. (Each counter uses only a small amount of memory, so increasing the number of counters is not expensive in terms of memory.)

• When used as a window function:

• This function does not support:

• ORDER BY sub-clause in the OVER() clause.

• Window frames.

## Examples¶

SELECT APPROX_TOP_K(C4) FROM lineitem;

--------------------+
APPROX_TOP_K(C4,3) |
--------------------|
[                  |
[                |
1,             |
124923         |
],               |
[                |
2,             |
107093         |
],               |
[                |
3,             |
89315          |
]                |
]                  |
--------------------+

WITH states AS (
SELECT approx_top_k(C4, 3, 5) AS state
FROM lineitem
)
SELECT value::INT AS value, value::INT AS frequency
FROM states, LATERAL FLATTEN(state);

-------+-----------+
VALUE | FREQUENCY |
-------+-----------|
1 |    124923 |
2 |    107093 |
3 |     89438 |
-------+-----------+