- Categories:
Aggregate functions (Frequency Estimation) , Window function syntax and usage
APPROX_TOP_K_ACCUMULATE¶
Returns the Space-Saving summary at the end of aggregation. (For more information about the Space-Saving summary, see Estimating Frequent Values.)
The function APPROX_TOP_K discards its internal, intermediate state when the final cardinality estimate is returned. However, in certain advanced use cases, such as estimating incremental frequent values during bulk loading, you might want to keep the intermediate state, in which case you would use APPROX_TOP_K_ACCUMULATE instead of APPROX_TOP_K.
In contrast to APPROX_TOP_K, APPROX_TOP_K_ACCUMULATE does not return a frequency estimate of items. Instead, it returns the algorithm state itself. The intermediate state can later be:
Combined (that is, merged) with intermediate states from separate but related batches of data.
Processed by other functions that operate directly on the intermediate state, for example, APPROX_TOP_K_ESTIMATE. (For an example, see the Examples section below.)
Exported to external tools.
- See also:
Syntax¶
Arguments¶
exprThe expression (e.g. column name) for which you want to find the most common values.
countersThis is the maximum number of distinct values that can be tracked at a time during the estimation process.
For example, if
countersis set to 100000, then the algorithm tracks 100,000 distinct values, attempting to keep the 100,000 most frequent values.The maximum number of
countersis100000(100,000).
Usage notes¶
Decimal-float (DECFLOAT) values aren’t supported.
Examples¶
This example shows how to use the three related functions APPROX_TOP_K_ACCUMULATE, APPROX_TOP_K_ESTIMATE, and APPROX_TOP_K_COMBINE.
Note
This example uses more counters than distinct data values in order to get consistent results. In real-world applications, the number of distinct values is usually larger than the number of counters, so approximations can vary.
This example generates one table with 8 rows that have values 1 - 8, and a second table with 8 rows that have values 5 - 12. Thus the most frequent values in the union of the two tables are the values 5-8, each of which has a count of 2.
Create a simple table and data:
Create a table that contains the “state” that represents the current
approximate Top K information for the table named sequence_demo:
Now create a second table and add data. (In a more realistic situation, the user could have loaded more data into the first table and divided the data into non-overlapping sets based on the time that the data was loaded.)
Get the “state” information for just the new data.
Combine the “state” information for the two batches of rows:
Get the approximate Top K value of the combined set of rows: