Categories:

Aggregate functions (Percentile Estimation) , Window function syntax and usage

APPROX_PERCENTILE_ACCUMULATE

Returns the internal representation of the t-Digest state (as a JSON object) at the end of aggregation. (For more information about t-Digest, see: Estimating Percentile Values.)

The function APPROX_PERCENTILE discards this internal, intermediate state when the final percentile estimate is returned. However, in certain advanced use cases, such as estimating incremental percentile during bulk loading, you may wish to keep the intermediate state, in which case you would use APPROX_PERCENTILE_ACCUMULATE instead of APPROX_PERCENTILE.

APPROX_PERCENTILE_ACCUMULATE does not return a percentile value. Instead, it returns the algorithm state itself. The intermediate state can later be:

  • Combined (i.e. merged) with other intermediate states from separate but related batches of data.

  • Processed by other functions that operate directly on the intermediate state, for example, APPROX_PERCENTILE_ESTIMATE. (For an example, see the Examples section below.)

  • Exported to external tools.

See also:

APPROX_PERCENTILE_COMBINE , APPROX_PERCENTILE_ESTIMATE

Syntax

APPROX_PERCENTILE_ACCUMULATE( <expr> )
Copy

Arguments

expr

A valid expression, such as a column name, that evaluates to a numeric value.

Usage notes

  • Percentile works only on numeric values, so expr should produce values that are numbers or can be cast to numbers.

Example

Store the t-Digest state of the testTable.c1 column in a table and then use the state to compute percentiles:

-- create a table from the accumulated t-Digest state for testtable.c1
create or replace table resultstate as
    select approx_percentile_accumulate(c1) s from testtable;

-- Next, use the t-Digest state to compute percentiles for testtable.

-- returns an approximated value for the 1.5th percentile of testtable.c1
select approx_percentile_estimate(s, 0.015) from resultstate;

-- returns an approximated value for the 20th percentile of testtable.c1
select approx_percentile_estimate(s, 0.2) from resultstate;
Copy

Here is a more extensive example. This one shows the usage of all three related functions: APPROX_PERCENTILE_ACCUMULATE, APPROX_PERCENTILE_ESTIMATE, and APPROX_PERCENTILE_COMBINE:

Create a simple table and data:

-- Create a table and insert some rows for which we'll later estimate the 
-- median value (the value at the 50th percentile).
CREATE OR REPLACE TABLE test_table1 (c1 INTEGER);
-- Insert data.
INSERT INTO test_table1 (c1) VALUES (1), (2), (3), (4);
Copy

Create a table that contains the “state” that represents the current approximate percentile information for the table named test_table1:

CREATE OR REPLACE TABLE resultstate1 AS (
     SELECT approx_percentile_accumulate(c1) AS rs1
        FROM test_table1);
Copy

Use that state information to display the current estimate of the median value (0.5 means that we want the value at the 50th percentile):

SELECT approx_percentile_estimate(rs1, 0.5) 
    FROM resultstate1;
Copy

Output:

SELECT approx_percentile_estimate(rs1, 0.5) 
    FROM resultstate1;
+--------------------------------------+
| APPROX_PERCENTILE_ESTIMATE(RS1, 0.5) |
|--------------------------------------|
|                                  2.5 |
+--------------------------------------+
Copy

Now create a second table and add data. (In a more realistic situation, the user could have loaded more data into the first table and divided the data into non-overlapping sets based on the time that the data was loaded.)

CREATE OR REPLACE TABLE test_table2 (c1 INTEGER);
-- Insert data.
INSERT INTO test_table2 (c1) VALUES (5), (6), (7), (8);
Copy

Get the “state” information for just the new data.

CREATE OR REPLACE TABLE resultstate2 AS 
  (SELECT approx_percentile_accumulate(c1) AS rs1 
     FROM test_table2);
Copy

Combine the “state” information for the two batches of rows:

CREATE OR REPLACE TABLE combined_resultstate (c1) AS 
  SELECT approx_percentile_combine(rs1) AS apc1
    FROM (
        SELECT rs1 FROM resultstate1
        UNION ALL
        SELECT rs1 FROM resultstate2
      )
      ;
Copy

Get the approximate median value of the combined set of rows:

SELECT approx_percentile_estimate(c1, 0.5) FROM combined_resultstate;
Copy

Output:

SELECT approx_percentile_estimate(c1, 0.5) FROM combined_resultstate;
+-------------------------------------+
| APPROX_PERCENTILE_ESTIMATE(C1, 0.5) |
|-------------------------------------|
|                                 4.5 |
+-------------------------------------+
Copy