Aggregate Functions¶

Aggregate functions operate on values across rows to perform mathematical calculations such as sum, average, counting, minimum/maximum values, standard deviation, and estimation, as well as some non-mathematical operations.

An aggregate function takes multiple rows (actually, zero, one, or more rows) as input and produces a single output. In contrast, scalar functions take one row as input and produce one row (one value) as output.

An aggregate function always returns exactly one row, even when the input contains zero rows. Typically, if the input contained zero rows, the output is NULL. However, an aggregate function could return 0, an empty string, or some other value when passed zero rows.

List of Functions (by Sub-category)¶

Function Name

Notes

General Aggregation

Uses different syntax than the other aggregate functions.

Uses different syntax than the other aggregate functions.

STDDEV and STDDEV_SAMP are aliases.

Alias for VAR_POP.

Alias for VAR_SAMP.

Bitwise Aggregation

Boolean Aggregation

Hash

Semi-structured Data Aggregation

Linear Regression

Statistics and Probability

Counting Distinct Values

Cardinality Estimation . (using HyperLogLog)

Alias for HLL.

Not an aggregate function; uses scalar input from HLL_ACCUMULATE or HLL_COMBINE.

Similarity Estimation . (using MinHash)

Alias for APPROXIMATE_SIMILARITY.

Frequency Estimation . (using Space-Saving)

Not an aggregate function; uses scalar input from APPROX_TOP_K_ACCUMULATE or APPROX_TOP_K_COMBINE.

Percentile Estimation . (using t-Digest)

Not an aggregate function; uses scalar input from APPROX_PERCENTILE_ACCUMULATE or APPROX_PERCENTILE_COMBINE.

Aggregation Utilities

Not an aggregate function, but can be used in conjunction with aggregate functions to determine the level of aggregation for a row produced by a GROUP BY query.

Alias for GROUPING.

Introductory Example¶

The following example illustrates the difference between an aggregate function (AVG) and a scalar function (COS). The scalar function returns one output row for each input row, while the aggregate function returns one output row for multiple input rows:

The table contains the following data:

SELECT x, y 
    FROM simple
    ORDER BY x,y;
+----+----+
|  X |  Y |
|----+----|
| 10 | 20 |
| 20 | 44 |
| 30 | 70 |
+----+----+
Copy

The scalar function returns one output row for each input row.

SELECT COS(x)
    FROM simple
    ORDER BY x;
+---------------+
|        COS(X) |
|---------------|
| -0.8390715291 |
|  0.4080820618 |
|  0.1542514499 |
+---------------+
Copy

The aggregate function returns one output row for multiple input rows:

SELECT SUM(x)
    FROM simple;
+--------+
| SUM(X) |
|--------|
|     60 |
+--------+
Copy

Aggregate Functions and NULL Values¶

Some aggregate functions ignore NULL values. For example, AVG calculates the average of values 1, 5, and NULL to be 3, based on the following formula:

(1 + 5) / 2 = 3

In both the numerator and the denominator, only the two non-NULL values are used.

If all of the values passed to the aggregate function are NULL, then the aggregate function returns NULL.

Some aggregate functions can be passed more than one column. For example:

SELECT COUNT(col1, col2) FROM table1;
Copy

In these instances, the aggregate function ignores a row if any individual column is NULL.

For example, in the following query, COUNT returns 1, not 4, because three of the four rows contain at least one NULL value in the selected columns:

Create a table and populate it with values:

CREATE TABLE t (x INT, y INT);
INSERT INTO t (x, y) VALUES
  (1, 2),         -- No NULLs.
  (3, NULL),      -- One but not all columns are NULL.
  (NULL, 6),      -- One but not all columns are NULL.
  (NULL, NULL);   -- All columns are NULL.
Copy

Query the table:

SELECT COUNT(x, y) FROM t;
+-------------+
| COUNT(X, Y) |
|-------------|
|           1 |
+-------------+
Copy

Similarly, if SUM is called with an expression that references two or more columns, and if one or more of those columns is NULL, then the expression evaluates to NULL, and the row is ignored:

SELECT SUM(x + y) FROM t;
+------------+
| SUM(X + Y) |
|------------|
|          3 |
+------------+
Copy

Note that this behavior differs from the behavior of GROUP BY, which does not discard rows when some columns are NULL:

SELECT x AS X_COL, y AS Y_COL FROM t GROUP BY x, y;
+-------+-------+
| X_COL | Y_COL |
|-------+-------|
|     1 |     2 |
|     3 |  NULL |
|  NULL |     6 |
|  NULL |  NULL |
+-------+-------+
Copy