Using window functions¶
Window functions operate on windows, which are groups of rows that are related (e.g. by date or location). This topic describes how to use the different types of window functions supported by Snowflake, including:
General window functions.
Window functions that calculate rank (e.g. highest, second-highest, etc.).
Window functions that support cumulative window frames and sliding window frames.
This document is aimed at readers who are not already fluent with window functions. Readers who are already fluent with these functions might find the reference material sufficient:
Documentation of each specific window function.
Windowing concepts¶
A window is a group of rows. A window can contain 0, 1, or multiple rows. However, for simplicity, we usually just say that a window contains “multiple rows”.
All the rows in a window are related in some way, for example by location (e.g. all from the same city) or by time (e.g. all from the same fiscal year).
A function that uses a window is a window function.
Functions categorized as window functions help answer different types of questions than scalar functions:
A query using a scalar function answers questions about a single row, using only data in that row.
A query using a window function answers questions about a row’s relationship to other rows in the same window.
For example, suppose that you manage one branch of a chain of five stores. To calculate the profit of your store, the calculation needs to look only at information about your specific store, such as the store’s revenue and costs. You’d use a scalar function for that query.
To calculate the profit of your store relative to other stores, the calculation must look at information not only about your store, but also about other stores. You’d use a window function for that query.
You can think of a window function as taking two arguments: the first argument is the column or expression to use in the calculation, for example, revenue or profit. The second argument defines the window (i.e. the group of rows used in the comparison); the window includes both the current row (your store) and the other rows to compare it with (other stores in the same chain).
To calculate your store’s percentage of the entire store chain’s profits, you divide your store’s profit by the total profit of all the stores (branch profit / chain profit).
To help explain window functions, this topic shows how to calculate your branch’s percentage of the company’s profits both with and without a window function.
The following example shows one way to calculate your store’s percentage of the chain’s profit without a window function.
Start by creating the table, loading the data, and calculating the profit of each store.
CREATE TRANSIENT TABLE store_sales ( branch_ID INTEGER, city VARCHAR, gross_sales NUMERIC(9, 2), gross_costs NUMERIC(9, 2), net_profit NUMERIC(9, 2) ); INSERT INTO store_sales (branch_ID, city, gross_sales, gross_costs) VALUES (1, 'Vancouver', 110000, 100000), (2, 'Vancouver', 140000, 125000), (3, 'Montreal', 150000, 140000), (4, 'Montreal', 155000, 146000); UPDATE store_sales SET net_profit = gross_sales - gross_costs;Now show each store’s percentage of the chain’s total profit:
SELECT branch_ID, net_profit AS store_profit, (SELECT SUM(net_profit) FROM store_sales) AS chain_profit, store_profit / chain_profit * 100 AS store_percentage_of_chain_profit FROM store_sales ORDER BY branch_ID; +-----------+--------------+--------------+----------------------------------+ | BRANCH_ID | STORE_PROFIT | CHAIN_PROFIT | STORE_PERCENTAGE_OF_CHAIN_PROFIT | |-----------+--------------+--------------+----------------------------------| | 1 | 10000.00 | 44000.00 | 22.72727300 | | 2 | 15000.00 | 44000.00 | 34.09090900 | | 3 | 10000.00 | 44000.00 | 22.72727300 | | 4 | 9000.00 | 44000.00 | 20.45454500 | +-----------+--------------+--------------+----------------------------------+
If you want a more granular report, showing each store’s profit as a percentage of all stores in that city, use the query below:
SELECT branch_ID, net_profit AS store_profit, (SELECT SUM(net_profit) FROM store_sales AS s2 WHERE s2.city = s1.city) AS city_profit, store_profit / city_profit * 100 AS store_percentage_of_city_profit FROM store_sales AS s1 ORDER BY branch_ID; +-----------+--------------+-------------+---------------------------------+ | BRANCH_ID | STORE_PROFIT | CITY_PROFIT | STORE_PERCENTAGE_OF_CITY_PROFIT | |-----------+--------------+-------------+---------------------------------| | 1 | 10000.00 | 25000.00 | 40.00000000 | | 2 | 15000.00 | 25000.00 | 60.00000000 | | 3 | 10000.00 | 19000.00 | 52.63157900 | | 4 | 9000.00 | 19000.00 | 47.36842100 | +-----------+--------------+-------------+---------------------------------+
It would be nice to have a function that would do approximately the same thing, dividing your store’s profit
by the sum of the profit of all the stores (or by the sum of the
profit of a specified group of stores, for example, all the stores in the same city). Such a function
could take two arguments, one of which was the column to do the calculation on, and the second of which specified
which rows to compare with. The second column could be a little like a WHERE
clause. You
might use that function in a way similar to the following (this is pseudo-code, not valid SQL):
SELECT branch_ID,
PERCENTAGE(net_profit, <where_condition>)
FROM store_sales;
This function would divide the current row’s (current store’s) profit by the sum of the profits of all stores that met the <where_condition>.
For example, to calculate the percentage of profit for each store in each city, the pseudo-code would look similar to:
SELECT branch_ID,
PERCENTAGE(net_profit, 'city')
FROM store_sales;
SQL doesn’t support the syntax shown above, but it does support the concept of a window function, which returns a result based on both the current row and on a defined group of rows.
Snowflake does not have a function named PERCENTAGE
, but it does have a function named RATIO_TO_REPORT
,
which divides the value in the current row by the sum of the values in all of the rows in a window.
Here’s the equivalent of the preceding query.
SELECT branch_ID, city, 100 * RATIO_TO_REPORT(net_profit) OVER (PARTITION BY city) FROM store_sales AS s1 ORDER BY city, branch_ID; +-----------+-----------+------------------------------------------------------------+ | BRANCH_ID | CITY | 100 * RATIO_TO_REPORT(NET_PROFIT) OVER (PARTITION BY CITY) | |-----------+-----------+------------------------------------------------------------| | 3 | Montreal | 52.63157900 | | 4 | Montreal | 47.36842100 | | 1 | Vancouver | 40.00000000 | | 2 | Vancouver | 60.00000000 | +-----------+-----------+------------------------------------------------------------+
The OVER()
clause defines the group of rows used in the calculation. It performs the same role
as the hypothetical second argument (the <where_condition>) in our earlier PERCENTAGE
function.
The PARTITION BY
sub-clause allows us to divide that window into sub-windows, in this case, one per city.
(If the statement does not explicitly use a PARTITION BY clause, then the window function operates on the entire input
as a single window.)
If you want to see the profit percentage relative to the entire chain, rather than just the stores within a specific city, then omit the PARTITION BY clause:
SELECT branch_ID, 100 * RATIO_TO_REPORT(net_profit) OVER () FROM store_sales AS s1 ORDER BY branch_ID; +-----------+-------------------------------------------+ | BRANCH_ID | 100 * RATIO_TO_REPORT(NET_PROFIT) OVER () | |-----------+-------------------------------------------| | 1 | 22.72727300 | | 2 | 34.09090900 | | 3 | 22.72727300 | | 4 | 20.45454500 | +-----------+-------------------------------------------+
Order-sensitive window functions¶
The result of the previous query does not depend upon the order of the rows selected by the PARTITION BY sub-clause. Your store’s percentage of corporate profits does not depend upon the order in which the query reads the other stores’ data.
Some queries, however, are order-sensitive. There are two types of order-sensitive window functions:
Rank-related window functions.
Window frame functions.
Some functions are both rank-related functions and window frame functions.
Order-sensitive functions use the ORDER BY
sub-clause of the OVER
clause, for example:
select order_sensitive_function(column_name) over ( [partition by <col2>] order by <col3> )
...
The ORDER BY
clause can be used with ASC
(ascending) or DESC
(descending). The default is ascending.
(The ORDER BY
sub-clause of the OVER
clause is separate from the ORDER BY
clause that sorts the final
output of a query.)
Window frame functions¶
A window frame is a subset of the rows in a window.
A window frame function uses a window frame to calculate things such as a moving average.
Snowflake supports two types of window frames:
Cumulative.
Sliding.
Cumulative window frames¶
A cumulative window is a variable-width window that starts at a fixed point and grows with each new row. For example, “Sales So Far This Month” is calculated using a cumulative window that starts on the first of the month and grows as each new day’s data is added:
+--------+-------+--------------+ | Day of | Sales | Sales So Far | | Month | Today | This Month | |--------+-------+--------------+ | 1 | 10 | 10 | | 2 | 10 | 20 | | 3 | 10 | 30 | | ... | ... | ... | +--------+-------+--------------+
Often, a cumulative window restarts from 0 at regular intervals. For example, if the table above showed the sales for February, then on March 1st, the starting point of the window frame would become March 1st, and the sales so far for the month would reset to 0 and start counting up from March 1st.
Sliding window frames¶
You can picture a sliding window frame by thinking about what you see as you look out the side window of a car as the car drives parallel to a fence. If the fence posts are spaced evenly, and if the width of the window is an integer multiple of the distance between posts, then the number of fenceposts you see remains constant; however, as you drive along, “old” fenceposts move out of your view, and “new” ones move into your view, so you don’t see the same fenceposts over time, even though you see the same number of fenceposts at all times.
A sliding window frame is a fixed-width frame that “slides along” the rows in the window, showing you a different slice of the window each time. As with the car driving past the fenceposts, the window frame moves along the data, with old rows disappearing from the frame and new rows appearing, so that the width of the frame (the number of rows in the frame) is always the same.
Sliding windows are often used to calculate moving averages. A moving average is an average that is calculated based on a fixed-size interval (e.g. number of days). The average is “moving” because although the size of the interval is constant, the actual values in the interval change over time (or over some other factor) as the window slides along.
For example, stock market analysts often analyze stocks based in part on the 13-week moving average of a stock’s price. The moving average price today is the average of price at the end of today and the price at the end of each day during the most recent 13 weeks. If stocks are traded 5 days a week, and if there were no holidays in the last 13 weeks, then the moving average is the average price on each of the most recent 65 trading days (including today).
The following example shows what happens to a 13-week (91 day) moving average of a stock price on the last day of June and the first few days of July:
On June 30th, the function returns the average price for April 1 to June 30 (inclusive).
On July 1st, the function returns the average price for April 2 to July 1 (inclusive).
On July 2nd, the function returns the average price for April 3 to July 2 (inclusive).
etc.
Note
Although sliding windows are fixed-width, when a window is first applied to a new data source, the data source might not have enough data to fill the window. For example, if record-keeping started on April 1st, then from April 1st to June 29th, the sliding window would include fewer than 91 days of data. The window’s width remains constant only after the window has finished filling.
The example below uses a small (3-day) sliding window over the first 7 days in the month. This demonstrates how the window frame slides across the window, always keeping the 3 most recent values for calculating the total within the window. This illustration takes into account that at the beginning of the period, the window might not be full:
And, as you can see in the corresponding table, the last column contains the sum of the three most recent days’ worth of sales
data. For example, the column value for day 4 is 36
, which is the sum of the sales for days 2, 3, and 4 (11 + 12 + 13
):
+--------+-------+---------------+ | Day of | Sales | Most Recent | | Month | Today | 3 Days' Sales | |--------+-------+---------------+ | 1 | 10 | 10 | | 2 | 11 | 21 | | 3 | 12 | 33 | | 4 | 13 | 36 | | 5 | 14 | 39 | | ... | ... | ... | +--------+-------+---------------+
More information about order-sensitive window functions¶
The PARTITION BY and ORDER BY clauses are independent. You can use the ORDER BY clause without the PARTITION BY clause. To rank your store against all other stores in the chain, not just against other stores in your city, use the query below:
SELECT
branch_ID,
net_profit,
RANK() OVER (ORDER BY net_profit DESC) AS sales_rank
FROM store_sales
Note
An ORDER BY sub-clause inside an OVER clause controls only the order in which the window function processes the rows; the clause does not control the output of the entire query. Controlling the output order of the query still requires an outer ORDER BY clause at the top level of the query. These ORDER BY clauses are independent and are not mutually exclusive. The following query uses the first ORDER BY clause to control processing by the window function and the second ORDER BY clause to control the order of the entire query’s output:
SELECT
branch_ID,
net_profit,
RANK() OVER (ORDER BY net_profit DESC) AS sales_rank
FROM store_sales
ORDER BY branch_ID;
Some window functions are order-sensitive and some are not.
Some window functions require an ORDER BY clause.
Some window functions prohibit an ORDER BY clause.
Some window functions use an ORDER BY clause if one is present, but do not require it.
Some window functions treat an ORDER BY clause as an implicit cumulative window frame clause. (More information about implied window frames is at Usage notes for window frames.)
Caution
In most situations, SQL is an explicit language. The language does what you tell it to do – no more, and no less. There are very few implied clauses, i.e. where one clause implicitly creates another clause that is not visible. For some window functions, an ORDER BY clause implicitly adds an additional clause (a cumulative window frame clause). This changes the behavior. More information about implied window frame clauses is at Usage notes for window frames.
Because behavior that is implied rather than explicit can lead to results that are difficult to understand, Snowflake recommends avoiding implied window frames. Instead, make all window frames explicit.
The following diagram shows the relationship between window functions, window-frame functions, and rank-related functions:
Window Functions
Order-insensitive Functions Order-sensitive Functions
+-----------------------------+--------------------------------------------------------------------+
| | Rank-related Functions |
| +--------------------------------------------------------------------+
| | Functions that can be both Rank-related and Window Frame functions |
| +--------------------------------------------------------------------+
| | Window Frame Functions |
+-----------------------------+--------------------------------------------------------------------+
As shown above:
Some window functions are not order-sensitive.
Some window functions are order-sensitive.
Order-sensitive window functions are divided into two categories:
Rank-related functions.
Window frame functions.
Some order-sensitive functions are both rank-related functions and window frame functions.
Window functions vs aggregate functions¶
Many window functions and aggregate functions have the same name.
For example there is a SUM()
window function and a SUM()
aggregate function.
To distinguish between the usage of the two, remember that:
For an aggregate function, the input is multiple rows, and the output is 1 row.
For a window function, there are two inputs, a window of rows, and a single row inside that window, and the output is 1 row per input row.
An aggregate function tells you something, such as the SUM
, about all of the rows combined, but nothing
about the individual rows.
A window function tells you something about the current row relative to all the other rows in the window.
The following SQL statements show the difference between using the SUM() aggregate function, which returns 1 row for the entire group of input rows, and using the SUM() window function, which returns 1 row for each row in the group of input rows:
SELECT SUM(net_profit) FROM store_sales; +-----------------+ | SUM(NET_PROFIT) | |-----------------| | 44000.00 | +-----------------+ SELECT branch_ID, SUM(net_profit) OVER () FROM store_sales ORDER BY branch_id; +-----------+-------------------------+ | BRANCH_ID | SUM(NET_PROFIT) OVER () | |-----------+-------------------------| | 1 | 44000.00 | | 2 | 44000.00 | | 3 | 44000.00 | | 4 | 44000.00 | +-----------+-------------------------+
Calling a window function¶
This section provides more information about the syntax of calling window functions.
Windows and window frames are specified using an OVER
clause:
... OVER ( [ PARTITION BY <expr1> ]
[ ORDER BY <expr2> [ { cumulativeFrame | slidingFrame } ] ]
) ...
The window frame syntax is covered in more detail in Window functions. This topic focuses on the subset of the syntax required by window functions, specifically:
The
PARTITION BY
sub-clause partitions the data into windows. For example, if you want to calculate running sums of monthly sales for more than one month, you could partition the data by month. This would allow you to calculate one running sum of sales for January, another running sum for February, etc.The
ORDER BY
sub-clause is primarily for rank-related window functions and for sliding and cumulative window frame functions; it determines the order of rows within each window.
Illustrated example¶
This example uses a sales scenario to illustrate many of the concepts described earlier in this topic.
Suppose that you need to generate a financial report that shows values based on sales over the last week:
Daily sales
Ranking within the week (i.e. sales ranked highest to lowest for the week)
This uses a rank-related window function (RANK).
Sales so far this week (i.e. the “running sum” for all days from the beginning of the week up through and including the current day)
This uses a window function (SUM), with a cumulative window frame.
Total sales for the week
This uses SUM as a simple window function.
3-day moving average (i.e. the average of the current day and the two previous days)
This uses (AVG) as a window function with a sliding window frame.
The report might look something like this:
+--------+-------+------+--------------+-------------+--------------+ | Day of | Sales | Rank | Sales So Far | Total Sales | 3-Day Moving | | Week | Today | | This Week | This Week | Average | |--------+-------+------+--------------+-------------|--------------+ | 1 | 10 | 4 | 10 | 84 | 10.0 | | 2 | 14 | 3 | 24 | 84 | 12.0 | | 3 | 6 | 5 | 30 | 84 | 10.0 | | 4 | 6 | 5 | 36 | 84 | 9.0 | | 5 | 14 | 3 | 50 | 84 | 10.0 | | 6 | 16 | 2 | 66 | 84 | 11.0 | | 7 | 18 | 1 | 84 | 84 | 12.0 | +--------+-------+------+--------------+-------------+--------------+
The SQL for this query is somewhat complex. Rather than show it as a single query, this discussion breaks down the SQL for the individual columns.
In a real world scenario, you would have years of data, so to calculate sums and averages for one specific week of data, you would need to use a one-week window, or use a filter similar to:
... WHERE date >= start_of_relevant_week and date <= end_of_relevant_week ...
However, for this example, assume that the table contains only the most recent week’s worth of data.
CREATE TABLE store_sales_2 ( day INTEGER, sales_today INTEGER ); +-------------------------------------------+ | status | |-------------------------------------------| | Table STORE_SALES_2 successfully created. | +-------------------------------------------+ INSERT INTO store_sales_2 (day, sales_today) VALUES (1, 10), (2, 14), (3, 6), (4, 6), (5, 14), (6, 16), (7, 18); +-------------------------+ | number of rows inserted | |-------------------------| | 7 | +-------------------------+
Calculating sales rank¶
The Rank
column is calculated using the RANK function:
SELECT day, sales_today, RANK() OVER (ORDER BY sales_today DESC) AS Rank FROM store_sales_2 ORDER BY day; +-----+-------------+------+ | DAY | SALES_TODAY | RANK | |-----+-------------+------| | 1 | 10 | 5 | | 2 | 14 | 3 | | 3 | 6 | 6 | | 4 | 6 | 6 | | 5 | 14 | 3 | | 6 | 16 | 2 | | 7 | 18 | 1 | +-----+-------------+------+
Note that although there are 7 days in the time period, there are only 5 different ranks (1, 2, 3, 5, 6). There were two ties (for 3rd place and 6th place), so there are no rows with ranks 4 or 7.
Calculating sales so far this week¶
The Sales So Far This Week
column is calculated using SUM as a window function
with a cumulative window frame:
SELECT day, sales_today, SUM(sales_today) OVER (ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "SALES SO FAR THIS WEEK" FROM store_sales_2 ORDER BY day; +-----+-------------+------------------------+ | DAY | SALES_TODAY | SALES SO FAR THIS WEEK | |-----+-------------+------------------------| | 1 | 10 | 10 | | 2 | 14 | 24 | | 3 | 6 | 30 | | 4 | 6 | 36 | | 5 | 14 | 50 | | 6 | 16 | 66 | | 7 | 18 | 84 | +-----+-------------+------------------------+
This query orders the rows by date and then, for each date, calculates the sum of sales from the start of the window up to the current date (inclusive).
Calculating total sales this week¶
The Total Sales This Week
column is calculated using SUM as a simple window
function.
SELECT day, sales_today, SUM(sales_today) OVER () AS total_sales FROM store_sales_2 ORDER BY day; +-----+-------------+-------------+ | DAY | SALES_TODAY | TOTAL_SALES | |-----+-------------+-------------| | 1 | 10 | 84 | | 2 | 14 | 84 | | 3 | 6 | 84 | | 4 | 6 | 84 | | 5 | 14 | 84 | | 6 | 16 | 84 | | 7 | 18 | 84 | +-----+-------------+-------------+
Calculating a 3-day moving average¶
The 3-Day Moving Average
column is calculated using AVG as a window function with a
sliding window frame:
SELECT day, sales_today, AVG(sales_today) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS "3-DAY MOVING AVERAGE" FROM store_sales_2 ORDER BY day; +-----+-------------+----------------------+ | DAY | SALES_TODAY | 3-DAY MOVING AVERAGE | |-----+-------------+----------------------| | 1 | 10 | 10.000 | | 2 | 14 | 12.000 | | 3 | 6 | 10.000 | | 4 | 6 | 8.666 | | 5 | 14 | 8.666 | | 6 | 16 | 12.000 | | 7 | 18 | 16.000 | +-----+-------------+----------------------+
The difference between this sliding window frame and the cumulative window frame described earlier is simply the starting point:
In a sliding window frame, the starting point slides rightward.
In a cumulative window frame, the starting point is fixed and the frame continues to accumulate with each additional row within the window.
Putting it all together¶
Here’s the final version of our query, showing:
Rank.
Cumulative window frame.
Simple window function.
Sliding window function.
SELECT day, sales_today, RANK() OVER (ORDER BY sales_today DESC) AS Rank, SUM(sales_today) OVER (ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "SALES SO FAR THIS WEEK", SUM(sales_today) OVER () AS total_sales, AVG(sales_today) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS "3-DAY MOVING AVERAGE" FROM store_sales_2 ORDER BY day; +-----+-------------+------+------------------------+-------------+----------------------+ | DAY | SALES_TODAY | RANK | SALES SO FAR THIS WEEK | TOTAL_SALES | 3-DAY MOVING AVERAGE | |-----+-------------+------+------------------------+-------------+----------------------| | 1 | 10 | 5 | 10 | 84 | 10.000 | | 2 | 14 | 3 | 24 | 84 | 12.000 | | 3 | 6 | 6 | 30 | 84 | 10.000 | | 4 | 6 | 6 | 36 | 84 | 8.666 | | 5 | 14 | 3 | 50 | 84 | 8.666 | | 6 | 16 | 2 | 66 | 84 | 12.000 | | 7 | 18 | 1 | 84 | 84 | 16.000 | +-----+-------------+------+------------------------+-------------+----------------------+
Additional examples¶
This section provides more detailed examples of using the windowing sub-clauses in window functions, and illustrates how these sub-clauses work together.
These examples use the following table and data:
CREATE TABLE sales (sales_date DATE, quantity INTEGER); INSERT INTO sales (sales_date, quantity) VALUES ('2018-01-01', 1), ('2018-01-02', 3), ('2018-01-03', 5), ('2018-02-01', 2) ;
Note
Many of these examples use two ORDER BY
clauses, one for the window clause, and one to put the result set in the most
readable order. For the purpose of this topic, references to the ORDER BY
clause are usually references
to the clause within the window.
Window with PARTITION BY clause¶
The PARTITION BY
sub-clause of the windowing clause divides the data into distinct subsets based on the value of the input
expression; the SELECT statement is applied to each subset, and the output has rows for each subset.
Note that this is similar to, but not identical to, how the GROUP BY
clause works.
The next example shows the quantity sold each month, and uses the PARTITION BY
clause to divide the data into one-month subsets:
SELECT MONTH(sales_date) AS MONTH_NUM, SUM(quantity) OVER (PARTITION BY MONTH(sales_date)) AS QUANTITY FROM sales ORDER BY sales_date; +-----------+----------+ | MONTH_NUM | QUANTITY | |-----------+----------| | 1 | 9 | | 1 | 9 | | 1 | 9 | | 2 | 2 | +-----------+----------+
As you can see, the first three rows are duplicates. There were 3 input rows for month #1, and the window function generates one output row for each input row, so there are 3 output rows for month #1 in the output. The SUM function is not used as a cumulative or sliding function; it is applied to the entire window and returns the same value for the entire window each time, so the function returns duplicate values as shown above.
You can reduce the duplicates by using the DISTINCT
keyword:
SELECT DISTINCT MONTH(sales_date) AS MONTH_NUM, SUM(quantity) OVER (PARTITION BY MONTH(sales_date)) FROM sales ORDER BY MONTH_NUM; +-----------+-----------------------------------------------------+ | MONTH_NUM | SUM(QUANTITY) OVER (PARTITION BY MONTH(SALES_DATE)) | |-----------+-----------------------------------------------------| | 1 | 9 | | 2 | 2 | +-----------+-----------------------------------------------------+
In this particular case, you can use a GROUP BY
clause rather than a windowing clause. For example:
SELECT MONTH(sales_date) AS MONTH_NUM, SUM(quantity) FROM sales GROUP BY MONTH_NUM ORDER BY MONTH_NUM; +-----------+---------------+ | MONTH_NUM | SUM(QUANTITY) | |-----------+---------------| | 1 | 9 | | 2 | 2 | +-----------+---------------+
However, GROUP BY
is not as flexible as windowing. Also, in more complex queries, you can’t always substitute GROUP BY
for a window clause.
Window with ORDER BY clause¶
The ORDER BY
clause controls the order of the data within each window (and each partition if there is more than one partition).
This is useful if you want to show a “running sum” over time as new rows are added.
A running sum can be calculated either from the beginning of the window to the current row (inclusive) or from the current row to the end of the window.
A query can use a “sliding” window, which is a fixed-width window that processes N specified rows relative to the current row (e.g. the 10 most recent rows, including the current row).
Both cumulative and sliding windows are explained below.
Cumulative window frame example¶
In a “cumulative” window frame, values are computed from the beginning of the window to the current row (or from the current row to the end of the window):
SELECT MONTH(sales_date) AS MONTH_NUM, quantity, SUM(quantity) OVER (ORDER BY MONTH(sales_date) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS CUMULATIVE_SUM_QUANTITY FROM sales ORDER BY sales_date;
The query result includes additional comments showing how the CUMULATIVE_SUM_QUANTITY
column was calculated:
+-----------+----------+-------------------------+ | MONTH_NUM | QUANTITY | CUMULATIVE_SUM_QUANTITY | |-----------+----------+-------------------------| | 1 | 1 | 1 | -- sum = 1 | 1 | 3 | 4 | -- sum = 1 + 3 | 1 | 5 | 9 | -- sum = 1 + 3 + 5 | 2 | 2 | 11 | -- sum = 1 + 3 + 5 + 2 +-----------+----------+-------------------------+
Sliding window frame examples¶
In the financial world, analysts often study “moving averages”.
For example, you might have a graph in which the X axis is time, and the Y axis shows the average price of the stock over the last 13 weeks (i.e. “13-week moving average”). In a graph of a 13-week moving average of a stock price, the price shown for June 30th is not the price of the stock on June 30th, but the average price of the stock for the 13 weeks up to and including June 30th (i.e. April 1st through June 30th). The value on July 1st is the average price for April 2nd through July 1st; the value on July 2nd is the average price for April 3rd through July 2nd, and so on. Each day, the window effectively adds the most recent day’s value to the moving average, and removes the oldest day’s value. This smooths out day-to-day fluctuations and can make trends easier to recognize.
Moving averages can be calculated using a “sliding window”. The window has a specific width in rows. In the stock price example above, 13 weeks is 91 days, so the sliding window would be 91 days. If the measurements are taken once per day (e.g. at the end of the day), then the window would be 91 rows “wide”.
To define a window that is 91 rows wide:
SELECT AVG(price) OVER (ORDER BY timestamp1 ROWS BETWEEN 90 PRECEDING AND CURRENT ROW) FROM sales;
Note
When the window starts out, it might be less than 91 days wide. For example, suppose that you want the 13-week moving average price of a stock. If the stock was first created on April 1st, then on April 3rd only 3 days’ of price information exists, so the window is only 3 rows wide.
The following example shows the result of summing over a sliding window wide enough to hold two samples:
SELECT MONTH(sales_date) AS MONTH_NUM, quantity, SUM(quantity) OVER (ORDER BY sales_date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS SLIDING_SUM_QUANTITY FROM sales ORDER BY sales_date;
The query result includes additional comments showing how the SLIDING_SUM_QUANTITY
column was calculated:
+-----------+----------+----------------------+ | MONTH_NUM | QUANTITY | SLIDING_SUM_QUANTITY | |-----------+----------+----------------------+ | 1 | 1 | 1 | -- sum = 1 | 1 | 3 | 4 | -- sum = 1 + 3 | 1 | 5 | 8 | -- sum = 3 + 5 (1 is no longer in the window) | 2 | 2 | 7 | -- sum = 5 + 2 (3 is no longer in the window) +-----------+----------+----------------------+
Note that the “sliding window” functionality requires the ORDER BY
clause; the sliding window must know the order
in which rows enter and exit the sliding window.
Window with PARTITION BY and ORDER BY clauses¶
You can combine PARTITION BY
and ORDER BY
clauses to get running sums within partitions. In this example, the partitions
are one month, and because the sums apply only within a partition, the sum is reset to 0
at the beginning of each new month:
SELECT MONTH(sales_date) AS MONTH_NUM, SUM(quantity) OVER (PARTITION BY MONTH(sales_date) ORDER BY sales_date) AS MONTHLY_CUMULATIVE_SUM_QUANTITY FROM sales ORDER BY sales_date;
The query result includes additional comments showing how the MONTHLY_CUMULATIVE_SUM_QUANTITY
column was calculated:
+-----------+---------------------------------+ | MONTH_NUM | MONTHLY_CUMULATIVE_SUM_QUANTITY | |-----------+---------------------------------+ | 1 | 1 | -- sum = 1 | 1 | 4 | -- sum = 1 + 3 | 1 | 9 | -- sum = 1 + 3 + 5 | 2 | 2 | -- sum = 0 + 2 (new month) +-----------+---------------------------------+
You can combine partitions and sliding windows. In the example below, the sliding window is usually two rows wide, but each time a new partition (i.e. new month) is reached, the sliding window starts with only the first row in that partition:
SELECT MONTH(sales_date) AS MONTH_NUM, quantity, SUM(quantity) OVER (PARTITION BY MONTH(sales_date) ORDER BY sales_date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS MONTHLY_SLIDING_SUM_QUANTITY FROM sales ORDER BY sales_date;
The query result includes additional comments showing how the MONTHLY_SLIDING_SUM_QUANTITY
column was calculated:
+-----------+----------+------------------------------+ | MONTH_NUM | QUANTITY | MONTHLY_SLIDING_SUM_QUANTITY | |-----------+----------+------------------------------+ | 1 | 1 | 1 | -- sum = 1 | 1 | 3 | 4 | -- sum = 1 + 3 | 1 | 5 | 8 | -- sum = 3 + 5 | 2 | 2 | 2 | -- sum = 0 + 2 (new month) +-----------+----------+------------------------------+