Using the Query Hash to Identify Patterns and Trends in Queries¶
To identify, group, and analyze similar queries in the query history, you can use a hash of the query text. For example, you can:
Group queries by the query hash to identify patterns in expensive queries.
Determine the effects of performance improvements (for example, changes to clustering keys) on repeated queries.
query_hash
and query_parameterized_hash
are new output columns in the ACCOUNT_USAGE views and in the output of
INFORMATION_SCHEMA table functions. These columns contain a hash of the query text. You can use this hash to analyze repeated
queries.
When the 2023_06 behavior change bundle is enabled, the query_hash
and
query_parameterized_hash
columns are available in the following views and in the output of the following table functions:
ACCOUNT_USAGE views (1 year retention)
INFORMATION_SCHEMA table functions (7 days retention)
QUERY_HISTORY table function
TASK_HISTORY table function
Prerequisites for Accessing the Query Hash¶
Currently, to access the columns containing the query hash and parameterized query hash, you must enable the 2023_06 bundle for your account.
For example:
SELECT SYSTEM$ENABLE_BEHAVIOR_CHANGE_BUNDLE('2023_06');
If the 2023_06 bundle is not enabled, the columns for the query hash and parameterized query hash do not appear in the ACCOUNT_USAGE views and in the output of the INFORMATION_SCHEMA table functions.
Using the Hash of the Query (query_hash
)¶
The query_hash
column contains a hash value that is computed, based on the canonicalized text of the SQL statement. Repeated
queries that have exactly the same query text have the same query_hash
values.
Repeated queries also have the same query_hash
if their query text differs only in:
Case insensitive identifier, session variable, and stage name
Note that this does not include identifiers specified using IDENTIFIER() with bind variables. Bind variables with different values produce different query hashes.
White space
Comments
If any other part of the query text of two queries differ, those queries have different query_hash
values.
For example, the following queries have the same query_hash
value because they have exactly the same query text.
SELECT * FROM table1 WHERE table1.name = 'TIM'
SELECT * FROM table1 WHERE table1.name = 'TIM'
You can use the query_hash
value to find patterns in query performance that might not be obvious otherwise. For example,
although a query might not be excessively expensive during any single execution, a frequently repeated query could lead to high
costs, based on the number of times it runs. You can use the query_hash
value to identify the queries to focus on optimizing
first.
For example, the following query uses the query_hash
value to identify the query IDs for the 100 longest-running queries:
SELECT
query_hash,
COUNT(*),
SUM(total_elapsed_time),
ANY_VALUE(query_id)
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE warehouse_name = 'MY_WAREHOUSE'
AND DATE_TRUNC('day', start_time) >= CURRENT_DATE() - 7
GROUP BY query_hash
ORDER BY SUM(total_elapsed_time) DESC
LIMIT 100;
Using the Hash of the Parameterized Query (query_parameterized_hash
)¶
query_parameterized_hash
contains a hash value that is computed based on the parameterized query, which means the version of
the query after literals are parameterized. These literals must be used in the query predicate and must be used with one of the
following comparison operators:
=
(equal to)!=
(not equal to)>=
(greater than or equal to)<=
(less than or equal to)
Repeated queries (including those with different parameter values) have the same query_parameterized_hash
value.
Repeated queries also have the same query_parameterized_hash
if their query text differs only in:
Case insensitive identifier, session variable, and stage name
Note that this does not include identifiers specified using IDENTIFIER() with bind variables. Bind variables with different values produce different query hashes.
White space
Comments
Queries that have the same query_parameterized_hash
value also have the same query_hash
value, but not vice versa.
For example, the following queries have the same query_parameterized_hash
value because the literal are the only difference
between the queries:
SELECT * FROM table1 WHERE table1.name = 'TIM'
SELECT * FROM table1 WHERE table1.name = 'AIHUA'
As is the case with the query_hash
value, you can use the query_parameterized_hash
value to find patterns in query
performance that might not be obvious otherwise.
The following statement identifies computes the average total_elapsed_time
each day for all queries with a specific
query_parameterized_hash
value (cbd58379a88c37ed6cc0ecfebb053b03
):
SELECT
DATE_TRUNC('day', start_time),
SUM(total_elapsed_time),
ANY_VALUE(query_id)
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_parameterized_hash = 'cbd58379a88c37ed6cc0ecfebb053b03'
AND DATE_TRUNC('day', start_time) >= CURRENT_DATE() - 30
GROUP BY DATE_TRUNC('day', start_time);
Checking the Version That Was Used to Generate the Hash¶
Over time, the logic used by Snowflake to generate the query hash can change. Changes to this logic can result in different hashes produced for the same query. For example, for a given query, the hash generated by version 1 of the logic might differ from the hash generated by version 2 of the logic.
The views and table function output that include the query_hash
and query_parameterized_hash
columns also include the
following columns that specify the version of the logic used to produce the hashes:
query_hash_version
query_parameterized_hash_version
The version number in these columns is a NUMBER (for example, 1
for the first version of the logic, 2
for the second
version of the logic, etc.).
If these columns contain different version numbers for different periods of time, you can use these version columns to identify the different hashes for the same query. For example:
...
WHERE (query_hash = 'hash_from_v1' AND query_hash_version = 1)
OR (query_hash = 'hash_from_v2' AND query_hash_version = 2)