- Categories:
Aggregate functions (Similarity Estimation) , Window functions (Similarity Estimation)
MINHASH¶
Returns a MinHash state containing an array of size k constructed by applying k number of different hash functions to the input rows and keeping the minimum of each hash function. This MinHash state can
then be input to the APPROXIMATE_SIMILARITY function to estimate the similarity with one or more other MinHash states.
For more information about MinHash states, see Estimating Similarity of Two or More Sets.
- See also:
Syntax¶
Aggregate function
Window function
For details about the OVER clause, see Window function syntax and usage.
Arguments¶
kThe number of hash functions to create. The larger the value, the better the approximation; however, this value has a linear impact on the computation time for estimating similarity using APPROXIMATE_SIMILARITY. The suggested value is 100. The maximum value is 1024.
exprOne or more expressions (typically column names) that determine the values to hash.
*Hash all columns in the input rows.
Usage notes¶
This function can be used as an aggregate function or a window function.
DISTINCT can be included as an argument, but has no effect.
Examples¶
Here is a more extensive example, showing the three related functions
MINHASH, MINHASH_COMBINE and APPROXIMATE_SIMILARITY. This
example creates 3 tables (ta, tb, and tc), two of which (ta and tb) are
similar, and two of which (ta and tc) are completely dissimilar.
Create and populate tables with values:
Calculate minhash info for the initial set of data:
Add more data to one of the tables:
Demonstrate the MINHASH_COMBINE function:
This query shows the approximate similarity of the two similar tables
(ta and tb):
This query shows the approximate similarity of the two very different tables
(ta and tc):