Categories:

Aggregate Functions (Similarity Estimation) , Window Functions

# MINHASH¶

Returns a MinHash state containing an array of size k constructed by applying k number of different hash functions to the input rows and keeping the minimum of each hash function. This MinHash state can then be input to the APPROXIMATE_SIMILARITY function to estimate the similarity with one or more other MinHash states.

MINHASH_COMBINE

## Syntax¶

MINHASH( <k> , [ DISTINCT ] expr+ )

MINHASH( <k> , * )


## Usage Notes¶

• k specifies the number of hash functions to be created. The larger the value, the better the approximation; however, this value has a linear impact on the computation time for estimating similarity using APPROXIMATE_SIMILARITY. The suggested value is 100.

The maximum value is 1024.

• DISTINCT can be included as an argument, but has no effect.

## Examples¶

USE SCHEMA snowflake_sample_data.tpch_sf1;

SELECT MINHASH(5, *) FROM orders;

+----------------------+
| MINHASH(5, *)        |
|----------------------|
| {                    |
|   "state": [         |
|     78678383574307,  |
|     586952033158539, |
|     525995912623966, |
|     508991839383217, |
|     492677003405678  |
|   ],                 |
|   "type": "minhash", |
|   "version": 1       |
| }                    |
+----------------------+


Here is a more extensive example, showing the three related functions MINHASH, MINHASH_COMBINE and APPROXIMATE_SIMILARITY. This example creates 3 tables (ta, tb, and tc), two of which (ta and tb) are similar, and two of which (ta and tc) are completely dissimilar.

Create and populate tables with values:

CREATE TABLE ta (i INTEGER);
CREATE TABLE tb (i INTEGER);
CREATE TABLE tc (i INTEGER);

-- Insert values into the 3 tables.
INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
-- Almost the same as the preceding values.
INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11);
-- Different values and different number of values.
INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);


Calculate minhash info for the initial set of data:

CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta;
CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb;
CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;


Add more data to one of the tables:

INSERT INTO ta (i) VALUES (12);


Demonstrate the MINHASH_COMBINE function:

-- Record minhash information about only the new rows:
CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10;

-- Now combine all the minhash info for the old and new rows in table ta.
CREATE TABLE minhash_a (mh) AS
SELECT MINHASH_COMBINE(mh) FROM
(
(SELECT mh FROM minhash_a_1)
UNION ALL
(SELECT mh FROM minhash_a_2)
);


This query shows the approximate similarity of the two similar tables (ta and tb):

SELECT APPROXIMATE_SIMILARITY (mh) FROM
(
(SELECT mh FROM minhash_a)
UNION ALL
(SELECT mh FROM minhash_b)
);
+-----------------------------+
| APPROXIMATE_SIMILARITY (MH) |
|-----------------------------|
|                        0.75 |
+-----------------------------+


This query shows the approximate similarity of the two very different tables (ta and tc):

SELECT APPROXIMATE_SIMILARITY (mh) FROM
(
(SELECT mh FROM minhash_a)
UNION ALL
(SELECT mh FROM minhash_c)
);
+-----------------------------+
| APPROXIMATE_SIMILARITY (MH) |
|-----------------------------|
|                           0 |
+-----------------------------+