Categories:

Aggregate Functions (Similarity Estimation) , Window Functions

# MINHASH_COMBINEÂ¶

Combines input MinHash states into a single MinHash output state. This Minhash state can then be input to the APPROXIMATE_SIMILARITY function to estimate the similarity with other MinHash states.

This allows use cases in which MINHASH is run over horizontal rowsets of the same table, producing a MinHash state for each rowset. These states can then be combined using MINHASH_COMBINE, producing the same output state as a single run of MINHASH over the entire table.

MINHASH

## SyntaxÂ¶

```MINHASH_COMBINE( [ DISTINCT ] <state> )
```

## Usage notesÂ¶

• DISTINCT can be included as an argument, but has no effect.

• Input MinHash `state` must have MinHash arrays of equal length.

## ExamplesÂ¶

```USE SCHEMA snowflake_sample_data.tpch_sf1;

SELECT MINHASH_COMBINE(mh) FROM
(
(SELECT MINHASH(5, c2) mh FROM orders WHERE c2 <= 10000)
UNION
(SELECT MINHASH(5, c2) mh FROM orders WHERE c2 > 10000 AND c2 <= 20000)
UNION
(SELECT MINHASH(5, C2) mh FROM orders WHERE c2 > 20000)
);

+-----------------------+
| MINHASH_COMBINE(MH)   |
|-----------------------|
| {                     |
|   "state": [          |
|     628914288006793,  |
|     1071764954434168, |
|     991489123966035,  |
|     2395105834644106, |
|     680224867834949   |
|   ],                  |
|   "type": "minhash",  |
|   "version": 1        |
| }                     |
+-----------------------+
```

Here is a more extensive example, showing the three related functions `MINHASH`, `MINHASH_COMBINE` and `APPROXIMATE_SIMILARITY`. This example creates 3 tables (ta, tb, and tc), two of which (ta and tb) are similar, and two of which (ta and tc) are completely dissimilar.

Create and populate tables with values:

```CREATE TABLE ta (i INTEGER);
CREATE TABLE tb (i INTEGER);
CREATE TABLE tc (i INTEGER);

-- Insert values into the 3 tables.
INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
-- Almost the same as the preceding values.
INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11);
-- Different values and different number of values.
INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);
```

Calculate minhash info for the initial set of data:

```CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta;
CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb;
CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;
```

Add more data to one of the tables:

```INSERT INTO ta (i) VALUES (12);
```

Demonstrate the `MINHASH_COMBINE` function:

```-- Record minhash information about only the new rows:
CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10;

-- Now combine all the minhash info for the old and new rows in table ta.
CREATE TABLE minhash_a (mh) AS
SELECT MINHASH_COMBINE(mh) FROM
(
(SELECT mh FROM minhash_a_1)
UNION ALL
(SELECT mh FROM minhash_a_2)
);
```

This query shows the approximate similarity of the two similar tables (ta and tb):

```SELECT APPROXIMATE_SIMILARITY (mh) FROM
(
(SELECT mh FROM minhash_a)
UNION ALL
(SELECT mh FROM minhash_b)
);
+-----------------------------+
| APPROXIMATE_SIMILARITY (MH) |
|-----------------------------|
|                        0.75 |
+-----------------------------+
```

This query shows the approximate similarity of the two very different tables (ta and tc):

```SELECT APPROXIMATE_SIMILARITY (mh) FROM
(
(SELECT mh FROM minhash_a)
UNION ALL
(SELECT mh FROM minhash_c)
);
+-----------------------------+
| APPROXIMATE_SIMILARITY (MH) |
|-----------------------------|
|                           0 |
+-----------------------------+
```
Language: English