Catégories :: Fonctions d’agrégation (estimation de la similarité), fonctions de fenêtre (estimation de la similarité)

MINHASH¶

Renvoie un état MinHash contenant un tableau de taille k construit en appliquant un nombre k de fonctions de hachage différentes aux lignes d’entrée et en maintenant chaque fonction de hachage au seuil minimum. Cet état MinHash peut ensuite être entré dans la fonction APPROXIMATE_SIMILARITY pour estimer la similarité avec un ou plusieurs autres états MinHash.

Pour plus d’informations sur les états MinHash, voir Estimation de la similarité de deux ensembles ou plus.

Voir aussi :: MINHASH_COMBINE

Syntaxe¶

Fonction d’agrégation

MINHASH( <k> , [ DISTINCT ] expr+ )

MINHASH( <k> , * )

Fonction de fenêtre

MINHASH( <k> , [ DISTINCT ] expr+ ) OVER ( [ PARTITION BY <expr1> ] )

MINHASH( <k> , * ) OVER ( [ PARTITION BY <expr1> ] )

Pour plus d’informations sur la clause OVER, consultez Syntaxe et utilisation des fonctions de fenêtre.

Arguments¶

k: Nombre de fonctions de hachage à créer. Plus la valeur est grande, meilleure est l’approximation ; Cependant, cette valeur a un impact linéaire sur le temps de calcul pour l’estimation de la similarité à l’aide de APPROXIMATE_SIMILARITY. La valeur suggérée est 100. La valeur maximale est 1 024.
expr: Une ou plusieurs expressions (généralement des noms de colonnes) qui déterminent les valeurs à hacher.
*: Hachez toutes les colonnes dans les lignes d’entrée.

Notes sur l’utilisation¶

Cette fonction peut être utilisée comme fonction d’agrégation ou comme fonction de fenêtre.
DISTINCT peut être inclus comme argument, mais n’a aucun effet.

Exemples¶

USE SCHEMA snowflake_sample_data.tpch_sf1;

SELECT MINHASH(5, *) FROM orders;

+----------------------+
| MINHASH(5, *)        |
|----------------------|
| {                    |
|   "state": [         |
|     78678383574307,  |
|     586952033158539, |
|     525995912623966, |
|     508991839383217, |
|     492677003405678  |
|   ],                 |
|   "type": "minhash", |
|   "version": 1       |
| }                    |
+----------------------+

Here is a more extensive example, showing the three related functions MINHASH, MINHASH_COMBINE and APPROXIMATE_SIMILARITY. This example creates 3 tables (ta, tb, and tc), two of which (ta and tb) are similar, and two of which (ta and tc) are completely dissimilar.

Créer et remplir des tables avec des valeurs :

CREATE TABLE ta (i INTEGER);
CREATE TABLE tb (i INTEGER);
CREATE TABLE tc (i INTEGER);

INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11);
INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);

Calculer les informations minhash pour l’ensemble initial de données :

CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta;
CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb;
CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;

Ajouter plus de données à l’une des tables :

INSERT INTO ta (i) VALUES (12);

Demonstrate the MINHASH_COMBINE function:

CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10;

CREATE TABLE minhash_a (mh) AS
  SELECT MINHASH_COMBINE(mh)
    FROM (
      (SELECT mh FROM minhash_a_1)
      UNION ALL
      (SELECT mh FROM minhash_a_2)
    );

This query shows the approximate similarity of the two similar tables (ta and tb):

SELECT APPROXIMATE_SIMILARITY(mh)
  FROM (
    (SELECT mh FROM minhash_a)
    UNION ALL
    (SELECT mh FROM minhash_b)
  );

+-----------------------------+
| APPROXIMATE_SIMILARITY (MH) |
|-----------------------------|
|                        0.75 |
+-----------------------------+

This query shows the approximate similarity of the two very different tables (ta and tc):

SELECT APPROXIMATE_SIMILARITY(mh)
  FROM (
    (SELECT mh FROM minhash_a)
    UNION ALL
    (SELECT mh FROM minhash_c)
  );

+-----------------------------+
| APPROXIMATE_SIMILARITY (MH) |
|-----------------------------|
|                           0 |
+-----------------------------+