- Categories:
Aggregate functions (Similarity Estimation) , Window function syntax and usage
APPROXIMATE_SIMILARITY¶
Returns an estimation of the similarity (Jaccard index) of inputs based on their MinHash states. For more information about MinHash states, see Estimating Similarity of Two or More Sets.
- Aliases:
- See also:
Syntax¶
Arguments¶
exprThe expression(s) should be one or more MinHash states returned by calls to the MINHASH function. In other words, the expressions must be
MinHashstate information, not the column or expression for which you want the approximate similarity. (The example below helps make this clear.)For more information about MinHash states, see Estimating Similarity of Two or More Sets.
Returns¶
A floating point number between 0.0 and 1.0 (inclusive), where 1.0 indicates that the sets are identical, and 0.0 indicates that the sets have no overlap.
Usage notes¶
DISTINCTcan be included as an argument, but has no effect.The input MinHash states must have MinHash arrays of equal length.
The array length of the input MinHash states is an indicator of the quality of approximation.
The larger the value of
kused in function MINHASH, the better the approximation. However, this value has a linear impact on the computation time for estimating similarity.
Examples¶
Here is a more extensive example, showing the three related functions
MINHASH, MINHASH_COMBINE and APPROXIMATE_SIMILARITY. This
example creates 3 tables (ta, tb, and tc), two of which (ta and tb) are
similar, and two of which (ta and tc) are completely dissimilar.
Create and populate tables with values:
Calculate minhash info for the initial set of data:
Add more data to one of the tables:
Demonstrate the MINHASH_COMBINE function:
This query shows the approximate similarity of the two similar tables
(ta and tb):
This query shows the approximate similarity of the two very different tables
(ta and tc):