Categories:
Aggregate Functions (Similarity Estimation) , Window Functions

APPROXIMATE_SIMILARITY

Returns an estimation of the similarity (Jaccard index) of inputs based on their MinHash states. For more information about MinHash states, see Estimating Similarity of Two or More Sets.

Aliases:
APPROXIMATE_JACCARD_INDEX
See also:
MINHASH , MINHASH_COMBINE

Syntax

APPROXIMATE_SIMILARITY( [ DISTINCT ] <expr> [ , ... ] )

APPROXIMATE_SIMILARITY(*)

Arguments

expr

The expression(s) should be one or more MinHash states returned by calls to the MINHASH function. In other words, the expressions must be MinHash state information, not the column or expression for which you want the approximate similarity. (The example below helps make this clear.)

For more information about MinHash states, see Estimating Similarity of Two or More Sets.

Returns

A floating point number between 0.0 and 1.0 (inclusive), where 1.0 indicates that the sets are identical, and 0.0 indicates that the sets have no overlap.

Usage Notes

  • DISTINCT can be included as an argument, but has no effect.

  • The input MinHash states must have MinHash arrays of equal length.

  • The array length of the input MinHash states is an indicator of the quality of approximation.

    The larger the value of k used in function MINHASH, the better the approximation. However, this value has a linear impact on the computation time for estimating similarity.

Examples

USE SCHEMA snowflake_sample_data.tpch_sf1;

SELECT APPROXIMATE_SIMILARITY(mh) FROM
    (
      (SELECT MINHASH(100, C5) mh FROM orders WHERE c2 <= 50000)
         UNION
      (SELECT MINHASH(100, C5) mh FROM orders WHERE C2 > 50000)
    );

+----------------------------+
| APPROXIMATE_SIMILARITY(MH) |
|----------------------------|
|                       0.97 |
+----------------------------+

Here is a more extensive example, showing the three related functions MINHASH, MINHASH_COMBINE and APPROXIMATE_SIMILARITY. This example creates 3 tables (ta, tb, and tc), two of which (ta and tb) are similar, and two of which (ta and tc) are completely dissimilar.

Create and populate tables with values:

CREATE TABLE ta (i INTEGER);
CREATE TABLE tb (i INTEGER);
CREATE TABLE tc (i INTEGER);

-- Insert values into the 3 tables.
INSERT INTO ta (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
-- Almost the same as the preceding values.
INSERT INTO tb (i) VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (11);
-- Different values and different number of values.
INSERT INTO tc (i) VALUES (-1), (-20), (-300), (-4000);

Calculate minhash info for the initial set of data:

CREATE TABLE minhash_a_1 (mh) AS SELECT MINHASH(100, i) FROM ta;
CREATE TABLE minhash_b (mh) AS SELECT MINHASH(100, i) FROM tb;
CREATE TABLE minhash_c (mh) AS SELECT MINHASH(100, i) FROM tc;

Add more data to one of the tables:

INSERT INTO ta (i) VALUES (12);

Demonstrate the MINHASH_COMBINE function:

-- Record minhash information about only the new rows:
CREATE TABLE minhash_a_2 (mh) AS SELECT MINHASH(100, i) FROM ta WHERE i > 10;

-- Now combine all the minhash info for the old and new rows in table ta.
CREATE TABLE minhash_a (mh) AS
  SELECT MINHASH_COMBINE(mh) FROM
    (
      (SELECT mh FROM minhash_a_1)
      UNION ALL
      (SELECT mh FROM minhash_a_2)
    );

This query shows the approximate similarity of the two similar tables (ta and tb):

SELECT APPROXIMATE_SIMILARITY (mh) FROM
  (
    (SELECT mh FROM minhash_a)
    UNION ALL
    (SELECT mh FROM minhash_b)
  );
+-----------------------------+
| APPROXIMATE_SIMILARITY (MH) |
|-----------------------------|
|                        0.75 |
+-----------------------------+

This query shows the approximate similarity of the two very different tables (ta and tc):

SELECT APPROXIMATE_SIMILARITY (mh) FROM
  (
    (SELECT mh FROM minhash_a)
    UNION ALL
    (SELECT mh FROM minhash_c)
  );
+-----------------------------+
| APPROXIMATE_SIMILARITY (MH) |
|-----------------------------|
|                           0 |
+-----------------------------+