# Estimating Approximate Similarity of Two or More Sets Using MinHash¶

MinHash is a scheme for estimating the approximate similarity between two or more sets. The MinHash scheme compares sets without computing the intersection or union of the sets, which enables efficient and effective estimation of similarity.

**In this Topic:**

## Overview¶

Typically, the Jaccard similarity coefficient (or index) is used to compare the similarity between two sets. For two sets, `A`

and `B`

, the Jaccard index is defined to be the ratio of the size of their
intersection and the size of their union:

`J(A,B) = (A ∩ B) / (A ∪ B)`

However, this calculation can consume significant resources and time and, therefore, is not ideal for large datasets. The goal of the MinHash scheme is to estimate `J(A,B)`

quickly, without computing the
intersection or union.

## SQL Functions¶

The following Aggregate Functions are provided for using the MinHash scheme to estimate approximate similarity:

- MINHASH: Returns a MinHash state containing a MinHash array of length
*k*(input argument). - MINHASH_COMBINE: Combines two (or more) input MinHash states into a single output MinHash state.
- APPROXIMATE_SIMILARITY (or APPROXIMATE_JACCARD_INDEX): Returns an estimation of the similarity (Jaccard index) of input sets based on their MinHash states.

## Implementation Notes¶

As detailed in MinHash:

Let

`H`

be a hash function that maps the members of`A`

and`B`

to distinct integer values and, for any set`S`

, define`H_min(S)`

to be the minimal member of`S`

with respect to`H`

, i.e. the member`s`

of`S`

with the minimum value of`H(s)`

, as expressed in the following equation:`H_min(S) = argmin_{s \in S} (H(s))`

If we apply

`H_min`

to both`A`

and`B`

, we will get the same value exactly when the element of the union`A ∪ B`

with minimum hash value lies in the intersection`A ∩ B`

. The probability of this being true is the above ratio, therefore:`Pr[H_min(A) = H_min(B)] = J(A,B)`

Namely, assuming randomly chosen sets

`A`

and`B`

, the probability that`H_min(A) = H_min(B)`

holds is equal to`J(A,B)`

. In other words, if`X`

is the random variable that is 1 when`H_min(A) = H_min(B)`

and 0 otherwise, then`X`

is an unbiased estimator of`J(A,B)`

. Note that`X`

has a too large variance to be a good estimator for the Jaccard index on its own (since it is always 0 or 1).The MinHash scheme reduces this variance by averaging together several variables constructed in the same way using

`k`

number of different hash functions.

In order to achieve this, the MINHASH function initially creates `k`

number of different hash functions and applies them to every element of each input set, retaining the
minimum of each one, to produce a MinHash array (also called a MinHash *state*) for each set. More specifically, for `i = 0 to k-1`

, the entry `i`

of the MinHash array for set `A`

(shown by
`MinHash_A`

) corresponds to the minimum value of hash function `H_i`

applied to every element of set `A`

.

Finally, an approximation for the similarity of the two sets `A`

and `B`

is calculated as:

`J_apprx(A,B) = (# of entries MinHash_A and MinHash_B agree on) / k`

## Examples¶

In the following example, we show how this scheme and the corresponding functions can be used in order to approximate the similarity of two sets of elements.

First, create two sample tables and insert some sample data:

CREATE OR REPLACE TABLE mhtab1(c1 NUMBER,c2 DOUBLE,c3 TEXT,c4 DATE); CREATE OR REPLACE TABLE mhtab2(c1 NUMBER,c2 DOUBLE,c3 TEXT,c4 DATE); INSERT INTO mhtab1 VALUES (1, 1.1, 'item 1', to_date('2016-11-30')), (2, 2.31, 'item 2', to_date('2016-11-30')), (3, 1.1, 'item 3', to_date('2016-11-29')), (4, 44.4, 'item 4', to_date('2016-11-30')); INSERT INTO mhtab2 VALUES (1, 1.1, 'item 1', to_date('2016-11-30')), (2, 2.31, 'item 2', to_date('2016-11-30')), (3, 1.1, 'item 3', to_date('2016-11-29')), (4, 44.4, 'item 4', to_date('2016-11-30')), (6, 34.23, 'item 6', to_date('2016-11-29'));

Then, approximate the similarity of the two sets (tables `mhtab1`

and `mhtab2`

) using their MinHash states:

SELECT APPROXIMATE_SIMILARITY(mh) FROM ((SELECT MINHASH(100, *) AS mh FROM mhtab1) UNION ALL (SELECT MINHASH(100, *) AS mh FROM mhtab2)); +----------------------------+ | APPROXIMATE_SIMILARITY(MH) | |----------------------------| | 0.79 | +----------------------------+

The similarity index of these two tables is approximated as 0.79, as opposed to the exact value 0.8 (i.e., 4/5).