Let's say there are N items (with N in the billions) and we want to find all of those
that are similar to one another, with similarity defined by a distance
function. The goal is to output a similarity matrix -- notice that
this matrix is very sparse and most of the cells are infinite.

One naive way to go about this is to compute the similarity of every possible pair of
items -- hence a huge O(N^2) problem. Can we reduce the order
of complexity?

### Location Sensitive Hashing

**First idea:** Find a hashing function such that similar items (say
distance is less than some predefined threshold) will be hashed to the
same bucket.

Let's say if we pick the hash function such that Probability(H(a) ==
H(b)) is proportional to similarity between a and b. And then we only
perform detail comparison on items that falls into the same bucket.

Here is some R code that plots the relationship between similarity and the chance of performing a detail comparison:

x <- seq(0, 1, 0.01) y <- x plot(x, y, xlab="similarity", ylab="prob of detail compare")

Lets say we are interested in comparing all pairs of items whose
similarity is above 0.3, we have a problem here because we have
probability 0.7 = 1 - 0.3 of missing them (as they are not landing on
the same bucket). We want a mechanism that is highly selective;
probability of performing a detail comparison should be close to one
when similarity is above 0.3 and close to zero when similarity is below
0.3.

**Second idea:** Lets use 100 hash functions and 2 items that has 30
or more matches of such hash functions will be selected for detail
comparison.

Here is some R code that plots the relationship between similarity and the chance of performing a detail comparison.

# Probability of having more than "threshold" matches out # of "no_of_hash" with a range of varying similarities prob_select <- function(threshold, similarity, no_of_hash) { sum <- rep(0, length(similarity)) for (k in 0:floor(no_of_hash * threshold)) { sum <- sum + dbinom(k, no_of_hash, similarity) } return(1 - sum) } x <- seq(0, 1, 0.01) y <- prob_select(0.3, x, 100) plot(x, y, main="black: 100 hashes, Red: 1000 hashes", xlab="similarity", ylab="prob of detail compare") lines(x, y) y <- prob_select(0.3, x, 1000) lines(x, y, col="red")

The graph looks much better this time, the chance of being selected for
detail comparison jumps from zero to one sharply when the similarity
crosses 0.3

But look at the combination of 30 out of 100, it is 100!/(30! * 70!) = 2.93 * 10^25, which is impractically huge. Even the graph is a nice, we cannot use this mechanism in practice.

Third idea: Lets use 100 hash function and break them into b groups of r each (ie: b*r = 100). Further let assume b = 20, and r = 5. In other words, we have 20 groups and Group1 has hash1 to hash5, Group2 has hash6 to hash10 ... etc. Now, we call itemA's group1 matches itemB's group1 if all their hash1 to hash5 are equal to each other. Now, we'll perform a detail comparison of itemA and itemB if any of the groups are equal to each other.

Probability of being selected is 1 - (1-s^r)^b

The idea can be visualized as follows

Notice that in this model, finding r and b based on s is a bit trial and error. Here we try 20 by 5, 33 by 3, 10 by 10.

prob_select2 <- function(similarity, row_per_grp, no_of_grp) { return(1 - (1 - similarity^row_per_grp)^no_of_grp) } x <- seq(0, 1, 0.01) y <- prob_select2(x, 5, 20) plot(x, y, main="black:20 by 5, red:10 by 10, blue:33 by 3", xlab="similarity", ylab="prob of detail compare") lines(x, y) y <- prob_select2(x, 10, 10) lines(x, y, col="red") y <- prob_select2(x, 3, 33) lines(x, y, col="blue")

From the graph, we see the blue curve fits better to select the similarity at 0.3. So lets use 33 by 3.

To perform the detail comparison, we can use a parallel Map/Reduce implementation

### Map Reduce Implementation

Here we have two round of Map/Reduce. In the first round, map function will compute all the groupKeys for each item and emit the groupKey with the item. All the items that has the groupKey matches will land on the same reducer, which creates all the possible pairs of items (these are candidates for pairwise comparison).However, we don't want to perform the detail comparison in the first round as there may be many duplicates for item pairs that matches more than one group. Therefore we want to perform another round of Map/reduce to remove the duplicates.

The first round proceeds as follows ...

After that, the second round proceeds as follows ...

By combining Location Sensitive Hashing and Map/Reduce, we can perform large scale similarity calculation in an effective manner.