<< back to other nerdy projects part 1: resemblance with the jaccard coefficient part 2: fastmap projection using jaccard distances part 3: the simhash algorithm part 4: a sketching algorithm why? shingling gives great results but the O(n2) runtime is poor a set of 1e6 records would require 5e11 comparisons and even the cpp impl can "only" do 5e6 /sec that's 2 months of runtime, 1.999 months too l