datasketch

datasketch gives you probabilistic data structures that can process vary large amount of data super fast, with little loss of accuracy.

datasketch must be used with Python 2.7 or above.

Install

To install datasketch using pip:

pip install datasketch -U

MinHash

MinHash lets you estimate the Jaccard similarity between datasets of arbitrary sizes in linear time using a small and fixed memory space. It can also be used to compute Jaccard similarity between data streams. MinHash is introduced by Andrei Z. Broder in this paper

from hashlib import sha1
from datasketch.minhash import MinHash, jaccard

data1 = ['minhash', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'datasets']
data2 = ['minhash', 'is', 'a', 'probability', 'data', 'structure', 'for',
        'estimating', 'the', 'similarity', 'between', 'documents']

m1, m2 = MinHash(), MinHash()
for d in data1:
	m1.digest(sha1(d.encode('utf8')))
for d in data2:
	m2.digest(sha1(d.encode('utf8')))
print("Estimated Jaccard for data1 and data2 is", jaccard([m1, m2]))

s1 = set(data1)
s2 = set(data2)
actual_jaccard = float(len(s1.intersection(s2)))/float(len(s1.union(s2)))
print("Actual Jaccard for data1 and data2 is", actual_jaccard)

You can adjust the accuracy by customizing the number of permutation functions used in MinHash.

# This will give better accuracy than the default setting (128).
m = MinHash(num_perm=256)

The trade-off for better accuracy is slower speed and higher memory usage. Because using more permutation functions means 1) more CPU instructions for every hash digested and 2) more hash values to be stored. The speed and memory usage of MinHash are both linearly proportional to the number of permutation functions used.

You can union two MinHash object using the merge function. This makes MinHash useful in parallel MapReduce style data analysis.

# The makes m1 the union of m2 and the original m1.
m1.merge(m2)

You can serialize a MinHash object into a byte buffer, and reconstruct a MinHash from a byte buffer. You can also obtain the byte size of the MinHash object using the bytesize function.

m = MinHash()
buf = bytearray(m.bytesize())
m.serialize(buf)
n = MinHash.deserialize(buf)

HyperLogLog

HyperLogLog is capable of estimating the cardinality (the number of distinct values) of dataset in a single pass, using a small and fixed memory space. HyperLogLog is first introduced in this paper by Philippe Flajolet, Éric Fusy, Olivier Gandouet and Frédéric Meunier.

from hashlib import sha1
from datasketch.hyperloglog import HyperLogLog

data1 = ['hyperloglog', 'is', 'a', 'probabilistic', 'data', 'structure', 'for',
'estimating', 'the', 'cardinality', 'of', 'dataset', 'dataset', 'a']

h = HyperLogLog()
for d in data1:
  h.digest(sha1(d.encode('utf8')))
print("Estimated cardinality is", h.count())

s1 = set(data1)
print("Actual cardinality is", len(s1))

As in MinHash, you can also control the accuracy of HyperLogLog by changing the parameter p.

# This will give better accuracy than the default setting (8).
h = HyperLogLog(p=12)

Interestingly, there is no speed penalty for using higher p value. However the memory usage is exponential to the p value.

As in MinHash, you can also merge two HyperLogLogs to create a union HyperLogLog.

h1 = HyperLogLog()
h2 = HyperLogLog()
h1.digest(sha1('test'.encode('utf8')))
# The makes h1 the union of h2 and the original h1.
h1.merge(h2)
# This will return the cardinality of the union
h1.count()

You can serialize a HyperLogLog object into a byte buffer, and reconstruct a HyperLogLog from a byte buffer. You can also obtain the byte size of the HyperLogLog object using the bytesize function.

h = HyperLogLog()
buf = bytearray(h.bytesize())
h.serialize(buf)
n = HyperLogLog.deserialize(buf)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
benchmark		benchmark
datasketch		datasketch
examples		examples
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION.rst		DESCRIPTION.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
hyperloglog_benchmark.png		hyperloglog_benchmark.png
minhash_benchmark.png		minhash_benchmark.png
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datasketch

Install

MinHash

HyperLogLog

About

Releases

Packages

Languages

License

smallcattom/datasketch

Folders and files

Latest commit

History

Repository files navigation

datasketch

Install

MinHash

HyperLogLog

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages