Description
Hi,
I'm working with two colleagues (@FrancescElies & @javiBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash
A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:
2013-05-02 (0.2.8):
* Use quadratic probing. When the capacity is power of 2, stepping function
i*(i+1)/2 guarantees to traverse each bucket. It is better than double
hashing on cache performance and is more robust than linear probing.
In theory, double hashing should be more robust than quadratic probing.
However, my implementation is probably not for large hash tables, because
the second hash function is closely tied to the first hash function,
which reduce the effectiveness of double hashing.
Reference: http://research.cs.vt.edu/AVresearch/hashing/quadratic.php
We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:
import numpy as np
import random
import pandas as pd
from khash.hashtable import Int64HashTable, StringHashTable
def new_klib_int(input_array):
ht = Int64HashTable(len(input_array))
return ht.get_labels_groupby(input_array)
def new_klib_str(input_array):
ht = StringHashTable(len(input_array))
return ht.factorize(input_array)
def new_klib_float(input_array):
ht = Float64HashTable(len(input_array))
return ht.factorize(input_array)
a = np.random.randint(100, size=10000000)
b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
b = pd.Series(b).values
c = np.random.random_sample(10000000) * 10000
%timeit pd.factorize(a)
%timeit new_klib_int(a)
%timeit pd.factorize(b)
%timeit new_klib_str(b)
%timeit pd.factorize(c)
%timeit new_klib_float(c)
In [20]: %timeit pd.factorize(a)
10 loops, best of 3: 122 ms per loop
In [21]: %timeit new_klib_int(a)
10 loops, best of 3: 101 ms per loop
In [22]: %timeit pd.factorize(b)
1 loops, best of 3: 496 ms per loop
In [23]: %timeit new_klib_str(b)
10 loops, best of 3: 165 ms per loop
In [36]: %timeit pd.factorize(c)
1 loops, best of 3: 1.61 s per loop
In [37]: %timeit new_klib_float(c)
1 loops, best of 3: 1.44 s per loop
So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)
Edit: added float64 example too (10-15% improvement)