Skip to content

Klib upgrade (factorizing performance increase) #8524

Open
@CarstVaartjes

Description

@CarstVaartjes

Hi,

I'm working with two colleagues (@FrancescElies & @javiBi4) on seeing if we can add groupby to bcolz, what we did first was to see if we can make Pandas' excellent klib implementation more library independent. See https://github.com/CarstVaartjes/khash
A while back khash (https://github.com/attractivechaos/klib) was updated to 0.2.8:

2013-05-02 (0.2.8):

* Use quadratic probing. When the capacity is power of 2, stepping function
  i*(i+1)/2 guarantees to traverse each bucket. It is better than double
  hashing on cache performance and is more robust than linear probing.

  In theory, double hashing should be more robust than quadratic probing.
  However, my implementation is probably not for large hash tables, because
  the second hash function is closely tied to the first hash function,
  which reduce the effectiveness of double hashing.

Reference: http://research.cs.vt.edu/AVresearch/hashing/quadratic.php

We updated the klib to it (which was something of a headache to be honest), but the quadratic probing seems to be quite faster which might make it interesting for Pandas to retrofit this into the general Pandas master. See the example:

import numpy as np
import random
import pandas as pd
from khash.hashtable import Int64HashTable, StringHashTable

def new_klib_int(input_array):
    ht = Int64HashTable(len(input_array))
    return ht.get_labels_groupby(input_array)

def new_klib_str(input_array):
    ht = StringHashTable(len(input_array))
    return ht.factorize(input_array)

def new_klib_float(input_array):
    ht = Float64HashTable(len(input_array))
    return ht.factorize(input_array)

a = np.random.randint(100, size=10000000)
b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
b = pd.Series(b).values
c = np.random.random_sample(10000000) * 10000

%timeit pd.factorize(a)
%timeit new_klib_int(a)
%timeit pd.factorize(b)
%timeit new_klib_str(b)
%timeit pd.factorize(c)
%timeit new_klib_float(c)

In [20]: %timeit pd.factorize(a)
10 loops, best of 3: 122 ms per loop

In [21]: %timeit new_klib_int(a)
10 loops, best of 3: 101 ms per loop

In [22]: %timeit pd.factorize(b)
1 loops, best of 3: 496 ms per loop

In [23]: %timeit new_klib_str(b)
10 loops, best of 3: 165 ms per loop

In [36]: %timeit pd.factorize(c)
1 loops, best of 3: 1.61 s per loop

In [37]: %timeit new_klib_float(c)
1 loops, best of 3: 1.44 s per loop

So a 20%ish improvement for int64 and a 65% increase for strings in this example. Just thought to give you a heads up about this, as you might have missed 0.2.8 (and it's a bit of a pain to adjust everything so this might help to make Pandas even faster)

Edit: added float64 example too (10-15% improvement)

Metadata

Metadata

Assignees

Labels

PerformanceMemory or execution speed performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions