Skip to content

Example of High Memory Usage of MultiIndex  #13904

Closed
@PeterKucirek

Description

@PeterKucirek

This is a dupe of #1752, in which @jreback recommended a new issue be opened. I have a (hopefully) reproduce-able example which shows how using MultiIndex can explode RAM usage.

At the top I define a function to get the memory of the current Python process. This will be called as I create arrays to get the "true" space of each object.

import numpy as np
import pandas as pd
import os
import psutil
import numba as nb

def print_mem():
    proc = psutil.Process(os.getpid())
    print proc.memory_info()[0] / float(2**20), "MB"
print_mem()
>> 74.20703125 MB

I need some functions to construct a plausible MultiIndex which matches roughly the actual data I'm using.

@nb.jit(nopython=True, nogil=True)
def nbf_number_new_level(repetitions):
    assert len(repetitions.shape) == 1

    n = repetitions.sum()
    ret = np.zeros(shape=n, dtype=np.uint64)

    i = 0
    for j in range(len(repetitions)):
        n_repetitions = repetitions[j]
        c = 0
        for k in range(n_repetitions):
            ret[i] = c
            c += 1; i += 1
    return ret

def expand_index(idx, repetitions, new_name=None):
    assert len(idx) == len(repetitions), "Index and repetitions must be the same length"
    repetitions = np.array(repetitions)
    assert np.all(repetitions > 0), "All repetitions must be > 0"

    names = list(idx.names) if isinstance(idx, pd.MultiIndex) else [idx.name]
    names.append(new_name)

    new_level_value = nbf_number_new_level(repetitions)

    value_arrays = [idx.get_level_values(i).repeat(repetitions) for i in range(idx.nlevels)] \
    if isinstance(idx, pd.MultiIndex) else \
    [idx.values.repeat(repetitions)]

    value_arrays.append(new_level_value)

    ret = pd.MultiIndex.from_arrays(value_arrays)
    ret.names = names
    return ret

Constructing the actual MultiIndex:

randomizer = np.random.RandomState(12345)

nhousehholds = 200000
hhids = pd.Index(np.arange(nhousehholds), name='hhid')
print nhousehholds, "households"
print_mem()
>>200000 households
>>83.95703125 MB

persons_per_hh = randomizer.randint(1, 9, size=nhousehholds)
person_ids = expand_index(hhids, persons_per_hh, new_name='pid')
print len(person_ids), "persons"
print_mem()
>>898254 persons
>>105.6953125 MB

tours_per_person = randomizer.choice([1, 2, 3], size=len(person_ids), replace=True, p=[0.85, 0.1, 0.05])
tour_ids = expand_index(person_ids, tours_per_person, new_name='tour_id')
print len(tour_ids), "tours"
print_mem()
>>1078155 tours
>>121.734375 MB

trips_per_tour = randomizer.choice([2,3,4,5], size=len(tour_ids), replace=True, p=[0.6, 0.25, 0.1, 0.05])
trip_ids = expand_index(tour_ids, trips_per_tour, new_name='trip_id')
print len(trip_ids), "trips"
print_mem()

trips = pd.DataFrame(index=trip_ids)
trips['origin'] = randomizer.randint(101, 200, size=len(trips))
trips['destination'] = randomizer.randint(101, 200, size=len(trips))
print_mem()

>>2803731 trips
>>150.17578125 MB
>>171.70703125 MB

Based on this, my fake trip table is taking up about 171.7 - 121.7 = 50MB of RAM. I haven't done any fancy indexing, just initialized the table. But, when I call trips.info() (which appears to instantiate the array of PyTuples) this happens:

print trips.info()
print_mem()
>><class 'pandas.core.frame.DataFrame'>
MultiIndex: 2803731 entries, (0, 0, 0, 0) to (199999, 0, 0, 1)
Data columns (total 2 columns):
origin         int32
destination    int32
dtypes: int32(2)
memory usage: 42.8+ MB
None
723.64453125 MB

My process's memory usage balloons to 723MB!. Doing the math, the cached indexer takes up 723.6 - 171.7 = 551 MB, a tenfold increase over the actual DataFrame!

For this fake dataset, this is not so much of a problem, but my production code is 20x the size and I soak up 27 GB of RAM when I as much as look at my trips table.

Any performance tips would be appreciated; the only thing I can think of right now is "don't use a MultiIndex". It sounds like a 'proper' fix for this lies deep in the indexing internals which is far above my pay grade. But I wanted to at least log this issue with the community.

Output of pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.4
pip: 8.1.1
setuptools: 0.6c11
Cython: 0.22
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.1.1
numexpr: 2.5.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.7.3
boto: 2.36.0
pandas_datareader: None

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions