Description
This is a dupe of #1752, in which @jreback recommended a new issue be opened. I have a (hopefully) reproduce-able example which shows how using MultiIndex can explode RAM usage.
At the top I define a function to get the memory of the current Python process. This will be called as I create arrays to get the "true" space of each object.
import numpy as np
import pandas as pd
import os
import psutil
import numba as nb
def print_mem():
proc = psutil.Process(os.getpid())
print proc.memory_info()[0] / float(2**20), "MB"
print_mem()
>> 74.20703125 MB
I need some functions to construct a plausible MultiIndex which matches roughly the actual data I'm using.
@nb.jit(nopython=True, nogil=True)
def nbf_number_new_level(repetitions):
assert len(repetitions.shape) == 1
n = repetitions.sum()
ret = np.zeros(shape=n, dtype=np.uint64)
i = 0
for j in range(len(repetitions)):
n_repetitions = repetitions[j]
c = 0
for k in range(n_repetitions):
ret[i] = c
c += 1; i += 1
return ret
def expand_index(idx, repetitions, new_name=None):
assert len(idx) == len(repetitions), "Index and repetitions must be the same length"
repetitions = np.array(repetitions)
assert np.all(repetitions > 0), "All repetitions must be > 0"
names = list(idx.names) if isinstance(idx, pd.MultiIndex) else [idx.name]
names.append(new_name)
new_level_value = nbf_number_new_level(repetitions)
value_arrays = [idx.get_level_values(i).repeat(repetitions) for i in range(idx.nlevels)] \
if isinstance(idx, pd.MultiIndex) else \
[idx.values.repeat(repetitions)]
value_arrays.append(new_level_value)
ret = pd.MultiIndex.from_arrays(value_arrays)
ret.names = names
return ret
Constructing the actual MultiIndex:
randomizer = np.random.RandomState(12345)
nhousehholds = 200000
hhids = pd.Index(np.arange(nhousehholds), name='hhid')
print nhousehholds, "households"
print_mem()
>>200000 households
>>83.95703125 MB
persons_per_hh = randomizer.randint(1, 9, size=nhousehholds)
person_ids = expand_index(hhids, persons_per_hh, new_name='pid')
print len(person_ids), "persons"
print_mem()
>>898254 persons
>>105.6953125 MB
tours_per_person = randomizer.choice([1, 2, 3], size=len(person_ids), replace=True, p=[0.85, 0.1, 0.05])
tour_ids = expand_index(person_ids, tours_per_person, new_name='tour_id')
print len(tour_ids), "tours"
print_mem()
>>1078155 tours
>>121.734375 MB
trips_per_tour = randomizer.choice([2,3,4,5], size=len(tour_ids), replace=True, p=[0.6, 0.25, 0.1, 0.05])
trip_ids = expand_index(tour_ids, trips_per_tour, new_name='trip_id')
print len(trip_ids), "trips"
print_mem()
trips = pd.DataFrame(index=trip_ids)
trips['origin'] = randomizer.randint(101, 200, size=len(trips))
trips['destination'] = randomizer.randint(101, 200, size=len(trips))
print_mem()
>>2803731 trips
>>150.17578125 MB
>>171.70703125 MB
Based on this, my fake trip table is taking up about 171.7 - 121.7 = 50MB of RAM. I haven't done any fancy indexing, just initialized the table. But, when I call trips.info()
(which appears to instantiate the array of PyTuples) this happens:
print trips.info()
print_mem()
>><class 'pandas.core.frame.DataFrame'>
MultiIndex: 2803731 entries, (0, 0, 0, 0) to (199999, 0, 0, 1)
Data columns (total 2 columns):
origin int32
destination int32
dtypes: int32(2)
memory usage: 42.8+ MB
None
723.64453125 MB
My process's memory usage balloons to 723MB!. Doing the math, the cached indexer takes up 723.6 - 171.7 = 551 MB, a tenfold increase over the actual DataFrame!
For this fake dataset, this is not so much of a problem, but my production code is 20x the size and I soak up 27 GB of RAM when I as much as look at my trips table.
Any performance tips would be appreciated; the only thing I can think of right now is "don't use a MultiIndex". It sounds like a 'proper' fix for this lies deep in the indexing internals which is far above my pay grade. But I wanted to at least log this issue with the community.
Output of pd.show_versions()
:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.18.1
nose: 1.3.4
pip: 8.1.1
setuptools: 0.6c11
Cython: 0.22
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.1.1
numexpr: 2.5.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.7.3
boto: 2.36.0
pandas_datareader: None
Activity