Open
Description
Hi Folks,
Related to #43155 I'm running into memory issues when pickling many small pandas dataframes. The following script creates a pandas dataframe, splits it up, and then pickles each little split and brings them back again. It then deletes all objects from memory but something is still sticking around. Here is the script, followed by the outputs of using memory_profiler
import numpy as np
import pandas as pd
import pickle
@profile
def test():
df = pd.DataFrame(np.random.random((20000, 1000)))
df["partitions"] = (df[0] * 10000).astype(int)
_, groups = zip(*df.groupby("partitions"))
del df
groups = [pickle.dumps(group) for group in groups]
groups = [pickle.loads(group) for group in groups]
del groups
if __name__ == "__main__":
test()
python -m memory_profiler memory_issue.py
Filename: memory_issue.py
Line # Mem usage Increment Occurences Line Contents
============================================================
7 76.574 MiB 76.574 MiB 1 @profile
8 def test():
9 229.445 MiB 152.871 MiB 1 df = pd.DataFrame(np.random.random((20000, 1000)))
10 230.738 MiB 1.293 MiB 1 df["partitions"] = (df[0] * 10000).astype(int)
11 398.453 MiB 167.715 MiB 1 _, groups = zip(*df.groupby("partitions"))
12 245.633 MiB -152.820 MiB 1 del df
13
14 445.688 MiB 47.273 MiB 8631 groups = [pickle.dumps(group) for group in groups]
15 712.285 MiB 266.598 MiB 8631 groups = [pickle.loads(group) for group in groups]
16
17 557.488 MiB -154.797 MiB 1 del groups
As you can see, we start with 70 MiB in memory and end with 550 MiB, despite all relevant objects being released. This leak increases with the number of groups (scale the 10000 number to move the leak up or down). Any help or pointers on how to track this down would be welcome.