Skip to content

Memory stays around after pickle cycle #43156

Open
@mrocklin

Description

@mrocklin

Hi Folks,

Related to #43155 I'm running into memory issues when pickling many small pandas dataframes. The following script creates a pandas dataframe, splits it up, and then pickles each little split and brings them back again. It then deletes all objects from memory but something is still sticking around. Here is the script, followed by the outputs of using memory_profiler

import numpy as np
import pandas as pd
import pickle


@profile
def test():
    df = pd.DataFrame(np.random.random((20000, 1000)))
    df["partitions"] = (df[0] * 10000).astype(int)
    _, groups = zip(*df.groupby("partitions"))
    del df

    groups = [pickle.dumps(group) for group in groups]
    groups = [pickle.loads(group) for group in groups]

    del groups


if __name__ == "__main__":
    test()
python -m memory_profiler memory_issue.py
Filename: memory_issue.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7   76.574 MiB   76.574 MiB           1   @profile
     8                                         def test():
     9  229.445 MiB  152.871 MiB           1       df = pd.DataFrame(np.random.random((20000, 1000)))
    10  230.738 MiB    1.293 MiB           1       df["partitions"] = (df[0] * 10000).astype(int)
    11  398.453 MiB  167.715 MiB           1       _, groups = zip(*df.groupby("partitions"))
    12  245.633 MiB -152.820 MiB           1       del df
    13                                         
    14  445.688 MiB   47.273 MiB        8631       groups = [pickle.dumps(group) for group in groups]
    15  712.285 MiB  266.598 MiB        8631       groups = [pickle.loads(group) for group in groups]
    16                                         
    17  557.488 MiB -154.797 MiB           1       del groups

As you can see, we start with 70 MiB in memory and end with 550 MiB, despite all relevant objects being released. This leak increases with the number of groups (scale the 10000 number to move the leak up or down). Any help or pointers on how to track this down would be welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO Pickleread_pickle, to_picklePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions