Skip to content

PERF: Load data (create Series, Dataframe) in a more functional way. #5902

Closed
@tinproject

Description

@tinproject

I've been trying to find out the process of creating DataFrames in order to try to solve #2305 with minimal memory use. I've made some tests that put in a IPython notebook: http://nbviewer.ipython.org/gist/tinproject/7d0e0de9475b16910fcf

Currently when reads from a generator pandas internally allocate the whole generator in memory within a list, using at least twice the memory needed. I suspect sometimes could be even more if there is a later type conversion needed.

Also, as it can be viewed in the notebook, the input data collection it's completely readed many times before data it's loaded in the final data structure.

To read from an iterator, without put the whole it in memory, it's needed to process the values that yield one by one, or at least in a one pass form.

One it could read data from a iterator in one pass, it could read from every iterable collection only needing the count of elements to read. It includes collections with implicit lenght: sequences(list, tuple, str), sets, mappings, etc. and collections with unknown length: generators, iterators, etc. which size must be explicitly given.

I think this could end in a great improve on performance and enhance the input data types accepted (In my tests I saw that the tuple collection it's not a valid type for DataFrame!)

Relates to #2305, #2193, #5898

I want to help to solve this, but it's a bis task, and I'm lost when comes to the Blocks part. Is there anywhere documented what structure have and how Blocks works?

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO DataIO issues that don't fit into a more specific labelInternalsRelated to non-user accessible pandas implementation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions