PERF: Load data (create Series, Dataframe) in a more functional way.

I've been trying to find out the process of creating DataFrames in order to try to solve #2305 with minimal memory use. I've made some tests that put in a IPython notebook: http://nbviewer.ipython.org/gist/tinproject/7d0e0de9475b16910fcf

Currently when reads from a generator pandas internally allocate the whole generator in memory within a list, using at least twice the memory needed. I suspect sometimes could be even more if there is a later type conversion needed.

Also, as it can be viewed in the notebook, the input data collection it's completely readed many times before data it's loaded in the final data structure.

To read from an iterator, without put the whole it in memory, it's needed to process the values that yield one by one, or at least in a one pass form.

One it could read data from a iterator in one pass, it could read from every iterable collection only needing the count of elements to read. It includes collections with implicit lenght: sequences(list, tuple, str), sets, mappings, etc. and collections with unknown length: generators, iterators, etc. which size must be explicitly given.

I think this could end in a great improve on performance and enhance the input data types accepted (In my tests I saw that the tuple collection it's not a valid type for DataFrame!)

Relates to #2305, #2193, #5898 

I want to help to solve this, but it's a bis task, and I'm lost when comes to the Blocks part. Is there anywhere documented what structure have and how Blocks works?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Load data (create Series, Dataframe) in a more functional way. #5902

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PERF: Load data (create Series, Dataframe) in a more functional way. #5902

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions