Skip to content

PERF: Explore even faster path for df.to_csv #3186

Closed
@ghost

Description

iotop and a simple-mined c program indicates we're nowhere
near IO-bound in df.to_csv, at about ~10-15x.

It might be possible to speed things up considerably with a fast path
for special cases (numerical only) that don't need fancy quoting and other
bells and whistles provided by the underlying csv python module.

#include <stdio.h>
#include <stdlib.h>

int main(int argc,char **argv)
{
    int i;
    FILE *f;
    char fmt[] = "%f,%f,%f,%f,%f\n";
    while (1) {
    f = fopen("out.csv","wb");
    for(i=0;i<1000000;i++) {
        fprintf(f,fmt, 1.0,2.0,3.0,4.0,5.0);
    }
    fclose(f);
    }
}

sustains about 30MB/s on my machine (without even batching writes)
vs ~2-3MB/s for the new (0.11.0) cython df.to_csv().

need to check if it's the stringifying, quoting logic, memory layout, or something
else that constitutes the difference.

Should also yield insights for any future binary serialization format
implemented.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csvOutput-Formatting__repr__ of pandas objects, to_stringPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions