Skip to content

csv.reader calls the state machine for every character needlessly #138213

@maurycy

Description

@maurycy

Bug report

Bug description:

The state machine:

parse_process_char(ReaderObj *self, _csvstate *module_state, Py_UCS4 c)

is called for every character processed by csv.reader:

cpython/Modules/_csv.c

Lines 969 to 974 in bbcb75c

while (linelen--) {
c = PyUnicode_READ(kind, data, pos);
if (parse_process_char(self, module_state, c) < 0) {
Py_DECREF(lineobj);
goto err;
}

Even putting aside sophisticated SIMD or branching optimizations, it could be more efficient.

Most time is likely to be spent in a field (IN_FIELD, IN_QUOTED_FIELD). It's more efficient to find interesting characters (ie: escapes, quotes) and just copy the whole slice in between.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    extension-modulesC modules in the Modules dirperformancePerformance or resource usagestdlibPython modules in the Lib dirtype-featureA feature request or enhancement

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions