Skip to content

use more than one core/thread when querying CSV #5205

@kmitchener

Description

@kmitchener

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When querying a CSV file on local disk only 1 thread is used, leaving the system mostly idle. For example, I have a 9 GB gzip-compressed CSV file that I'd like to query and convert to Parquet format. However, when running queries against the CSV, it can take 5 minutes or more for simple counts or group bys on single columns. 5 minutes per query isn't too unexpected (when uncompressed, it's a 100 GB file, 45 million records x 242 columns). But it's nowhere near utilizing all the machine's resources. It seems like there should be a way to move more of the work to other threads and querying/conversion to parquet could be substantially faster.

When profiled, it looks like this (green is the only busy thread):
image

and if you zoom into what that thread is doing, it's following a pattern of:

  • read data from disk
  • then for about 10 cycles, it
    • decompresses a chunk
    • reads csv records (csv_core) from decompressed chunk
  • read again from disk and repeat the cycle

image

Describe the solution you'd like

Better utilization of the machine, work done in more than one thread and sustained reads from disk.

Describe alternatives you've considered

Is it possible to offload the decompression and processing of bytes -> csv records to other threads? I had an idea to try to implement some kind of fan-out/fan-in somewhere under CsvExec so that one thread is just reading from disk and sending the raw bytes to other threads to decompress/csv_read/convert to recordbatches (CPU-heavy work is in the CSV reading) -> fan back in to stream the recordbatches from CsvExec as expected. Is there a better way to do it?

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions