-
Couldn't load subscription status.
- Fork 1.7k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When reading multiple parquet files, DataFusion will sometimes request many file handles from the OS concurrently. This is both inefficient (each file handles takes up memory, requires system calls, etc) as well as leads to "too many open files" types errors.
Depending on how fast IO comes in and the details of the Tokio scheduler, sometimes it will have far too many open files at once (it might end up opening 100 input parquet files, for example, even if there are only 8 cores available for processing)
Describe the solution you'd like
As described by @Dandandan in https://github.com/apache/arrow-datafusion/pull/706/files#r667508175 it would be nice to decouple the setting for number of concurrent parquet files scanned with the number of target partitions for other operators.
So the idea would be to add a new config setting parquet_partitions or perhapsfilesource_partitions that would control the number of parquet "partitions" created and thus the number of file handles to run datafusion plans
Describe alternatives you've considered
@andygrove has mentioned the Ballista scheduler is more sophisticated in this area and hopefully we can move some of those improvements down into the core DataFusion engine
Additional context
There are reports in arrow-rs of "too many open files" apache/arrow-rs#47 (comment) which may also be helped by this feature, though there is probably more work as well