-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
If you run a query in DataFusion against parquet files, it will create several unnecessary temporary files.
IOx also hits the same thing (with the same root cause): https://github.com/influxdata/influxdb_iox/issues/3507#issuecomment-1023679575
There are several places which (non obviously) create a DiskManager instance today -- the one that hits the parquet usecase above is (in the creation of the pruning predicate that requires an ExecutionContext): https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_optimizer/pruning.rs#L132
This has two problems:
- it is unneeded overhead (the disk manager is not used),
- the overhead is larger than it needs to be (it creates a tempfile)
I propose a two pronged solution (will propose two PRs):
- Create temp files on demand in the DiskManger (so we are at least not doing IO unless needed)
- Remove unnecessary creation of ExecutionContext
I think the second will be a slightly larger project as it gets passed to create_physical_expr
Though I think the main sources of problem are related to create_physical_expr and that only uses the context to look up vars, if necessary.