-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
To make DataFusion scan tables more flexibly and more efficiently, we can take several further steps. (I linked several issues I am aware of)
Capability:
- Enable listing/reading remote storage systems in an async way Add support for reading distributed datasets #616
- Enable file block granularity processing (row groups or offset range processing instead of currently per-file bases). Make it possible to only scan part of a parquet file in a partition #1990
- Enable reading partitioned table. i.e., partition columns value encoded in the file path Add support for reading partitioned Parquet files #133
- Support table schema evolution. Or in other words, relax the requirement that all files in a table are completely consistent in the schema
Performance:
- Parallel table file listing Refactor ParquetExec::try_from_files in preparation for making it parallel #896
- Scan parquet metadata lazily DataFusion should scan Parquet statistics once per query #871
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request