Skip to content

Use Arc<Statistics> rather than Statistics in PartitionedFile #11885

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We are trying to improve the speed of DataFusion when running the ClickBench partitioned test (which has 100 files) -- this means the per-file overhead is important to redudce

One structure that has non trivial overhead is the Statistics structure (as it has a ScalarValue for each column of each file so there are 100 * (number columns) * 2 at least ScalarValues

Describe the solution you'd like

It would be great to reduce the overhead of passing around these values.

Describe alternatives you've considered

One way to do so is to avoid copying them when the underlying ParquetExec is copied by using an Option<Arc<Statistics>> here:

https://github.com/apache/datafusion/blob/9503456388544788e1a881a0a80a3c61ac015a86/datafusion/core/src/datasource/listing/mod.rs#L81-L80

Additional context

Interestingly @Rachelint
#11802 (comment)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions