Skip to content

Include file stats when converting a parquet directory to a Delta table #2490

Closed
@gruuya

Description

@gruuya

Description

Currently the ConvertToDeltaBuilder skips fetching and populating the stats

Add {
path: percent_decode_str(file.location.as_ref())
.decode_utf8()?
.to_string(),
size: i64::try_from(file.size)?,
partition_values: partition_values
.into_iter()
.map(|(k, v)| {
(
k,
if v.is_null() {
None
} else {
Some(v.serialize())
},
)
})
.collect(),
modification_time: file.last_modified.timestamp_millis(),
data_change: true,
..Default::default()
}

This results in log files missing the min/max/null count statistics.

Use Case

These stats are useful as they allow partition pruning and thus influence performance.

Granted it may be possible to use the stats from the files themselves, but that it is sub-optimal to reading from the log directly.

Related Issue(s)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions