Skip to content

Issue 6544 - Query performance improvements on large parquet files#6557

Open
m09526 wants to merge 22 commits intodevelopfrom
6544-query-performance-improvements-on-large-parquet-files
Open

Issue 6544 - Query performance improvements on large parquet files#6557
m09526 wants to merge 22 commits intodevelopfrom
6544-query-performance-improvements-on-large-parquet-files

Conversation

@m09526
Copy link
Member

@m09526 m09526 commented Feb 4, 2026

Make sure you have checked all steps below.

Issue

  • My PR fully resolves the following issues. I've referenced an issue in the PR title, for example "Issue 1234 - My
    Feature". Note that before an issue is finished, you can still make a pull request by raising a separate issue
    for your progress.

Tests

  • My PR adds the following tests based on our test strategy OR does not need testing for this extremely good reason:
    • rust/sleeper_core/src/datafusion/util.rs

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it, or I have linked to a
    separate issue for that below.
  • If I have added new Java code, I have added Javadoc that explains it following our conventions and style.
  • If I have added or removed any dependencies from the project, I have updated the NOTICES file.

@m09526 m09526 linked an issue Feb 4, 2026 that may be closed by this pull request
@m09526 m09526 changed the title 6544 query performance improvements on large parquet files Issue 6544 - Query performance improvements on large parquet files Feb 4, 2026
m09526 and others added 2 commits February 4, 2026 14:07
…les' of github.com:gchq/sleeper into 6544-query-performance-improvements-on-large-parquet-files
@rtjd6554 rtjd6554 self-assigned this Feb 4, 2026
Copy link
Collaborator

@patchwork01 patchwork01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are failing tests and Checkstyle problems.

@patchwork01 patchwork01 assigned m09526 and unassigned rtjd6554 Feb 4, 2026
);
// Disable page indexes since we won't benefit from them as we are reading large contiguous file regions
cfg.options_mut().execution.parquet.enable_page_index = false;
cfg.options_mut().execution.parquet.enable_page_index = self.config.read_page_indexes();
Copy link
Collaborator

@patchwork01 patchwork01 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue states that disabling reading page indexes in DataFusion isn't possible with the cached reader.
It looks like you've worked around this in rust/sleeper_core/src/datafusion/metadata_cache.rs?

Please can you explain in a comment here how these two things relate to one another, and what was intended?

Is there a way we can bring the code for this together into one place? It makes it a bit difficult to read with the read_page_indexes setting checked in multiple places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more comments to code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That helps, but I think the important part is the relationship between this line, the other check of read_page_indexes below, and the code in rust/sleeper_core/src/datafusion/metadata_cache.rs. From reading the comment here you wouldn't know that we have worked around the problem with DataFusion, or that those other pieces of code exist.

…les' of github.com:gchq/sleeper into 6544-query-performance-improvements-on-large-parquet-files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query performance improvements on large Parquet files

3 participants