Skip to content

Conversation

@sopel39
Copy link
Member

@sopel39 sopel39 commented Nov 12, 2025

dictionary_page_offset might not be reliable, even when dictionary encoding is present.

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

Fixes #27232

dictionary_page_offset might not be reliable, even when dictionary encoding is present.
@cla-bot cla-bot bot added the cla-signed label Nov 12, 2025
@github-actions github-actions bot added the hive Hive connector label Nov 12, 2025
@sopel39
Copy link
Member Author

sopel39 commented Nov 12, 2025

Comment on lines -294 to -295
assertThat(ageChunk.getDictionaryPageOffset()).isGreaterThan(0);
assertThat(idChunk.getDictionaryPageOffset()).isGreaterThan(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will address the problem
If a column uses dictionary encoding, then it must write a dictionary page and that has to be a non-zero offset from beginning of file (because parquet file has a compulsory header of few bytes).
So if this was failing sometimes, then the next check for use of dictionary encoding should also fail, as the dictionary page offset can be zero only if there is no dictionary page at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot really reproduce the issue locally.

Take a look at https://stackoverflow.com/a/55226688.

If a column uses dictionary encoding, then it must write a dictionary page and that has to be a non-zero offset from beginning of file (because parquet file has a compulsory header of few bytes).

Assuming that example is correct (where parquet-mr puts 0 even when dictionaries are present), does it mean we do not support dictionary offset correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector

Development

Successfully merging this pull request may close these issues.

Flaky TestHiveParquetEncryption.testEncryptedDictionaryPruningTwoColumns

2 participants