feat: Implement dictionary based rowgroup skipping for dictionary encoded data #14907

jaystarshot · 2025-09-19T04:43:59Z

This PR implements dictionary based rowgroup skipping for parquet

Presto has row group skipping for dictionary encoded pages (link). This is very efficient as this skips entire rowgroups if the dictionary values don't match the domain and metadata filters.

The first page of a dictionary encoded column is always a dictionary page in every rowgroup

Without this feature, some of our java traffic migration was blocked due to cluster slowless and load.
With this In production we see > 8x CPU improvements and > 3x rows read decrease for relevant queries

netlify · 2025-09-19T04:44:05Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`2706c36`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/68e573cff512770008bca01c

jaystarshot · 2025-09-24T17:46:10Z

@majetideepak , @Yuhta Can you please have a first pass at this, this PR is ready, I am just figuring out a way to setup testing for opensource since our test files are internal

jaystarshot · 2025-09-24T18:00:14Z

velox/dwio/parquet/reader/ParquetData.cpp

+  return false;
+}
+
+bool ParquetData::testFilterAgainstDictionary(


Main change which reads the first page of column chunk and applies filters

jaystarshot · 2025-10-07T04:37:02Z

@majetideepak @Yuhta @pedroerp I have added a test case which fails without the changes (number of rows are > 0).
We have also extensively tested this change using shadow rowgroup skipping (signal when skipped and fail query if num rows processed are > 0) in production

Yuhta

This is probably mainly because we are missing SelectiveIntegerDictionaryColumnReader and SelectiveStringDictionaryColumnReader in Parquet like we have in DWRF. Basically we apply the actual filter on the dictionary values and cache the results. It would be nice if we implement the actual filter instead of just statistic filters.

Yuhta · 2025-10-07T17:50:33Z

velox/connectors/hive/HiveConnectorUtil.cpp

      subfieldSpecs.clear();
    }
  }
-


Let's just revert these white space change?

jaystarshot · 2025-10-07T18:36:51Z

Thanks @Yuhta for the review yeah i think that should be done for non skipped rowgroups

aditi-pandit

These are mostly code writing comments. Adding Ying for a review of the logic.

aditi-pandit · 2025-10-08T14:47:21Z

velox/dwio/parquet/reader/Metadata.cpp

+
+std::optional<int64_t> ColumnChunkMetaDataPtr::bloom_filter_offset() const {
+  if (hasBloomFilter()) {
+    return thriftColumnChunkPtr(ptr_)->meta_data.bloom_filter_offset;


Can the offset be 0 ? If offset is 0 would it be preferred to return nullopt instead ?

aditi-pandit · 2025-10-08T14:47:50Z

velox/dwio/parquet/reader/Metadata.cpp

+      thriftColumnChunkPtr(ptr_)->meta_data.__isset.bloom_filter_offset;
+}
+
+std::optional<int64_t> ColumnChunkMetaDataPtr::bloom_filter_offset() const {


Nit : function name should be camel case bloomFilterOffset()

aditi-pandit · 2025-10-08T14:48:14Z

velox/dwio/parquet/reader/Metadata.cpp

+  return thriftColumnChunkPtr(ptr_)->meta_data.encoding_stats;
+}
+
+const std::vector<thrift::Encoding::type>& ColumnChunkMetaDataPtr::getEncoding()


Nit : Function name getEncodings()

aditi-pandit · 2025-10-08T14:50:45Z

velox/dwio/parquet/reader/ParquetData.cpp

+  auto parquetData =
+      std::make_unique<ParquetData>(type, metaData_, pool(), sessionTimezone_);
+  // Set the BufferedInput if available
+  if (bufferedInput_) {


Why is this related to your change ? Can this be an independent change ?

This is related, we need bufferInput_ to read the first page

aditi-pandit · 2025-10-08T14:51:01Z

velox/dwio/parquet/reader/ParquetData.cpp

 #include "velox/dwio/common/BufferedInput.h"
 #include "velox/dwio/parquet/reader/ParquetStatsContext.h"

+#include <thrift/protocol/TCompactProtocol.h>


Move these headers before the velox ones.

aditi-pandit · 2025-10-08T15:03:54Z

velox/dwio/parquet/reader/ParquetData.cpp

+    const ColumnChunkMetaDataPtr& columnChunk) {
+  // Use existing stream if available
+  if (rowGroupId < streams_.size() && streams_[rowGroupId]) {
+    return std::move(streams_[rowGroupId]);


Do you need std::move() here ?

Though if streams_ is a variable owned by this class then its not a good idea to return this unique_ptr as it will move the ownership from the class to a local variable in the caller of this function.

aditi-pandit · 2025-10-08T15:07:23Z

velox/dwio/parquet/reader/ParquetData.cpp

+
+bool ParquetData::testFilterAgainstDictionary(
+    uint32_t rowGroupId,
+    const common::Filter* filter,


Why is this a pointer and not a reference ?

aditi-pandit · 2025-10-08T15:09:47Z

velox/dwio/parquet/reader/ParquetData.cpp

+
+  auto dictionaryPtr = readDictionaryPageForFiltering(rowGroupId, columnChunk);
+  if (!dictionaryPtr->values || dictionaryPtr->numValues == 0) {
+    return true;


Why does this return true and not false ?

aditi-pandit · 2025-10-08T15:10:58Z

velox/dwio/parquet/reader/ParquetData.cpp

+}
+
+// Helper methods for EncodingStats analysis (like Java Presto)
+bool ParquetData::hasDictionaryPages(


These methods needn't be class methods. They can be added in an anonymous namespace.

aditi-pandit · 2025-10-08T15:11:16Z

velox/dwio/parquet/reader/ParquetData.cpp

+
+bool ParquetData::hasNonDictionaryEncodedPages(
+    const std::vector<thrift::PageEncodingStats>& stats) {
+  for (const auto& pageStats : stats) {


Same as above. This can be a method in an anonymous namespace.

aditi-pandit · 2025-10-08T15:14:07Z

@yingsu00 : Please review. I'm not able to add you as a reviewer.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 19, 2025

jaystarshot changed the title ~~feat: support parquet dictionary filter based rowgroup skipping~~ feat: support parquet dictionary filter rowgroup skipping Sep 19, 2025

jaystarshot changed the title ~~feat: support parquet dictionary filter rowgroup skipping~~ feat: support parquet dictionary filter rowgroup skipping (Plain_dictionary) encoding Sep 19, 2025

jaystarshot changed the title ~~feat: support parquet dictionary filter rowgroup skipping (Plain_dictionary) encoding~~ feat: support parquet dictionary filter rowgroup skipping Sep 22, 2025

jaystarshot force-pushed the jay-oss-dict-filter branch 2 times, most recently from 81228f9 to bf0a6e0 Compare September 24, 2025 17:42

jaystarshot changed the title ~~feat: support parquet dictionary filter rowgroup skipping~~ feat[parquet]: implement dictionary based rowgroup skipping for dictionary encoded data Sep 24, 2025

jaystarshot changed the title ~~feat[parquet]: implement dictionary based rowgroup skipping for dictionary encoded data~~ feat[parquet]: Implement dictionary based rowgroup skipping for dictionary encoded data Sep 24, 2025

jaystarshot force-pushed the jay-oss-dict-filter branch 2 times, most recently from 762ec1f to 4ca36e8 Compare September 24, 2025 17:45

jaystarshot marked this pull request as ready for review September 24, 2025 17:45

jaystarshot requested a review from majetideepak as a code owner September 24, 2025 17:45

Support rowgroup skipping based on parquet dict encoding

1aa7915

jaystarshot force-pushed the jay-oss-dict-filter branch from 4ca36e8 to 1aa7915 Compare September 24, 2025 17:48

jaystarshot commented Sep 24, 2025

View reviewed changes

Improve with unique ptr

cd59333

jaystarshot changed the title ~~feat[parquet]: Implement dictionary based rowgroup skipping for dictionary encoded data~~ feat: Implement dictionary based rowgroup skipping for dictionary encoded data Oct 7, 2025

jaystarshot force-pushed the jay-oss-dict-filter branch 2 times, most recently from 6168313 to 3cfd907 Compare October 7, 2025 05:02

add test case

b5b46c6

jaystarshot force-pushed the jay-oss-dict-filter branch from 3cfd907 to b5b46c6 Compare October 7, 2025 05:09

Yuhta reviewed Oct 7, 2025

View reviewed changes

jaystarshot force-pushed the jay-oss-dict-filter branch 3 times, most recently from 47181f9 to 01cd836 Compare October 7, 2025 20:09

fix spacing

319426a

jaystarshot force-pushed the jay-oss-dict-filter branch from 01cd836 to 319426a Compare October 7, 2025 20:09

jaystarshot added 2 commits October 7, 2025 20:10

fix spacing

f508d78

fix spacing

2706c36

aditi-pandit reviewed Oct 8, 2025

View reviewed changes

feat: Implement dictionary based rowgroup skipping for dictionary encoded data #14907

Are you sure you want to change the base?

feat: Implement dictionary based rowgroup skipping for dictionary encoded data #14907

Conversation

jaystarshot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

jaystarshot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaystarshot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yuhta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaystarshot commented Oct 7, 2025

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit commented Oct 8, 2025

Uh oh!

Uh oh!

jaystarshot commented Sep 19, 2025 •

edited

Loading

netlify bot commented Sep 19, 2025 •

edited

Loading

jaystarshot commented Sep 24, 2025 •

edited

Loading

jaystarshot commented Oct 7, 2025 •

edited

Loading

Yuhta left a comment •

edited

Loading