Do not create dictionary for high-cardinality columns by KKcorps · Pull Request #9527 · apache/pinot

KKcorps · 2022-10-04T11:02:44Z

Disable dicts for JSON and TEXT indexing columns
Extend optimizeDictionary config for dimension columns with fixed width as well.

Objective is to reduce the segment size and server space/memory usage.

Release Notes

New config added optimizeDictionary to reduce memory usage due to dictionaries

Documentation

To enable this feature, add the following to your table config

"tableIndexConfig" : {
 "optimizeDictionary" : true
}

codecov-commenter · 2022-10-04T11:41:00Z

Codecov Report

Merging #9527 (a5026af) into master (f7c5511) will decrease coverage by 54.16%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #9527       +/-   ##
=============================================
- Coverage     70.00%   15.83%   -54.17%     
+ Complexity     4933      175     -4758     
=============================================
  Files          1946     1912       -34     
  Lines        104280   102715     -1565     
  Branches      15808    15624      -184     
=============================================
- Hits          72998    16265    -56733     
- Misses        26157    85253    +59096     
+ Partials       5125     1197     -3928

Flag	Coverage Δ
integration1	`?`
integration2	`?`
unittests1	`?`
unittests2	`15.83% <0.00%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ment/creator/impl/SegmentColumnarIndexCreator.java	`0.00% <0.00%> (-79.61%)`	⬇️
...ot/segment/spi/creator/SegmentGeneratorConfig.java	`0.00% <0.00%> (-82.00%)`	⬇️
.../apache/pinot/spi/config/table/IndexingConfig.java	`0.00% <0.00%> (-90.70%)`	⬇️
...src/main/java/org/apache/pinot/sql/FilterKind.java	`0.00% <0.00%> (-100.00%)`	⬇️
...ain/java/org/apache/pinot/core/data/table/Key.java	`0.00% <0.00%> (-100.00%)`	⬇️
...in/java/org/apache/pinot/spi/utils/BytesUtils.java	`0.00% <0.00%> (-100.00%)`	⬇️
...n/java/org/apache/pinot/core/data/table/Table.java	`0.00% <0.00%> (-100.00%)`	⬇️
.../java/org/apache/pinot/core/data/table/Record.java	`0.00% <0.00%> (-100.00%)`	⬇️
.../java/org/apache/pinot/core/util/GroupByUtils.java	`0.00% <0.00%> (-100.00%)`	⬇️
...java/org/apache/pinot/spi/trace/BaseRecording.java	`0.00% <0.00%> (-100.00%)`	⬇️
... and 1515 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

somandal · 2022-10-04T16:25:47Z

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java

now that this config has been changed to work for both metrics and dimension fields, shouldn't isOptimizeDictionaryForMetrics() and the associated variable indicate that it works for both? maybe this should be a separate flag for dimension since it'll be hard to modify existing configs across all tables to change the name of this one.

In the IndexingConfig there is even a comment that'll need to be updated:

/** * If `optimizeDictionaryForMetrics` enabled, dictionary is not created for the metric columns * for which rawIndexSize / forwardIndexSize is less than the `noDictionarySizeRatioThreshold`. */

Yeah, it makes sense. I am planning to introduce a new config altogether even for json/text.

On thinking about it, It will get pretty confusing for users. I have decided to finally change the config to optimizeDictionary and keep the old config as well for backward compatibility. Have added comments for deprecation though.

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java

somandal · 2022-10-12T04:40:05Z

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java

I see that you've renamed the original config isOptimizeDictionaryForMetrics. Won't this cause backward compatibility issues with tables out there that already use isOptimizeDictionaryForMetrics?

I don't want to keep multiple flags as it creates a lot of confusion for new users. Also afaik, this config is pretty new (introduced in 0.10) and is not used by a lot of folks. I understand your concern though.

@KKcorps we did some validation on our end and no one is using this config yet, so nothing should break on our end with your change. I'm okay with changing the config name.

I asked some users and seems like they are using the old config. For now, I have kept both but marked the older one as deprecated.

Jackie-Jiang · 2022-10-12T07:22:48Z

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/IndexingConfig.java

Long term wise, we want to have a flag to automatically choose whether to use fixed-length dictionary, var-length dictionary or raw index. Not sure if we want to name it optimizeDictionary, but I don't have a good name either

Why do we need config for that? Isn't it determined by data type of the column?

We want a config to give more control to the user so that they can explicitly choose the index to create if desired. In certain scenarios, raw index might be able to save space, but not good for query performance

I think that is out of scope for this PR. That will anyways not be a boolean config so we can decide on that later. The config name for that would be more like dictionaryType

After a second thought, need more discussion

Jackie-Jiang

I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied.
With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added).
How about adding a config to only apply this to json/text column?

Jackie-Jiang · 2022-10-31T18:23:10Z

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java

+
+

(nit) Remove extra empty lines

KKcorps · 2022-11-02T08:59:50Z

I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied.
With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added).
How about adding a config to only apply this to json/text column?

Actually the reason for this change was to introduce this config for dimension columns (after complaints about space amplification and memory usage from users).
Json and text index got introduced later in the scope.
IMO, what we can do though is then introduce a seperate metric optimizeDictionaryForDimensions but mention the risk with setting this config.

users do have cases where they keep String columns as dimensions but don't really do any filtering on top of them.

Jackie-Jiang · 2022-11-05T05:28:58Z

In that case, we can keep both optimizeDictionary (apply to both dimensions and metrics) and optimizeDictionaryForMetrics (only apply to metrics) to avoid backward incompatible. I don't see a case where user only want to optimize dimensions but not metrics

KKcorps · 2022-11-21T19:13:03Z

In that case, we can keep both optimizeDictionary (apply to both dimensions and metrics) and optimizeDictionaryForMetrics (only apply to metrics) to avoid backward incompatible. I don't see a case where user only want to optimize dimensions but not metrics

Makes sense. Incorporated this now.

Jackie-Jiang

LGTM. Please add a release note section and documentation for the new added config

KKcorps requested a review from Jackie-Jiang October 4, 2022 11:02

somandal reviewed Oct 4, 2022

View reviewed changes

somandal reviewed Oct 12, 2022

View reviewed changes

Jackie-Jiang reviewed Oct 12, 2022

View reviewed changes

KKcorps force-pushed the dict_size_patch branch from 8d877d3 to dbc5e48 Compare October 26, 2022 12:40

Kartik Khare added 3 commits October 28, 2022 19:30

Do not create dictionary for high-cardinality columns

05fb428

Refactor: rename config

b3d7d79

Add dimension fields in the test as well

be10c86

KKcorps force-pushed the dict_size_patch branch from dbc5e48 to 9702077 Compare October 28, 2022 14:00

Add old config back for backward compatibility

4c84cbb

KKcorps force-pushed the dict_size_patch branch from 9702077 to 4c84cbb Compare October 28, 2022 14:26

KKcorps requested a review from somandal October 28, 2022 15:52

Jackie-Jiang previously approved these changes Oct 31, 2022

View reviewed changes

Jackie-Jiang reviewed Oct 31, 2022

View reviewed changes

Use both old and new optimizeDictionary configs

a5026af

KKcorps requested review from Jackie-Jiang and removed request for somandal November 22, 2022 04:46

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) labels Nov 22, 2022

Jackie-Jiang approved these changes Nov 22, 2022

View reviewed changes

KKcorps merged commit 2140954 into apache:master Nov 23, 2022

Conversation

KKcorps commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

Documentation

Uh oh!

codecov-commenter commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KKcorps commented Nov 2, 2022

Uh oh!

Jackie-Jiang commented Nov 5, 2022

Uh oh!

KKcorps commented Nov 21, 2022

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KKcorps commented Oct 4, 2022 •

edited

Loading

codecov-commenter commented Oct 4, 2022 •

edited

Loading