Do not create dictionary for high-cardinality columns#9527
Do not create dictionary for high-cardinality columns#9527KKcorps merged 5 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9527 +/- ##
=============================================
- Coverage 70.00% 15.83% -54.17%
+ Complexity 4933 175 -4758
=============================================
Files 1946 1912 -34
Lines 104280 102715 -1565
Branches 15808 15624 -184
=============================================
- Hits 72998 16265 -56733
- Misses 26157 85253 +59096
+ Partials 5125 1197 -3928
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
now that this config has been changed to work for both metrics and dimension fields, shouldn't isOptimizeDictionaryForMetrics() and the associated variable indicate that it works for both? maybe this should be a separate flag for dimension since it'll be hard to modify existing configs across all tables to change the name of this one.
In the IndexingConfig there is even a comment that'll need to be updated:
/**
* If `optimizeDictionaryForMetrics` enabled, dictionary is not created for the metric columns
* for which rawIndexSize / forwardIndexSize is less than the `noDictionarySizeRatioThreshold`.
*/
There was a problem hiding this comment.
Yeah, it makes sense. I am planning to introduce a new config altogether even for json/text.
There was a problem hiding this comment.
On thinking about it, It will get pretty confusing for users. I have decided to finally change the config to optimizeDictionary and keep the old config as well for backward compatibility. Have added comments for deprecation though.
...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I see that you've renamed the original config isOptimizeDictionaryForMetrics. Won't this cause backward compatibility issues with tables out there that already use isOptimizeDictionaryForMetrics?
There was a problem hiding this comment.
I don't want to keep multiple flags as it creates a lot of confusion for new users. Also afaik, this config is pretty new (introduced in 0.10) and is not used by a lot of folks. I understand your concern though.
There was a problem hiding this comment.
@KKcorps we did some validation on our end and no one is using this config yet, so nothing should break on our end with your change. I'm okay with changing the config name.
There was a problem hiding this comment.
I asked some users and seems like they are using the old config. For now, I have kept both but marked the older one as deprecated.
There was a problem hiding this comment.
Long term wise, we want to have a flag to automatically choose whether to use fixed-length dictionary, var-length dictionary or raw index. Not sure if we want to name it optimizeDictionary, but I don't have a good name either
There was a problem hiding this comment.
Why do we need config for that? Isn't it determined by data type of the column?
There was a problem hiding this comment.
We want a config to give more control to the user so that they can explicitly choose the index to create if desired. In certain scenarios, raw index might be able to save space, but not good for query performance
There was a problem hiding this comment.
I think that is out of scope for this PR. That will anyways not be a boolean config so we can decide on that later. The config name for that would be more like dictionaryType
8d877d3 to
dbc5e48
Compare
dbc5e48 to
9702077
Compare
9702077 to
4c84cbb
Compare
After a second thought, need more discussion
Jackie-Jiang
left a comment
There was a problem hiding this comment.
I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied.
With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added).
How about adding a config to only apply this to json/text column?
|
|
||
|
|
There was a problem hiding this comment.
(nit) Remove extra empty lines
Actually the reason for this change was to introduce this config for dimension columns (after complaints about space amplification and memory usage from users). users do have cases where they keep String columns as dimensions but don't really do any filtering on top of them. |
|
In that case, we can keep both |
Makes sense. Incorporated this now. |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
LGTM. Please add a release note section and documentation for the new added config
optimizeDictionaryconfig for dimension columns with fixed width as well.Objective is to reduce the segment size and server space/memory usage.
Release Notes
New config added
optimizeDictionaryto reduce memory usage due to dictionariesDocumentation
To enable this feature, add the following to your table config