-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-5862: [Java] Provide dictionary builder #4813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4813 +/- ##
==========================================
+ Coverage 72.34% 89.7% +17.35%
==========================================
Files 831 670 -161
Lines 110383 99383 -11000
Branches 1418 0 -1418
==========================================
+ Hits 79853 89148 +9295
+ Misses 30168 10235 -19933
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
...hm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchTreeBasedDictionaryBuilder.java
Outdated
Show resolved
Hide resolved
java/algorithm/src/main/java/org/apache/arrow/algorithm/dictionary/DictionaryBuilder.java
Outdated
Show resolved
Hide resolved
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to collapse the the class hierarchy and refactor once we have a second example. It seems plausible that you could potentially make use of composition to achieve different algorithmic complexities rather then inheritance.
ed180da to
85fe336
Compare
|
@emkornfield I have opened a separate issue (ARROW-6021) to track the problem of moving copySafe method to ValueVector, because
Let's hold this PR until ARROW-6021 is resolved. Thank you so much. |
64b803c to
6be773b
Compare
|
@emkornfield @pravindra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make all of the fields private for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to clarify whether this class is intended to be used more then one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Revised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably better named populateSortedDictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Revised accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point we might want to explore the possibility of making use of an off-heap datastructure if possible (but not for this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. We will provide an ArrowBuf based binary search tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want a guard here to ensure this and getSortedDictionary aren't called until endBuild is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. We use a bool flag to mark the end of dictionary building. Please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a code sample in the javadoc here might make it easier for users to understand the the intended call pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Sample code is added to the Javadoc.
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments.
6be773b to
215606c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify that this dictionary is NOT sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please include details on retrieving/populating the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thanks.
|
A couple of documentation clarifications and then I think this can be merged. |
215606c to
d6038e4
Compare
|
Thinking about this a little bit more, why wouldn't we encode at the same time we are building the dictionary? |
@emkornfield This is suitable for scenarios when the dictionary is used as is (for some cases, the dictionary may need to be sorted by content or frequency, before encoding). So let's first support this basic scenario, and then support the online version in a separate issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it important to keep the state of startBuild and endBuild. Is seems like we can set ValueCount whenever getDictionary is called if users want an incremental dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Revised accordingly. Thanks.
d2f58a9 to
37878a4
Compare
...hm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchTreeBasedDictionaryBuilder.java
Outdated
Show resolved
Hide resolved
...rc/test/java/org/apache/arrow/algorithm/dictionary/TestSearchTreeBasedDictionaryBuilder.java
Outdated
Show resolved
Hide resolved
37878a4 to
84ea87f
Compare
...hm/src/main/java/org/apache/arrow/algorithm/dictionary/SearchTreeBasedDictionaryBuilder.java
Outdated
Show resolved
Hide resolved
84ea87f to
2007b87
Compare
|
+1, thank you. |
The dictionary builder servers for the following scenario which is frequently encountered in practice when dictionary encoding is involved: the dictionary values are not known a priori, so they are determined dynamically, as new data arrive continually. In particular, when a new value arrives, it is tested to check if it is already in the dictionary. If so, it is simply neglected, otherwise, it is added to the dictionary. When all values have been evaluated, the dictionary can be considered complete. So encoding can start afterward. The code snippet using a dictionary builder should be like this: DictonaryBuilder<IntVector> dictionaryBuilder = ... dictionaryBuilder.startBuild(); ... dictionaryBuild.addValue(newValue); ... dictionaryBuilder.endBuild(); Closes apache#4813 from liyafan82/fly_0705_build and squashes the following commits: 2007b87 <liyafan82> Provide dictionary builder Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
The dictionary builder servers for the following scenario which is frequently encountered in practice when dictionary encoding is involved: the dictionary values are not known a priori, so they are determined dynamically, as new data arrive continually.
In particular, when a new value arrives, it is tested to check if it is already in the dictionary. If so, it is simply neglected, otherwise, it is added to the dictionary.
When all values have been evaluated, the dictionary can be considered complete. So encoding can start afterward.
The code snippet using a dictionary builder should be like this:
DictonaryBuilder dictionaryBuilder = ...
dictionaryBuilder.startBuild();
...
dictionaryBuild.addValue(newValue);
...
dictionaryBuilder.endBuild();