Skip to content

Conversation

@liyafan82
Copy link
Contributor

The dictionary builder servers for the following scenario which is frequently encountered in practice when dictionary encoding is involved: the dictionary values are not known a priori, so they are determined dynamically, as new data arrive continually.

In particular, when a new value arrives, it is tested to check if it is already in the dictionary. If so, it is simply neglected, otherwise, it is added to the dictionary.

When all values have been evaluated, the dictionary can be considered complete. So encoding can start afterward.

The code snippet using a dictionary builder should be like this:

DictonaryBuilder dictionaryBuilder = ...
dictionaryBuilder.startBuild();
...
dictionaryBuild.addValue(newValue);
...
dictionaryBuilder.endBuild();

@codecov-io
Copy link

codecov-io commented Jul 6, 2019

Codecov Report

Merging #4813 into master will increase coverage by 17.35%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #4813       +/-   ##
==========================================
+ Coverage   72.34%   89.7%   +17.35%     
==========================================
  Files         831     670      -161     
  Lines      110383   99383    -11000     
  Branches     1418       0     -1418     
==========================================
+ Hits        79853   89148     +9295     
+ Misses      30168   10235    -19933     
+ Partials      362       0      -362
Impacted Files Coverage Δ
cpp/src/arrow/testing/gtest_util.h 97.36% <0%> (-2.64%) ⬇️
cpp/src/arrow/compute/kernel.h 61.16% <0%> (-1.53%) ⬇️
cpp/src/arrow/result.h 91.3% <0%> (-0.37%) ⬇️
python/pyarrow/tests/test_parquet.py 96.33% <0%> (-0.07%) ⬇️
cpp/src/arrow/table.h 100% <0%> (ø) ⬆️
cpp/src/arrow/flight/server.h 100% <0%> (ø) ⬆️
cpp/src/arrow/csv/column-builder.h 100% <0%> (ø) ⬆️
cpp/src/arrow/table_builder.h 100% <0%> (ø) ⬆️
cpp/src/plasma/plasma.h 100% <0%> (ø) ⬆️
cpp/src/gandiva/configuration.h 100% <0%> (ø) ⬆️
... and 760 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 167cea0...d6038e4. Read the comment docs.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to collapse the the class hierarchy and refactor once we have a second example. It seems plausible that you could potentially make use of composition to achieve different algorithmic complexities rather then inheritance.

@kszucs kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29
@liyafan82
Copy link
Contributor Author

@emkornfield I have opened a separate issue (ARROW-6021) to track the problem of moving copySafe method to ValueVector, because

  1. The changes are not minor, so placing them in a separate issue makes it easier to review the changes.
  2. I have realized that the requirement is not only needed by this single feature. So I hope this problem can be resolved sooner.

Let's hold this PR until ARROW-6021 is resolved. Thank you so much.

@liyafan82
Copy link
Contributor Author

@emkornfield @pravindra
As the dependency (ARROW-6021) is resolved, I have revised this PR. Would you please take a look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make all of the fields private for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Sounds reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to clarify whether this class is intended to be used more then one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Revised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably better named populateSortedDictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Revised accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point we might want to explore the possibility of making use of an off-heap datastructure if possible (but not for this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. We will provide an ArrowBuf based binary search tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want a guard here to ensure this and getSortedDictionary aren't called until endBuild is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. We use a bool flag to mark the end of dictionary building. Please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a code sample in the javadoc here might make it easier for users to understand the the intended call pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Sample code is added to the Javadoc.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify that this dictionary is NOT sorted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Good suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please include details on retrieving/populating the dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thanks.

@emkornfield
Copy link
Contributor

A couple of documentation clarifications and then I think this can be merged.

@emkornfield
Copy link
Contributor

Thinking about this a little bit more, why wouldn't we encode at the same time we are building the dictionary?

@liyafan82
Copy link
Contributor Author

Thinking about this a little bit more, why wouldn't we encode at the same time we are building the dictionary?

@emkornfield
Good question and good suggestion. This is called "online encoding", for which building dictionary and encoding are carried out simultaneously.

This is suitable for scenarios when the dictionary is used as is (for some cases, the dictionary may need to be sorted by content or frequency, before encoding).

So let's first support this basic scenario, and then support the online version in a separate issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it important to keep the state of startBuild and endBuild. Is seems like we can set ValueCount whenever getDictionary is called if users want an incremental dictionary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Revised accordingly. Thanks.

@liyafan82 liyafan82 force-pushed the fly_0705_build branch 2 times, most recently from d2f58a9 to 37878a4 Compare August 13, 2019 10:52
@emkornfield
Copy link
Contributor

+1, thank you.

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
The dictionary builder servers for the following scenario which is frequently encountered in practice when dictionary encoding is involved: the dictionary values are not known a priori, so they are determined dynamically, as new data arrive continually.

In particular, when a new value arrives, it is tested to check if it is already in the dictionary. If so, it is simply neglected, otherwise, it is added to the dictionary.

When all values have been evaluated, the dictionary can be considered complete. So encoding can start afterward.

The code snippet using a dictionary builder should be like this:

DictonaryBuilder<IntVector> dictionaryBuilder = ...
dictionaryBuilder.startBuild();
...
dictionaryBuild.addValue(newValue);
...
dictionaryBuilder.endBuild();

Closes apache#4813 from liyafan82/fly_0705_build and squashes the following commits:

2007b87 <liyafan82>  Provide dictionary builder

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants