-
Couldn't load subscription status.
- Fork 3.9k
ARROW-6113: [Java] Support vector deduplicate function #4993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liyafan82 this is useful functionality, but I'm not sure about the interface. At least for me the functionality seems like it is intended for some form of RLE. It seems like it might pay to make a utility class with static methods of something like (I name the methods assuming RLE):
static void populateRunStartIndicators(ValueVector inputVector, BitVector runStarts);
void populateRunLengths(BitVector runStarts, IntVector runLengths );
// This might belong in a separate class, or there might be some existing funtionality in the code base
void selectValues(BitVector indicators, ValueVector inputVector, ValueVector outputVector);
At least to me its not clear from the documentation how this class class is supposed to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield
I try to make this API general, and RLE is an important application of it.
In addition, deduplication is also intensively used in SQL runtime.
I implemented it as an class object, instead of using static methods, because the bit set to hold start positions is an internal state, so IMO, it should not be exposed to the outside.
However, I have also provided a utility class with utility methods as you suggested, to support more flexible scenarios.
cc72e80 to
f7d4e74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this really is run leghts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. They are just run lengths. The method name has been revised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public class VectorDeduplicator<V extends ValueVector> implements AutoCloseable { | |
| public class VectorRunDeduplicator<V extends ValueVector> implements AutoCloseable { |
Deduplication is ambiguous, since dictionary building is another form of deduplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes a speed-memory tradeoff, it seems like for all the operations defined on the method they would be much more efficient using a representation that stores the actual indices of the run starts.
Another possible option is to use a RoaringBitmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general problem. In the design, we have considered 3 different representations:
- start indices represented as a bit set.
- start indices represented as an byte vector.
- an integer vector storing the start indices.
Generally speaking, 1 is cpu intensive, while 2 and 3 are memory intensive.
We finally selected option 1, because in the extreme case (where each element is distinct from its previous one), the memory requirement of 1 is modest (in practice, the input vector can be large), while the start indices for 3 can be larger than the original vector (e.g. the original vector is a TinyIntVector).
Would you please give some suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the use-case is to apply this universally and memory is a concern then as you note this approach is reasonable. As noted above, a Roaring Bitmap implementation would potentially give you even a small set size and also potentially make computation faster, but this is more complicated and can potentially be done as a follow-up.
I'm still not clear on the exact use-case here, but if this were for RLE per-se, you could simply abort the computation if the first N elements were all set to one. It really depends on the use-case, I figured it was worth a discussion in either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the recommended paper. It seems like something useful and interesting. Let's do it in a follow-up.
I agree with you that the choice depends on the specific use case. For our use case (big data shuffling and external sort), memory is the main concern. So we chose this design.
However, it may not work well for cases where cpu resources is the main concern.
So what do you suggest for now? Do you think we should provide different implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take another look at the implementation, but keeping as is should be OK for now.
On a general note it seems like you are adding a lot of algorithms that have potential design trade-offs. If you know the algorithms you are going to add, I think it would be helpful to write-up a short document describing the algorithms you intend to add and share it on the mailing list, so feedback can be provided ahead of time (it doesn't need to be long, but should probably cover, at a high level any on-heap components, and processing vs memory trade-off for future algorithms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good suggestion. I will start more discussions in the mainling list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the good suggestion. I will start more discussions in the mainling list.
Thank you. Probably just one thread with a list of planned algorithms is enough (a google doc for commenting might be helpful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the kind reminder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
populuteDecuplicatedValues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public void getValueLength(IntVector lengthVector) { | |
| public void populateRunLengths(IntVector lengthVector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public int getDistinctValueCount() { | |
| public int getRunCount() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is redundant with the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer the object API, lets make this package private for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem.
ae27b3d to
1380957
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Gets the number of values which are different from their precedence. | |
| * Gets the number of values which are different from their predecessor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Gets the vector with deduplicated adjacent valued removed. | |
| * Gets the vector with deduplicated adjacent values removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved. Thanks for the careful review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
populateDeduplicated
(sorry if I had a typo in my original suggestion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry. I just copied and pasted, without checking it. This is not a good habit.
1380957 to
26f9bfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rebase, I think this API has been eliminated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Thank you for the kind reminder. This is blocked by another issue: ARROW-6266. Let's fix that one first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Rebased. Please take a look. Thank you.
26f9bfd to
611dcd2
Compare
Codecov Report
@@ Coverage Diff @@
## master #4993 +/- ##
==========================================
+ Coverage 87.62% 89.74% +2.11%
==========================================
Files 1014 682 -332
Lines 145908 101786 -44122
Branches 1437 0 -1437
==========================================
- Hits 127857 91343 -36514
+ Misses 17689 10443 -7246
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
| BitVectorHelper.setValidityBitToOne(runStarts, 0); | ||
|
|
||
| for (int i = 1; i < vector.getValueCount(); i++) { | ||
| RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector, i - 1, i, 1, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/param_name/=false for the last two element.
For use-cases like this it seems like there should be a better API to avoid recreatingt his object each time. Maybe getting rid of the equals was premature? We can potentially revisit this in a follow-up PR.
CC @pravindra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can modify the RangeEqualVisitor API (or create a wrapper visitor) to not require recreating the object each time.
dupChecker = DuplicateChecker(vector);
if (dupChecker.isDuplicate(i - 1 /*srcIndex*/, i /*dstIndex*/)) {
/* do something */
}
I think the equals API in ValueVector cannot be efficient - will require too many repetitive checks for it to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield @pravindra
Good point. I think RangeEqualsVisitor should be made reusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liyafan82 please add /parame_name=/ for the last two parameters or make them constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thanks for the good suggestion.
|
I fixed a comment, will merge once green. |
Thanks for the fix. |
|
+1, thank you. |
According to the discussion in #4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc). According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead. To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor. Closes #5195 from liyafan82/fly_0826_reuse and squashes the following commits: ffe0e6a <liyafan82> Merge pull request #1 from pravindra/pull-5195 073bc78 <Pindikura Ravindra> Test: Move out Range from the visitor params 7482414 <liyafan82> Wrapper visit parameters into a pojo 53c1e0b <liyafan82> Merge branch 'master' into fly_0826_reuse a1f7046 <liyafan82> Make range equal visitor reusable Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com> Co-authored-by: Pindikura Ravindra <ravindra@dremio.com> Co-authored-by: liyafan82 <42827532+liyafan82@users.noreply.github.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
According to the discussion in apache#4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc). According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead. To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor. Closes apache#5195 from liyafan82/fly_0826_reuse and squashes the following commits: ffe0e6a <liyafan82> Merge pull request #1 from pravindra/pull-5195 073bc78 <Pindikura Ravindra> Test: Move out Range from the visitor params 7482414 <liyafan82> Wrapper visit parameters into a pojo 53c1e0b <liyafan82> Merge branch 'master' into fly_0826_reuse a1f7046 <liyafan82> Make range equal visitor reusable Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com> Co-authored-by: Pindikura Ravindra <ravindra@dremio.com> Co-authored-by: liyafan82 <42827532+liyafan82@users.noreply.github.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
According to the discussion in apache/arrow#4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc). According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead. To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor. Closes #5195 from liyafan82/fly_0826_reuse and squashes the following commits: ffe0e6a4e <liyafan82> Merge pull request #1 from pravindra/pull-5195 073bc78d9 <Pindikura Ravindra> Test: Move out Range from the visitor params 7482414c9 <liyafan82> Wrapper visit parameters into a pojo 53c1e0b63 <liyafan82> Merge branch 'master' into fly_0826_reuse a1f704636 <liyafan82> Make range equal visitor reusable Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com> Co-authored-by: Pindikura Ravindra <ravindra@dremio.com> Co-authored-by: liyafan82 <42827532+liyafan82@users.noreply.github.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
Remove adjacent deduplicated elements from a vector. This function can be used, for example, in finding distinct values, or in compressing the vector data. Closes apache#4993 from liyafan82/fly_0722_dedup and squashes the following commits: c53cbdb <liyafan82> Add comment for parameters bd1fabd <emkornfield> Update VectorRunDeduplicator.java 611dcd2 <liyafan82> Support vector deduplicate function Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Remove adjacent deduplicated elements from a vector. This function can be used, for example, in finding distinct values, or in compressing the vector data.