Skip to content

Conversation

@liyafan82
Copy link
Contributor

Remove adjacent deduplicated elements from a vector. This function can be used, for example, in finding distinct values, or in compressing the vector data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyafan82 this is useful functionality, but I'm not sure about the interface. At least for me the functionality seems like it is intended for some form of RLE. It seems like it might pay to make a utility class with static methods of something like (I name the methods assuming RLE):

static void populateRunStartIndicators(ValueVector inputVector, BitVector runStarts);
void populateRunLengths(BitVector runStarts,  IntVector runLengths );
// This might belong in a separate class, or there might be some existing funtionality in the code base
void selectValues(BitVector indicators, ValueVector inputVector, ValueVector outputVector);

At least to me its not clear from the documentation how this class class is supposed to be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield
I try to make this API general, and RLE is an important application of it.
In addition, deduplication is also intensively used in SQL runtime.

I implemented it as an class object, instead of using static methods, because the bit set to hold start positions is an internal state, so IMO, it should not be exposed to the outside.

However, I have also provided a utility class with utility methods as you suggested, to support more flexible scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really is run leghts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. They are just run lengths. The method name has been revised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class VectorDeduplicator<V extends ValueVector> implements AutoCloseable {
public class VectorRunDeduplicator<V extends ValueVector> implements AutoCloseable {

Deduplication is ambiguous, since dictionary building is another form of deduplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes a speed-memory tradeoff, it seems like for all the operations defined on the method they would be much more efficient using a representation that stores the actual indices of the run starts.

Another possible option is to use a RoaringBitmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general problem. In the design, we have considered 3 different representations:

  1. start indices represented as a bit set.
  2. start indices represented as an byte vector.
  3. an integer vector storing the start indices.

Generally speaking, 1 is cpu intensive, while 2 and 3 are memory intensive.

We finally selected option 1, because in the extreme case (where each element is distinct from its previous one), the memory requirement of 1 is modest (in practice, the input vector can be large), while the start indices for 3 can be larger than the original vector (e.g. the original vector is a TinyIntVector).

Would you please give some suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the use-case is to apply this universally and memory is a concern then as you note this approach is reasonable. As noted above, a Roaring Bitmap implementation would potentially give you even a small set size and also potentially make computation faster, but this is more complicated and can potentially be done as a follow-up.

I'm still not clear on the exact use-case here, but if this were for RLE per-se, you could simply abort the computation if the first N elements were all set to one. It really depends on the use-case, I figured it was worth a discussion in either case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the recommended paper. It seems like something useful and interesting. Let's do it in a follow-up.

I agree with you that the choice depends on the specific use case. For our use case (big data shuffling and external sort), memory is the main concern. So we chose this design.

However, it may not work well for cases where cpu resources is the main concern.

So what do you suggest for now? Do you think we should provide different implementations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take another look at the implementation, but keeping as is should be OK for now.

On a general note it seems like you are adding a lot of algorithms that have potential design trade-offs. If you know the algorithms you are going to add, I think it would be helpful to write-up a short document describing the algorithms you intend to add and share it on the mailing list, so feedback can be provided ahead of time (it doesn't need to be long, but should probably cover, at a high level any on-heap components, and processing vs memory trade-off for future algorithms).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good suggestion. I will start more discussions in the mainling list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good suggestion. I will start more discussions in the mainling list.

Thank you. Probably just one thread with a list of planned algorithms is enough (a google doc for commenting might be helpful).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the kind reminder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

populuteDecuplicatedValues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void getValueLength(IntVector lengthVector) {
public void populateRunLengths(IntVector lengthVector) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public int getDistinctValueCount() {
public int getRunCount() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant with the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer the object API, lets make this package private for the time being.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.

@liyafan82 liyafan82 force-pushed the fly_0722_dedup branch 2 times, most recently from ae27b3d to 1380957 Compare August 14, 2019 04:07
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Gets the number of values which are different from their precedence.
* Gets the number of values which are different from their predecessor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Gets the vector with deduplicated adjacent valued removed.
* Gets the vector with deduplicated adjacent values removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Thanks for the careful review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

populateDeduplicated

(sorry if I had a typo in my original suggestion)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry. I just copied and pasted, without checking it. This is not a good habit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rebase, I think this API has been eliminated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Thank you for the kind reminder. This is blocked by another issue: ARROW-6266. Let's fix that one first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Rebased. Please take a look. Thank you.

@codecov-io
Copy link

codecov-io commented Aug 20, 2019

Codecov Report

Merging #4993 into master will increase coverage by 2.11%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4993      +/-   ##
==========================================
+ Coverage   87.62%   89.74%   +2.11%     
==========================================
  Files        1014      682     -332     
  Lines      145908   101786   -44122     
  Branches     1437        0    -1437     
==========================================
- Hits       127857    91343   -36514     
+ Misses      17689    10443    -7246     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/parquet/arrow/reader.h 65% <0%> (-15%) ⬇️
cpp/src/arrow/testing/util.h 94.44% <0%> (-5.56%) ⬇️
cpp/src/parquet/arrow/reader.cc 81.43% <0%> (-3.48%) ⬇️
cpp/src/parquet/properties_test.cc 96.96% <0%> (-3.04%) ⬇️
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
python/pyarrow/tests/test_array.py 95.31% <0%> (-0.69%) ⬇️
cpp/src/arrow/array.cc 82.23% <0%> (-0.15%) ⬇️
cpp/src/arrow/record_batch.cc 87.59% <0%> (-0.1%) ⬇️
cpp/src/arrow/table.h 100% <0%> (ø) ⬆️
cpp/src/arrow/table_test.cc 100% <0%> (ø) ⬆️
... and 377 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a40d6b6...c53cbdb. Read the comment docs.

BitVectorHelper.setValidityBitToOne(runStarts, 0);

for (int i = 1; i < vector.getValueCount(); i++) {
RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector, i - 1, i, 1, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/param_name/=false for the last two element.

For use-cases like this it seems like there should be a better API to avoid recreatingt his object each time. Maybe getting rid of the equals was premature? We can potentially revisit this in a follow-up PR.

CC @pravindra

Copy link
Contributor

@pravindra pravindra Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can modify the RangeEqualVisitor API (or create a wrapper visitor) to not require recreating the object each time.

dupChecker = DuplicateChecker(vector);
if (dupChecker.isDuplicate(i - 1 /*srcIndex*/, i /*dstIndex*/)) {
/* do something */
}

I think the equals API in ValueVector cannot be efficient - will require too many repetitive checks for it to be safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield @pravindra
Good point. I think RangeEqualsVisitor should be made reusable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyafan82 please add /parame_name=/ for the last two parameters or make them constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thanks for the good suggestion.

@emkornfield
Copy link
Contributor

I fixed a comment, will merge once green.

@liyafan82
Copy link
Contributor Author

I fixed a comment, will merge once green.

Thanks for the fix.

@emkornfield
Copy link
Contributor

+1, thank you.

pravindra pushed a commit that referenced this pull request Sep 4, 2019
According to the discussion in #4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc).

According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead.

To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor.

Closes #5195 from liyafan82/fly_0826_reuse and squashes the following commits:

ffe0e6a <liyafan82> Merge pull request #1 from pravindra/pull-5195
073bc78 <Pindikura Ravindra> Test: Move out Range from the visitor params
7482414 <liyafan82>  Wrapper visit parameters into a pojo
53c1e0b <liyafan82> Merge branch 'master' into fly_0826_reuse
a1f7046 <liyafan82>  Make range equal visitor reusable

Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com>
Co-authored-by: Pindikura Ravindra <ravindra@dremio.com>
Co-authored-by: liyafan82 <42827532+liyafan82@users.noreply.github.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
projjal pushed a commit to projjal/arrow that referenced this pull request Mar 8, 2020
According to the discussion in apache#4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc).

According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead.

To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor.

Closes apache#5195 from liyafan82/fly_0826_reuse and squashes the following commits:

ffe0e6a <liyafan82> Merge pull request #1 from pravindra/pull-5195
073bc78 <Pindikura Ravindra> Test: Move out Range from the visitor params
7482414 <liyafan82>  Wrapper visit parameters into a pojo
53c1e0b <liyafan82> Merge branch 'master' into fly_0826_reuse
a1f7046 <liyafan82>  Make range equal visitor reusable

Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com>
Co-authored-by: Pindikura Ravindra <ravindra@dremio.com>
Co-authored-by: liyafan82 <42827532+liyafan82@users.noreply.github.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
kou pushed a commit to apache/arrow-java that referenced this pull request Nov 25, 2024
According to the discussion in apache/arrow#4993 (comment), we often encountered this scenario: we compare values repeatedly. The comparisons differs only in the parameters (vector to compare, start index, etc).

According to the current API, we have to create a new RangeEqualVisitor object each time the comparison is performed. This leads to non-trivial performance overhead.

To address this problem, we make the RangeEqualVisitor reusable, and allow the client to change parameters of an existing visitor.

Closes #5195 from liyafan82/fly_0826_reuse and squashes the following commits:

ffe0e6a4e <liyafan82> Merge pull request #1 from pravindra/pull-5195
073bc78d9 <Pindikura Ravindra> Test: Move out Range from the visitor params
7482414c9 <liyafan82>  Wrapper visit parameters into a pojo
53c1e0b63 <liyafan82> Merge branch 'master' into fly_0826_reuse
a1f704636 <liyafan82>  Make range equal visitor reusable

Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com>
Co-authored-by: Pindikura Ravindra <ravindra@dremio.com>
Co-authored-by: liyafan82 <42827532+liyafan82@users.noreply.github.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
Remove adjacent deduplicated elements from a vector. This function can be used, for example, in finding distinct values, or in compressing the vector data.

Closes apache#4993 from liyafan82/fly_0722_dedup and squashes the following commits:

c53cbdb <liyafan82>  Add comment for parameters
bd1fabd <emkornfield> Update VectorRunDeduplicator.java
611dcd2 <liyafan82>  Support vector deduplicate function

Lead-authored-by: liyafan82 <fan_li_ya@foxmail.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants