Skip to content

Conversation

@tianchen92
Copy link
Contributor

As discussed in #4792

Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.

@tianchen92
Copy link
Contributor Author

@emkornfield Hi Micah, I made a simple implementation for dictionary hash table.
1 Implement hash table to store dictionary vector hash & index, support lookup with given toEncode vector index.
2 add hashCode and equals methods in ValueVector API.
Please take a look and give some comments, thanks!

@tianchen92
Copy link
Contributor Author

@liyafan82 Thanks for your comments, This PR is just a demo as I commented as WIP and this will work for VarCharVector type encoding in TestDictionaryVector to test correctness. default and memory copy already commented as TODO to make it perfect since we have to use memory level hashCode and implement equals in all type vectors.

@tianchen92 tianchen92 changed the title ARROW-5902: [Java][WIP] Implement HashTable for dictionary encoding ARROW-5902: [Java] Implement hash table and equals & hashCode API for dictionary encoding Jul 13, 2019
@codecov-io
Copy link

codecov-io commented Jul 13, 2019

Codecov Report

Merging #4846 into master will increase coverage by 2.16%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4846      +/-   ##
==========================================
+ Coverage   87.42%   89.58%   +2.16%     
==========================================
  Files         994      661     -333     
  Lines      140102    96617   -43485     
  Branches     1418        0    -1418     
==========================================
- Hits       122481    86555   -35926     
+ Misses      17259    10062    -7197     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/gandiva/precompiled/decimal_ops.cc 97.17% <0%> (ø) ⬆️
cpp/src/gandiva/precompiled/decimal_ops_test.cc 97.43% <0%> (ø) ⬆️
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
... and 326 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c350bba...2db7302. Read the comment docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a no-op I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments, IMO, what dose this mean? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Could you please explain this a little more? thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Integer.hashCode(value) == value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are right, thanks, fixed.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start do you see performance improvements? If so I think the next step would be to add unit tests.

Also, CC @praveenbingo and @pravindra to make sure they don't have concerns with the new interfaces.

@tianchen92
Copy link
Contributor Author

tianchen92 commented Jul 16, 2019

This looks like a good start do you see performance improvements? If so I think the next step would be to add unit tests.

Also, CC @praveenbingo and @pravindra to make sure they don't have concerns with the new interfaces.

Sure, I have already added a DictionaryEncoderBenchmark and list its previous performance data in #4786.
The performance improvement with this PR is as below:
DictionaryEncoderBenchmarks#testEncode:
Before: 5 415703.943 ± 9258.049 ns/op
After: 5 386094.115 ± 19138.808 ns/op

And add what kind of UT? now TestDictionaryVector already support many types like Struct, List, Binary, Varchar etc.

@emkornfield
Copy link
Contributor

Sorry forgot about the existing unit tests. LGTM.

@emkornfield
Copy link
Contributor

+1, than you

kszucs pushed a commit that referenced this pull request Jul 22, 2019
… dictionary encoding

As discussed in #4792

Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4846 from tianchen92/hasher and squashes the following commits:

2db7302 <tianchen> fix
5facc2a <tianchen> resolve comments
175192a <tianchen> fix test and style
7a87526 <tianchen> implementation of equals and hashCode
c89608b <tianchen> fix
8f2e1a2 <tianchen> hash table prototype
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
… dictionary encoding

As discussed in apache#4792

Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4846 from tianchen92/hasher and squashes the following commits:

2db7302 <tianchen> fix
5facc2a <tianchen> resolve comments
175192a <tianchen> fix test and style
7a87526 <tianchen> implementation of equals and hashCode
c89608b <tianchen> fix
8f2e1a2 <tianchen> hash table prototype
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants