- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
ARROW-5902: [Java] Implement hash table and equals & hashCode API for dictionary encoding #4846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @emkornfield Hi Micah, I made a simple implementation for dictionary hash table. | 
        
          
                java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/main/java/org/apache/arrow/vector/VarCharVector.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | @liyafan82 Thanks for your comments, This PR is just a demo as I commented as WIP and this will work for VarCharVector type encoding in TestDictionaryVector to test correctness. default and memory copy already commented as TODO to make it perfect since we have to use memory level hashCode and implement equals in all type vectors. | 
| Codecov Report
 @@            Coverage Diff             @@
##           master    #4846      +/-   ##
==========================================
+ Coverage   87.42%   89.58%   +2.16%     
==========================================
  Files         994      661     -333     
  Lines      140102    96617   -43485     
  Branches     1418        0    -1418     
==========================================
- Hits       122481    86555   -35926     
+ Misses      17259    10062    -7197     
+ Partials      362        0     -362Continue to review full report at Codecov. 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a no-op I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comments, IMO, what dose this mean? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Could you please explain this a little more? thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Integer.hashCode(value) == value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you are right, thanks, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start do you see performance improvements? If so I think the next step would be to add unit tests.
Also, CC @praveenbingo and @pravindra to make sure they don't have concerns with the new interfaces.
        
          
                java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryHashTable.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestExtensionType.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestExtensionType.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
 Sure, I have already added a DictionaryEncoderBenchmark and list its previous performance data in #4786. And add what kind of UT? now TestDictionaryVector already support many types like Struct, List, Binary, Varchar etc. | 
| Sorry forgot about the existing unit tests. LGTM. | 
| +1, than you | 
… dictionary encoding As discussed in #4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
… dictionary encoding As discussed in apache#4792 Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API. Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4846 from tianchen92/hasher and squashes the following commits: 2db7302 <tianchen> fix 5facc2a <tianchen> resolve comments 175192a <tianchen> fix test and style 7a87526 <tianchen> implementation of equals and hashCode c89608b <tianchen> fix 8f2e1a2 <tianchen> hash table prototype
As discussed in #4792
Implement a hash table to only store hash & index, meanwhile add check equal function in ValueVector API.