- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1k
 
Add dictionary array support for substring function #1665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// let error = substring(&array, 0, Some(5)).unwrap_err().to_string(); | ||
| /// assert!(error.contains("invalid utf-8 boundary")); | ||
| /// ``` | ||
| pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this func to the beginning of the file, before all other non-public ones, for better readability.
| DataType::Dictionary(kt, _) => { | ||
| substring_dict!( | ||
| kt, | ||
| Int8: Int8Type, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may make this shorter via concat_idents (e.g., concat_idents($t, Type)) but it's only available in nightly.
| } | ||
| 
               | 
          ||
| #[test] | ||
| fn dictionary() -> Result<()> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn dictionary() -> Result<()> { | |
| fn test_substring_dictionary() -> Result<()> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessary to add test_ prefix for Rust tests since they are already under the tests module. The substring here also seem redundant since the full test name compute::kernels::substring::tests::dictionary already contain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. A few minor comments.
          Codecov Report
 @@            Coverage Diff             @@
##           master    #1665      +/-   ##
==========================================
+ Coverage   83.10%   83.16%   +0.05%     
==========================================
  Files         193      193              
  Lines       55864    56039     +175     
==========================================
+ Hits        46425    46603     +178     
+ Misses       9439     9436       -3     
 Continue to review full report at Codecov. 
  | 
    
| /// let error = substring(&array, 0, Some(5)).unwrap_err().to_string(); | ||
| /// assert!(error.contains("invalid utf-8 boundary")); | ||
| /// ``` | ||
| pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit: Maybe we could let length be Option<u32>. Because the longest length will not exceed 1<<31 - 1 (for LargeBinaryArray and LargeStringArray)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think this is not quite related to this PR. I can open another one for the change.
| /// ``` | ||
| /// | ||
| /// # Error | ||
| /// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also update that Dictionary arrays with [large]string/[large]binary values are also accepted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Updated.
| 
           Thank you @sunchao ❤️  | 
    
| 
           Merged, thanks @sunchao @HaoYang670 @alamb  | 
    
Which issue does this PR close?
Closes #1656.
Rationale for this change
Currently the
substringkernel only support "plain" arrays but not dictionary encoded ones. With dictionary array, the compute could be much more efficient since it only needs to be done on the dictionary values.What changes are included in this PR?
This PR adds the support of dictionary array for
substringkernel.Are there any user-facing changes?
No