Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial support string_view and binary_view, supports layout and basic construction + tests #5481

Merged
merged 2 commits into from
Mar 14, 2024

Conversation

ariesdevil
Copy link
Contributor

@ariesdevil ariesdevil commented Mar 7, 2024

Which issue does this PR close?

Closes #5469

Rationale for this change

Initially support StringViewArray and BinaryViewArray, mainly for adding layout and basic construction and tests for these two new types of array.

Note: This implementation is primarily based on these two PRs [#4585 databendlabs/databend/pull/14662]

What changes are included in this PR?

Add two new types of arrays.

Are there any user-facing changes?

Yes

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 7, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, made some suggestions.

I also wonder if this should be ByteView and not BytesView for consistency with things like GenericByteArray, although I do agree in hindsight it probably should be GenericBytesArray

FWIW I rebased #4585 in https://github.com/tustvold/arrow-rs/tree/array-view and compared the changes this makes on top to speed up my review.

arrow-data/Cargo.toml Outdated Show resolved Hide resolved
arrow-array/src/types.rs Outdated Show resolved Hide resolved
arrow-data/src/data.rs Outdated Show resolved Hide resolved
arrow-data/src/equal/bytes_view.rs Outdated Show resolved Hide resolved
@ariesdevil ariesdevil force-pushed the string_view branch 2 times, most recently from 49dc2a1 to 1fd109b Compare March 12, 2024 09:54
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @ariesdevil -- I think this code is looking basically ready to me. I have some API suggestions that might be worth considering but we can do it as a follow on PR.

The only thing missing from this PR are a few more tests and then it will be ready to go

cc @XiangpengHao

arrow-array/src/array/byte_view_array.rs Outdated Show resolved Hide resolved

/// Returns the views buffer
#[inline]
pub fn views(&self) -> &ScalarBuffer<u128> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but I wonder if we should consider implementing a ByteViewBuffer type, similarly to the OffsetBuffer used for GenericBinaryView -- https://docs.rs/arrow/latest/arrow/buffer/struct.OffsetBuffer.html

I thought that the introduction of OffsetBuffer made working with StringArray/BinaryArray much easier.

I can imagine ByteViewBuffer encapsulating the 12 byte inline string calculation, as well as building such values up as well as hosting documentation explaining what types are present.

If this seems like a reasonable idea, I can write up a ticket / maybe whack up a PR to show what it might look like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, I'll do it in the next PR.

pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
let v: &[u8] = value.as_ref().as_ref();
let length: u32 = v.len().try_into().unwrap();
if length <= 12 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a views builder maybe this could look like

self.views_builder.append_inline(v, length)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

arrow-array/src/array/byte_view_array.rs Show resolved Hide resolved
arrow-array/src/record_batch.rs Show resolved Hide resolved
arrow-data/src/byte_view.rs Show resolved Hide resolved
arrow/tests/array_transform.rs Show resolved Hide resolved
arrow/tests/array_equal.rs Show resolved Hide resolved
@ariesdevil
Copy link
Contributor Author

Hi @alamb , I added more tests for your kindly comments, PTAL again.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again @ariesdevil -- I think a few more tests and some cleanups and this PR will be good to go from my perspective

arrow-data/src/equal/byte_view.rs Show resolved Hide resolved
arrow-array/src/builder/generic_bytes_view_builder.rs Outdated Show resolved Hide resolved
arrow-array/src/array/byte_view_array.rs Show resolved Hide resolved
arrow-data/src/byte_view.rs Outdated Show resolved Hide resolved
@ariesdevil
Copy link
Contributor Author

Hi @alamb @tustvold , I modified the code as you guys suggested and added more tests, PTAL again.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ariesdevil -- I think this PR is ready to go from my perspective.

All remaining issues / comments I think we can address as follow on PRs.

I plan to merge this tomorrow morning (around 18 hours from now) unless anyone else would like additional time to review.

Thanks again @ariesdevil

kou pushed a commit to apache/arrow that referenced this pull request Mar 13, 2024
…s padded with `0` (#40512)

### Rationale for this change
While  implementing `Variable-size Binary View Layout` (thanks @ ariesdevil !) in  apache/arrow-rs#5481 it was not 100% clear if the inlined string was zero padded. 

@ bkietz noted that 

> The spec does say "padded with zero" https://github.com/apache/arrow/blob/main/docs/source/format/Columnar.rst?plain=1#L384 but it could be repeated in the surrounding paragraph. In any case, padded with zero is definitely the intent

```
    * Short strings, length <= 12
      | Bytes 0-3  | Bytes 4-15                            |
      |------------|---------------------------------------|
      | length     | data (padded with 0)                  |
```
### What changes are included in this PR?

Add a sentence in the surrounding text to make it clear the inlined strings values are zero padded

Note I do not think this is a specification change (and therefore doesn't need a vote on the mailing list) as the spec already specifies the padding is zero (in the diagram). This simply clarifies the text to emphasize this point for ease of understanding

### Are these changes tested?

### Are there any user-facing changes?

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@alamb alamb merged commit d39cf28 into apache:master Mar 14, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

🚀 I filed a bunch of follow on tickets:
IPC format support for StringViewArray and BinaryViewArray #5506
#5507
#5508
#5509
#5510
#5511
#5513

@@ -1027,6 +1028,44 @@ fn test_extend_nulls_panic() {
mutable.extend_nulls(2);
}

#[test]
fn test_string_view() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add StringViewArray implementation and layout and basic construction + tests
8 participants