PARQUET-1716: [C++] Add BYTE_STREAM_SPLIT encoder and decoder #6005

martinradev · 2019-12-09T23:51:27Z

The patch implements an encoder and decoder for Parquet's
BYTE_STREAM_SPLIT encoding. The patch also adds tests for
the new encoding.

martinradev · 2019-12-09T23:52:13Z

The patch adds an encoding from a future parquet thrift release and it might have to go to a separate branch until the new parquet format is released.

github-actions · 2019-12-10T00:06:15Z

https://issues.apache.org/jira/browse/ARROW-5913

wesm · 2019-12-10T14:42:27Z

I moved the JIRA to the PARQUET project

github-actions · 2019-12-10T14:45:50Z

https://issues.apache.org/jira/browse/PARQUET-1716

martinradev · 2019-12-11T15:46:31Z

Can you review this?

I have one other patch about optimizing the implementation since clang/gcc did not auto-vectorize the encoder/decoder loop in some of my early tests. I have not checked with the current patch.

pitrou

I'm not a Parquet expert, but here is a review.

pitrou · 2019-12-11T17:50:31Z

cpp/src/parquet/encoding.cc

You should certainly use a arrow::TypedBufferBuilder<T> instead. It will make use of the Arrow memory pool for allocations (and will reallocate efficiently).

pitrou · 2019-12-11T17:52:44Z

cpp/src/parquet/encoding.cc

Here you could use TypedBufferBuilder<T>::Append(const T* values, int64_t num_elements).

pitrou · 2019-12-11T17:56:44Z

cpp/src/parquet/types.h

Should simply be spelled UnsignedTypeWithWidth<4>

pitrou · 2019-12-11T18:02:03Z

cpp/src/parquet/encoding.cc

Better to raise an error than crash out, no?

Done. Added a test for invalid types.
There might be also some benefit of using it for integers but I have not tested on any data set, and that can be changed in the future if found useful.

pitrou · 2019-12-11T18:02:33Z

cpp/src/parquet/encoding.cc

Why use an unsigned initializer here?

Also, there already is a num_values_ field in DecoderImpl.

Removed and changed to using the one in DecoderImpl

pitrou · 2019-12-11T18:14:34Z

cpp/src/parquet/encoding_test.cc

Why the unsigned const?

By mistake. Changed to signed literal.

pitrou · 2019-12-11T18:16:01Z

cpp/src/parquet/encoding_test.cc

Instead you could just write:

const uint8_t* value_as_bytes = reinterpret_cast<const uint8_t*>(&value); ASSERT_EQ(value_as_bytes[0], mutable_data[0]); // etc

pitrou · 2019-12-11T18:16:39Z

cpp/src/parquet/encoding_test.cc

What's the point? Just test with a reasonable data size > 0.

I wanted to cover the case when data is large (>1000) elements but that's already covered in another test, so I removed this one.

pitrou · 2019-12-11T18:18:34Z

cpp/src/parquet/encoding_test.cc

Why are you initializing a single element here?

This initializes all elements to .0 but I changed it to make it more reader-friendly per your request.

pitrou · 2019-12-11T18:19:52Z

cpp/src/parquet/encoding_test.cc

Shouldn't this be SetData(2, data, 16)? (2 double values -> 16 bytes len)

This is me trying to test having number of elements greater than what can be supported with the number of bytes provided. This should be in fact a failing test but the implementation doesn't check for this and also the test doesn't check that results are incorrect.
I suppose since the decoder/encoder is only used internally the correct arguments should always be provided, so I shouldn't be testing for that.

martinradev · 2019-12-13T13:25:40Z

The failure seems unrelated.

martinradev · 2019-12-15T13:02:41Z

@pitrou , thanks for the review. Can you check whether the patch is better now?

martinradev · 2019-12-16T10:29:32Z

@wesm Can you have a look?

pitrou · 2019-12-16T14:00:03Z

thanks for the review. Can you check whether the patch is better now?

Thanks for the update. It's indeed better, though a couple comments haven't been answered. Can you take a look at them?

wesm · 2019-12-16T16:42:03Z

cpp/src/parquet/encoding.cc

You might better use SafeLoadAs here https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L54

wesm · 2019-12-16T16:44:30Z

cpp/src/parquet/encoding.cc

Use PARQUET_THROW_NOT_OK

wesm · 2019-12-16T16:45:17Z

cpp/src/parquet/encoding.cc

Not having this implemented is rather limiting, considering that much data will be inbound from spaced Arrow formatted

Can I do this in a follow-up patch?

wesm · 2019-12-16T16:48:04Z

cpp/src/parquet/encoding.cc

Indeed looks incorrect

wesm · 2019-12-16T16:48:58Z

cpp/src/parquet/encoding.cc

Is it accurate/correct that this is striding by num_values_ bytes through the encoded memory on each iteration?

Yes, there are sizeof(T) streams and num_values_ entries per stream.
The array of F32 values [0x11223344, 0xAABBCCDD] would be encoded as
[44 DD 33 CC 22 BB 11 AA].
num_values_ is 2 here which is the stride.
So to decode the first value we go over 44, 33, 22, 11.

wesm · 2019-12-16T16:55:07Z

cpp/src/parquet/encoding_test.cc

You test the reset-state behavior in multiple places, I wonder if you would want to move these checks to a dedicated test

I can do this in a follow-up patch for all encoders.

wesm · 2019-12-16T16:55:22Z

cpp/src/parquet/encoding_test.cc

Similar question re: code reuse

Since it's a refactoring for all encoder, I can do this in a follow-up patch.

wesm · 2019-12-16T16:57:37Z

cpp/src/parquet/encoding_test.cc

Is there value in this manual reassembly versus a static or randomly generated array?

If we test the round-trip Decode(Encode(A)) == A then we only test that the decoder does the inverse of the encoder. Since we have a spec we have to follow, it's good to check that everything is precise.
If we get a failing test, we might not know whether it's in the decode or encode path. Testing only the encoder/decoder can help root cause the issue.

wesm · 2019-12-16T16:58:36Z

cpp/src/parquet/encoding_test.cc

Helper function for checking round trips

I will do this in a separate patch.

wesm · 2019-12-16T16:59:50Z

cpp/src/parquet/types.h

Does this need to be public?

No, will move it to encoding.cc in an unnamed namespace.

martinradev · 2019-12-17T09:45:19Z

@pitrou I missed the comments due to my unfamiliarity with github's ui.

pitrou · 2019-12-17T10:09:10Z

@martinradev No problem :-)

martinradev · 2019-12-18T11:29:44Z

I made some changes to the patch to implement the missing encoder interface and to make it so that the code adheres to the other encoders.

martinradev · 2019-12-20T21:47:29Z

@pitrou Does it look better?
I am not sure how to refactor the existing tests in encoding_test.cc since they look to me that they target somewhat different paths. So, I would add code to cover round-trips but I do not know whether I can easily remove existing code.

pitrou

Sorry for the delays. Two more comments below.

pitrou · 2020-01-09T17:32:55Z

cpp/src/parquet/encoding.cc

Hmm... do you think the compiler is able to unroll the inner loop? Otherwise, perhaps you can precompute the stream pointers outside of the loop:

const uint8_t* stream_data[num_streams]; for (size_t b = 0; b < num_streams; ++b) { stream_data[b] = data_ + b * num_values_in_buffer + num_decoded_previously; } for (int i = 0; i < values_to_decode; ++i) { uint8_t gathered_byte_data[num_streams]; for (size_t b = 0; b < num_streams; ++b) { gathered_byte_data[b] = stream_data[b][byte_index]; } // ...

It looks like the loop is unrolled for the decoder and encoder for both float and double types:

Encoder<float> 0x00007ffff736f060 <+624>: movzx r10d,BYTE PTR [rax+r9*4] 0x00007ffff736f065 <+629>: mov BYTE PTR [r8+r9*1],r10b 0x00007ffff736f069 <+633>: movzx r10d,BYTE PTR [rax+r9*4+0x1] 0x00007ffff736f06f <+639>: mov BYTE PTR [rdx+r9*1],r10b 0x00007ffff736f073 <+643>: movzx r10d,BYTE PTR [rax+r9*4+0x2] 0x00007ffff736f079 <+649>: mov BYTE PTR [rdi+r9*1],r10b 0x00007ffff736f07d <+653>: movzx r10d,BYTE PTR [rax+r9*4+0x3] 0x00007ffff736f083 <+659>: mov BYTE PTR [rsi+r9*1],r10b 0x00007ffff736f087 <+663>: add r9,0x1 0x00007ffff736f08b <+667>: cmp rcx,r9 0x00007ffff736f08e <+670>: ja 0x7ffff736f060 <_ZN7parquet22ByteStreamSplitEncoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE11FlushValuesEv+624> Decoder<float> 0x7ffff736eca8 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+104>: movzx ecx,BYTE PTR [r11+rdx*1] 0x7ffff736ecad <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+109>: mov al,BYTE PTR [rbx+rdx*1] 0x7ffff736ecb0 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+112>: movzx edi,BYTE PTR [r10+rdx*1] 0x7ffff736ecb5 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+117>: mov ah,cl 0x7ffff736ecb7 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+119>: movzx ecx,BYTE PTR [r9+rdx*1] 0x7ffff736ecbc <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+124>: shl edi,0x10 0x7ffff736ecbf <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+127>: movzx eax,ax 0x7ffff736ecc2 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+130>: or eax,edi 0x7ffff736ecc4 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+132>: shl ecx,0x18 0x7ffff736ecc7 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+135>: or eax,ecx 0x7ffff736ecc9 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+137>: mov DWORD PTR [rsi+rdx*4],eax 0x7ffff736eccc <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+140>: add rdx,0x1 0x7ffff736ecd0 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+144>: cmp r8,rdx 0x7ffff736ecd3 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+147>: jne 0x7ffff736eca8 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+104> Encoder<double> 0x00007ffff736ee70 <+128>: movzx esi,BYTE PTR [rdx] 0x00007ffff736ee73 <+131>: add rax,0x1 0x00007ffff736ee77 <+135>: add rdx,0x8 0x00007ffff736ee7b <+139>: mov BYTE PTR [rax-0x1],sil 0x00007ffff736ee7f <+143>: movzx esi,BYTE PTR [rdx-0x7] 0x00007ffff736ee83 <+147>: mov BYTE PTR [rax+rcx*1-0x1],sil 0x00007ffff736ee88 <+152>: movzx esi,BYTE PTR [rdx-0x6] 0x00007ffff736ee8c <+156>: mov BYTE PTR [rax+r10*1-0x1],sil 0x00007ffff736ee91 <+161>: movzx esi,BYTE PTR [rdx-0x5] 0x00007ffff736ee95 <+165>: mov BYTE PTR [rax+r8*1-0x1],sil 0x00007ffff736ee9a <+170>: movzx esi,BYTE PTR [rdx-0x4] 0x00007ffff736ee9e <+174>: mov BYTE PTR [rax+r9*1-0x1],sil 0x00007ffff736eea3 <+179>: movzx esi,BYTE PTR [rdx-0x3] 0x00007ffff736eea7 <+183>: mov BYTE PTR [rax+rbp*1-0x1],sil 0x00007ffff736eeac <+188>: movzx esi,BYTE PTR [rdx-0x2] 0x00007ffff736eeb0 <+192>: mov BYTE PTR [rax+rbx*1-0x1],sil 0x00007ffff736eeb5 <+197>: movzx esi,BYTE PTR [rdx-0x1] 0x00007ffff736eeb9 <+201>: mov BYTE PTR [rax+rdi*1-0x1],sil 0x00007ffff736eebe <+206>: cmp r11,rax 0x00007ffff736eec1 <+209>: jne 0x7ffff736ee70 <_ZN7parquet22ByteStreamSplitEncoderINS_12PhysicalTypeILNS_4Type4typeE5EEEE11FlushValuesEv+128> Decoder<double>: 0x00007ffff736ecc8 <+184>: movzx r15d,BYTE PTR [r13+rdx*1+0x0] 0x00007ffff736ecce <+190>: mov al,BYTE PTR [r14+rdx*1] 0x00007ffff736ecd2 <+194>: mov rcx,r15 0x00007ffff736ecd5 <+197>: movzx r15d,BYTE PTR [r12+rdx*1] 0x00007ffff736ecda <+202>: mov ah,cl 0x00007ffff736ecdc <+204>: movabs rcx,0xffff00ffffffffff 0x00007ffff736ece6 <+214>: and rax,0xffffffffff00ffff 0x00007ffff736ecec <+220>: shl r15,0x10 0x00007ffff736ecf0 <+224>: or rax,r15 0x00007ffff736ecf3 <+227>: movzx r15d,BYTE PTR [rbp+rdx*1+0x0] 0x00007ffff736ecf9 <+233>: and rax,r8 0x00007ffff736ecfc <+236>: shl r15,0x18 0x00007ffff736ed00 <+240>: or rax,r15 0x00007ffff736ed03 <+243>: movzx r15d,BYTE PTR [rbx+rdx*1] 0x00007ffff736ed08 <+248>: and rax,rdi 0x00007ffff736ed0b <+251>: shl r15,0x20 0x00007ffff736ed0f <+255>: or rax,r15 0x00007ffff736ed12 <+258>: movzx r15d,BYTE PTR [r11+rdx*1] 0x00007ffff736ed17 <+263>: and rax,rcx 0x00007ffff736ed1a <+266>: movabs rcx,0xff00ffffffffffff 0x00007ffff736ed24 <+276>: shl r15,0x28 0x00007ffff736ed28 <+280>: or rax,r15 0x00007ffff736ed2b <+283>: movzx r15d,BYTE PTR [r10+rdx*1] 0x00007ffff736ed30 <+288>: and rax,rcx 0x00007ffff736ed33 <+291>: movabs rcx,0xffffffffffffff 0x00007ffff736ed3d <+301>: shl r15,0x30 0x00007ffff736ed41 <+305>: or rax,r15 0x00007ffff736ed44 <+308>: movzx r15d,BYTE PTR [r9+rdx*1] 0x00007ffff736ed49 <+313>: and rax,rcx 0x00007ffff736ed4c <+316>: shl r15,0x38 0x00007ffff736ed50 <+320>: or rax,r15 0x00007ffff736ed53 <+323>: mov QWORD PTR [rsi+rdx*8],rax 0x00007ffff736ed57 <+327>: add rdx,0x1 0x00007ffff736ed5b <+331>: cmp QWORD PTR [rsp+0x8],rdx 0x00007ffff736ed60 <+336>: jne 0x7ffff736ecc8 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE5EEEE6DecodeEPdi+184>

But the compiler does not decide to simd-vectorize any of the loops.
In my experiments, it does make a difference to use simd -> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
I have an implementation for the encoder and decoder, and I plan to submit a PR once we get this patch in.
Do you think this makes sense?

It probably makes sense, yes.

pitrou · 2020-01-09T17:35:45Z

cpp/src/parquet/encoding_test.cc

It would be nice to at least add the roundtrip helper(s) in this PR, since otherwise there's a lot of additional duplication.

martinradev · 2020-01-19T09:31:52Z

I am a bit overwhelmed with other tasks but I will get back to this in the upcoming weekend.

The patch implements an encoder and decoder for Parquet's BYTE_STREAM_SPLIT encoding. The patch also adds tests for the new encoding.

martinradev · 2020-01-24T08:38:38Z

Hey @pitrou , I did a little bit of refactoring to reduce the bloat in the test file. I also checked that the compiler unrolls the loop when GCC is the compiler. I did not check with clang but I suspect it does as well.

pitrou · 2020-01-27T17:05:56Z

@wesm Do you want to take a look again here?

wesm · 2020-01-29T15:11:21Z

Looking

martinradev · 2020-01-30T13:44:54Z

Is the patch better now?

pitrou · 2020-01-30T13:46:42Z

For me, the only thing remaining is the lack of genericity and infrastructure reuse in unit tests: some tests hardcode a float32 datatype, others a float64 datatype, and for some reason the tests don't reuse the TestEncodingBase class. But that may not necessarily block the PR for integration. I'll let @wesm comment.

martinradev · 2020-01-30T16:54:02Z

@pitrou Ok, then I can try to reiterate on the patch and address the issue.

wesm · 2020-01-31T01:41:02Z

Sorry I've been at a conference this week and was a bit optimistic about my ability to review patches. I'm having a look at this right now

wesm

+1. Sorry for the delay on this. Can you open a JIRA issue for the follow up refactoring (better code reuse in encoding-test.cc)?

The patch implements an encoder and decoder for Parquet's BYTE_STREAM_SPLIT encoding. The patch also adds tests for the new encoding. Closes #6005 from martinradev/byte_stream_split_submit and squashes the following commits: 5a78f8b <Martin Radev> ARROW-5913: Add BYTE_STREAM_SPLIT encoder and decoder Authored-by: Martin Radev <martin.b.radev@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

martinradev force-pushed the byte_stream_split_submit branch 3 times, most recently from d7bbcf2 to e4c875a Compare December 10, 2019 11:01

wesm changed the title ~~ARROW-5913: [C++] Add BYTE_STREAM_SPLIT encoder and decoder~~ PARQUET-1716: [C++] Add BYTE_STREAM_SPLIT encoder and decoder Dec 10, 2019

martinradev force-pushed the byte_stream_split_submit branch 3 times, most recently from 20f5061 to 1e48ed6 Compare December 11, 2019 14:00

pitrou requested changes Dec 11, 2019

View reviewed changes

martinradev force-pushed the byte_stream_split_submit branch 3 times, most recently from ba951f4 to 280bc4f Compare December 13, 2019 10:14

martinradev requested a review from pitrou December 13, 2019 13:25

martinradev force-pushed the byte_stream_split_submit branch 2 times, most recently from 10f67ca to b084394 Compare December 14, 2019 12:38

wesm requested changes Dec 16, 2019

View reviewed changes

martinradev force-pushed the byte_stream_split_submit branch from b084394 to 98dd5b4 Compare December 16, 2019 23:23

martinradev force-pushed the byte_stream_split_submit branch 3 times, most recently from f354123 to 33d0dd1 Compare December 17, 2019 23:06

pitrou reviewed Jan 9, 2020

View reviewed changes

martinradev force-pushed the byte_stream_split_submit branch from 33d0dd1 to 42d006d Compare January 23, 2020 23:32

ARROW-5913: [C++] Add BYTE_STREAM_SPLIT encoder and decoder

5a78f8b

The patch implements an encoder and decoder for Parquet's BYTE_STREAM_SPLIT encoding. The patch also adds tests for the new encoding.

martinradev force-pushed the byte_stream_split_submit branch from 42d006d to 5a78f8b Compare January 23, 2020 23:52

wesm approved these changes Feb 5, 2020

View reviewed changes

wesm closed this in 25fd97b Feb 5, 2020

asfimport mentioned this pull request Jun 23, 2024

[C++][Parquet] Add support for BYTE_STREAM_SPLIT encoding #42372

Closed

PARQUET-1716: [C++] Add BYTE_STREAM_SPLIT encoder and decoder #6005

PARQUET-1716: [C++] Add BYTE_STREAM_SPLIT encoder and decoder #6005

Uh oh!

Conversation

martinradev commented Dec 9, 2019

Uh oh!

martinradev commented Dec 9, 2019

Uh oh!

github-actions bot commented Dec 10, 2019

Uh oh!

wesm commented Dec 10, 2019

Uh oh!

github-actions bot commented Dec 10, 2019

Uh oh!

martinradev commented Dec 11, 2019

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martinradev commented Dec 13, 2019

Uh oh!

martinradev commented Dec 15, 2019

Uh oh!

martinradev commented Dec 16, 2019

Uh oh!

pitrou commented Dec 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment