-
Notifications
You must be signed in to change notification settings - Fork 3.9k
PARQUET-1716: [C++] Add BYTE_STREAM_SPLIT encoder and decoder #6005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The patch adds an encoding from a future parquet thrift release and it might have to go to a separate branch until the new parquet format is released. |
d7bbcf2 to
e4c875a
Compare
|
I moved the JIRA to the PARQUET project |
20f5061 to
1e48ed6
Compare
|
Can you review this? I have one other patch about optimizing the implementation since clang/gcc did not auto-vectorize the encoder/decoder loop in some of my early tests. I have not checked with the current patch. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a Parquet expert, but here is a review.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should certainly use a arrow::TypedBufferBuilder<T> instead. It will make use of the Arrow memory pool for allocations (and will reallocate efficiently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you could use TypedBufferBuilder<T>::Append(const T* values, int64_t num_elements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/types.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should simply be spelled UnsignedTypeWithWidth<4>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to raise an error than crash out, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added a test for invalid types.
There might be also some benefit of using it for integers but I have not tested on any data set, and that can be changed in the future if found useful.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use an unsigned initializer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there already is a num_values_ field in DecoderImpl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and changed to using the one in DecoderImpl
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the unsigned const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By mistake. Changed to signed literal.
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead you could just write:
const uint8_t* value_as_bytes = reinterpret_cast<const uint8_t*>(&value);
ASSERT_EQ(value_as_bytes[0], mutable_data[0]);
// etcThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point? Just test with a reasonable data size > 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to cover the case when data is large (>1000) elements but that's already covered in another test, so I removed this one.
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you initializing a single element here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This initializes all elements to .0 but I changed it to make it more reader-friendly per your request.
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be SetData(2, data, 16)? (2 double values -> 16 bytes len)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is me trying to test having number of elements greater than what can be supported with the number of bytes provided. This should be in fact a failing test but the implementation doesn't check for this and also the test doesn't check that results are incorrect.
I suppose since the decoder/encoder is only used internally the correct arguments should always be provided, so I shouldn't be testing for that.
ba951f4 to
280bc4f
Compare
|
The failure seems unrelated. |
10f67ca to
b084394
Compare
|
@pitrou , thanks for the review. Can you check whether the patch is better now? |
|
@wesm Can you have a look? |
Thanks for the update. It's indeed better, though a couple comments haven't been answered. Can you take a look at them? |
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might better use SafeLoadAs here https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/ubsan.h#L54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use PARQUET_THROW_NOT_OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having this implemented is rather limiting, considering that much data will be inbound from spaced Arrow formatted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I do this in a follow-up patch?
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed looks incorrect
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it accurate/correct that this is striding by num_values_ bytes through the encoded memory on each iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are sizeof(T) streams and num_values_ entries per stream.
The array of F32 values [0x11223344, 0xAABBCCDD] would be encoded as
[44 DD 33 CC 22 BB 11 AA].
num_values_ is 2 here which is the stride.
So to decode the first value we go over 44, 33, 22, 11.
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You test the reset-state behavior in multiple places, I wonder if you would want to move these checks to a dedicated test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this in a follow-up patch for all encoders.
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question re: code reuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a refactoring for all encoder, I can do this in a follow-up patch.
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there value in this manual reassembly versus a static or randomly generated array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we test the round-trip Decode(Encode(A)) == A then we only test that the decoder does the inverse of the encoder. Since we have a spec we have to follow, it's good to check that everything is precise.
If we get a failing test, we might not know whether it's in the decode or encode path. Testing only the encoder/decoder can help root cause the issue.
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helper function for checking round trips
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do this in a separate patch.
cpp/src/parquet/types.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, will move it to encoding.cc in an unnamed namespace.
b084394 to
98dd5b4
Compare
|
@pitrou I missed the comments due to my unfamiliarity with github's ui. |
|
@martinradev No problem :-) |
f354123 to
33d0dd1
Compare
|
I made some changes to the patch to implement the missing encoder interface and to make it so that the code adheres to the other encoders. |
|
@pitrou Does it look better? |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delays. Two more comments below.
cpp/src/parquet/encoding.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... do you think the compiler is able to unroll the inner loop? Otherwise, perhaps you can precompute the stream pointers outside of the loop:
const uint8_t* stream_data[num_streams];
for (size_t b = 0; b < num_streams; ++b) {
stream_data[b] = data_ + b * num_values_in_buffer + num_decoded_previously;
}
for (int i = 0; i < values_to_decode; ++i) {
uint8_t gathered_byte_data[num_streams];
for (size_t b = 0; b < num_streams; ++b) {
gathered_byte_data[b] = stream_data[b][byte_index];
}
// ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the loop is unrolled for the decoder and encoder for both float and double types:
Encoder<float>
0x00007ffff736f060 <+624>: movzx r10d,BYTE PTR [rax+r9*4]
0x00007ffff736f065 <+629>: mov BYTE PTR [r8+r9*1],r10b
0x00007ffff736f069 <+633>: movzx r10d,BYTE PTR [rax+r9*4+0x1]
0x00007ffff736f06f <+639>: mov BYTE PTR [rdx+r9*1],r10b
0x00007ffff736f073 <+643>: movzx r10d,BYTE PTR [rax+r9*4+0x2]
0x00007ffff736f079 <+649>: mov BYTE PTR [rdi+r9*1],r10b
0x00007ffff736f07d <+653>: movzx r10d,BYTE PTR [rax+r9*4+0x3]
0x00007ffff736f083 <+659>: mov BYTE PTR [rsi+r9*1],r10b
0x00007ffff736f087 <+663>: add r9,0x1
0x00007ffff736f08b <+667>: cmp rcx,r9
0x00007ffff736f08e <+670>: ja 0x7ffff736f060 <_ZN7parquet22ByteStreamSplitEncoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE11FlushValuesEv+624>
Decoder<float>
0x7ffff736eca8 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+104>: movzx ecx,BYTE PTR [r11+rdx*1]
0x7ffff736ecad <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+109>: mov al,BYTE PTR [rbx+rdx*1]
0x7ffff736ecb0 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+112>: movzx edi,BYTE PTR [r10+rdx*1]
0x7ffff736ecb5 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+117>: mov ah,cl
0x7ffff736ecb7 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+119>: movzx ecx,BYTE PTR [r9+rdx*1]
0x7ffff736ecbc <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+124>: shl edi,0x10
0x7ffff736ecbf <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+127>: movzx eax,ax
0x7ffff736ecc2 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+130>: or eax,edi
0x7ffff736ecc4 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+132>: shl ecx,0x18
0x7ffff736ecc7 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+135>: or eax,ecx
0x7ffff736ecc9 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+137>: mov DWORD PTR [rsi+rdx*4],eax
0x7ffff736eccc <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+140>: add rdx,0x1
0x7ffff736ecd0 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+144>: cmp r8,rdx
0x7ffff736ecd3 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+147>: jne 0x7ffff736eca8 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE4EEEE6DecodeEPfi+104>
Encoder<double>
0x00007ffff736ee70 <+128>: movzx esi,BYTE PTR [rdx]
0x00007ffff736ee73 <+131>: add rax,0x1
0x00007ffff736ee77 <+135>: add rdx,0x8
0x00007ffff736ee7b <+139>: mov BYTE PTR [rax-0x1],sil
0x00007ffff736ee7f <+143>: movzx esi,BYTE PTR [rdx-0x7]
0x00007ffff736ee83 <+147>: mov BYTE PTR [rax+rcx*1-0x1],sil
0x00007ffff736ee88 <+152>: movzx esi,BYTE PTR [rdx-0x6]
0x00007ffff736ee8c <+156>: mov BYTE PTR [rax+r10*1-0x1],sil
0x00007ffff736ee91 <+161>: movzx esi,BYTE PTR [rdx-0x5]
0x00007ffff736ee95 <+165>: mov BYTE PTR [rax+r8*1-0x1],sil
0x00007ffff736ee9a <+170>: movzx esi,BYTE PTR [rdx-0x4]
0x00007ffff736ee9e <+174>: mov BYTE PTR [rax+r9*1-0x1],sil
0x00007ffff736eea3 <+179>: movzx esi,BYTE PTR [rdx-0x3]
0x00007ffff736eea7 <+183>: mov BYTE PTR [rax+rbp*1-0x1],sil
0x00007ffff736eeac <+188>: movzx esi,BYTE PTR [rdx-0x2]
0x00007ffff736eeb0 <+192>: mov BYTE PTR [rax+rbx*1-0x1],sil
0x00007ffff736eeb5 <+197>: movzx esi,BYTE PTR [rdx-0x1]
0x00007ffff736eeb9 <+201>: mov BYTE PTR [rax+rdi*1-0x1],sil
0x00007ffff736eebe <+206>: cmp r11,rax
0x00007ffff736eec1 <+209>: jne 0x7ffff736ee70 <_ZN7parquet22ByteStreamSplitEncoderINS_12PhysicalTypeILNS_4Type4typeE5EEEE11FlushValuesEv+128>
Decoder<double>:
0x00007ffff736ecc8 <+184>: movzx r15d,BYTE PTR [r13+rdx*1+0x0]
0x00007ffff736ecce <+190>: mov al,BYTE PTR [r14+rdx*1]
0x00007ffff736ecd2 <+194>: mov rcx,r15
0x00007ffff736ecd5 <+197>: movzx r15d,BYTE PTR [r12+rdx*1]
0x00007ffff736ecda <+202>: mov ah,cl
0x00007ffff736ecdc <+204>: movabs rcx,0xffff00ffffffffff
0x00007ffff736ece6 <+214>: and rax,0xffffffffff00ffff
0x00007ffff736ecec <+220>: shl r15,0x10
0x00007ffff736ecf0 <+224>: or rax,r15
0x00007ffff736ecf3 <+227>: movzx r15d,BYTE PTR [rbp+rdx*1+0x0]
0x00007ffff736ecf9 <+233>: and rax,r8
0x00007ffff736ecfc <+236>: shl r15,0x18
0x00007ffff736ed00 <+240>: or rax,r15
0x00007ffff736ed03 <+243>: movzx r15d,BYTE PTR [rbx+rdx*1]
0x00007ffff736ed08 <+248>: and rax,rdi
0x00007ffff736ed0b <+251>: shl r15,0x20
0x00007ffff736ed0f <+255>: or rax,r15
0x00007ffff736ed12 <+258>: movzx r15d,BYTE PTR [r11+rdx*1]
0x00007ffff736ed17 <+263>: and rax,rcx
0x00007ffff736ed1a <+266>: movabs rcx,0xff00ffffffffffff
0x00007ffff736ed24 <+276>: shl r15,0x28
0x00007ffff736ed28 <+280>: or rax,r15
0x00007ffff736ed2b <+283>: movzx r15d,BYTE PTR [r10+rdx*1]
0x00007ffff736ed30 <+288>: and rax,rcx
0x00007ffff736ed33 <+291>: movabs rcx,0xffffffffffffff
0x00007ffff736ed3d <+301>: shl r15,0x30
0x00007ffff736ed41 <+305>: or rax,r15
0x00007ffff736ed44 <+308>: movzx r15d,BYTE PTR [r9+rdx*1]
0x00007ffff736ed49 <+313>: and rax,rcx
0x00007ffff736ed4c <+316>: shl r15,0x38
0x00007ffff736ed50 <+320>: or rax,r15
0x00007ffff736ed53 <+323>: mov QWORD PTR [rsi+rdx*8],rax
0x00007ffff736ed57 <+327>: add rdx,0x1
0x00007ffff736ed5b <+331>: cmp QWORD PTR [rsp+0x8],rdx
0x00007ffff736ed60 <+336>: jne 0x7ffff736ecc8 <_ZTv0_n64_N7parquet22ByteStreamSplitDecoderINS_12PhysicalTypeILNS_4Type4typeE5EEEE6DecodeEPdi+184>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the compiler does not decide to simd-vectorize any of the loops.
In my experiments, it does make a difference to use simd -> https://github.com/martinradev/arrow-fp-compression-bench/blob/master/optimize_byte_stream_split/report_final.pdf
I have an implementation for the encoder and decoder, and I plan to submit a PR once we get this patch in.
Do you think this makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes sense, yes.
cpp/src/parquet/encoding_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to at least add the roundtrip helper(s) in this PR, since otherwise there's a lot of additional duplication.
|
I am a bit overwhelmed with other tasks but I will get back to this in the upcoming weekend. |
33d0dd1 to
42d006d
Compare
The patch implements an encoder and decoder for Parquet's BYTE_STREAM_SPLIT encoding. The patch also adds tests for the new encoding.
42d006d to
5a78f8b
Compare
|
Hey @pitrou , I did a little bit of refactoring to reduce the bloat in the test file. I also checked that the compiler unrolls the loop when GCC is the compiler. I did not check with clang but I suspect it does as well. |
|
@wesm Do you want to take a look again here? |
|
Looking |
|
Is the patch better now? |
|
For me, the only thing remaining is the lack of genericity and infrastructure reuse in unit tests: some tests hardcode a float32 datatype, others a float64 datatype, and for some reason the tests don't reuse the |
|
@pitrou Ok, then I can try to reiterate on the patch and address the issue. |
|
Sorry I've been at a conference this week and was a bit optimistic about my ability to review patches. I'm having a look at this right now |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Sorry for the delay on this. Can you open a JIRA issue for the follow up refactoring (better code reuse in encoding-test.cc)?
The patch implements an encoder and decoder for Parquet's BYTE_STREAM_SPLIT encoding. The patch also adds tests for the new encoding. Closes #6005 from martinradev/byte_stream_split_submit and squashes the following commits: 5a78f8b <Martin Radev> ARROW-5913: Add BYTE_STREAM_SPLIT encoder and decoder Authored-by: Martin Radev <martin.b.radev@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
The patch implements an encoder and decoder for Parquet's
BYTE_STREAM_SPLIT encoding. The patch also adds tests for
the new encoding.