Arrow: Avoid extra dictionary buffer copy #5137

bryanck · 2022-06-27T11:35:24Z

This PR changes the dictionary value accessors in the vectorized parquet reader so that the dictionary values are read from the underlying dictionary directly, rather than copying the values into a new buffer where relevant (this was already being done in the dictionary decimal accessor classes). The underlying parquet dictionary classes already load the values into a buffer, so copying them to a new buffer appears redundant in some cases.

This PR also makes a couple of changes to avoid binary buffer copies when building string values when possible.

In very limited testing, this shows a performance gain of over 20% in vectorized read performance in some scenarios, though more testing would be required to get more accurate metrics.

kbendick

Thanks @bryanck for working on this!

Mostly some style nits, plus a question for my own clarification. Thank you!

kbendick · 2022-06-27T17:16:48Z

...spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ArrowVectorAccessorFactory.java

+    public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
+      if (byteBuffer.hasArray()) {
+        return UTF8String.fromBytes(
+                byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());


Nit: indentation

kbendick · 2022-06-27T17:17:03Z

...spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ArrowVectorAccessorFactory.java

+    public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
+      if (byteBuffer.hasArray()) {
+        return UTF8String.fromBytes(
+                byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());


Nit: over-indented (should be 4 spaces from the start of return on the line above).

Thanks, I fixed these. Checkstyle didn't seem to mind...

Thanks for letting me know. I’ll see if I can add a checkstyle rule for that or update one to catch it!

kbendick · 2022-06-27T17:17:55Z

...spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ArrowVectorAccessorFactory.java

+    public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
+      if (byteBuffer.hasArray()) {
+        return UTF8String.fromBytes(
+                byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());


Nit: indentation

kbendick · 2022-06-27T17:18:33Z

...spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ArrowVectorAccessorFactory.java

+    public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
+      if (byteBuffer.hasArray()) {
+        return UTF8String.fromBytes(
+                byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());


Nit: indentation

kbendick · 2022-06-27T17:19:56Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java

-          .toArray(genericArray(stringFactory.getGenericClass()));
+      this.dictionary = dictionary;
+      this.stringFactory = stringFactory;
+      this.cache = genericArray(stringFactory.getGenericClass(), dictionary.getMaxId() + 1);


Question: why are you adding 1 here?

To support a 0-based index of getMaxId(), you need a size that is one bigger

kbendick · 2022-06-27T17:20:09Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowVectorAccessors.java

+    public String ofByteBuffer(ByteBuffer byteBuffer) {
+      if (byteBuffer.hasArray()) {
+        return new String(byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(),
+                byteBuffer.remaining(), StandardCharsets.UTF_8);


Nit: over-indented (should be 4 spaces from the start of return on the line above).

rdblue · 2022-06-27T19:12:43Z

Looks great. Thanks for fixing this, @bryanck!

github-actions bot added the arrow label Jun 27, 2022

Arrow: Avoid extra dictionary buffer copy

5d497cc

bryanck force-pushed the dict-value-decode branch from 156d23d to 5d497cc Compare June 27, 2022 11:53

github-actions bot added the spark label Jun 27, 2022

bryanck force-pushed the dict-value-decode branch from 2c0c7dd to 4d7e9ca Compare June 27, 2022 13:09

Avoid buffer copy when decoding strings where possible

3686ec3

bryanck force-pushed the dict-value-decode branch from 4d7e9ca to 3686ec3 Compare June 27, 2022 13:13

more efficient string decoding for byte buffer

f25e381

kbendick reviewed Jun 27, 2022

View reviewed changes

fix indentation

e1d83b6

rdblue approved these changes Jun 27, 2022

View reviewed changes

rdblue merged commit eff6556 into apache:master Jun 27, 2022

kbendick mentioned this pull request Jul 6, 2022

Vectorized Parquet reader spends a lot of time in getVectorAccessor #4164

Closed

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Arrow: Avoid extra dictionary buffer copy (apache#5137)

bb437a1

namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022

Arrow: Avoid extra dictionary buffer copy (apache#5137)

d4de393

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Avoid extra dictionary buffer copy #5137

Arrow: Avoid extra dictionary buffer copy #5137

bryanck commented Jun 27, 2022 •

edited

Loading

kbendick left a comment

kbendick Jun 27, 2022

kbendick Jun 27, 2022

bryanck Jun 27, 2022

kbendick Jun 27, 2022

kbendick Jun 27, 2022

kbendick Jun 27, 2022

kbendick Jun 27, 2022

bryanck Jun 27, 2022

kbendick Jun 27, 2022

rdblue commented Jun 27, 2022

Arrow: Avoid extra dictionary buffer copy #5137

Arrow: Avoid extra dictionary buffer copy #5137

Conversation

bryanck commented Jun 27, 2022 • edited Loading

kbendick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Jun 27, 2022

bryanck commented Jun 27, 2022 •

edited

Loading