Skip to content

[SPARK-55683][SQL][FOLLOWUP] Optimize VectorizedPlainValuesReader.readUnsignedLongs to reuse scratch buffer and avoid per-element allocations#54510

Open
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-55683-FOLLOWUP
Open

[SPARK-55683][SQL][FOLLOWUP] Optimize VectorizedPlainValuesReader.readUnsignedLongs to reuse scratch buffer and avoid per-element allocations#54510
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-55683-FOLLOWUP

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Feb 26, 2026

What changes were proposed in this pull request?

This pr refer to the suggestion from Copilot: #54479 (review), further optimizes VectorizedPlainValuesReader.readUnsignedLongs by introducing a reusable scratch buffer to eliminate per-element byte[] allocations introduced in the previous refactoring.

The previous implementation allocates a new byte[] per element for the encoded output:

// Previous: new byte[totalLen] per element, plus new byte[]{0} for zero values
byte[] dest = new byte[totalLen];
...
c.putByteArray(rowId, dest, 0, totalLen);

The new implementation allocates a single byte[9] scratch buffer once per batch and reuses it across all elements. Since WritableColumnVector.putByteArray copies the bytes into its internal storage immediately, the scratch buffer can be safely overwritten on the next iteration:

// New: one byte[9] allocated per batch, reused for every element
byte[] scratch = new byte[9];
for (...) {
    putLittleEndianBytesAsBigInteger(c, rowId, src, offset, scratch);
}

The scratch buffer is sized at 9 bytes to accommodate the worst case: 1 0x00 sign byte + 8 value bytes. The zero value special case is also handled via scratch, avoiding the previous new byte[]{0} allocation.

Why are the changes needed?

The previous implementation still allocates one byte[] per element for the encoded output. For a typical batch of 4096 values this means 4096 heap allocations per readUnsignedLongs call, creating GC pressure in workloads that read large UINT_64 columns. With the scratch buffer approach, the entire batch produces only 2 allocations: byte[9] (scratch) and byte[8] (direct buffer fallback read buffer), regardless of batch size.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Java 17

[info] Benchmark                                                              (numValues)  Mode  Cnt       Score      Error  Units
[info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_offHeap_New     10000000  avgt   10  233820.658 ± 1888.523  us/op
[info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_offHeap_Old     10000000  avgt   10  255563.248 ± 3500.165  us/op
[info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_onHeap_New      10000000  avgt   10  228672.684 ± 2985.496  us/op
[info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_onHeap_Old      10000000  avgt   10  275756.804 ± 2065.405  us/op

Java 21

[info] Benchmark                                                              (numValues)  Mode  Cnt       Score       Error  Units
[info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_offHeap_New     10000000  avgt   10  241977.924 ± 15125.343  us/op
[info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_offHeap_Old     10000000  avgt   10  250343.470 ±  1342.509  us/op
[info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_onHeap_New      10000000  avgt   10  212929.948 ±  1387.671  us/op
[info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_onHeap_Old      10000000  avgt   10  274561.949 ±  1226.348  us/op

Judging from the test results, the onHeap path demonstrates approximately a 17-22% improvement, while the offHeap path shows roughly a 3-9% improvement across Java 17 and Java 21.

Was this patch authored or co-authored using generative AI tooling?

Yes, Claude Sonnet 4.6 was used to assist in completing the code writing.

Comment on lines +282 to +283
// putByteArray copies the bytes into arrayData(), so scratch can be safely reused
c.putByteArray(rowId, scratch, 0, totalLen);
Copy link
Member

@pan3793 pan3793 Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you leave the same comments at line 262?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants