[SPARK-55683][SQL] Optimize VectorizedPlainValuesReader.readUnsignedLongs#54479
[SPARK-55683][SQL] Optimize VectorizedPlainValuesReader.readUnsignedLongs#54479LuciferYang wants to merge 1 commit intoapache:masterfrom
VectorizedPlainValuesReader.readUnsignedLongs#54479Conversation
|
Test first, the PRprdescription and performance comparison will be updated later. |
There was a problem hiding this comment.
Pull request overview
Optimizes Spark SQL’s Parquet vectorized PLAIN decoding for UINT_64 by replacing the per-value BigInteger(String) construction path with direct byte-level conversion, and extends the existing Parquet IO test to cover boundary cases for unsigned 64-bit decoding.
Changes:
- Reworked
VectorizedPlainValuesReader.readUnsignedLongsto convert little-endianUINT_64bytes into BigInteger-compatible big-endian two’s-complement byte arrays withoutString/BigIntegerintermediates. - Added a helper to produce
BigInteger.toByteArray()-compatible encodings (zero handling + sign-byte handling). - Extended
ParquetIOSuite’sUINT_64test data with boundary values (0/1/max/min/-1/-2).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java | Implements the new byte-manipulation decoding path + helper for BigInteger-compatible output. |
| sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala | Extends existing UINT_64 Parquet read test with boundary values to validate the new encoding logic. |
Comments suppressed due to low confidence (2)
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:278
putLittleEndianBytesAsBigIntegerstill allocates a newbyte[]per value (new byte[totalLen], plusnew byte[]{0}for zeros). SinceWritableColumnVector.putByteArray(rowId, value, offset, count)copies the bytes intoarrayData(), you can reuse a single scratch buffer (e.g.,byte[9]) across the loop and pass the appropriate offset/length to avoid per-element heap allocations entirely.
// Zero value: must write [0x00] rather than an empty array.
// BigInteger.ZERO.toByteArray() returns [0x00], and new BigInteger(new byte[0])
// throws NumberFormatException("Zero length BigInteger").
if (msbIndex == offset && src[offset] == 0) {
c.putByteArray(rowId, new byte[]{0});
return;
}
// Prepend a 0x00 sign byte if the most significant byte has bit 7 set.
// This matches BigInteger.toByteArray() behavior: positive values whose highest
// magnitude byte has the MSB set are prefixed with 0x00 to distinguish them
// from negative values in two's-complement encoding.
boolean needSignByte = (src[msbIndex] & 0x80) != 0;
int valueLen = msbIndex - offset + 1;
int totalLen = needSignByte ? valueLen + 1 : valueLen;
byte[] dest = new byte[totalLen];
int destOffset = 0;
if (needSignByte) {
dest[destOffset++] = 0x00;
}
// Reverse byte order: little-endian src → big-endian dest
for (int i = msbIndex; i >= offset; i--) {
dest[destOffset++] = src[i];
}
c.putByteArray(rowId, dest, 0, totalLen);
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:1275
- The boundary values sequence is duplicated here and again when constructing
boundaryExpectedbelow. Consider defining a singleval boundaryValues = Seq(...)and reusing it for both writing and expected rows to avoid accidental drift if the list changes in the future.
// Boundary values: zero, one, signed extremes interpreted as unsigned
Seq(0L, 1L, Long.MaxValue, Long.MinValue, -2L, -1L).foreach { v =>
val group = factory.newGroup().append("a", v)
writer.write(group)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Merged to master for Apache Spark 4.2.0. Thank you, @LuciferYang and @pan3793 . |
This suggestion from Copilot sounds good. Let me try it out to see the effect. If there are benefits, I'll submit a FOLLOWUP. |
It has shown stable optimization effects for OnHeap , and I've opened a follow-up: |
What changes were proposed in this pull request?
This PR optimizes
VectorizedPlainValuesReader.readUnsignedLongsby replacing the per-elementBigIntegerheap allocation chain with direct byte manipulation.The original implementation allocates multiple objects per element:
The new implementation reads raw little-endian bytes directly from the
ByteBufferbacking array (when available) and converts them toBigInteger-compatible big-endian encoding in a single pass:The private helper
putLittleEndianBytesAsBigIntegerhandles the conversion with output matchingBigInteger.toByteArray()semantics:[0x00](1 byte) rather than an empty array, sincenew BigInteger(new byte[0])throwsNumberFormatException0x00when the most significant byte has bit 7 set, to ensure the value is interpreted as positive byBigIntegerWhy are the changes needed?
The original implementation constructs a
BigIntegerviaLong.toUnsignedString+new BigInteger(String), which involves per-element allocations of aString, aBigInteger, its internalint[]magnitude array, and the finalbyte[]. For a typical batch of 4096 values this means ~16K object allocations, creating significant GC pressure in workloads reading largeUINT_64columns.The new implementation reduces this to one
byte[]allocation per element by operating directly on the raw bytes from theByteBuffer, avoiding all intermediate object creation. Additionally, the direct buffer fallback path reuses a singlebyte[8]scratch buffer across the entire batch.Does this PR introduce any user-facing change?
No.
How was this patch tested?
SPARK-34817: Read UINT_64 as Decimal from parquetinParquetIOSuitewas extended with boundary values covering the critical edge cases of the new byte manipulation logicOldVectorizedPlainValuesReader, and compare the latency of the old and newreadUnsignedLongsmethods using JMH:Benchmark Code (click to expand)
Perform
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReaderJMHBenchmark"to conduct the testBenchmark results:
Both onHeap and offHeap paths show approximately ~8x improvement.
Was this patch authored or co-authored using generative AI tooling?
Yes, Claude Sonnet 4.6 was used to assist in completing the code writing.