[SPARK-54571][CORE][SQL] Use LZ4 safeDecompressor#53454
[SPARK-54571][CORE][SQL] Use LZ4 safeDecompressor#53454pan3793 wants to merge 5 commits intoapache:masterfrom
Conversation
|
the test failure is caused by |
|
cc @dbtsai @huaxingao, I checked all versions available in Maven Central, all of them have the same issue.
Update: contacted the DB2 JDBC driver's author, new release that bundles the latest lz4-java is working in progress Update: DB2 team provides a special JDBC driver 12.1.3.0_special_74723 that bundles lz4-java 1.10.1 which addressed |
JIRA Issue Information=== Improvement SPARK-54571 === This comment was automatically generated by GitHub Actions |
| <postgresql.version>42.7.7</postgresql.version> | ||
| <db2.jcc.version>11.5.9.0</db2.jcc.version> | ||
| <!-- A special version that bundles lz4-java 1.10.1 --> | ||
| <db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version> |
There was a problem hiding this comment.
this dep is only used for testing
| <postgresql.version>42.7.7</postgresql.version> | ||
| <db2.jcc.version>11.5.9.0</db2.jcc.version> | ||
| <!-- A special version that bundles lz4-java 1.10.1 --> | ||
| <db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version> |
There was a problem hiding this comment.
Can we upgrade it first in a separate pr?
There was a problem hiding this comment.
okay, let me upgrade it first
There was a problem hiding this comment.
@LuciferYang FYI, I opened SPARK-55136 (#53920) for it
|
@viirya can you take a look? Thanks. |
|
Hmm, so that's said, in lz4-java 1.10.1, fastDecompressor is paradoxically much slower than safeDecompressor? |
|
@viirya yeah, it was mentioned in https://sites.google.com/sonatype.com/vulnerabilities/cve-2025-12183 |
| LZ4BlockInputStream.newBuilder() | ||
| .withDecompressor(lz4Factory.safeDecompressor()) | ||
| .withChecksum(xxHashFactory.newStreamingHash32(defaultSeed).asChecksum) | ||
| .withStopOnEmptyBlock(disableConcatenationOfByteStream) |
There was a problem hiding this comment.
Is this .withStopOnEmptyBlock(disableConcatenationOfByteStream) a correct mapping to previous behavior?
There was a problem hiding this comment.
yes, exactly, you can check the source code at
public LZ4BlockInputStream(InputStream in, LZ4FastDecompressor fastDecompressor, Checksum checksum, boolean stopOnEmptyBlock) {
this(in, fastDecompressor, null, checksum, stopOnEmptyBlock);
}
| <!-- A special version that bundles lz4-java 1.10.1 --> | ||
| <db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version> |
There was a problem hiding this comment.
Does it mean we need to update this dependency when upgrading lz4-java?
|
@pan3793 thanks for the great effort! I was not aware of this PR and opened the other one due to performance regression. Do we know what are the current blockers from merging this PR? |
|
@Yicong-Huang, you can find all related issues/background in the PR description |
…17, Scala 2.13, split 1 of 1)
…21, Scala 2.13, split 1 of 1)
| Compression: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------------------------------ | ||
| Compression 4 times 2605 2611 8 0.0 651243236.8 1.0X | ||
| Compression 4 times 2613 2621 12 0.0 653232793.3 1.0X |
There was a problem hiding this comment.
Although safeDecompressor shows slower performance, but it's in the marginal scope. There might be some improvements at lz4-java 1.10.3.
Given the above, could you update the PR description?
There was a problem hiding this comment.
nvm. I forgot that this PR is about Decompression (lz4Factory.safeDecompressor).
|
BTW, I updated the PR description to include the following because the previous benchmark result was generated based on that. |
|
@dongjoon-hyun, the benchmark results in this PR are generated by lz4-java 1.10.1 a few months ago, I will regenerate that once we resolve the Db2 JDBC driver issue. |
|
What I meant (by the following) was that the result in the current
For the new result, yes. We need to update it, of course. |
|
I see |
What changes were proposed in this pull request?
Previously, lz4-java was upgraded to 1.10.1 to address CVEs,
lz4-javato 1.10.0 #53327lz4-javato 1.10.1 #53347while this casues significant performance drop, see the benchmark report at
this PR follows the suggestion to migrate to safeDecompressor.
This PR also upgrades DB2 JDBC driver to a special version 12.1.3.0_special_74723 provided by DB2 team that bundles lz4-java 1.10.1 which addresses
NoSuchMethodErrorissue (it only affects test).Update: I opened SPARK-55136 (#53920) to process DB2 JDBC driver independently, please reivew that first.
Why are the changes needed?
Mitigate performance regression.
Does this PR introduce any user-facing change?
No, except for performance.
How was this patch tested?
GHA for functionality, benchmark for performance.
Was this patch authored or co-authored using generative AI tooling?
No.