Skip to content

[SPARK-54571][CORE][SQL] Use LZ4 safeDecompressor#53454

Open
pan3793 wants to merge 5 commits intoapache:masterfrom
pan3793:SPARK-54571
Open

[SPARK-54571][CORE][SQL] Use LZ4 safeDecompressor#53454
pan3793 wants to merge 5 commits intoapache:masterfrom
pan3793:SPARK-54571

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Dec 12, 2025

What changes were proposed in this pull request?

Previously, lz4-java was upgraded to 1.10.1 to address CVEs,

while this casues significant performance drop, see the benchmark report at

this PR follows the suggestion to migrate to safeDecompressor.


This PR also upgrades DB2 JDBC driver to a special version 12.1.3.0_special_74723 provided by DB2 team that bundles lz4-java 1.10.1 which addresses NoSuchMethodError issue (it only affects test).

Update: I opened SPARK-55136 (#53920) to process DB2 JDBC driver independently, please reivew that first.

Why are the changes needed?

Mitigate performance regression.

Does this PR introduce any user-facing change?

No, except for performance.

How was this patch tested?

GHA for functionality, benchmark for performance.

TL;DR - my test results show lz4-java 1.10.1 is about 10~15% slower on lz4 compression than 1.8.0, and is about ~5% slower on lz4 decompression even with migrating to suggested safeDecompressor

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Dec 12, 2025
@pan3793 pan3793 changed the title WIP [SPARK-54571] Use safeDecompressor WIP [SPARK-54571] Use LZ4 safeDecompressor Dec 12, 2025
@pan3793
Copy link
Member Author

pan3793 commented Dec 14, 2025

the test failure is caused by com.ibm.db2:jcc includes unshaded old lz4 classes

@pan3793
Copy link
Member Author

pan3793 commented Dec 15, 2025

cc @dbtsai @huaxingao, com.ibm.db2:jcc includes unshaded old lz4 classes, which causes sql/hive/docker-it modules test failure after this patch

java.lang.NoSuchMethodError: 'net.jpountz.lz4.LZ4BlockInputStream$Builder net.jpountz.lz4.LZ4BlockInputStream.newBuilder()'
 	at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:156)
 	...

I checked all versions available in Maven Central, all of them have the same issue.
https://mvnrepository.com/artifact/com.ibm.db2/jcc

I don't find the public contact info of the IBM DB2 JDBC driver team, not sure what's the next step, temporarily purge the dependency and disable DB2 tests? Or any better ideas?


Update: contacted the DB2 JDBC driver's author, new release that bundles the latest lz4-java is working in progress


Update: DB2 team provides a special JDBC driver 12.1.3.0_special_74723 that bundles lz4-java 1.10.1 which addressed NoSuchMethodError issue.

@github-actions github-actions bot added the SQL label Dec 17, 2025
@pan3793 pan3793 changed the title WIP [SPARK-54571] Use LZ4 safeDecompressor [WIP][SPARK-54571] Use LZ4 safeDecompressor Jan 22, 2026
@github-actions github-actions bot added the BUILD label Jan 22, 2026
@github-actions
Copy link

JIRA Issue Information

=== Improvement SPARK-54571 ===
Summary: Use LZ4 safeDecompressor
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@pan3793 pan3793 changed the title [WIP][SPARK-54571] Use LZ4 safeDecompressor [SPARK-54571] Use LZ4 safeDecompressor Jan 22, 2026
@pan3793 pan3793 changed the title [SPARK-54571] Use LZ4 safeDecompressor [SPARK-54571][CORE][SQL] Use LZ4 safeDecompressor Jan 22, 2026
<postgresql.version>42.7.7</postgresql.version>
<db2.jcc.version>11.5.9.0</db2.jcc.version>
<!-- A special version that bundles lz4-java 1.10.1 -->
<db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dep is only used for testing

@pan3793 pan3793 marked this pull request as ready for review January 22, 2026 08:00
@pan3793
Copy link
Member Author

pan3793 commented Jan 22, 2026

<postgresql.version>42.7.7</postgresql.version>
<db2.jcc.version>11.5.9.0</db2.jcc.version>
<!-- A special version that bundles lz4-java 1.10.1 -->
<db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we upgrade it first in a separate pr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, let me upgrade it first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang FYI, I opened SPARK-55136 (#53920) for it

@dbtsai
Copy link
Member

dbtsai commented Feb 11, 2026

@viirya can you take a look? Thanks.

@viirya
Copy link
Member

viirya commented Feb 11, 2026

Hmm, so that's said, in lz4-java 1.10.1, fastDecompressor is paradoxically much slower than safeDecompressor?

@pan3793
Copy link
Member Author

pan3793 commented Feb 11, 2026

LZ4BlockInputStream.newBuilder()
.withDecompressor(lz4Factory.safeDecompressor())
.withChecksum(xxHashFactory.newStreamingHash32(defaultSeed).asChecksum)
.withStopOnEmptyBlock(disableConcatenationOfByteStream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this .withStopOnEmptyBlock(disableConcatenationOfByteStream) a correct mapping to previous behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly, you can check the source code at

https://github.com/yawkat/lz4-java/blob/33d180cb70c4d93c80fb0dc3ab3002f457e93840/src/java/net/jpountz/lz4/LZ4BlockInputStream.java#L69

  public LZ4BlockInputStream(InputStream in, LZ4FastDecompressor fastDecompressor, Checksum checksum, boolean stopOnEmptyBlock) {
    this(in, fastDecompressor, null, checksum, stopOnEmptyBlock);
  }

Comment on lines +343 to +344
<!-- A special version that bundles lz4-java 1.10.1 -->
<db2.jcc.version>12.1.3.0_special_74723</db2.jcc.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we need to update this dependency when upgrading lz4-java?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya, ideally, we should, but I'm afraid we can't, we can discuss the Db2 JDBC issue in the dedicated PR #53920

@Yicong-Huang
Copy link
Contributor

@pan3793 thanks for the great effort! I was not aware of this PR and opened the other one due to performance regression. Do we know what are the current blockers from merging this PR?

cc @dbtsai @ueshin @dongjoon-hyun

@pan3793
Copy link
Member Author

pan3793 commented Feb 26, 2026

@Yicong-Huang, you can find all related issues/background in the PR description

Compression: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Compression 4 times 2605 2611 8 0.0 651243236.8 1.0X
Compression 4 times 2613 2621 12 0.0 653232793.3 1.0X
Copy link
Member

@dongjoon-hyun dongjoon-hyun Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although safeDecompressor shows slower performance, but it's in the marginal scope. There might be some improvements at lz4-java 1.10.3.

Given the above, could you update the PR description?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. I forgot that this PR is about Decompression (lz4Factory.safeDecompressor).

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 26, 2026

BTW, I updated the PR description to include the following because the previous benchmark result was generated based on that.

@pan3793
Copy link
Member Author

pan3793 commented Feb 26, 2026

@dongjoon-hyun, the benchmark results in this PR are generated by lz4-java 1.10.1 a few months ago, I will regenerate that once we resolve the Db2 JDBC driver issue.

@dongjoon-hyun
Copy link
Member

What I meant (by the following) was that the result in the current master branch. The - part.

the previous benchmark result was generated based on that.

- Decompression 4 times                              2219           2220           1          0.0   554762743.5       1.0X

For the new result, yes. We need to update it, of course.

@pan3793
Copy link
Member Author

pan3793 commented Feb 26, 2026

I see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants