[SPARK-54693][CORE][TESTS] Add LZ4TPCDSDataBenchmark by pan3793 · Pull Request #53453 · apache/spark

pan3793 · 2025-12-12T08:35:41Z

What changes were proposed in this pull request?

Add LZ4TPCDSDataBenchmark, test LZ4CompressionCodec against TPCDS catalog_sales.dat (SF1), the size is about 283M.

Why are the changes needed?

Add a benchmark to measure the perf impact of lz4 security upgrading.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added benchmark result. Since the change has a refactor that touched ZSTDTPCDSDataBenchmark, also updated its benchmark results.

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2025-12-12T08:50:00Z

TL;DR - my test results show lz4-java 1.10.1 is about 10~15% slower on lz4 compression than 1.8.0, and is about 5% slower on lz4 decompression even with migrating to suggested safeDecompressor (#53454)

Here is my local test result:

lz4-java 1.8.0 with fastDecompressor (state before aa65bda)

[info] Running benchmark: Benchmark LZ4CompressionCodec
[info]   Running case: Compression 4 times
[info]   Stopped after 2 iterations, 4143 ms
[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.17.9-76061709-generic
[info] Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
[info] Benchmark LZ4CompressionCodec:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Compression 4 times                                2064           2072          11          0.0   515881622.3       1.0X
[info] Running benchmark: Benchmark LZ4CompressionCodec
[info]   Running case: Decompression 4 times
[info]   Stopped after 3 iterations, 2659 ms
[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.17.9-76061709-generic
[info] Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
[info] Benchmark LZ4CompressionCodec:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Decompression 4 times                               843            886          45          0.0   210840272.0       1.0X

lz4-java 1.10.1 with fastDecompressor (current master state, significant decompress perf drop!!!)

[info] Running benchmark: Benchmark LZ4CompressionCodec
[info]   Running case: Compression 4 times
[info]   Stopped after 2 iterations, 4740 ms
[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.17.9-76061709-generic
[info] Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
[info] Benchmark LZ4CompressionCodec:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Compression 4 times                                2351           2370          27          0.0   587850958.3       1.0X
[info] Running benchmark: Benchmark LZ4CompressionCodec
[info]   Running case: Decompression 4 times
[info]   Stopped after 2 iterations, 4147 ms
[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.17.9-76061709-generic
[info] Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
[info] Benchmark LZ4CompressionCodec:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Decompression 4 times                              2070           2074           5          0.0   517477273.3       1.0X

lz4-java 1.10.1 with safeDecompressor (#53454)

[info] Running benchmark: Benchmark LZ4CompressionCodec
[info]   Running case: Compression 4 times
[info]   Stopped after 2 iterations, 4729 ms
[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.17.9-76061709-generic
[info] Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
[info] Benchmark LZ4CompressionCodec:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Compression 4 times                                2360           2365           7          0.0   589977281.7       1.0X
[info] Running benchmark: Benchmark LZ4CompressionCodec
[info]   Running case: Decompression 4 times
[info]   Stopped after 3 iterations, 2742 ms
[info] OpenJDK 64-Bit Server VM 17.0.15+6-LTS on Linux 6.17.9-76061709-generic
[info] Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz
[info] Benchmark LZ4CompressionCodec:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] Decompression 4 times                               886            914          43          0.0   221537120.3       1.0X

core/src/test/scala/org/apache/spark/io/LZ4TPCDSDataBenchmark.scala

…21, Scala 2.13, split 1 of 1)

…17, Scala 2.13, split 1 of 1)

… (JDK 21, Scala 2.13, split 1 of 1)

… (JDK 17, Scala 2.13, split 1 of 1)

pan3793 · 2025-12-15T02:45:16Z

@LuciferYang, I did a refactor to make LZ4TPCDSDataBenchmark and ZStandardTPCDSDataBenchmark share the common code, also added the benchmard result generated by CI, could you please take another look?

LuciferYang · 2025-12-15T03:08:12Z

core/src/test/scala/org/apache/spark/benchmark/BenchmarkBase.scala

+  /**
+   * Any code before running any benchmark, e.g., data preparation
+   */
+  def beforeAll(): Unit = {}


My suggestion is not to add the beforeAll() method in the current pr If we find it to be widely applicable, we can add this method in a subsequent pr and also extract beforeAll() for more microbenchmark scenarios.

addressed in f02e559

LuciferYang · 2025-12-15T08:19:47Z

Merged into master. Thanks @pan3793

dongjoon-hyun · 2025-12-18T04:10:22Z

Although I didn't get a chance to take a look at the detail, thank you so much for adding additional benchmark, @pan3793 and @LuciferYang !

pan3793 added 2 commits December 12, 2025 16:31

LZ4TPCDSDataBenchmark

ba67e02

GHA workflow

7b66018

github-actions bot added CORE INFRA labels Dec 12, 2025

pan3793 mentioned this pull request Dec 12, 2025

[WIP][SPARK-54571][CORE] Use LZ4 safeDecompressor #53290

Closed

LuciferYang reviewed Dec 12, 2025

View reviewed changes

pan3793 and others added 6 commits December 12, 2025 20:25

refactor

9433bd7

fix import

bf961b0

Benchmark results for org.apache.spark.io.LZ4TPCDSDataBenchmark (JDK …

061ea81

…21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.io.LZ4TPCDSDataBenchmark (JDK …

b5c73de

…17, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.io.ZStandardTPCDSDataBenchmark…

d752379

… (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.io.ZStandardTPCDSDataBenchmark…

ea7b879

… (JDK 17, Scala 2.13, split 1 of 1)

pan3793 marked this pull request as ready for review December 15, 2025 02:43

pan3793 mentioned this pull request Dec 15, 2025

[SPARK-54571][CORE][SQL] Use LZ4 safeDecompressor #53454

Open

LuciferYang reviewed Dec 15, 2025

View reviewed changes

address comment

f02e559

LuciferYang approved these changes Dec 15, 2025

View reviewed changes

LuciferYang closed this in 1e14c5c Dec 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54693][CORE][TESTS] Add LZ4TPCDSDataBenchmark#53453

[SPARK-54693][CORE][TESTS] Add LZ4TPCDSDataBenchmark#53453
pan3793 wants to merge 9 commits intoapache:masterfrom
pan3793:SPARK-54693

pan3793 commented Dec 12, 2025 •

edited

Loading

Uh oh!

pan3793 commented Dec 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pan3793 commented Dec 15, 2025

Uh oh!

LuciferYang Dec 15, 2025

Uh oh!

pan3793 Dec 15, 2025

Uh oh!

LuciferYang commented Dec 15, 2025

Uh oh!

dongjoon-hyun commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pan3793 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pan3793 commented Dec 15, 2025

Uh oh!

LuciferYang Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Dec 15, 2025

Uh oh!

dongjoon-hyun commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pan3793 commented Dec 12, 2025 •

edited

Loading

pan3793 commented Dec 12, 2025 •

edited

Loading