Skip to content

Conversation

jerolba
Copy link
Contributor

@jerolba jerolba commented May 25, 2025

Avoid LongStream reading files and use an ad-hoc Long Iterator

Rationale for this change

Profiling the load of a Parquet file with Java Mission Control, I've noticed that InternalParquetRecordReader LongStream consumes relevant amount of time:

image

This LongStream can be replaced with a simpler Long Iterator that iterates from 0 to pages.getRowCount().

To measure the overhead I've created a test project that overwrites InternalParquetRecordReader class using a Long Iterator (the same change than proposed in the PR)

The execution time is sensitive to the context of the JVM, but running the benchmark multiple times shows that LongStream is slower than LongIterator, between 1% and 4% depending on the run.

What changes are included in this PR?

A new LongIterator that implements PrimitiveIterator.OfLong and replaces a LongStream.range(0, pages.getRowCount()).iterator()

Are these changes tested?

Not directly, but it's covered by existing tests

Are there any user-facing changes?

No

Closes #3226

@pan3793
Copy link
Member

pan3793 commented Jun 3, 2025

I ran your benchmark project locally, but actually got an opposite conclusion

$ java -version
openjdk version "21.0.3" 2024-04-16 LTS
OpenJDK Runtime Environment Zulu21.34+19-CA (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Zulu21.34+19-CA (build 21.0.3+9-LTS, mixed mode, sharing)
CPU: Intel i5-9500 (6) @ 4.400GHz
Ubuntu 24.04 Linux Kernel version 6.12.10
> Task :long-iterator:run
Using Long Iterator
Fetching file
Reading TripNarrow from parquet file...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
20405666 1733
20405666 1854
20405666 1852
20405666 1838
20405666 1846
20405666 1846
20405666 1847
20405666 1842
20405666 1843
20405666 1836
Total time 18337

> Task :long-stream:run
Using Long Stream
Reading TripNarrow from parquet file...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
20405666 1687
20405666 1783
20405666 1778
20405666 1790
20405666 1782
20405666 1803
20405666 1810
20405666 1778
20405666 1835
20405666 1816
Total time 17862

@jerolba
Copy link
Contributor Author

jerolba commented Jun 3, 2025

Yes, as I said in he PR, it's sensitive to the context. I was unable to obtain deterministic results. Even running the same test twice consecutively produced different execution times. After multiple executions, the long-iterator version was better more times than the long-stream version.
I decided to create the PR because from a theoretical perspective, Stream code is more computationally expensive than Iterator code, even when accounting for inlining and other JIT optimizations.

@pan3793
Copy link
Member

pan3793 commented Jun 3, 2025

I ran the benchmark using different JDKs, it looks like your assertion is almost true in the lower versions of JDK (I tested 8, 17), but in new versions(I tested 21), long-stream is a little bit faster than long-iterator. So I believe that new JDKs have some optimization for cases you mentioned, if so, we should keep the code as-is for future-proofing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of InternalParquetRecordReader (1%)
2 participants