Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39704: [C++][Parquet] Benchmark levels decoding #39705

Merged
merged 8 commits into from
Feb 5, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jan 19, 2024

Rationale for this change

This patch add the level-decoding benchmark. It test:

  1. Different max-level (for flat type, maximum level would be 1, for nested type, it would grows)
  2. With different repeat ( repeated null / non-null is different from non-repeated data)
  3. With different read-batch size. This part of logic is a bit tricky in original code

What changes are included in this PR?

Add Level decoding benchmark

Are these changes tested?

No need

Are there any user-facing changes?

no

@mapleFU mapleFU requested a review from wgtmac as a code owner January 19, 2024 13:25
Copy link

⚠️ GitHub issue #39704 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Jan 19, 2024

@pitrou @emkornfield @wgtmac Would you mind take a look?

Also cc @Hattonuri

@mapleFU
Copy link
Member Author

mapleFU commented Jan 19, 2024

Benchmark on my MacOS with Release (-O2)

--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          2771 ns         2725 ns       244327
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          9603 ns         9281 ns        74978
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024        534 ns          508 ns      1391429
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          2111 ns         2007 ns       348569
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          2078 ns         1993 ns       352508
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1731 ns         1728 ns       404636
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          8545 ns         8236 ns        84408

@mapleFU
Copy link
Member Author

mapleFU commented Jan 19, 2024

This benchmark shows that, when not highly repeated, the RLE without bitpacking is slow 😅

After changing RLE to BIT_PACKED, the speed gets a bit faster when repeat is not high:

--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------------------
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1072 ns         1069 ns       658198
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1056 ns         1051 ns       646001
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1088 ns         1057 ns       662383
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1075 ns         1033 ns       683908
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1123 ns         1121 ns       627145
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1093 ns         1091 ns       637848
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1183 ns         1135 ns       616930

We should mentioned that, our native unpack to int16 is slow =_=, this making decoding a bit slow.

@alippai
Copy link
Contributor

alippai commented Jan 20, 2024

Is this use case relevant here? #34510

Reading a non nullable fixed size list is missing the fast path, it’d nice to see it in the benchmark (even if not improving yet). With all the AI nowadays I assume tensor storage will be more and more common.

@mapleFU
Copy link
Member Author

mapleFU commented Jan 20, 2024

Reading a non nullable fixed size list is missing the fast path

Yeah I think it's related, I think I can optimize unpack later, but maybe I need some help in optimizing RLE

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea @mapleFU . Please see my comments below.

cpp/src/parquet/column_reader_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_reader_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_reader_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_reader_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_reader_benchmark.cc Show resolved Hide resolved
cpp/src/parquet/column_reader_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_reader_benchmark.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 22, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 22, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Jan 26, 2024

@emkornfield @pitrou Updated, so sorry for the delaying

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 26, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Jan 26, 2024

Result in my MacOS with Release(-O2):

ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1                  3122 ns         3123 ns       225916 bytes_per_second=4.8286G/s items_per_second=2.59233G/s
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7                  8642 ns         8640 ns        81130 bytes_per_second=1.74531G/s items_per_second=937.005M/s
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024               1124 ns         1124 ns       617709 bytes_per_second=13.4136G/s items_per_second=7.20137G/s
ReadLevels/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1                  2412 ns         2414 ns       290082 bytes_per_second=6.24778G/s items_per_second=3.35425G/s
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1                  2403 ns         2401 ns       292828 bytes_per_second=6.27942G/s items_per_second=3.37124G/s
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1                  2201 ns         2202 ns       320742 bytes_per_second=6.84694G/s items_per_second=3.67592G/s
ReadLevels/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7                  7836 ns         7829 ns        90506 bytes_per_second=1.92618G/s items_per_second=1034.11M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1724 ns         1726 ns       397519 bytes_per_second=8.73636G/s items_per_second=4.6903G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1694 ns         1695 ns       411450 bytes_per_second=8.89878G/s items_per_second=4.7775G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1694 ns         1695 ns       409966 bytes_per_second=8.89797G/s items_per_second=4.77706G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1668 ns         1669 ns       414886 bytes_per_second=9.03668G/s items_per_second=4.85153G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1760 ns         1761 ns       395125 bytes_per_second=8.56206G/s items_per_second=4.59672G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1733 ns         1734 ns       402188 bytes_per_second=8.69522G/s items_per_second=4.66821G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1754 ns         1756 ns       396380 bytes_per_second=8.58685G/s items_per_second=4.61003G/s

@mapleFU
Copy link
Member Author

mapleFU commented Jan 30, 2024

Ping @pitrou @emkornfield for help

@pitrou
Copy link
Member

pitrou commented Jan 31, 2024

Strange phenomenon: we get results like bytes_per_second=11.7241G/s items_per_second=6.29431G/s, where bytes_per_second is not equal to 2 * items_per_second.

@pitrou
Copy link
Member

pitrou commented Jan 31, 2024

Oh, it seems Google benchmark has a weird behavior here. Unrelated to this PR though.

@pitrou
Copy link
Member

pitrou commented Jan 31, 2024

Posted google/benchmark#1749 for the Google benchmark oddity.

@pitrou
Copy link
Member

pitrou commented Feb 1, 2024

FTR, benchmark numbers here:

--------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2487 ns         2488 ns       288334 bytes_per_second=6.06116Gi/s items_per_second=3.25406G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8071 ns         8072 ns        86861 bytes_per_second=1.86809Gi/s items_per_second=1.00292G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024            826 ns          828 ns       841216 bytes_per_second=18.2145Gi/s items_per_second=9.77881G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2209 ns         2211 ns       314019 bytes_per_second=6.81903Gi/s items_per_second=3.66094G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2110 ns         2112 ns       331250 bytes_per_second=7.13955Gi/s items_per_second=3.83302G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              1904 ns         1906 ns       368359 bytes_per_second=7.91382Gi/s items_per_second=4.2487G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              7673 ns         7675 ns        90873 bytes_per_second=1.96488Gi/s items_per_second=1.05489G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1338 ns         1342 ns       522900 bytes_per_second=11.2397Gi/s items_per_second=6.03429G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1342 ns         1345 ns       521761 bytes_per_second=11.2118Gi/s items_per_second=6.01927G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1340 ns         1343 ns       520704 bytes_per_second=11.2263Gi/s items_per_second=6.02705G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1327 ns         1330 ns       526037 bytes_per_second=11.3356Gi/s items_per_second=6.08578G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1335 ns         1338 ns       520920 bytes_per_second=11.2687Gi/s items_per_second=6.04985G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1327 ns         1329 ns       526969 bytes_per_second=11.3443Gi/s items_per_second=6.09043G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1336 ns         1338 ns       522798 bytes_per_second=11.2696Gi/s items_per_second=6.05033G/s

@mapleFU
Copy link
Member Author

mapleFU commented Feb 5, 2024

@pitrou I think the benchmark result shows "batch_size" should be take into consideration, for example, when the batchsize grows, BITPACK doesn't get improved, however, RLE is well optimized

@mapleFU
Copy link
Member Author

mapleFU commented Feb 5, 2024

Mybad, change vector for output to batch_size

@mapleFU mapleFU requested a review from pitrou February 5, 2024 17:34
@pitrou pitrou merged commit 0c88d13 into apache:main Feb 5, 2024
26 of 30 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Feb 5, 2024
@mapleFU mapleFU deleted the level-decoding-benchmark branch February 5, 2024 18:07
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 0c88d13.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
### Rationale for this change

This patch add the level-decoding benchmark. It test:
1. Different max-level (for flat type, maximum level would be 1, for nested type, it would grows)
2. With different repeat ( repeated null / non-null is different from non-repeated data)
3. With different read-batch size. This part of logic is a bit tricky in original code

### What changes are included in this PR?

Add Level decoding benchmark

### Are these changes tested?

No need

### Are there any user-facing changes?

no

* Closes: apache#39704

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
### Rationale for this change

This patch add the level-decoding benchmark. It test:
1. Different max-level (for flat type, maximum level would be 1, for nested type, it would grows)
2. With different repeat ( repeated null / non-null is different from non-repeated data)
3. With different read-batch size. This part of logic is a bit tricky in original code

### What changes are included in this PR?

Add Level decoding benchmark

### Are these changes tested?

No need

### Are there any user-facing changes?

no

* Closes: apache#39704

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
### Rationale for this change

This patch add the level-decoding benchmark. It test:
1. Different max-level (for flat type, maximum level would be 1, for nested type, it would grows)
2. With different repeat ( repeated null / non-null is different from non-repeated data)
3. With different read-batch size. This part of logic is a bit tricky in original code

### What changes are included in this PR?

Add Level decoding benchmark

### Are these changes tested?

No need

### Are there any user-facing changes?

no

* Closes: apache#39704

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Benchmark Level Decoding
5 participants