Add multithreaded parquet reader benchmarks. #15585

nvdbaranec · 2024-04-23T17:08:24Z

Addresses: #12700

Adds multithreaded benchmarks for the parquet reader. Separate benchmarks for the chunked and non-chunked readers. In both cases, the primary cases are 2, 4 and 8 threads running reads at the same time. There is not a ton of variability in the other benchmarking axes.

The primary use of this particular benchmark is to see inter-kernel performance (that is, how well do our many different kernel types coexist with each other). Whereas normal benchmarks tend to be more for intra-kernel performance checking.

NVTX ranges are included to help visually group the bundles of reads together in nsight-sys. I also posted a new issue which would help along these lines: #15575

Update: I've tweaked some of the numbers to demonstrate some mild performance improvements as we go up in thread count, and included 1-thread as a case. Some examples:

## parquet_multithreaded_read_decode_mixed
| cardinality | total_data_size | num_threads | num_cols | bytes_per_second |
|-------------|-----------------|-------------|----------|------------------|
|        1000 |       536870912 |           1 |        4 |      28874731473 |
|        1000 |      1073741824 |           1 |        4 |      30564139526 |
|        1000 |       536870912 |           2 |        4 |      29399214255 |
|        1000 |      1073741824 |           2 |        4 |      31486327920 |
|        1000 |       536870912 |           4 |        4 |      27009769400 |
|        1000 |      1073741824 |           4 |        4 |      32234841632 |
|        1000 |       536870912 |           8 |        4 |      24416650118 |
|        1000 |      1073741824 |           8 |        4 |      30841124677 |

## parquet_multithreaded_read_decode_chunked_string
| cardinality | total_data_size | num_threads | num_cols | bytes_per_second |
|-------------|-----------------|-------------|----------|------------------|
|        1000 |       536870912 |           1 |        4 |      14637004584 |
|        1000 |      1073741824 |           1 |        4 |      16025843421 |
|        1000 |       536870912 |           2 |        4 |      15333491977 |
|        1000 |      1073741824 |           2 |        4 |      17164197747 |
|        1000 |       536870912 |           4 |        4 |      16556300728 |
|        1000 |      1073741824 |           4 |        4 |      17711338934 |
|        1000 |       536870912 |           8 |        4 |      15788371298 |
|        1000 |      1073741824 |           8 |        4 |      17911649578 |

In addition, this benchmark clearly shows multi-thread only regressions. An example case below using the pageable-error-code regression we've seen in the past.

Example without regression:


## parquet_multithreaded_read_decode_chunked_fixed_width
total_data_size | num_threads | bytes_per_second |
----------------|-------------|------------------|
      536870912 |           1 |      25681728660 |
     1073741824 |           1 |      26281335927 |
      536870912 |           2 |      25597258848 |
     1073741824 |           2 |      26733626352 |
      536870912 |           4 |      25190211717 |
     1073741824 |           4 |      28117411682 |
      536870912 |           8 |      25805791994 |
     1073741824 |           8 |      27788485204 |

Example with regression (pageable error-code return values):

## parquet_multithreaded_read_decode_chunked_fixed_width
total_data_size | num_threads | bytes_per_second |
-----------------|------------|------------------|
       536870912 |          1 |      25660470283 |
      1073741824 |          1 |      26146862480 |
       536870912 |          2 |      25040145602 |
      1073741824 |          2 |      25460591520 |
       536870912 |          4 |      22917046969 |
      1073741824 |          4 |      24922624784 |
       536870912 |          8 |      20529770200 |
      1073741824 |          8 |      23333751767 |

In both cases, we can see that the single-thread case remains the same but there's a regression in the multi-thread case. particularly with 4 threads.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Mike Wilson <knobby@burntsheep.com>

…r chunked and non-chunked reader.

copy-pr-bot · 2024-04-23T17:08:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

nvdbaranec · 2024-04-23T17:10:14Z

One thing of note. I'm tempted to go larger with some of the sizes here because I want to make sure we're saturating the GPU as much as we can, but using sizes much larger than this (especially in the chunked case) causes total usage to quickly blow past 16 GB. I don't know how much we care about that.

cpp/benchmarks/CMakeLists.txt

hyperbolic2346 · 2024-04-24T19:50:48Z

What kind of runtime is this? How does performance look? Are we able to catch regressions like #14167?

GregoryKimball · 2024-05-16T17:26:44Z

@nvdbaranec BTW @vuule and I were discussing that maybe we need to introduce a pinned host buffer data source to show better scaling with this kind of multi-threaded benchmark.

…mance increases with more threads. And the benchmarks clearly show multiple-thread-only performance regressions that we have historically seen.

nvdbaranec · 2024-05-20T18:54:34Z

/ok to test

vuule

Looks good, just one TODO that can be addressed now

cpp/benchmarks/io/parquet/parquet_reader_multithread.cpp

nvdbaranec · 2024-05-21T18:19:54Z

/ok to test

nvdbaranec · 2024-05-21T21:18:06Z

/ok to test

nvdbaranec · 2024-05-21T22:51:34Z

/merge

GregoryKimball · 2024-06-03T22:29:24Z

Thank you @nvdbaranec, these benchmarks are excellent!

Here are the settings that I recently used to study the interleaving of copy and copy on A100 (+ @vuule)

./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=16000000000 --axis num_threads=16

I decided to scale the total_data_size with the num_threads so that I could compare the throughput for 1 thread to read 1 GB with the throughput for 10 threads to read 10 GB.

Beautiful results!

Addresses: #15973 Adds multithreaded benchmarks for the ORC reader. Based off of the parquet equivalent in #15585 ``` # Benchmark Results ## orc_multithreaded_read_decode_mixed ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 338x | 44.348 ms | 1.18% | 44.343 ms | 1.18% | 12107185968 | 939.341 MiB | 39.557 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 80x | 77.634 ms | 0.65% | 77.629 ms | 0.65% | 13831742649 | 1.834 GiB | 79.072 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 341x | 43.921 ms | 1.20% | 43.916 ms | 1.20% | 12224889363 | 825.333 MiB | 39.568 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 80x | 75.418 ms | 0.70% | 75.414 ms | 0.70% | 14237999015 | 1.611 GiB | 79.113 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 80x | 42.682 ms | 1.18% | 42.678 ms | 1.18% | 12579566132 | 883.436 MiB | 39.587 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 9x | 74.056 ms | 0.48% | 74.052 ms | 0.48% | 14499873867 | 1.724 GiB | 79.136 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 25x | 42.198 ms | 0.50% | 42.194 ms | 0.49% | 12723960975 | 940.562 MiB | 39.600 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 8x | 73.933 ms | 0.49% | 73.929 ms | 0.49% | 14524042443 | 1.781 GiB | 79.175 MiB | ## orc_multithreaded_read_decode_fixed_width ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 13x | 40.149 ms | 0.04% | 40.144 ms | 0.04% | 13373482726 | 643.390 MiB | 59.821 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 211x | 71.216 ms | 0.67% | 71.211 ms | 0.67% | 15078297784 | 1.257 GiB | 119.650 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 378x | 39.662 ms | 1.31% | 39.658 ms | 1.31% | 13537590893 | 643.392 MiB | 59.833 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 209x | 71.693 ms | 0.71% | 71.688 ms | 0.71% | 14978085376 | 1.257 GiB | 119.642 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 377x | 39.731 ms | 1.30% | 39.726 ms | 1.30% | 13514305239 | 643.394 MiB | 59.856 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 8x | 70.766 ms | 0.08% | 70.761 ms | 0.08% | 15174115364 | 1.030 GiB | 119.665 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 379x | 39.486 ms | 1.27% | 39.482 ms | 1.27% | 13597888468 | 647.399 MiB | 59.928 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 207x | 72.686 ms | 2.04% | 72.681 ms | 2.04% | 14773317833 | 1.143 GiB | 119.711 MiB | ## orc_multithreaded_read_decode_string ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 80x | 22.933 ms | 2.13% | 22.928 ms | 2.13% | 23415352877 | 661.948 MiB | 10.879 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 160x | 34.167 ms | 1.41% | 34.162 ms | 1.41% | 31430436877 | 1.293 GiB | 21.757 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 560x | 22.533 ms | 2.18% | 22.528 ms | 2.18% | 23830839172 | 609.407 MiB | 10.941 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 80x | 34.311 ms | 1.54% | 34.307 ms | 1.54% | 31298288990 | 1.188 GiB | 21.758 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 23x | 22.179 ms | 0.11% | 22.175 ms | 0.11% | 24211151047 | 624.177 MiB | 10.947 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 15x | 33.793 ms | 0.08% | 33.789 ms | 0.08% | 31777989791 | 1.190 GiB | 21.881 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 679x | 22.006 ms | 1.74% | 22.002 ms | 1.74% | 24401381631 | 624.524 MiB | 10.951 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 160x | 33.320 ms | 1.57% | 33.316 ms | 1.57% | 32229227026 | 1.207 GiB | 21.894 MiB | ## orc_multithreaded_read_decode_list ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|------------|--------|------------|--------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 96x | 74.437 ms | 0.68% | 74.433 ms | 0.68% | 7212831148 | 600.751 MiB | 60.245 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 7x | 80.994 ms | 0.49% | 80.990 ms | 0.49% | 13257745936 | 1.173 GiB | 120.549 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 80x | 79.234 ms | 4.57% | 79.229 ms | 4.57% | 6776190522 | 600.950 MiB | 60.250 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 166x | 90.437 ms | 17.19% | 90.432 ms | 17.19% | 11873413959 | 1.173 GiB | 120.489 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 80x | 78.613 ms | 2.98% | 78.608 ms | 2.98% | 6829702014 | 602.764 MiB | 60.323 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 127x | 118.629 ms | 22.67% | 118.624 ms | 22.67% | 9051644873 | 1.174 GiB | 120.499 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 112x | 133.950 ms | 4.45% | 133.945 ms | 4.45% | 4008135293 | 603.471 MiB | 60.353 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 90x | 167.850 ms | 15.93% | 167.844 ms | 15.93% | 6397248426 | 1.177 GiB | 120.646 MiB | ## orc_multithreaded_read_decode_chunked_mixed ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 333x | 45.009 ms | 1.10% | 45.005 ms | 1.10% | 11929261073 | 939.341 MiB | 39.557 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 96x | 81.524 ms | 0.61% | 81.519 ms | 0.61% | 13171640865 | 1.834 GiB | 79.072 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 339x | 44.183 ms | 0.96% | 44.179 ms | 0.96% | 12152252271 | 825.333 MiB | 39.568 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 7x | 79.051 ms | 0.02% | 79.046 ms | 0.02% | 13583676002 | 1.611 GiB | 79.113 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 12x | 43.276 ms | 0.09% | 43.272 ms | 0.09% | 12407024794 | 883.436 MiB | 39.587 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 19x | 78.019 ms | 0.49% | 78.014 ms | 0.49% | 13763433041 | 1.724 GiB | 79.136 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 42.803 ms | 1.22% | 42.799 ms | 1.22% | 12543864010 | 911.993 MiB | 39.600 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 193x | 77.856 ms | 0.59% | 77.852 ms | 0.59% | 13792063986 | 1.837 GiB | 79.175 MiB | ## orc_multithreaded_read_decode_chunked_fixed_width ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 112x | 40.497 ms | 1.23% | 40.493 ms | 1.23% | 13258480947 | 643.390 MiB | 59.821 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 7x | 75.440 ms | 0.09% | 75.435 ms | 0.09% | 14234033611 | 1.648 GiB | 119.651 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 39.793 ms | 1.36% | 39.789 ms | 1.36% | 13493067216 | 643.392 MiB | 59.833 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 69x | 74.499 ms | 0.50% | 74.494 ms | 0.50% | 14413864845 | 1.336 GiB | 119.642 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 381x | 39.273 ms | 1.11% | 39.269 ms | 1.11% | 13671742653 | 643.394 MiB | 59.856 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 204x | 73.755 ms | 0.60% | 73.751 ms | 0.60% | 14559012350 | 1.648 GiB | 119.665 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 39.490 ms | 1.31% | 39.486 ms | 1.31% | 13596333864 | 631.980 MiB | 59.928 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 203x | 73.907 ms | 1.34% | 73.903 ms | 1.34% | 14529071322 | 1.454 GiB | 119.711 MiB | ## orc_multithreaded_read_decode_chunked_string ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 23.022 ms | 1.96% | 23.017 ms | 1.96% | 23324556592 | 661.948 MiB | 10.879 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 37.687 ms | 1.37% | 37.682 ms | 1.37% | 28494755419 | 1.659 GiB | 21.757 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 22.703 ms | 2.30% | 22.699 ms | 2.30% | 23652118769 | 609.407 MiB | 10.941 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 37.581 ms | 1.42% | 37.577 ms | 1.42% | 28574723179 | 1.658 GiB | 21.758 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 544x | 22.296 ms | 1.56% | 22.293 ms | 1.56% | 24082840350 | 631.319 MiB | 10.947 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 14x | 36.990 ms | 0.14% | 36.985 ms | 0.14% | 29031484389 | 1.554 GiB | 21.881 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 676x | 22.114 ms | 1.22% | 22.110 ms | 1.22% | 24281965280 | 627.616 MiB | 10.951 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 37.409 ms | 1.40% | 37.405 ms | 1.40% | 28706077426 | 1.562 GiB | 21.894 MiB | ## orc_multithreaded_read_decode_chunked_list ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|------------|--------|------------|--------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 74.780 ms | 0.67% | 74.776 ms | 0.67% | 7179747067 | 600.751 MiB | 60.245 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 175x | 86.040 ms | 0.56% | 86.035 ms | 0.56% | 12480222210 | 1.576 GiB | 120.549 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 186x | 80.668 ms | 4.14% | 80.664 ms | 4.14% | 6655685080 | 600.951 MiB | 60.250 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 143x | 105.217 ms | 21.56% | 105.212 ms | 21.56% | 10205531345 | 1.576 GiB | 120.489 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 128x | 80.087 ms | 3.05% | 80.082 ms | 3.05% | 6704042147 | 602.764 MiB | 60.323 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 135x | 111.556 ms | 21.88% | 111.551 ms | 21.88% | 9625546746 | 1.489 GiB | 120.499 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 112x | 134.677 ms | 4.14% | 134.672 ms | 4.14% | 3986513604 | 603.471 MiB | 60.353 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 178.735 ms | 14.17% | 178.730 ms | 14.17% | 6007630497 | 1.520 GiB | 120.646 MiB | ``` Authors: - Zach Puller (https://github.com/zpuller) - Vukasin Milovanovic (https://github.com/vuule) - MithunR (https://github.com/mythrocks) Approvers: - Yunsong Wang (https://github.com/PointKernel) - MithunR (https://github.com/mythrocks) URL: #16009

hyperbolic2346 and others added 3 commits December 19, 2023 18:22

first pass at multi-threaded parquet benchmarks

1d32613

Signed-off-by: Mike Wilson <knobby@burntsheep.com>

Multithread parquet benchmarks. Broken up into separate benchmarks fo…

00d682f

…r chunked and non-chunked reader.

Merge branch 'branch-24.06' into multithread_parquet_bench

a4803ae

nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 23, 2024

nvdbaranec requested a review from a team as a code owner April 23, 2024 17:08

nvdbaranec requested review from hyperbolic2346 and vuule April 23, 2024 17:08

nvdbaranec changed the title ~~Add multithreaded parquet reader benchmarks.`~~ Add multithreaded parquet reader benchmarks. Apr 23, 2024

github-actions bot added the CMake CMake build issue label Apr 23, 2024

ttnghia reviewed Apr 24, 2024

View reviewed changes

cpp/benchmarks/CMakeLists.txt Show resolved Hide resolved

vuule mentioned this pull request Apr 29, 2024

[FEA] Have a global pinned memory pool by default #15612

Closed

GregoryKimball assigned nvdbaranec May 6, 2024

ttnghia approved these changes May 10, 2024

View reviewed changes

nvdbaranec added 3 commits May 16, 2024 16:07

Merge branch 'branch-24.06' into multithread_parquet_bench

feb145d

Rework test parameters so that we generally see at least small perfor…

21a6261

…mance increases with more threads. And the benchmarks clearly show multiple-thread-only performance regressions that we have historically seen.

Merge branch 'branch-24.06' into multithread_parquet_bench

6d0c922

Formatting.

ad747cb

vuule reviewed May 20, 2024

View reviewed changes

cpp/benchmarks/io/parquet/parquet_reader_multithread.cpp Show resolved Hide resolved

Merge branch 'branch-24.06' into multithread_parquet_bench

6dc06b5

nvdbaranec requested a review from vuule May 21, 2024 18:16

vuule approved these changes May 21, 2024

View reviewed changes

vuule mentioned this pull request May 21, 2024

Switch cuIO benchmarks to use pinned-pool host allocations by default. #15805

Merged

3 tasks

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label May 21, 2024

hyperbolic2346 approved these changes May 21, 2024

View reviewed changes

Merge branch 'branch-24.06' into multithread_parquet_bench

dcd33aa

rapids-bot bot merged commit fea8fd6 into rapidsai:branch-24.06 May 21, 2024
70 checks passed

This was referenced Jun 10, 2024

[FEA] Use SMs to submit small copies to prevent serialization on a busy copy engine #15620

Open

[FEA] Add an ORC reader benchmark that uses multiple CUDA streams #15973

Closed

zpuller mentioned this pull request Jun 12, 2024

orc multithreaded benchmark #16009

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multithreaded parquet reader benchmarks. #15585

Add multithreaded parquet reader benchmarks. #15585

nvdbaranec commented Apr 23, 2024 •

edited

Loading

copy-pr-bot bot commented Apr 23, 2024

nvdbaranec commented Apr 23, 2024

hyperbolic2346 commented Apr 24, 2024

GregoryKimball commented May 16, 2024

nvdbaranec commented May 20, 2024

vuule left a comment

nvdbaranec commented May 21, 2024

nvdbaranec commented May 21, 2024

nvdbaranec commented May 21, 2024

GregoryKimball commented Jun 3, 2024 •

edited

Loading

Add multithreaded parquet reader benchmarks. #15585

Add multithreaded parquet reader benchmarks. #15585

Conversation

nvdbaranec commented Apr 23, 2024 • edited Loading

Checklist

copy-pr-bot bot commented Apr 23, 2024

nvdbaranec commented Apr 23, 2024

hyperbolic2346 commented Apr 24, 2024

GregoryKimball commented May 16, 2024

nvdbaranec commented May 20, 2024

vuule left a comment

Choose a reason for hiding this comment

nvdbaranec commented May 21, 2024

nvdbaranec commented May 21, 2024

nvdbaranec commented May 21, 2024

GregoryKimball commented Jun 3, 2024 • edited Loading

nvdbaranec commented Apr 23, 2024 •

edited

Loading

GregoryKimball commented Jun 3, 2024 •

edited

Loading