[FEA] Add a Parquet reader benchmark that uses multiple CUDA streams #12700
Description
Is your feature request related to a problem? Please describe.
Our suite of Parquet reader benchmarks includes a variety of data source, data types, compression formats and reader options. However it does not include a benchmark that uses multiple CUDA streams with multiple host threads to read portions of the same dataset and maximize GPU utilization. The Spark-RAPIDS plugin relies on multi-stream parquet reads from host buffers (using per-thread-default-stream, PTDS) for the data ingest step into libcudf.
Describe the solution you'd like
We should add a libcudf microbenchmark that creates several host threads, each with it's own non-default CUDA stream, and then reads a large parquet dataset from host memory into a libcudf table. Currently we haven't exposed a stream in the public API for the parquet reader, but development of the benchmark can begin by using the read_parquet detail API. We could design this benchmark either to read one file per thread or one row group per thread, whichever is more expedient. After the read step, we might want to add a concatenation step to yield a single table. It might be useful to leverage the same generated data as in the other Parquet reader benchmarks, so we have a performance reference when studying the advantage of multi-thread, multi-stream read times.
Describe alternatives you've considered
The alternative would be to continue using Spark-RAPIDS NDS runs to track performance of libcudf's parquet reader in a multi-threaded, multi-stream use case.