Description
tensorflow_io
supports a wide variety of API's for IO operations. However, there is a lack of benchmarking of the performance of these API's, which the users might find very useful. This issue is aimed at discussing possible design solutions to address this.
Suggestion:
We can use the test data available in our repo (add additional data if required or download data on the fly) and perform benchmarking using Github Actions. A different workflow named benchmarks
can be created and will be dedicated to benchmarking the performance of the API's using the latest tensorflow-io
package.
- To start with, we can benchmark the API's which do not need any external data. For example: using
decode_json
on the elements of a customtf.data.Dataset
(which can be prepared using a list of serialized jsons). - For the API's that need data in the form of files, we can leverage the data available in
tests/
and come up with some initial benchmarks. For example: preparing a dataset usingIODataset.from_parquet()
and benchmark the throughput of data consumption. - The tricky part lies in benchmarking the API's which need external data sources. For example: kafka, pulsar, mongoDB etc. We can try populating the data and benchmark the consumption throughput but it heavily depends on the stability of Github Actions environment.
API:
We can subclass tf.test.Benchmark
and write custom benchmarks for the API's. The benchmarks can be placed at the root of the directory and can be named benchmarks/
.