Skip to content

Performance benchmarking of tensorflow_io API's #1265

Closed
@kvignesh1420

Description

@kvignesh1420

tensorflow_io supports a wide variety of API's for IO operations. However, there is a lack of benchmarking of the performance of these API's, which the users might find very useful. This issue is aimed at discussing possible design solutions to address this.

Suggestion:
We can use the test data available in our repo (add additional data if required or download data on the fly) and perform benchmarking using Github Actions. A different workflow named benchmarks can be created and will be dedicated to benchmarking the performance of the API's using the latest tensorflow-io package.

  • To start with, we can benchmark the API's which do not need any external data. For example: using decode_json on the elements of a custom tf.data.Dataset (which can be prepared using a list of serialized jsons).
  • For the API's that need data in the form of files, we can leverage the data available in tests/ and come up with some initial benchmarks. For example: preparing a dataset using IODataset.from_parquet() and benchmark the throughput of data consumption.
  • The tricky part lies in benchmarking the API's which need external data sources. For example: kafka, pulsar, mongoDB etc. We can try populating the data and benchmark the consumption throughput but it heavily depends on the stability of Github Actions environment.

API:
We can subclass tf.test.Benchmark and write custom benchmarks for the API's. The benchmarks can be placed at the root of the directory and can be named benchmarks/.

cc: @yongtang @terrytangyuan @BryanCutler @vlasenkoalexey.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions