Skip to content

Produce representative dataframes for benchmarking #15911

Open
@mrocklin

Description

@mrocklin

It would be convenient to have a canonical set of dataframes for use in testing and/or benchmarking. Ideally this would be a set of named dataframes that represented common forms of data like the following:

  1. Random floating point data
  2. Random integer data
  3. Strings with low entropy
  4. Strings with high entropy
  5. Mostly sorted datetimes
  6. ...

These could then be used either within Pandas or in other libraries for benchmarks. Having a consistent set of dataframes would probably aid consistent benchmarking.

Additionally if this was then separately arranged into pytest fixture we could imagine setting things up and tearing things down in a way that made benchmarking more consistent (such as controlling garbage collection), though this may be a separate endeavor. It would be nice to have access to the dataframes outside of the context of PyTest as well

cc @jreback @wesm @cpcloud

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions