Skip to content

Export benchmark information as line protocol #6107

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We want to have information about DataFusion's performance over time -- #5504. This is becoming more important as we work on more performance items / optimizations such as #5904

Currently the datafusion benchmarks in https://github.com/apache/arrow-datafusion/tree/main/benchmarks#datafusion-benchmarks can output the runs results as a JSON file.

I would like to use existing visualization systems (like timeseries databases).

Describe the solution you'd like

I would like to output the benchmark data optionally as LineProtocol https://docs.influxdata.com/influxdb/cloud-iox/reference/syntax/line-protocol/ so that it can be visualized by grafana / other systems that can handle line protocol

See https://grafana.com/docs/grafana-cloud/data-configuration/metrics/metrics-influxdb/push-from-telegraf/

Proposed Design

Write a python script, modeled after compare.py, that takes a performance json file and produces as output lineprotocol

Desired output
measurement: benchmark
tags: details from run
fields: query iteration, row_count, elapsed_ms
timestamp: ns since epoch (I think that means multiply by 1000, but maybe by 1,000,000)

Example output
A line like this for each element in the queries array:

benchmark,name=sort,"--scale_factor"="1.0","datafusion_version"="31.0.0",num_cpus="8 query="sort utf8",iteration=1,row_count=10838832 1694704746000

Example input:

{
  "context": {
    "arguments": [
      "sort",
      "--path",
      "/home/alamb/arrow-datafusion/benchmarks/data",
      "--scale-factor",
      "1.0",
      "--iterations",
      "5",
      "-o",
      "/home/alamb/arrow-datafusion/benchmarks/results/main_base/sort.json"
    ],
    "benchmark_version": "31.0.0",
    "datafusion_version": "31.0.0",
    "num_cpus": 8,
    "start_time": 1694704746
  },
  "queries": [
    {
      "iterations": [
        {
          "elapsed": 86441.988369,
          "row_count": 10838832
        },
        {
          "elapsed": 73182.81637,
          "row_count": 10838832
        },
        {
          "elapsed": 69536.53120900001,
          "row_count": 10838832
        },
        {
          "elapsed": 72179.459332,
          "row_count": 10838832
        },
        {
          "elapsed": 71660.65385500001,
          "row_count": 10838832
        }
      ],
      "query": "sort utf8",
      "start_time": 1694704746
    },
    {
      "iterations": [
        {
          "elapsed": 89047.348867,
          "row_count": 10838832
        },
        {
          "elapsed": 89168.79565399999,
          "row_count": 10838832
        },
        {
          "elapsed": 88951.52251499999,
          "row_count": 10838832
        },
        {
          "elapsed": 98504.891076,
          "row_count": 10838832
        },
        {
          "elapsed": 89457.13566700001,
          "row_count": 10838832
        }
      ],
      "query": "sort int",
      "start_time": 1694705119
    },
    {
      "iterations": [
        {
          "elapsed": 71307.72546599999,
          "row_count": 10838832
        },
        {
          "elapsed": 71463.172695,
          "row_count": 10838832
        },
        {
          "elapsed": 77577.714498,
          "row_count": 10838832
        },
        {
          "elapsed": 71730.90387400001,
          "row_count": 10838832
        },
        {
          "elapsed": 72624.773934,
          "row_count": 10838832
        }
      ],
      "query": "sort decimal",
      "start_time": 1694705575
    },
    {
      "iterations": [
        {
          "elapsed": 96741.53251,
          "row_count": 10838832
        },
        {
          "elapsed": 97752.85497999999,
          "row_count": 10838832
        },
        {
          "elapsed": 95654.327294,
          "row_count": 10838832
        },
        {
          "elapsed": 96713.50062400001,
          "row_count": 10838832
        },
        {
          "elapsed": 94291.325883,
          "row_count": 10838832
        }
      ],
      "query": "sort integer tuple",
      "start_time": 1694705940
    },
    {
      "iterations": [
        {
          "elapsed": 72497.7272,
          "row_count": 10838832
        },
        {
          "elapsed": 72443.536695,
          "row_count": 10838832
        },
        {
          "elapsed": 73023.115685,
          "row_count": 10838832
        },
        {
          "elapsed": 73800.62915899999,
          "row_count": 10838832
        },
        {
          "elapsed": 71583.947462,
          "row_count": 10838832
        }
      ],
      "query": "sort utf8 tuple",
      "start_time": 1694706421
    },
    {
      "iterations": [
        {
          "elapsed": 81407.140528,
          "row_count": 10838832
        },
        {
          "elapsed": 85593.791929,
          "row_count": 10838832
        },
        {
          "elapsed": 81712.19639,
          "row_count": 10838832
        },
        {
          "elapsed": 80993.492422,
          "row_count": 10838832
        },
        {
          "elapsed": 83290.99224600001,
          "row_count": 10838832
        }
      ],
      "query": "sort mixed tuple",
      "start_time": 1694706785
    }
  ]
}

Here is a zip file with a bunch of example benchmark json files: results.zip

Describe alternatives you've considered

No response

Additional context

Related to #5504 tracking data over time

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions