|
| 1 | +--- |
| 2 | +title: "Query Auditor (tool)" |
| 3 | +linkTitle: "query auditor (tool)" |
| 4 | +weight: 1 |
| 5 | +slug: query-auditor |
| 6 | +--- |
| 7 | + |
| 8 | +The query auditor is a tool bundled in the cortex repo, but **not** included in docker images -- this must be built from source. It's primarily useful for those _developing_ cortex, but can be helpful to operators as well during certain scenarios (backend migrations come to mind). |
| 9 | + |
| 10 | +## How it works |
| 11 | + |
| 12 | +The `query-audit` tool performs a set of queries against two backends that expose the prometheus read API. This is generally the `query-frontend` component of two cortex deployments. It will then compare the differences in the responses to determine the average difference for each query. It does this by: |
| 13 | + - Ensuring the resulting label sets match |
| 14 | + - For each label set, ensuring they contain the same number of samples as their pair from the other backend |
| 15 | + - For each sample, calculates their difference against it's pair from the other backend/label set. |
| 16 | + - Calculates the average diff per query from the above diffs. |
| 17 | + |
| 18 | +### Limitations |
| 19 | + |
| 20 | +It currently only supports queries with `Matrix` response types, but should be simple to extend to `Vector`s as well, should the need arise. |
| 21 | + |
| 22 | +### Use cases |
| 23 | + |
| 24 | +- Correctness testing when working on the read path. |
| 25 | +- Comparing results from different backends. |
| 26 | + |
| 27 | +### Example Configuration |
| 28 | + |
| 29 | +```yaml |
| 30 | +control: |
| 31 | + host: http://localhost:8080/api/prom |
| 32 | + headers: |
| 33 | + "X-Scope-OrgID": 1234 |
| 34 | + |
| 35 | +test: |
| 36 | + host: http://localhost:8081/api/prom |
| 37 | + headers: |
| 38 | + "X-Scope-OrgID": 1234 |
| 39 | + |
| 40 | +queries: |
| 41 | + - query: 'sum(rate(container_cpu_usage_seconds_total[5m]))' |
| 42 | + start: 2019-11-25T00:00:00Z |
| 43 | + end: 2019-11-28T00:00:00Z |
| 44 | + step_size: 15m |
| 45 | + - query: 'sum(rate(container_cpu_usage_seconds_total[5m])) by (container_name)' |
| 46 | + start: 2019-11-25T00:00:00Z |
| 47 | + end: 2019-11-28T00:00:00Z |
| 48 | + step_size: 15m |
| 49 | + - query: 'sum(rate(container_cpu_usage_seconds_total[5m])) without (container_name)' |
| 50 | + start: 2019-11-25T00:00:00Z |
| 51 | + end: 2019-11-26T00:00:00Z |
| 52 | + step_size: 15m |
| 53 | + - query: 'histogram_quantile(0.9, sum(rate(cortex_cache_value_size_bytes_bucket[5m])) by (le, job))' |
| 54 | + start: 2019-11-25T00:00:00Z |
| 55 | + end: 2019-11-25T06:00:00Z |
| 56 | + step_size: 15m |
| 57 | + # two shardable legs |
| 58 | + - query: 'sum without (instance, job) (rate(cortex_query_frontend_queue_length[5m])) or sum by (job) (rate(cortex_query_frontend_queue_length[5m]))' |
| 59 | + start: 2019-11-25T00:00:00Z |
| 60 | + end: 2019-11-25T06:00:00Z |
| 61 | + step_size: 15m |
| 62 | + # one shardable leg |
| 63 | + - query: 'sum without (instance, job) (rate(cortex_cache_request_duration_seconds_count[5m])) or rate(cortex_cache_request_duration_seconds_count[5m])' |
| 64 | + start: 2019-11-25T00:00:00Z |
| 65 | + end: 2019-11-25T06:00:00Z |
| 66 | + step_size: 15m |
| 67 | +``` |
| 68 | +
|
| 69 | +### Example Output |
| 70 | +
|
| 71 | +Under ideal circumstances, you'll see output like the following: |
| 72 | +
|
| 73 | +``` |
| 74 | +$ go install ./tools/query-audit/ && query-audit -f ~/grafana/tmp/equivalence-config.yaml |
| 75 | + |
| 76 | +0.000000% avg diff for: |
| 77 | + query: sum(rate(container_cpu_usage_seconds_total[5m])) |
| 78 | + series: 1 |
| 79 | + samples: 289 |
| 80 | + start: 2019-11-25 00:00:00 +0000 UTC |
| 81 | + end: 2019-11-28 00:00:00 +0000 UTC |
| 82 | + step: 15m0s |
| 83 | + |
| 84 | +0.000000% avg diff for: |
| 85 | + query: sum(rate(container_cpu_usage_seconds_total[5m])) by (container_name) |
| 86 | + series: 95 |
| 87 | + samples: 25877 |
| 88 | + start: 2019-11-25 00:00:00 +0000 UTC |
| 89 | + end: 2019-11-28 00:00:00 +0000 UTC |
| 90 | + step: 15m0s |
| 91 | + |
| 92 | +0.000000% avg diff for: |
| 93 | + query: sum(rate(container_cpu_usage_seconds_total[5m])) without (container_name) |
| 94 | + series: 4308 |
| 95 | + samples: 374989 |
| 96 | + start: 2019-11-25 00:00:00 +0000 UTC |
| 97 | + end: 2019-11-26 00:00:00 +0000 UTC |
| 98 | + step: 15m0s |
| 99 | + |
| 100 | +0.000000% avg diff for: |
| 101 | + query: histogram_quantile(0.9, sum(rate(cortex_cache_value_size_bytes_bucket[5m])) by (le, job)) |
| 102 | + series: 13 |
| 103 | + samples: 325 |
| 104 | + start: 2019-11-25 00:00:00 +0000 UTC |
| 105 | + end: 2019-11-25 06:00:00 +0000 UTC |
| 106 | + step: 15m0s |
| 107 | + |
| 108 | +0.000000% avg diff for: |
| 109 | + query: sum without (instance, job) (rate(cortex_query_frontend_queue_length[5m])) or sum by (job) (rate(cortex_query_frontend_queue_length[5m])) |
| 110 | + series: 21 |
| 111 | + samples: 525 |
| 112 | + start: 2019-11-25 00:00:00 +0000 UTC |
| 113 | + end: 2019-11-25 06:00:00 +0000 UTC |
| 114 | + step: 15m0s |
| 115 | + |
| 116 | +0.000000% avg diff for: |
| 117 | + query: sum without (instance, job) (rate(cortex_cache_request_duration_seconds_count[5m])) or rate(cortex_cache_request_duration_seconds_count[5m]) |
| 118 | + series: 942 |
| 119 | + samples: 23550 |
| 120 | + start: 2019-11-25 00:00:00 +0000 UTC |
| 121 | + end: 2019-11-25 06:00:00 +0000 UTC |
| 122 | + step: 15m0s |
| 123 | + |
| 124 | +0.000000% avg diff for: |
| 125 | + query: sum by (namespace) (predict_linear(container_cpu_usage_seconds_total[5m], 10)) |
| 126 | + series: 16 |
| 127 | + samples: 400 |
| 128 | + start: 2019-11-25 00:00:00 +0000 UTC |
| 129 | + end: 2019-11-25 06:00:00 +0000 UTC |
| 130 | + step: 15m0s |
| 131 | + |
| 132 | +0.000000% avg diff for: |
| 133 | + query: sum by (namespace) (avg_over_time((rate(container_cpu_usage_seconds_total[5m]))[10m:]) > 1) |
| 134 | + series: 4 |
| 135 | + samples: 52 |
| 136 | + start: 2019-11-25 00:00:00 +0000 UTC |
| 137 | + end: 2019-11-25 01:00:00 +0000 UTC |
| 138 | + step: 5m0s |
| 139 | +``` |
0 commit comments