Skip to content

Commit b95cf6d

Browse files
author
jimmy.xj
committed
Update README.md
1 parent 04b5d2d commit b95cf6d

File tree

2 files changed

+130
-128
lines changed

2 files changed

+130
-128
lines changed

README.md

Lines changed: 65 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,15 @@ DevOps-Eval is a comprehensive evaluation suite specifically designed for founda
1111

1212
📚 This repo contains questions and exercises related to DevOps, including the AIOps.
1313

14-
💥️ There are currently **4850** multiple-choice questions spanning 8 diverse general categories, as shown [below](images/data_info.png).
14+
💥️ There are currently **5977** multiple-choice questions spanning 8 diverse general categories, as shown [below](images/data_info.png).
1515

16-
🔥 There are a total of **2200** samples in the AIOps subcategory, covering scenarios such as **log parsing**, **time series anomaly detection**, **time series classification**, **and root cause analysis**.
16+
🔥 There are a total of **2840** samples in the AIOps subcategory, covering scenarios such as **log parsing**, **time series anomaly detection**, **time series classification**, **time series forecasting**, and **root cause analysis**.
1717

1818
<p align="center"> <a href="resources/devops_diagram_zh.jpg"> <img src="images/data_info.png" style="width: 100%;" id="data_info"></a></p>
1919

2020

2121
## 🔔 News
22+
* **[2023.11.27]** Add 487 operation sense samples and 640 time series forecasting samples; Update the Leaderboard;
2223
* **[2023.10.30]** Add the AIOps Leaderboard.
2324
* **[2023.10.25]** Add the AIOps samples, including log parsing, time series anomaly detection, time series classification and root cause analysis.
2425
* **[2023.10.18]** Update the initial Leaderboard...
@@ -44,77 +45,77 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
4445

4546
| **ModelName** | plan | code | build | test | release | deploy | operate | monitor | **AVG** |
4647
|:------------------------:|:-----:|:-----:|:-----:|:------:|:--------:|:------:|:-------:|:--------:|:-----------:|
47-
| **DevOpsPal-14B-Chat** | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 81.34 | 79.17 | **80.34** |
48-
| **DevOpsPal-14B-Base** | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 85.82 | 82.41 | **80.26** |
49-
| Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 83.58 | 80.56 | 79.28 |
50-
| Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 82.09 | 80.09 | 77.92 |
51-
| Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 85.07 | 83.8 | 75.10 |
52-
| Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 82.09 | 84.72 | 74.60 |
53-
| **DevOpsPal-7B-Chat** | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 79.85 | 77.78 | **74.00** |
54-
| **DevOpsPal-7B-Base** | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 79.85 | 78.7 | **73.55** |
55-
| Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 83.58 | 80.09 | 73.13 |
56-
| Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 80.6 | 79.17 | 71.96 |
57-
| Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 75.37 | 79.63 | 68.17 |
58-
| Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 74.63 | 78.24 | 68.08 |
59-
| Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 76.12 | 75.93 | 67.51 |
60-
| Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 78.36 | 75.93 | 66.91 |
48+
| DevOpsPal-14B-Chat | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 69.89 | 79.17 | 78.23 |
49+
| DevOpsPal-14B-Base | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 71.18 | 82.41 | 78.23 |
50+
| Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 69.57 | 80.56 | 77.18 |
51+
| Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 70.05 | 80.09 | 76.19 |
52+
| Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 70.37 | 83.8 | 73.73 |
53+
| Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 67.63 | 84.72 | 72.9 |
54+
| DevOpsPal-7B-Chat | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 64.73 | 77.78 | 71.92 |
55+
| DevOpsPal-7B-Base | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 65.54 | 78.7 | 71.69 |
56+
| Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 65.06 | 80.09 | 71.09 |
57+
| Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 62.64 | 79.17 | 69.75 |
58+
| Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 59.42 | 79.63 | 66.97 |
59+
| Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 60.39 | 78.24 | 66.27 |
60+
| Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 61.67 | 75.93 | 66.21 |
61+
| Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 65.86 | 75.93 | 65.99 |
6162

6263

6364
#### Five Shot
6465

6566
| **ModelName** | plan | code | build | test | release | deploy | operate | monitor | **AVG** |
6667
|:------------------------:|:-----:|:-----:|:-----:|:------:|:--------:|:------:|:-------:|:--------:|:---------:|
67-
| **DevOpsPal-14B-Chat** |63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 89.55 | 81.48 | **81.77** |
68-
| **DevOpsPal-14B-Base** | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 85.07 | 80.09 | **81.70** |
69-
| Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 85.82 | 81.48 | 79.55 |
70-
| Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 86.57 | 80.56 | 79.51 |
71-
| Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 91.79 | 85.19 | 77.09 |
72-
| Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 85.07 | 81.94 | 77.02 |
73-
| Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 88.06 | 80.56 | 75.32 |
74-
| **DevOpsPal-7B-Chat** | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 82.84 | 76.85 | **75.25** |
75-
| **DevOpsPal-7B-Base** | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 80.6 | 79.17 | **75.17** |
76-
| Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 80.6 | 81.02 | 73.62 |
77-
| Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 80.6 | 79.63 | 72.11 |
78-
| Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 73.88 | 77.31 | 71.09 |
79-
| Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 78.36 | 79.17 | 70.49 |
80-
| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 79.85 | 75.46 | 69.17 |
68+
| DevOpsPal-14B-Chat | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 72.95 | 81.48 | 79.69 |
69+
| DevOpsPal-14B-Base | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 71.98 | 80.09 | 79.63 |
70+
| Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 70.85 | 81.48 | 77.81 |
71+
| Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 72.46 | 80.56 | 77.56 |
72+
| Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 73.75 | 85.19 | 75.8 |
73+
| Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 70.37 | 81.94 | 75.36 |
74+
| Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 71.98 | 80.56 | 74.12 |
75+
| DevOpsPal-7B-Chat | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 68.6 | 76.85 | 73.61 |
76+
| DevOpsPal-7B-Base | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 67.15 | 79.17 | 73.35 |
77+
| Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 65.38 | 81.02 | 71.69 |
78+
| Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 67.31 | 79.63 | 70.8 |
79+
| Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 60.06 | 77.31 | 69.21 |
80+
| Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 64.9 | 79.17 | 69.05 |
81+
| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 63.45 | 75.46 | 67.17 |
8182

8283
### 🔥 AIOps
8384
#### Zero Shot
84-
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | **AVG** |
85-
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-------------------------:|:-------:|
86-
| Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 49.27 |
87-
| DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 46.55 |
88-
| DevOpsPal-14BChat | 60 | 56 | 24 | 43 | 46.18 |
89-
| Qwen-14B-Chat | 64.57 | 51.6 | 22.67 | 36 | 45 |
90-
| Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 40.82 |
91-
| Qwen-7B-Chat | 57.43 | 38.8 | 22.33 | 39.5 | 40.36 |
92-
| DevOpsPal-7B—Chat | 56.57 | 30.4 | 25.33 | 45 | 40 |
93-
| Baichuan2-13B-Chat | 64 | 18 | 21.33 | 37.5 | 37.09 |
94-
| Baichuan2-7B-Chat | 60.86 | 10 | 28 | 34.5 | 35.55 |
95-
| Baichuan2-7B-Base | 53.43 | 12.8 | 27.67 | 36.5 | 34.09 |
96-
| Internlm-7BBase | 48.57 | 18.8 | 23.33 | 37.5 | 32.91 |
97-
| Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 32.55 |
98-
| DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 32.55 |
99-
| Internlm-7B—Chat | 58.86 | 8.8 | 22.33 | 28.5 | 32 |
85+
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | **AVG** |
86+
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-----------------------------------------:|:---------------------------:|:-------:|
87+
| Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 62.5 | 52.25 |
88+
| DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 64.06 | 50.49 |
89+
| Qwen-14B-Chat | 64.57 | 51.6 | 22.67 | 36 | 62.5 | 48.94 |
90+
| DevOpsPal-14BChat | 60 | 56 | 24 | 43 | 57.81 | 48.8 |
91+
| Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 43.75 | 41.48 |
92+
| DevOpsPal-7BChat | 56.57 | 30.4 | 25.33 | 45 | 44.06 | 40.92 |
93+
| Baichuan2-13B-Chat | 64 | 18 | 21.33 | 37.5 | 46.88 | 39.3 |
94+
| Qwen-7B-Chat | 57.43 | 38.8 | 22.33 | 39.5 | 25.31 | 36.97 |
95+
| Internlm-7BChat | 58.86 | 8.8 | 22.33 | 28.5 | 51.25 | 36.34 |
96+
| Baichuan2-7B-Chat | 60.86 | 10 | 28 | 34.5 | 39.06 | 36.34 |
97+
| Baichuan2-7B-Base | 53.43 | 12.8 | 27.67 | 36.5 | 40.31 | 35.49 |
98+
| Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 42.81 | 34.86 |
99+
| DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 38.75 | 33.94 |
100+
| Internlm-7B—Base | 48.57 | 18.8 | 23.33 | 37.5 | 33.75 | 33.1 |
100101

101102
#### One Shot
102-
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | **AVG** |
103-
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-------------------------:|:-------:|
104-
| DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 53.91 |
105-
| Qwen-14B-Base | 64.29 | 74.4 | 28 | 48.5 | 53.82 |
106-
| DevOpsPal-14BBase | 60 | 74 | 25.33 | 43.5 | 50.73 |
107-
| Qwen-14B-Chat | 49.71 | 65.6 | 28.67 | 48 | 47.27 |
108-
| Qwen-7B-Base | 56 | 60.8 | 27.67 | 44 | 47.18 |
109-
| DevOpsPal-7B—Base | 52.86 | 44.4 | 28 | 44.5 | 42.64 |
110-
| Qwen-7B-Chat | 54.57 | 52 | 29.67 | 26.5 | 42.09 |
111-
| Baichuan2-13B-Base | 56 | 43.2 | 24.33 | 41 | 41.73 |
112-
| Baichuan2-13B-Chat | 57.43 | 44.4 | 25 | 25.5 | 39.82 |
113-
| Baichuan2-7B-Base | 48.29 | 40.4 | 27 | 42 | 39.55 |
114-
| Baichuan2-7B-Chat | 58.57 | 31.6 | 27 | 31.5 | 38.91 |
115-
| DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 38.64 |
116-
| Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 37.09 |
117-
| Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 32.73 |
103+
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | **AVG** |
104+
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-----------------------------------------:|:---------------------------:|:-------:|
105+
| DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 56.25 | 54.44 |
106+
| DevOpsPal-14BBase | 60 | 74 | 25.33 | 43.5 | 52.5 | 51.13 |
107+
| Qwen-14B-Base | 64.29 | 74.4 | 28 | 48.5 | 40.31 | 50.77 |
108+
| Qwen-7B-Base | 56 | 60.8 | 27.67 | 44 | 57.19 | 49.44 |
109+
| Qwen-14B-Chat | 49.71 | 65.6 | 28.67 | 48 | 42.19 | 46.13 |
110+
| Baichuan2-13B-Base | 56 | 43.2 | 24.33 | 41 | 46.88 | 42.89 |
111+
| Baichuan2-7B-Chat | 58.57 | 31.6 | 27 | 31.5 | 51.88 | 41.83 |
112+
| DevOpsPal-7B—Base | 52.86 | 44.4 | 28 | 44.5 | 36.25 | 41.2 |
113+
| Baichuan2-7B-Base | 48.29 | 40.4 | 27 | 42 | 40.94 | 39.86 |
114+
| Qwen-7B-Chat | 54.57 | 52 | 29.67 | 26.5 | 27.19 | 38.73 |
115+
| Baichuan2-13B-Chat | 57.43 | 44.4 | 25 | 25.5 | 30.63 | 37.75 |
116+
| DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 33.44 | 37.46 |
117+
| Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 50.31 | 36.69 |
118+
| Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 31.56 | 35.85 |
118119

119120

120121
## ⏬ Data
@@ -140,7 +141,7 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
140141
# {"id": 1, "question": "单元测试应该覆盖以下哪些方面?", "A": "正常路径", "B": "异常路径", "C": "边界值条件","D": 所有以上,"answer": "D", "explanation": ""} ```
141142

142143
#### 👀 Notes
143-
To facilitate usage, we have organized the category name handlers and English/Chinese names corresponding to 53 subcategories. Please refer to [category_mapping.json](resources/categroy_mapping.json) for details. The format is:
144+
To facilitate usage, we have organized the category name handlers and English/Chinese names corresponding to 55 subcategories. Please refer to [category_mapping.json](resources/categroy_mapping.json) for details. The format is:
144145

145146
```
146147
{
@@ -285,7 +286,7 @@ python src/run_eval.py \
285286

286287
## 🧭 TODO
287288
- [x] add AIOps samples.
288-
- [ ] add AIOps scenario **time series forecasting**.
289+
- [x] add AIOps scenario **time series forecasting**.
289290
- [ ] increase in sample size.
290291
- [ ] add samples with the difficulty level set to hard.
291292
- [ ] add the English version of the samples.

0 commit comments

Comments
 (0)