Skip to content

Commit cefdc32

Browse files
chengeharrisonXu Yuanchen
andauthored
[ColossalEval] Support GSM, Data Leakage Evaluation and Tensor Parallel (#5169)
* Support GSM, Data Leakage Evaluation and Tensor Parallel * remove redundant code and update inference.py in examples/gpt_evaluation --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
1 parent b07a6f4 commit cefdc32

File tree

19 files changed

+578
-100
lines changed

19 files changed

+578
-100
lines changed

applications/ColossalEval/README.md

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
- [Citations](#citations)
3838

3939
## Overview
40-
[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. More details can be found in the following sections.
40+
[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. Currently we support AGIEval, CEval, CMMLU, CValues, GAOKAO-Bench, GSM8K, LongBench, MMLU, MtBench and SafetyBench. More details can be found in the following sections.
4141

4242
## Leaderboard
4343

@@ -101,7 +101,7 @@ The evaluation process involves 2 steps which are `inference` and `evaluation`.
101101

102102
### Inference
103103

104-
The inference process consists of two parts.
104+
The inference process consists of two parts. We now support tensor parallel inference for large models using [ShardFormer](colossalai/shardformer) in the [example](applications/ColossalEval/examples/dataset_evaluation/inference.py) script.
105105
1. Preprocess and convert the original dataset.
106106
2. Config your tokenizer and model arguments to perform zero-shot or few-shot prompting.
107107

@@ -193,7 +193,7 @@ In this step, you will configure your tokenizer and model arguments to infer on
193193

194194
A config file consists of two parts.
195195
1. Model config. In model config, you need to specify model name, model path, model class, tokenizer arguments and model arguments. For model class, currently we support `HuggingFaceModel`, `HuggingFaceCausalLM`, `ChatGLMModel` and `ChatGLMModel2`. `HuggingFaceModel` is for models that can be loaded with `AutoModel` and `HuggingFaceCausalLM` is for models that can be loaded with `AutoModelForCausalLM`. `ChatGLMModel` and `ChatGLMModel2` are for ChatGLM and ChatGLM2 models respectively. You can check all model classes in `colossal_eval/models/__init__.py`. If your model should set `trust_remote_code` as true, specify it in the `tokenizer_kwargs` and `model_kwargs` fields.
196-
2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Bench and LongBench and few-shot on dataset MMLU, CMMLU and AGIEval. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.
196+
2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K and LongBench and few-shot on dataset MMLU, CMMLU AGIEval and GSM8K. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.
197197

198198
Once you have all config ready, the program will run inference on all the given datasets on all the given models.
199199

@@ -236,17 +236,20 @@ An example config using model class `HuggingFaceCausalLM` and dataset class `CMM
236236

237237
Currently, we support Hugging Face models. The `tokenizer_kwargs` is the arguments used in `AutoTokenizer.from_pretrained()`. The `model_kwargs` is the arguments used in `AutoModel.from_pretrained` or `AutoModelForCausalLM.from_pretrained()`. `few_shot` will be set true if you want to enable few-shot prompting for the dataset. `debug` will be set true if you want to verify whether your prompt is right or wrong.
238238

239+
> For GSM8K dataset, you can set additional flags `load_train` or `load_reference` for dataset configuration as true and during the inference process, the program will calculate loss summation over all tokens for each data sample. During the evaluation process, you can use metric `loss_over_all_tokens` to calculate the overall loss and use it for data leakage evaluation.
240+
239241
#### How to Use
240242
An example script can be the following. The `configs/dataset_evaluation/inference.py` is the same in all examples provided.
241243

242244
```shell
243-
torchrun --nproc_per_node=1 inference.py \
245+
torchrun --nproc_per_node=4 inference.py \
244246
--config "path to config file" \
245247
--load_dataset \
248+
--tp_size 2 \
246249
--inference_save_path "path to save inference results"
247250
```
248251

249-
You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`.
252+
You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`. If you want to use tensor parallel inference, specify the tensor parallel size in `--tp_size` and the process will automatically calculate data parallel size.
250253

251254
### Evaluation
252255

@@ -371,11 +374,13 @@ To make it more easier to set the config, you only need to specify all metrics y
371374
- `classification_score`: Calculate classification score between prediction and reference. It determines whether the ouput(a class) is equal to the reference. It is used in Longbench.
372375
- `code_sim_score`: Calculate similarity score between prediction and reference. It is used in Longbench.
373376
- `count_score`: Calculate count score between prediction and reference. It determines whether the ouput(number of given passages) is equal to the reference. It is used in Longbench.
377+
- `gsm_accuracy`: Calculate scores between prediction and reference.. It is used in GSM8K.
374378
- `perplexity`: Calculate perplexity. The formula is $ perplexity = \frac{1}{n} \sum_i e^{loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
375379
- `ppl_score`: Calculate perplexity score. The formula is $ ppl\_score = \frac{1}{n} \sum_i e^{-loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
376380
- `ppl_score_over_choices`: Calculate perplexity score over choices. The formula is $ ppl\_score\_over\_choices= \frac{1}{n} \sum_i e^{-loss\_over\_choices_i} $ where $n$ is the number of samples and $ loss\_over\_choices_i $ is the loss on the first predicted token for sample $ i $. It can be used in all dataset that contains single-choice questions.
377381
- `per_byte_perplexity`: Calculate per byte perplexity. The formula is $ \frac{1}{n} \sum_i e^{\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
378382
- `per_byte_ppl_score`: Calculate per byte perplexity score. The formula is $ \frac{1}{n} \sum_i e^{-\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
383+
- `loss_over_all_tokens`: Calculate loss over all tokens. The formula is $ loss\_over\_all\_tokens = \frac{1}{n} \sum_i loss_i $ where $n$ is the total number of tokens of the dataset and $ loss_i $ is the loss summation for sample $ i $ over all tokens and $ \sum_i loss_i $ is the loss summation for all samples. It can be used in all dataset.
379384

380385
We use `combined_single_choice_accuracy` and `first_token_logit` in the leaderboard.
381386

@@ -520,6 +525,15 @@ year={2023}
520525
primaryClass={cs.CL}
521526
}
522527
528+
@misc{xu2023cvalues,
529+
title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility},
530+
author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
531+
year={2023},
532+
eprint={2307.09705},
533+
archivePrefix={arXiv},
534+
primaryClass={cs.CL}
535+
}
536+
523537
@inproceedings{Zhang2023EvaluatingTP,
524538
title={Evaluating the Performance of Large Language Models on GAOKAO Benchmark},
525539
author={Xiaotian Zhang and Chunyang Li and Yi Zong and Zhengyu Ying and Liang He and Xipeng Qiu},
@@ -542,6 +556,20 @@ year={2023}
542556
year={2021}
543557
}
544558
559+
@article{zhang2023safetybench,
560+
title={SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions},
561+
author={Zhexin Zhang and Leqi Lei and Lindong Wu and Rui Sun and Yongkang Huang and Chong Long and Xiao Liu and Xuanyu Lei and Jie Tang and Minlie Huang},
562+
journal={arXiv preprint arXiv:2309.07045},
563+
year={2023}
564+
}
565+
566+
@article{cobbe2021training,
567+
title={Training verifiers to solve math word problems},
568+
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others},
569+
journal={arXiv preprint arXiv:2110.14168},
570+
year={2021}
571+
}
572+
545573
@article{hendrycks2021ethics,
546574
title={Aligning AI With Shared Human Values},
547575
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
@@ -558,4 +586,12 @@ year={2023}
558586
primaryClass={cs.CL}
559587
}
560588
589+
@misc{wei2023skywork,
590+
title={Skywork: A More Open Bilingual Foundation Model},
591+
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
592+
year={2023},
593+
eprint={2310.19341},
594+
archivePrefix={arXiv},
595+
primaryClass={cs.CL}
596+
}
561597
```

applications/ColossalEval/colossal_eval/dataset/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from .colossalai import ColossalDataset
66
from .cvalues import CValuesDataset
77
from .gaokaobench import GaoKaoBenchDataset
8+
from .gsm import GSMDataset
89
from .longbench import LongBenchDataset
910
from .mmlu import MMLUDataset
1011
from .mtbench import MTBenchDataset
@@ -24,4 +25,5 @@
2425
"SafetyBenchENDataset",
2526
"SafetyBenchZHDataset",
2627
"CValuesDataset",
28+
"GSMDataset",
2729
]

applications/ColossalEval/colossal_eval/dataset/agieval.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -99,11 +99,20 @@ def get_prompt(line: Dict, dataset_name: str, logger: DistributedLogger) -> Dict
9999

100100
# process few-shot raw_prompts
101101
def combine_prompt(prompt_path, dataset_name, load_explanation=True, chat_mode=False):
102+
demostrations = []
103+
demostration_en = "Here are the answers for the problems in the exam."
104+
demostration_zh = "以下是考试中各个问题的答案。"
105+
106+
if dataset_name in english_qa_datasets or dataset_name in english_cloze_datasets:
107+
demostrations.append(demostration_en)
108+
elif dataset_name in chinese_qa_datasets or dataset_name in chinese_cloze_datasets:
109+
demostrations.append(demostration_zh)
110+
102111
skip_passage = False
103112
if dataset_name == "sat-en-without-passage":
104113
skip_passage = True
105114
dataset_name = "sat-en"
106-
demostrations = []
115+
107116
# read the prompts by context and explanation
108117
context_row = [0, 1, 3, 5, 7, 9]
109118
explanation_row = [0, 2, 4, 6, 8, 10]
@@ -153,7 +162,7 @@ def combine_prompt(prompt_path, dataset_name, load_explanation=True, chat_mode=F
153162
if chat_mode:
154163
demostrations.append((question_input,))
155164
else:
156-
demostrations.append(question_input + "\n")
165+
demostrations.append(question_input)
157166

158167
return demostrations
159168

@@ -178,7 +187,9 @@ class AGIEvalDataset(BaseDataset):
178187
"""
179188

180189
@staticmethod
181-
def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
190+
def load(
191+
path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
192+
) -> List[Dict]:
182193
dataset = {"test": {}}
183194

184195
files = glob.glob(os.path.join(path, "*.jsonl"))

applications/ColossalEval/colossal_eval/dataset/base.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ class BaseDataset:
1212
logger: Logger for the dataset.
1313
"""
1414

15-
def __init__(self, path, logger, few_shot):
16-
self.dataset = self.load(path, logger, few_shot)
15+
def __init__(self, path, logger, few_shot, forward_only=False, load_train=False, load_reference=False):
16+
self.dataset = self.load(path, logger, few_shot, forward_only, load_train, load_reference)
1717

1818
def save(self, save_path):
1919
"""Save the converted dataset"""

applications/ColossalEval/colossal_eval/dataset/ceval.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,8 @@
7171
}
7272

7373

74-
def get_few_shot_data(data: List[Dict]):
75-
few_shot_data = []
74+
def get_few_shot_data(data: List[Dict], subject):
75+
few_shot_data = [f"以下是中国关于{subject}考试的单项选择题,请选出其中的正确答案。"]
7676
for i in data:
7777
few_shot_data.append(i["input"] + i["target"])
7878
return few_shot_data
@@ -86,7 +86,9 @@ class CEvalDataset(BaseDataset):
8686
"""
8787

8888
@staticmethod
89-
def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
89+
def load(
90+
path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
91+
) -> List[Dict]:
9092
dataset = {"dev": {}, "test": {}}
9193
for split in ["dev", "test"]:
9294
files = os.listdir(os.path.join(path, split))
@@ -105,7 +107,7 @@ def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
105107

106108
if split == "test" and few_shot:
107109
dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
108-
dataset["dev"][subject]["data"]
110+
dataset["dev"][subject]["data"], subject
109111
)
110112

111113
with open(file_dir, encoding="utf-8") as f:

applications/ColossalEval/colossal_eval/dataset/cmmlu.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -86,8 +86,8 @@
8686
}
8787

8888

89-
def get_few_shot_data(data: List[Dict]):
90-
few_shot_data = []
89+
def get_few_shot_data(data: List[Dict], subject):
90+
few_shot_data = [f"以下是关于{subject}的单项选择题,请直接给出正确答案的选项。"]
9191
for i in data:
9292
few_shot_data.append(i["input"] + i["target"])
9393
return few_shot_data
@@ -101,7 +101,9 @@ class CMMLUDataset(BaseDataset):
101101
"""
102102

103103
@staticmethod
104-
def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
104+
def load(
105+
path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
106+
) -> List[Dict]:
105107
dataset = {"dev": {}, "test": {}}
106108
for split in ["dev", "test"]:
107109
files = os.listdir(os.path.join(path, split))
@@ -120,7 +122,7 @@ def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
120122

121123
if split == "test" and few_shot:
122124
dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
123-
dataset["dev"][subject]["data"]
125+
dataset["dev"][subject]["data"], subject
124126
)
125127

126128
with open(file_dir, encoding="utf-8") as f:

applications/ColossalEval/colossal_eval/dataset/gaokaobench.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,9 @@ class GaoKaoBenchDataset(BaseDataset):
6969
"""
7070

7171
@staticmethod
72-
def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
72+
def load(
73+
path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
74+
) -> List[Dict]:
7375
dataset = {"test": {}}
7476
for category in ["Fill-in-the-blank_Questions", "Multiple-choice_Questions", "Open-ended_Questions"]:
7577
files = os.listdir(os.path.join(path, "data", category))

0 commit comments

Comments
 (0)