You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ColossalEval] Support GSM, Data Leakage Evaluation and Tensor Parallel (#5169)
* Support GSM, Data Leakage Evaluation and Tensor Parallel
* remove redundant code and update inference.py in examples/gpt_evaluation
---------
Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com>
Copy file name to clipboardExpand all lines: applications/ColossalEval/README.md
+41-5Lines changed: 41 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@
37
37
-[Citations](#citations)
38
38
39
39
## Overview
40
-
[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. More details can be found in the following sections.
40
+
[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. Currently we support AGIEval, CEval, CMMLU, CValues, GAOKAO-Bench, GSM8K, LongBench, MMLU, MtBench and SafetyBench. More details can be found in the following sections.
41
41
42
42
## Leaderboard
43
43
@@ -101,7 +101,7 @@ The evaluation process involves 2 steps which are `inference` and `evaluation`.
101
101
102
102
### Inference
103
103
104
-
The inference process consists of two parts.
104
+
The inference process consists of two parts. We now support tensor parallel inference for large models using [ShardFormer](colossalai/shardformer) in the [example](applications/ColossalEval/examples/dataset_evaluation/inference.py) script.
105
105
1. Preprocess and convert the original dataset.
106
106
2. Config your tokenizer and model arguments to perform zero-shot or few-shot prompting.
107
107
@@ -193,7 +193,7 @@ In this step, you will configure your tokenizer and model arguments to infer on
193
193
194
194
A config file consists of two parts.
195
195
1. Model config. In model config, you need to specify model name, model path, model class, tokenizer arguments and model arguments. For model class, currently we support `HuggingFaceModel`, `HuggingFaceCausalLM`, `ChatGLMModel` and `ChatGLMModel2`. `HuggingFaceModel` is for models that can be loaded with `AutoModel` and `HuggingFaceCausalLM` is for models that can be loaded with `AutoModelForCausalLM`. `ChatGLMModel` and `ChatGLMModel2` are for ChatGLM and ChatGLM2 models respectively. You can check all model classes in `colossal_eval/models/__init__.py`. If your model should set `trust_remote_code` as true, specify it in the `tokenizer_kwargs` and `model_kwargs` fields.
196
-
2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Benchand LongBench and few-shot on dataset MMLU, CMMLU and AGIEval. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.
196
+
2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class. Currently, we support zero-shot on dataset MMLU, CMMLU, AGIEval, GAOKAO-Bench, GSM8K and LongBench and few-shot on dataset MMLU, CMMLU AGIEval and GSM8K. If you want to enable few shot, set `few_shot` as true. You can check all model classes in `colossal_eval/dataset/__init__.py`.
197
197
198
198
Once you have all config ready, the program will run inference on all the given datasets on all the given models.
199
199
@@ -236,17 +236,20 @@ An example config using model class `HuggingFaceCausalLM` and dataset class `CMM
236
236
237
237
Currently, we support Hugging Face models. The `tokenizer_kwargs` is the arguments used in `AutoTokenizer.from_pretrained()`. The `model_kwargs` is the arguments used in `AutoModel.from_pretrained` or `AutoModelForCausalLM.from_pretrained()`. `few_shot` will be set true if you want to enable few-shot prompting for the dataset. `debug` will be set true if you want to verify whether your prompt is right or wrong.
238
238
239
+
> For GSM8K dataset, you can set additional flags `load_train` or `load_reference` for dataset configuration as true and during the inference process, the program will calculate loss summation over all tokens for each data sample. During the evaluation process, you can use metric `loss_over_all_tokens` to calculate the overall loss and use it for data leakage evaluation.
240
+
239
241
#### How to Use
240
242
An example script can be the following. The `configs/dataset_evaluation/inference.py` is the same in all examples provided.
241
243
242
244
```shell
243
-
torchrun --nproc_per_node=1 inference.py \
245
+
torchrun --nproc_per_node=4 inference.py \
244
246
--config "path to config file" \
245
247
--load_dataset \
248
+
--tp_size 2 \
246
249
--inference_save_path "path to save inference results"
247
250
```
248
251
249
-
You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`.
252
+
You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`. If you want to use tensor parallel inference, specify the tensor parallel size in `--tp_size` and the process will automatically calculate data parallel size.
250
253
251
254
### Evaluation
252
255
@@ -371,11 +374,13 @@ To make it more easier to set the config, you only need to specify all metrics y
371
374
-`classification_score`: Calculate classification score between prediction and reference. It determines whether the ouput(a class) is equal to the reference. It is used in Longbench.
372
375
-`code_sim_score`: Calculate similarity score between prediction and reference. It is used in Longbench.
373
376
-`count_score`: Calculate count score between prediction and reference. It determines whether the ouput(number of given passages) is equal to the reference. It is used in Longbench.
377
+
-`gsm_accuracy`: Calculate scores between prediction and reference.. It is used in GSM8K.
374
378
-`perplexity`: Calculate perplexity. The formula is $ perplexity = \frac{1}{n} \sum_i e^{loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
375
379
-`ppl_score`: Calculate perplexity score. The formula is $ ppl\_score = \frac{1}{n} \sum_i e^{-loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
376
380
-`ppl_score_over_choices`: Calculate perplexity score over choices. The formula is $ ppl\_score\_over\_choices= \frac{1}{n} \sum_i e^{-loss\_over\_choices_i} $ where $n$ is the number of samples and $ loss\_over\_choices_i $ is the loss on the first predicted token for sample $ i $. It can be used in all dataset that contains single-choice questions.
377
381
-`per_byte_perplexity`: Calculate per byte perplexity. The formula is $ \frac{1}{n} \sum_i e^{\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
378
382
-`per_byte_ppl_score`: Calculate per byte perplexity score. The formula is $ \frac{1}{n} \sum_i e^{-\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
383
+
-`loss_over_all_tokens`: Calculate loss over all tokens. The formula is $ loss\_over\_all\_tokens = \frac{1}{n} \sum_i loss_i $ where $n$ is the total number of tokens of the dataset and $ loss_i $ is the loss summation for sample $ i $ over all tokens and $ \sum_i loss_i $ is the loss summation for all samples. It can be used in all dataset.
379
384
380
385
We use `combined_single_choice_accuracy` and `first_token_logit` in the leaderboard.
381
386
@@ -520,6 +525,15 @@ year={2023}
520
525
primaryClass={cs.CL}
521
526
}
522
527
528
+
@misc{xu2023cvalues,
529
+
title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility},
530
+
author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
531
+
year={2023},
532
+
eprint={2307.09705},
533
+
archivePrefix={arXiv},
534
+
primaryClass={cs.CL}
535
+
}
536
+
523
537
@inproceedings{Zhang2023EvaluatingTP,
524
538
title={Evaluating the Performance of Large Language Models on GAOKAO Benchmark},
525
539
author={Xiaotian Zhang and Chunyang Li and Yi Zong and Zhengyu Ying and Liang He and Xipeng Qiu},
@@ -542,6 +556,20 @@ year={2023}
542
556
year={2021}
543
557
}
544
558
559
+
@article{zhang2023safetybench,
560
+
title={SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions},
561
+
author={Zhexin Zhang and Leqi Lei and Lindong Wu and Rui Sun and Yongkang Huang and Chong Long and Xiao Liu and Xuanyu Lei and Jie Tang and Minlie Huang},
562
+
journal={arXiv preprint arXiv:2309.07045},
563
+
year={2023}
564
+
}
565
+
566
+
@article{cobbe2021training,
567
+
title={Training verifiers to solve math word problems},
568
+
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others},
569
+
journal={arXiv preprint arXiv:2110.14168},
570
+
year={2021}
571
+
}
572
+
545
573
@article{hendrycks2021ethics,
546
574
title={Aligning AI With Shared Human Values},
547
575
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
@@ -558,4 +586,12 @@ year={2023}
558
586
primaryClass={cs.CL}
559
587
}
560
588
589
+
@misc{wei2023skywork,
590
+
title={Skywork: A More Open Bilingual Foundation Model},
591
+
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
0 commit comments