Skip to content

Commit 6a5640a

Browse files
committed
feat(experiments): reproduce humaneval(+) and mbpp(+) results
1 parent 830ef3b commit 6a5640a

File tree

3 files changed

+341
-189
lines changed

3 files changed

+341
-189
lines changed

experiments/README.md

Lines changed: 318 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,328 @@
1-
### 📚Description
1+
# Reproduce the experiments
2+
3+
> [!WARNING]
4+
> This documentation is still WIP. Raise an [issue](https://github.com/ise-uiuc/magicoder/issues) in case you found any errors.
5+
6+
In this document, we provide the instructions for reproducing the experiments in the paper.
7+
8+
> [!IMPORTANT]
9+
> **General requirements**
10+
>
11+
> Before you start, make sure you cloned the respository.
12+
> Here are the environment and hardware requirements to 100% reproduce the paper results.
13+
>
14+
> - Two NVIDIA A100 80G GPUs
15+
> - Python 3.10.12
16+
> - Having installed [pdm](https://pdm-project.org/latest/) and having set it up for the magicoder repo (e.g., `pdm install`).
17+
> - Now you should have the same package versions as specified in [pdm.lock](/pdm.lock).
18+
19+
## Reproduce HumanEval(+) and MBPP(+)
20+
21+
We pack multiple problems into one batch to speed up the inference. A different batch size may lead to slightly worse/better results due to the floating point round off resulted from the underlying [cuBLAS](https://docs.nvidia.com/cuda/cublas/index.html) optimization. We chose the batch size that can maximize the utilization of 1 or 2 GPUs depending on the resource availability at the time we ran the evaluation.
22+
23+
Make sure you set `CUDA_VISIBLE_DEVICES` to the 1 or 2 GPUs you want to use and `cd`ed to the root directory of the repo. Some larger batch sizes require 2 GPUs.
24+
25+
### HumanEval(+)
26+
27+
<details>
28+
29+
<summary>Magicoder-CL-7B</summary>
30+
31+
```bash
32+
MODEL_KEY=codellama/CodeLlama-7b-Python-hf
33+
MODEL=ise-uiuc/Magicoder-CL-7B
34+
DATASET=humaneval
35+
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
36+
37+
python -m experiments.text2code \
38+
--model_key $MODEL_KEY \
39+
--model_name_or_path $MODEL \
40+
--save_path $SAVE_PATH \
41+
--dataset $DATASET \
42+
--temperature 0.0 \
43+
--top_p 1.0 \
44+
--max_new_tokens 512 \
45+
--n_problems_per_batch 16 \
46+
--n_samples_per_problem 1 \
47+
--n_batches 1
48+
49+
evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
50+
# humaneval (base tests)
51+
# pass@1: 0.604
52+
# humaneval+ (base + extra tests)
53+
# pass@1: 0.555
54+
```
55+
56+
</details>
57+
58+
<details>
59+
60+
<summary>Magicoder-S-CL-7B</summary>
61+
62+
```bash
63+
MODEL_KEY=codellama/CodeLlama-7b-Python-hf
64+
MODEL=ise-uiuc/Magicoder-S-CL-7B
65+
DATASET=humaneval
66+
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
67+
68+
python -m experiments.text2code \
69+
--model_key $MODEL_KEY \
70+
--model_name_or_path $MODEL \
71+
--save_path $SAVE_PATH \
72+
--dataset $DATASET \
73+
--temperature 0.0 \
74+
--top_p 1.0 \
75+
--max_new_tokens 512 \
76+
--n_problems_per_batch 16 \
77+
--n_samples_per_problem 1 \
78+
--n_batches 1
79+
80+
evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
81+
# humaneval (base tests)
82+
# pass@1: 0.707
83+
# humaneval+ (base + extra tests)
84+
# pass@1: 0.665
85+
```
86+
87+
</details>
88+
89+
<details>
90+
91+
<summary>Magicoder-DS-6.7B</summary>
92+
93+
```bash
94+
MODEL_KEY=deepseek-ai/deepseek-coder-6.7b-base
95+
MODEL=ise-uiuc/Magicoder-DS-6.7B
96+
DATASET=humaneval
97+
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
98+
99+
python -m experiments.text2code \
100+
--model_key $MODEL_KEY \
101+
--model_name_or_path $MODEL \
102+
--save_path $SAVE_PATH \
103+
--dataset $DATASET \
104+
--temperature 0.0 \
105+
--top_p 1.0 \
106+
--max_new_tokens 512 \
107+
--n_problems_per_batch 28 \
108+
--n_samples_per_problem 1 \
109+
--n_batches 1
110+
111+
evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
112+
# humaneval (base tests)
113+
# pass@1: 0.665
114+
# humaneval+ (base + extra tests)
115+
# pass@1: 0.604
116+
```
117+
118+
</details>
119+
120+
<details>
121+
122+
<summary>Magicoder-S-DS-6.7B</summary>
123+
124+
```bash
125+
MODEL_KEY=deepseek-ai/deepseek-coder-6.7b-base
126+
MODEL=ise-uiuc/Magicoder-S-DS-6.7B
127+
DATASET=humaneval
128+
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
129+
130+
python -m experiments.text2code \
131+
--model_key $MODEL_KEY \
132+
--model_name_or_path $MODEL \
133+
--save_path $SAVE_PATH \
134+
--dataset $DATASET \
135+
--temperature 0.0 \
136+
--top_p 1.0 \
137+
--max_new_tokens 512 \
138+
--n_problems_per_batch 28 \
139+
--n_samples_per_problem 1 \
140+
--n_batches 1
141+
142+
evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
143+
# humaneval (base tests)
144+
# pass@1: 0.768
145+
# humaneval+ (base + extra tests)
146+
# pass@1: 0.707
147+
```
148+
149+
</details>
150+
151+
### MBPP(+)
152+
153+
Make sure you download the [EvalPlus repo](https://github.com/evalplus/evalplus) and performed `export PYTHONPATH=$EVALPLUS_REPO_ROOT`. We will use its `tools.sanitize` to sanitize the generated samples.
154+
155+
<details>
156+
157+
<summary>Magicoder-CL-7B</summary>
158+
159+
```bash
160+
MODEL_KEY=codellama/CodeLlama-7b-Python-hf
161+
MODEL=ise-uiuc/Magicoder-CL-7B
162+
DATASET=mbpp
163+
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
164+
SANITIZED_PATH=evalplus-$(basename $MODEL)-$DATASET-sanitized.jsonl
165+
166+
python -m experiments.text2code \
167+
--model_key $MODEL_KEY \
168+
--model_name_or_path $MODEL \
169+
--save_path $SAVE_PATH \
170+
--dataset $DATASET \
171+
--temperature 0.0 \
172+
--top_p 1.0 \
173+
--max_new_tokens 512 \
174+
--n_problems_per_batch 24 \
175+
--n_samples_per_problem 1 \
176+
--n_batches 1
177+
178+
python -m tools.sanitize --dataset $DATASET --samples $SAVE_PATH
179+
evalplus.evaluate --dataset $DATASET --samples $SANITIZED_PATH
180+
# mbpp (base tests)
181+
# pass@1: 0.642
182+
# mbpp+ (base + extra tests)
183+
# pass@1: 0.526
184+
```
185+
186+
</details>
187+
188+
<details>
189+
190+
<summary>Magicoder-S-CL-7B</summary>
191+
192+
```bash
193+
MODEL_KEY=codellama/CodeLlama-7b-Python-hf
194+
MODEL=ise-uiuc/Magicoder-S-CL-7B
195+
DATASET=mbpp
196+
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
197+
SANITIZED_PATH=evalplus-$(basename $MODEL)-$DATASET-sanitized.jsonl
198+
199+
python -m experiments.text2code \
200+
--model_key $MODEL_KEY \
201+
--model_name_or_path $MODEL \
202+
--save_path $SAVE_PATH \
203+
--dataset $DATASET \
204+
--temperature 0.0 \
205+
--top_p 1.0 \
206+
--max_new_tokens 512 \
207+
--n_problems_per_batch 24 \
208+
--n_samples_per_problem 1 \
209+
--n_batches 1
210+
211+
python -m tools.sanitize --dataset $DATASET --samples $SAVE_PATH
212+
evalplus.evaluate --dataset $DATASET --samples $SANITIZED_PATH
213+
# mbpp (base tests)
214+
# pass@1: 0.684
215+
# mbpp+ (base + extra tests)
216+
# pass@1: 0.566
217+
```
218+
219+
</details>
220+
221+
<details>
222+
223+
<summary>Magicoder-DS-6.7B</summary>
224+
225+
```bash
226+
MODEL_KEY=deepseek-ai/deepseek-coder-6.7b-base
227+
MODEL=ise-uiuc/Magicoder-DS-6.7B
228+
DATASET=mbpp
229+
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
230+
SANITIZED_PATH=evalplus-$(basename $MODEL)-$DATASET-sanitized.jsonl
231+
232+
python -m experiments.text2code \
233+
--model_key $MODEL_KEY \
234+
--model_name_or_path $MODEL \
235+
--save_path $SAVE_PATH \
236+
--dataset $DATASET \
237+
--temperature 0.0 \
238+
--top_p 1.0 \
239+
--max_new_tokens 512 \
240+
--n_problems_per_batch 24 \
241+
--n_samples_per_problem 1 \
242+
--n_batches 1
243+
244+
python -m tools.sanitize --dataset $DATASET --samples $SAVE_PATH
245+
evalplus.evaluate --dataset $DATASET --samples $SANITIZED_PATH
246+
# mbpp (base tests)
247+
# pass@1: 0.754
248+
# mbpp+ (base + extra tests)
249+
# pass@1: 0.619
250+
```
251+
252+
</details>
253+
254+
<details>
255+
256+
<summary>Magicoder-S-DS-6.7B</summary>
257+
258+
```bash
259+
MODEL_KEY=deepseek-ai/deepseek-coder-6.7b-base
260+
MODEL=ise-uiuc/Magicoder-S-DS-6.7B
261+
DATASET=mbpp
262+
SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
263+
SANITIZED_PATH=evalplus-$(basename $MODEL)-$DATASET-sanitized.jsonl
264+
265+
python -m experiments.text2code \
266+
--model_key $MODEL_KEY \
267+
--model_name_or_path $MODEL \
268+
--save_path $SAVE_PATH \
269+
--dataset $DATASET \
270+
--temperature 0.0 \
271+
--top_p 1.0 \
272+
--max_new_tokens 512 \
273+
--n_problems_per_batch 24 \
274+
--n_samples_per_problem 1 \
275+
--n_batches 1
276+
277+
python -m tools.sanitize --dataset $DATASET --samples $SAVE_PATH
278+
evalplus.evaluate --dataset $DATASET --samples $SANITIZED_PATH
279+
# mbpp (base tests)
280+
# pass@1: 0.757
281+
# mbpp+ (base + extra tests)
282+
# pass@1: 0.644
283+
```
284+
285+
</details>
286+
287+
## Reproduce MultiPL-E
288+
289+
TBD
290+
291+
## Reproduce DS-1000
292+
293+
TBD
294+
295+
## Reproduce data analysis
296+
2297
Here are some descriptions for the `experiments/data_embedding` directory:
298+
3299
- `length.py`: provides the token length distribution for data file problems and solutions.
4300
- `cosine_similarity.py`: computes the cosine similarity between the TF-IDF embeddings of data file and HumanEval.
5301
- `instruction_embedding.py`: classifies and calculates the percentage composition of data within the data file based on the instruction you provide.
6302

7-
### 🔍Data Analysis
8303
1. To depict the length distribution for either problems or solutions of the data file, you can run the command:
304+
9305
```bash
10306
python experiments/data_embedding/length.py
11307
```
308+
12309
The result will be shown in `Length.png`
13310

14311
2. To see the similarity between the data file and HumanEval, you can run the command:
312+
15313
```bash
16314
python experiments/data_embedding/cosine_similarity.py
17315
```
316+
18317
The result will be shown in `HE_similarity_comparison.png`
19318

20319
3. To study the categories of the data file, there are two different modes:
21320
- In the **instruction** mode, the model will generate the corresponding embeddings according to the instructions and number of clusters you give, and then generate clusters based on these embeddings.
22-
321+
23322
You can change the clustering criteria by adjusting the `--instruction`.
24-
323+
25324
For example, if you want to cluster the data file according to the programming languages, you can run the command:
26-
325+
27326
```bash
28327
python experiments/data_embedding/instructor_embedding.py \
29328
--data_files data-clean-decontaminated.jsonl \
@@ -32,14 +331,15 @@ Here are some descriptions for the `experiments/data_embedding` directory:
32331
--instruction "Represent the programming language used" \
33332
--n_clusters 2
34333
```
334+
35335
The clustering result will be shown in `Clusters.png`.
36-
336+
37337
- In the **query** mode, the model will generate the corresponding embeddings according to the instructions and queries you give, then classifies them by calculating the cosine similarity between the embeddings of the data file and the embeddings of queries.
38-
338+
39339
You can change the classification criteria by adjusting the `--query_instruction` and `--queries`.
40-
340+
41341
For example, if you want to classify the data file according to the topic of the content, you can run the command:
42-
342+
43343
```bash
44344
python experiments/data_embedding/instructor_embedding.py \
45345
--data_files data-clean-decontaminated.jsonl \
@@ -49,5 +349,13 @@ Here are some descriptions for the `experiments/data_embedding` directory:
49349
--query_instruction "Represent the comment for retrieving the corresponding code" \
50350
--queries "Algorithmic and Data Structure Problems" "Mathematical and Computational Problems" "Database and SQL Problems" "System Design and Architecture Problems" "Security and Cryptography Problems" "Performance Optimization Problems" "Web Problems" "Domain Specific Problems" "User Interface and Application Design Problems" "Data Science and Machine Learning Problems"
51351
```
352+
52353
The classification result will be shown in `Pie_Chart.png`.
53-
- You can find more information about how to generate data embeddings by using specific instructions and queries [here](https://arxiv.org/pdf/2212.09741.pdf)
354+
- You can find more information about how to generate data embeddings by using specific instructions and queries [here](https://arxiv.org/pdf/2212.09741.pdf)
355+
356+
## Limitations
357+
358+
- In the evaluation of HumanEval(+) and MBPP(+), we did not consider the influence of randomness caused by the batch size choice. A different batch size can result in better/worse results due to the underlying [cuBLAS](https://docs.nvidia.com/cuda/cublas/index.html) optimization.
359+
- We primarily presented results from existing studies (e.g., [EvalPlus Leaderboard](https://evalplus.github.io)) and did not evaluate how varying prompts might impact the performance of Magicoder or other models.
360+
361+
In the near future, we will continue to improve Magicoder and provide more detailed and robust evaluations.

0 commit comments

Comments
 (0)