You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> This documentation is still WIP. Raise an [issue](https://github.com/ise-uiuc/magicoder/issues) in case you found any errors.
5
+
6
+
In this document, we provide the instructions for reproducing the experiments in the paper.
7
+
8
+
> [!IMPORTANT]
9
+
> **General requirements**
10
+
>
11
+
> Before you start, make sure you cloned the respository.
12
+
> Here are the environment and hardware requirements to 100% reproduce the paper results.
13
+
>
14
+
> - Two NVIDIA A100 80G GPUs
15
+
> - Python 3.10.12
16
+
> - Having installed [pdm](https://pdm-project.org/latest/) and having set it up for the magicoder repo (e.g., `pdm install`).
17
+
> - Now you should have the same package versions as specified in [pdm.lock](/pdm.lock).
18
+
19
+
## Reproduce HumanEval(+) and MBPP(+)
20
+
21
+
We pack multiple problems into one batch to speed up the inference. A different batch size may lead to slightly worse/better results due to the floating point round off resulted from the underlying [cuBLAS](https://docs.nvidia.com/cuda/cublas/index.html) optimization. We chose the batch size that can maximize the utilization of 1 or 2 GPUs depending on the resource availability at the time we ran the evaluation.
22
+
23
+
Make sure you set `CUDA_VISIBLE_DEVICES` to the 1 or 2 GPUs you want to use and `cd`ed to the root directory of the repo. Some larger batch sizes require 2 GPUs.
Make sure you download the [EvalPlus repo](https://github.com/evalplus/evalplus) and performed `export PYTHONPATH=$EVALPLUS_REPO_ROOT`. We will use its `tools.sanitize` to sanitize the generated samples.
The result will be shown in`HE_similarity_comparison.png`
19
318
20
319
3. To study the categories of the data file, there are two different modes:
21
320
- In the **instruction** mode, the model will generate the corresponding embeddings according to the instructions and number of clusters you give, and then generate clusters based on these embeddings.
22
-
321
+
23
322
You can change the clustering criteria by adjusting the `--instruction`.
24
-
323
+
25
324
For example, if you want to cluster the data file according to the programming languages, you can run the command:
@@ -32,14 +331,15 @@ Here are some descriptions for the `experiments/data_embedding` directory:
32
331
--instruction "Represent the programming language used" \
33
332
--n_clusters 2
34
333
```
334
+
35
335
The clustering result will be shown in`Clusters.png`.
36
-
336
+
37
337
- In the **query** mode, the model will generate the corresponding embeddings according to the instructions and queries you give, then classifies them by calculating the cosine similarity between the embeddings of the data file and the embeddings of queries.
38
-
338
+
39
339
You can change the classification criteria by adjusting the `--query_instruction` and `--queries`.
40
-
340
+
41
341
For example, if you want to classify the data file according to the topic of the content, you can run the command:
@@ -49,5 +349,13 @@ Here are some descriptions for the `experiments/data_embedding` directory:
49
349
--query_instruction "Represent the comment for retrieving the corresponding code" \
50
350
--queries "Algorithmic and Data Structure Problems""Mathematical and Computational Problems""Database and SQL Problems""System Design and Architecture Problems""Security and Cryptography Problems""Performance Optimization Problems""Web Problems""Domain Specific Problems""User Interface and Application Design Problems""Data Science and Machine Learning Problems"
51
351
```
352
+
52
353
The classification result will be shown in`Pie_Chart.png`.
53
-
- You can find more information about how to generate data embeddings by using specific instructions and queries [here](https://arxiv.org/pdf/2212.09741.pdf)
354
+
- You can find more information about how to generate data embeddings by using specific instructions and queries [here](https://arxiv.org/pdf/2212.09741.pdf)
355
+
356
+
## Limitations
357
+
358
+
- In the evaluation of HumanEval(+) and MBPP(+), we did not consider the influence of randomness caused by the batch size choice. A different batch size can result in better/worse results due to the underlying [cuBLAS](https://docs.nvidia.com/cuda/cublas/index.html) optimization.
359
+
- We primarily presented results from existing studies (e.g., [EvalPlus Leaderboard](https://evalplus.github.io)) and did not evaluate how varying prompts might impact the performance of Magicoder or other models.
360
+
361
+
In the near future, we will continue to improve Magicoder and provide more detailed and robust evaluations.
0 commit comments