This project repository includes the scripts for replicating our work "Enhancing Code Generation for Low-Resource Languages: No Silver Bullet". In this study, we compare and suggest different approaches for improving the code generation capabilities of state-of-the-art deep learning models. Below is a step-by-step guide for installing the necessary dependencies and reproducing our findings.
This repository contains scripts written and tested in the following languages:
- Python (tested on version 3.10.14)
- R (tested on version 4.4.0)
Evaluating and testing the model generations requires the usage of Docker containers. We run our experiments with the following version:
- Docker (tested on version 24.0.7)
We employed AppleScript files to conduct the Copilot study. Hence, an Apple machine and a regular subscription to Github Copilot are required to reproduce this experiment. You also need the Visual Studio Code editor and the Copilot plugin correctly installed and configured on your machine. We tested this experiment on the following versions:
- Visual Studio Code (tested on version 1.91.1)
- Copilot plugin (tested on version 1.216.0)
Before running the scripts to automate Copilot completions, setup the system permissions as follows:
System Settings > Privacy & Security > Accessibility > Allow Visual Studio Code and Terminal applications
The provided scripts are intended solely for research purposes.
Before running our scripts, we recommend creating a new Python virtual environment and installing the required libraries as follows:
# Create a new virtual environment
python3 -m venv venv
# Activate it
source venv/bin/activate
# Install the required packages
pip3 install -r requirements.txt
We provide an alternative requirements file, namely deepseek_requirements.txt, to train DeepSeek-Coder models. We strongly recommend using an alternative environment and installing the provided packages before training the aforementioned models.
# Create DeepSeek-Coder virtual environment
python3 -m venv deepseek_venv
source deepseek_venv/bin/activate
# Install the required packages (from DeepSeek-Coder official repository)
pip3 install -r deepseek_requirements.txt
We used the MultiPL-E tool as benchmark for our experiments. Since this project is in continuous evolution, our scripts install the tool from the commit 19a2567
, which is the version used in our experiments. You can find an extensive guide on how to install and use this tool on the official website. While we relied on Docker for models evaluation, you can also use Podman as a replacement. More details are provided in the official guidelines. We also point out that all the experiments were executed on Linux (inference and fine-tuning) and MacOS (copilot benchmark). Hence, the provided scripts and the containers may not run on different operating systems.
We conducted our experiments on a cluster featuring the following GPUs:
- 8 NVIDIA A100 (80 GB)
- 32 NVIDIA A40
- 8 NVIDIA A30
It is possible to replicate our experiments with a different hardware infrastructure by adjusting inference and training parameters accordingly (see more in Training section).
All datasets and results from this study are available in our Zenodo repository.
In particular, the repository is structured as follows:
-
prompt-prefixes: list of prompts prepended to the code generation instruction for the in-context learning experiments (translation examples, translation rules, and few-shot)
-
predictions: includes model evaluation results for each experiment (baseline, in-context evaluation, fine-tuning, and copilot). Each result is stored in the compatible format with the MultiPL-E tool (Gzip compressed files). In particular, for each model and experiment we provide
*.json.gz
files, which contain the model generations for a specific HumanEval problem, and*.results.json.gz
files, which contain the status of the test suite execution. Under thefinetuning
directory, we have included the model's predictions on the MultiPL-E benchmark for each fine-tuning epoch. Thebest-results
folder, instead, groups the predictions of the best checkpoints, whose performance has been described in the paper. -
results: provides all the quantitative results from our work. It divides into the following directories:
- accuracies: includes CSV files containing the pass@1 discussed in our study, organized into subdirectories by model and experiment. In folder
epochs-accuracies
, we provide the performance of the 'fine-tuned only' and 'pre-trained and fine-tuned' models for each epoch. - statistical-analyses: contains the CSV and PDF files that result from the statistical analyses described in the paper. The PDF containing the results of the statistical tests is also available in this repository.
- accuracies: includes CSV files containing the pass@1 discussed in our study, organized into subdirectories by model and experiment. In folder
At the time of this writing, MultiPL-T datasets can only be accessed by accepting the usage terms via the official HuggingFace repository. Due to this policy, we do not include the authors' datasets in the replication package, but provide the necessary scripts to use them for our use cases (see Datasets generation section for more information).
This repository is organized into progressively numbered folders, each representing a different experiment. We recommend executing the scripts in the provided order. Also, each directory contains bash files to aid the correct replication of our scripts. Again, the filenames suggest the execution order of the scripts. Below, we describe some key experiments contained in this repository.
Folders 1-base-benchmark
, 2-context-benchmark
, and 4-finetuned-benchmark
provide the scripts for evaluating base models and their fine-tuned versions. Below is an example of how to run the evaluation scripts:
# You must launch the script with the following format:
# bash <experiment_script> <language_id> <model_path>
# Experiment scripts: one of the inference scripts provided in the mentioned folders
# Language id: the index of the language in the LANGUAGES array. It defines on which language the model must be evaluated
# Model path: the model to evaluate. It can either be a model on HuggingFace or a checkpoint locally stored on your machine
# E.g., for baseline evaluation with DeepSeek-Coder 1.3B Instruct on the Julia benchmark:
bash 1_run_benchmark.sh 1 deepseek-ai/deepseek-coder-1.3b-instruct
Before launching the scripts in 4-finetuned-benchmark
, update the CHECKPOINTS
list with the names of the checkpoints to evaluate from the fine-tuning stage (folder 3-pretrain-finetune
).
The evaluation of these models relies on the MultiPL-E Benchmark tool, automatically installed through our scripts (see section Software Dependencies). Before evaluating the model generations, verify that the Docker process is correctly running.
File 1_prepare_datasets.sh contains the scripts used to generate the pre-training and fine-tuning datasets for this study. As mentioned above, we start from the MultiPL-T datasets and adapt them for our use cases. Below are the steps required to re-create the datasets used in our experiments:
-
Go to the MultiPL-T HuggingFace repository and accept their usage terms. This would allow access to their datasets.
-
Run the 1_prepare_datasets.sh script from the terminal. Since the visibility of MultiPL-T datasets is restricted, you must provide a valid HuggingFace token and run the script as follows:
bash 1_prepare_datasets.sh <HF_ACCESS_TOKEN>
As a result, it will generate the datasets
folder containing the MultiPL-T Python functions and their translations in the target low-resource languages (translated
directory). In addition, it will generate:
- the
pretraining
directory, which contains the datasets for the pre-training objective experiment. Each JSON element contains the "instruction" and "output" keys, which are provided in input during model training. - the
finetuning
folder, containing datasets for fine-tuning models on low-resource languages (R and Racket). The JSON items follow the same schema seen for the translation pretraining objective.
To train CodeLlama and DeepSeek-Coder models, we used the scripts in folder 3-pretrain-finetune. As mentioned in the Package Dependencies section, we recommend running DeepSeek-Coder training in a separate environment, complying with the official model specifications.
To run these scripts, you must specify the model name and the training dataset as shown below:
# You must launch the script with the following format:
# bash <experiment_script> <dataset_id> <model_path>
# Experiment scripts: one of the pre-training / fine-tuning scripts provided in 3-pretrain-finetune folder
# Dataset id: the index of the language in the LANGUAGES array. It automatically fetch the training dataset.
# Model path: the model you want to train. It can either be a model on HuggingFace or a checkpoint locally stored on your machine
# For example:
bash 2_run_only_finetuning_codellama 0 codellama/CodeLlama-7b-Instruct-hf
In case you want to run these scripts on different GPUs, you may need to modify the training batch size or the DeepSpeed configuration accordingly with your hardware infrastructure.
Folder 5-copilot-study contains the scripts to reproduce the Copilot experiment. 1_setup_experiment.sh script generates 50 separate folders, each containing the docstring and the signature of an HumanEval problem and a prompt prefix in case of in-context learning study. Next, you can run the 2_run_copilot.sh script to launch Copilot on the generated files. Before running the scripts, ensure that all of the mentioned dependencies are satisfied.