Fine-tune SantaCoder on Code and Text Generation datasets. For example on new programming languages from The Stack dataset, or on a code-to-text dataset like GitHub-Jupyter. SantaCoder is a 1B parameters model pre-trained on Python, Java & JavaScript, we suggest fine-tuning on programming languages close to them, otherwise, the model might not converge well.
We provide code to fine-tune the pre-trained SantaCoder model on code/text datasets such as The Stack dataset. Check this repository for fine-tuning models on other code tasks such as code classification.
- You can use this Google Colab by @mrm8488 for the fine-tuning.
- To train on a local machine, you can use the
train.py
script by following the steps below. It allows you to launch training using the command line on multiple GPUs.
- To begin with, we should clone the repository locally, install all the required packages and log into HuggingFace Hub and Weight & Biases.
First, you can clone this repo with:
git clone https://github.com/bigcode/santacoder-finetuning.git
cd santacoder-finetuning
Second, install the required packages. The packages are listed in the requirements.txt
file and can be installed with
pip install -r requirements.txt
Third, make sure you are logged to HuggingFace Hub and Weights & Biases
huggingface-cli login
wandb login
-
Next, take a look at the
train.py
script to get an understanding of how it works. In short, the script does the following:- Load the given dataset
- Load the model with given hyperparameters
- Pre-process the dataset to input into the model
- Run training
- Run evaluation
-
The following examples show how you can launch fine-tuning for The Stack dataset. Here we will run the script on the Ruby subset of the dataset for demonstration purposes. Note that:
- Gradient Checkpointing is enabled by default and the caching mechanism is disabled to save memory. If you want to disable them call
no_gradient_checkpointing
argument. Note that mixed precision is disabled with theno_fp16
flag due to some issues we noticed when using it, you can enable it by removing that argument. However, a better choice would be to use bf16 mixed precision, if it's supported on your hardware (e.g A100), it's enabled with thebf16
flag and can be more stable in training. - If the model still doesn't fit in your memory use
batch_size
1 and reduceseq_length
to 1024 for example. - If you want to use streaming and avoid downloading the entire dataset, add the flag
streaming
. - If you want to train your model with Fill-In-The-Middle (FIM), use a tokenizer that includes FIM tokens, like SantaCoder's and specify the FIM rate arguments
fim_rate
andfim_spm_rate
(by default they are 0, for SantaCoder we use 0.5 for both).
python train.py \
--model_path="bigcode/santacoder" \
--dataset_name="bigcode/the-stack-dedup" \
--subset="data/shell" \
--data_column "content" \
--split="train" \
--seq_length 2048 \
--max_steps 30000 \
--batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 5e-5 \
--num_warmup_steps 500 \
--eval_freq 3000 \
--save_freq 3000 \
--log_freq 1 \
--num_workers="$(nproc)" \
--no_fp16
To launch the training on multiple GPUs use the following command (we just add python -m torch.distributed.launch --nproc_per_node number_of_gpus):
python -m torch.distributed.launch \
--nproc_per_node number_of_gpus train.py \
--model_path="bigcode/santacoder" \
--dataset_name="bigcode/the-stack-dedup" \
--subset="data/shell" \
--data_column "content" \
--split="train" \
--seq_length 2048 \
--max_steps 30000 \
--batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 5e-5 \
--num_warmup_steps 500 \
--eval_freq 3000 \
--save_freq 3000 \
--log_freq 1 \
--num_workers="$(nproc)" \
--no_fp16
Note: The checkpoints saved from this training command will have argument use_cache
in the file config.json
as False
, for fast inference you should change it to True
like in this commit or add it each time you're loading the model.
If you want to fine-tune on other text datasets, you just need to change data_column
argument to the name of the column containing the code/text you want to fine-tune on.
For example, We fine-tuned the model on the GitHub-Jupyter dataset on 4 A100 using the following command:
python -m torch.distributed.launch \
--nproc_per_node 4 train.py \
--model_path="bigcode/santacoder" \
--dataset_name="codeparrot/github-jupyter-code-to-text" \
--data_column "content" \
--split="train" \
--seq_length 2048 \
--max_steps 1000 \
--batch_size 2 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-5 \
--num_warmup_steps 100 \
--eval_freq 100 \
--save_freq 100 \
--log_freq 1 \
--num_workers="$(nproc)" \
--no_fp16
The resulting model can be found here with an associated space.
Can I use another Model: Yes! you can use other CLM models on the hub such as GPT2, CodeParrot, CodeGen, InCoder... Just make sure to change the seq_length
and eos_token_id
arguments.
To upload your trained checkpoint, you have to create a new model repository on the 🤗 model hub, from this page: https://huggingface.co/new
You can also follow the more in-depth instructions here if needed.
Having created your model repository on the hub, you should clone it locally:
git lfs install
git clone https://huggingface.co/username/your-model-name
Then and add the following files that fully define a SantaCoder checkpoint into the repository. You should have added the following files.
tokenizer_config.json
tokenizer.json
config.json
pytorch_model.bin
- modeling files (see below)
Note: As previously stated, the checkpoints saved from this training with gradient checkpointing and no caching command will have argument use_cache
in the file config.json
as False
, for fast inference you should change it to True
like in this commit.
You can get the tokenizer files by cloning the model repo and copying them to your directory. Santacoder currently has a custom modeling file + config file on the hub, but they will be included with the saved checkpoints if you used the transformers
branch in requirements.txt
.
Having added the above files, you should run the following to push files to your model repository.
git add . && git commit -m "Add model files" && git push
The next important step is to create the model card. For people to use your fine-tuned model it is important to understand:
- What kind of model is it?
- What is your model useful for?
- What data was your model trained on?
- How well does your model perform?
All these questions should be answered in a model card which is the first thing people see when
visiting your model on the hub under https://huggingface.co/{your_username}/{your_modelname}
.
Don't hesitate to also create a Gradio Demo for your model to showcase its capabilities 🚀. You can find more information on how to do that here.
This is inspired by the Wave2vec fine-tuning week by Hugging Face.