Skip to content

Commit d0e9565

Browse files
authored
Merge pull request #1 from salesforce/main
add two codet5-large checkpoints
2 parents 5b37c34 + afcc8ef commit d0e9565

File tree

4 files changed

+38
-8
lines changed

4 files changed

+38
-8
lines changed

README.md

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,19 @@ This is the official PyTorch implementation for the following EMNLP 2021 paper f
1111

1212
## Updates
1313

14+
**July 06, 2022**
15+
16+
We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
17+
18+
* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
19+
20+
* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective.
21+
22+
CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
23+
1424
**Oct 29, 2021**
1525

16-
We
17-
release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
26+
We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
1827
for all the downstream tasks covered in the paper.
1928

2029
**Oct 25, 2021**
@@ -114,7 +123,7 @@ CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
114123

115124
## Citation
116125

117-
If you find this code to be useful for your research, please consider citing.
126+
If you find this code to be useful for your research, please consider citing:
118127

119128
```
120129
@inproceedings{
@@ -124,6 +133,13 @@ If you find this code to be useful for your research, please consider citing.
124133
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
125134
year={2021},
126135
}
136+
137+
@article{coderl2022,
138+
title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
139+
author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
140+
journal={arXiv preprint arXiv:2207.01780},
141+
year={2022}
142+
}
127143
```
128144

129145
## License
@@ -216,6 +232,12 @@ Please refer to the argument flags in [configs.py](https://github.com/salesforce
216232
available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
217233
Note that we employ one A100 GPU for all fine-tuning experiments.
218234

235+
### How to reproduce the results using the released finetuned checkpoints?
236+
237+
* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84).
238+
* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
239+
* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
240+
219241
### How to fine-tune on your own task and dataset?
220242
If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
221243

run_clone.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
from __future__ import absolute_import
2323
import os
24+
import pdb
25+
2426
from models import CloneModel
2527
import logging
2628
import argparse
@@ -136,6 +138,7 @@ def main():
136138
config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
137139
model = model_class.from_pretrained(args.model_name_or_path)
138140
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name)
141+
model.resize_token_embeddings(32000)
139142

140143
model = CloneModel(model, config, tokenizer, args)
141144
logger.info("Finish loading model [%s] from %s", get_model_size(model), args.model_name_or_path)

sh/exp_with_args.sh

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
WORKDIR="path_to_your_dir/CodeT5"
1+
WORKDIR="your_CodeT5_path/CodeT5"
22
export PYTHONPATH=$WORKDIR
33

44
TASK=${1}
@@ -64,6 +64,10 @@ elif [[ $MODEL_TAG == codet5_base ]]; then
6464
MODEL_TYPE=codet5
6565
TOKENIZER=Salesforce/codet5-base
6666
MODEL_PATH=Salesforce/codet5-base
67+
elif [[ $MODEL_TAG == codet5_large ]]; then
68+
MODEL_TYPE=codet5
69+
TOKENIZER=Salesforce/codet5-large
70+
MODEL_PATH=Salesforce/codet5-large
6771
fi
6872

6973

@@ -78,10 +82,9 @@ else
7882
RUN_FN=${WORKDIR}/run_gen.py
7983
fi
8084

81-
8285
CUDA_VISIBLE_DEVICES=${GPU} \
83-
python ${RUN_FN} \
84-
--do_train --do_eval --do_eval_bleu --do_test ${MULTI_TASK_AUG} \
86+
python ${RUN_FN} ${MULTI_TASK_AUG} \
87+
--do_train --do_eval --do_eval_bleu --do_test \
8588
--task ${TASK} --sub_task ${SUB_TASK} --model_type ${MODEL_TYPE} --data_num ${DATA_NUM} \
8689
--num_train_epochs ${EPOCH} --warmup_steps ${WARMUP} --learning_rate ${LR}e-5 --patience ${PATIENCE} \
8790
--tokenizer_name=${TOKENIZER} --model_name_or_path=${MODEL_PATH} --data_dir ${WORKDIR}/data \

sh/run_exp.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ def get_args_by_task_model(task, sub_task, model_tag):
7676
bs = 64
7777
elif task == 'clone':
7878
bs = 25
79+
elif 'codet5_large' in model_tag:
80+
bs = 8
7981
else:
8082
bs = 32
8183
if task == 'translate':
@@ -142,7 +144,7 @@ def get_sub_tasks(task):
142144
if __name__ == '__main__':
143145
parser = argparse.ArgumentParser()
144146
parser.add_argument("--model_tag", type=str, default='codet5_base',
145-
choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
147+
choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base', 'codet5_large'])
146148
parser.add_argument("--task", type=str, default='summarize', choices=['summarize', 'concode', 'translate',
147149
'refine', 'defect', 'clone', 'multi_task'])
148150
parser.add_argument("--sub_task", type=str, default='ruby')

0 commit comments

Comments
 (0)