Merge pull request #1 from salesforce/main

Kamel773 · web-flow · commit d0e956559467 · 2023-03-29T02:48:55.000-04:00
add two codet5-large checkpoints
diff --git a/README.md b/README.md
@@ -11,10 +11,19 @@ This is the official PyTorch implementation for the following EMNLP 2021 paper f
 
 ## Updates
 
+**July 06, 2022**
+
+We release two large-sized CodeT5 checkpoints at Hugging Face: [Salesforce/codet5-large](https://huggingface.co/Salesforce/codet5-large) and [Salesforce/codet5-large-ntp-py](https://huggingface.co/Salesforce/codet5-large-ntp-py), which are introduced by the paper: [CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning](https://arxiv.org/pdf/2207.01780.pdf) by Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi.
+
+* CodeT5-large was pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and achieve new SOTA results on several CodeXGLUE benchmarks. The finetuned checkpoints are released at [here](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models). See Appendix A.1 of the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
+
+* CodeT5-large-ntp-py was first pretrained using Masked Span Prediction (MSP) objective on CodeSearchNet and GCPY (the Python split of [Github Code](https://huggingface.co/datasets/codeparrot/github-code) data), followed by another 10 epochs on GCPY using Next Token Prediction (NTP) objective. 
+
+CodeT5-large-ntp-py is especially optimized for Python code generation tasks and employed as the foundation model for our [CodeRL](https://github.com/salesforce/CodeRL), yielding new SOTA results on the APPS Python competition-level program synthesis benchmark. See the [paper](https://arxiv.org/pdf/2207.01780.pdf) for more details.
+
 **Oct 29, 2021**
 
-We
-release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
+We release [fine-tuned checkpoints](https://console.cloud.google.com/storage/browser/sfr-codet5-data-research/finetuned_models)
 for all the downstream tasks covered in the paper.
 
 **Oct 25, 2021**
@@ -114,7 +123,7 @@ CodeT5 as a VS Code plugin to provide three capabilities for Apex developers:
 
 ## Citation
 
-If you find this code to be useful for your research, please consider citing.
+If you find this code to be useful for your research, please consider citing:
 
 ```
 @inproceedings{
@@ -124,6 +133,13 @@ If you find this code to be useful for your research, please consider citing.
     booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
     year={2021},
 }
+
+@article{coderl2022,
+  title={CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
+  author={Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven C. H.},
+  journal={arXiv preprint arXiv:2207.01780},
+  year={2022}
+}
 ```
 
 ## License
@@ -216,6 +232,12 @@ Please refer to the argument flags in [configs.py](https://github.com/salesforce
 available options. The saved training curves in `summary_dir` can be visualized using [tensorboard](https://pypi.org/project/tensorboard/).
 Note that we employ one A100 GPU for all fine-tuning experiments.
 
+### How to reproduce the results using the released finetuned checkpoints?
+
+* Remove the `--do_train --do_eval --do_eval_bleu` and reserve only `--do_test` at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/sh/exp_with_args.sh#L84). 
+* Pass the path of your downloaded finetuned checkpoint to load at [here](https://github.com/salesforce/CodeT5/blob/5b37c34f4bbbfcfd972c24a9dd1f45716568ecb5/run_gen.py#L366), e.g., `file = "CodeT5/finetuned_models/summarize_python_codet5_base.bin"`
+* Run the program: `python run_exp.py --model_tag codet5_base --task summarize --sub_task python`
+
 ### How to fine-tune on your own task and dataset?
 If you want to fine-tune on your dataset, you can add your own task and sub_task in `configs.py` ([here](https://github.com/salesforce/CodeT5/blob/d27512d23ba6130e089e571d8c3e399760db1c31/configs.py#L11)) and add your data path and the function to read in `utils.py` ([here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L103) and [here](https://github.com/salesforce/CodeT5/blob/5bb41e21b07fee73f310476a91ded00e385290d7/utils.py#L149)). The read function can be implemented in `_utils.py` similar to [this one](https://github.com/salesforce/CodeT5/blob/aaf9c4a920c4986abfd54a74f5456b056b6409e0/_utils.py#L213). If your task to add is a generation task, you can simply reuse or customize the `run_gen.py`. For understanding tasks, please refer to `run_defect.py` and `run_clone.py`.
 
diff --git a/run_clone.py b/run_clone.py
@@ -21,6 +21,8 @@
 
 from __future__ import absolute_import
 import os
+import pdb
+
 from models import CloneModel
 import logging
 import argparse
@@ -136,6 +138,7 @@ def main():
     config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
     model = model_class.from_pretrained(args.model_name_or_path)
     tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name)
+    model.resize_token_embeddings(32000)
 
     model = CloneModel(model, config, tokenizer, args)
     logger.info("Finish loading model [%s] from %s", get_model_size(model), args.model_name_or_path)
diff --git a/sh/exp_with_args.sh b/sh/exp_with_args.sh
@@ -1,4 +1,4 @@
-WORKDIR="path_to_your_dir/CodeT5"
+WORKDIR="your_CodeT5_path/CodeT5"
 export PYTHONPATH=$WORKDIR
 
 TASK=${1}
@@ -64,6 +64,10 @@ elif [[ $MODEL_TAG == codet5_base ]]; then
   MODEL_TYPE=codet5
   TOKENIZER=Salesforce/codet5-base
   MODEL_PATH=Salesforce/codet5-base
+elif [[ $MODEL_TAG == codet5_large ]]; then
+  MODEL_TYPE=codet5
+  TOKENIZER=Salesforce/codet5-large
+  MODEL_PATH=Salesforce/codet5-large
 fi
 
 
@@ -78,10 +82,9 @@ else
   RUN_FN=${WORKDIR}/run_gen.py
 fi
 
-
 CUDA_VISIBLE_DEVICES=${GPU} \
-  python ${RUN_FN}  \
-  --do_train --do_eval --do_eval_bleu --do_test ${MULTI_TASK_AUG}  \
+  python ${RUN_FN}  ${MULTI_TASK_AUG}   \
+  --do_train --do_eval --do_eval_bleu --do_test  \
   --task ${TASK} --sub_task ${SUB_TASK} --model_type ${MODEL_TYPE} --data_num ${DATA_NUM}  \
   --num_train_epochs ${EPOCH} --warmup_steps ${WARMUP} --learning_rate ${LR}e-5 --patience ${PATIENCE} \
   --tokenizer_name=${TOKENIZER}  --model_name_or_path=${MODEL_PATH} --data_dir ${WORKDIR}/data  \
diff --git a/sh/run_exp.py b/sh/run_exp.py
@@ -76,6 +76,8 @@ def get_args_by_task_model(task, sub_task, model_tag):
             bs = 64
         elif task == 'clone':
             bs = 25
+    elif 'codet5_large' in model_tag:
+        bs = 8
     else:
         bs = 32
         if task == 'translate':
@@ -142,7 +144,7 @@ def get_sub_tasks(task):
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument("--model_tag", type=str, default='codet5_base',
-                        choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base'])
+                        choices=['roberta', 'codebert', 'bart_base', 'codet5_small', 'codet5_base', 'codet5_large'])
     parser.add_argument("--task", type=str, default='summarize', choices=['summarize', 'concode', 'translate',
                                                                           'refine', 'defect', 'clone', 'multi_task'])
     parser.add_argument("--sub_task", type=str, default='ruby')