Official research release for the CodeGen2.5 models for Program Synthesis.
Title: CodeGen2.5: Small, but mighty
Authors: Erik Nijkamp*, Hiroaki Hayashi*, Yingbo Zhou, Caiming Xiong (* equal contribution)
Model checkpoints are published at Hugging Face Hub.
- CodeGen2.5-7B-multi (Apache-2.0)
- CodeGen2.5-7B-mono (Apache-2.0)
- CodeGen2.5-7B-instruct (Research purposes only)
Model cards outline how to use the model for causal and infill sampling. Please refer to each model card for more details.
The models are pre-trained on the StarCoderData, a programming language dataset developed by BigCode.
transformers>=4.29.2
tiktoken==0.4.0
Program synthesis in the form of auto-regressive sampling can be performed as follows:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-mono", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-mono")
inputs = tokenizer("def hello_world():", return_tensors="pt")
sample = model.generate(**inputs, max_length=128)
print(tokenizer.decode(sample[0]))
Please cite CodeGen2 paper:
@article{Nijkamp2023codegen2,
title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages},
author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo},
journal={ICLR},
year={2023}
}