A repository to perform self-instruct with a model on HF Hub
This repository is dedicated to Self-instruct. It is an iterative approach which allows to generate a dataset of instructions by boostrapping on a model's prediction. For it to work well, the model used has to be powerful. The original work actually focuses on OpenAI's text-davinci-003
engine which is one of their most powerful model. Our aim is to give a chance to modest, decoder-based models to be used for a data generation purpose.
- May 24, 2023: We've build a space which allow to visualize the data generated by self-instruct when the model use is StarCoder💫, the recent SOTA open-source code LLM by Hugging Face 🤗.
- Our approach requires the availability of a good amount of computational resources.
- We will focus on the dataset generation pipeline and the curation rather than the fine-tuning.
- Keep in mind that the quality of the dataset obtained by this method is strongly dependent on the quality of the model that is used.
Self-instruct is an iterative method that helps LM improve their ability to follow natural language instructions. The idea is to use a seed set of manually-written instructio and use them to prompt the model to generate new instructions and their corresponding input-output instances. The method includes a filtering step to ensure the novelty of the generated task.
Our implementation is inspired by the original Self-instruct method and recent updates including Stanford's alpaca and Code alpaca. While the last two are almost identical, with the sole difference being the set of seed tasks used, the original work has a different mindset. As a matter of fact, self-instruct's author uses a set of seed tasks and prompt the model with some of them to make it generate instructions. Later on, the output to the generated instructions are found separately. Conversely, alpaca is all in one in the sense that the model is prompted generates instruction as well as input-output at the same time. It uses the following template
### Instruction:
{instruction}
### Input:
{input}
### Output:
{output}
The advantage is that this all in one template allows to reduce the inference cost of the method, and the quality of the generated instances is not proven to be significantly impaired. We believe, intuitively, that this prompting approach generates feasible instructions thanks to the obligation to have a sound input-output pair associated to it.
Our approach is focused on code use cases, therefore our modifications are mostly relevant for that framework.
During our tests, we realized that, at least with "small" code models, the trigger words Input:
and Output:
tend to make them generate test cases instead. It is significantly impairing because given an instruction, we want a working implementation rather than a potentially buggy test case. In order to alleviate this issue, we decided to get rid of the Input:
trigger word. We adopt an instruction-output format.
Using Instruction:
, Input:
and Output:
seems to work well for text-davinci-003
but how well does it work for other models? This parameter is definitely relevant for small models as this can have a huge impact on the quality of their generations. Following this intuition, we included in our code the possibility to change the trigger words that are used during the prompting. This allows to accomodate to every single model.
How to select and post-process the instructions that are generated by prompting a model? In the original work, the instructions are generated iteratively, and we keep those with a rouge score stricly less than 0.7
with any previously generated instruction. This allows diversity in the dataset, at least in terms of how the instructions are worded. According to our experiments, it is still possible to generate a problem multiple times with a different formulation each time. We propose to extend take the curation further with multiple ideas.
We came up with a strong data instruction filtering technique. The idea is very simple, we want to test if the model is consistent with what it generates. We verify that by prompting the output of each generated instruction, the model generates the corresponding input and we check the consistency between that generation and the ground truth.
Another alternative is to post-process the raw dataset by only keeping a set of unique instructions.
We modified the seed tasks to keep only those who are related to code. For that we combine the tasks from Code Alpaca (code tasks extrated from the original seed tasks + some new tasks probably created by the repo's author) and some leetcode tasks. We have a total of 41
seed tasks.
StarCoder was trained on GitHub code, thus it can be used to perform code generation. More precisely, the model can complete the implementation of a function or infer the following characters in a line of code. This can be done with the help of the 🤗's transformers library.
Here, we present a step by step recipe that anybody can use in order to apply our self-instruct method on its prefered LLM in a conda environment. Create a new conda environment and activate it
conda create -n env
conda activate env
Install the pytorch
version compatible with your version of cuda here, for example the following command works with cuda 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
Install transformers
and accelerate
conda install -c huggingface transformers
pip install git+https://github.com/huggingface/accelerate.git
Do not forget to launch accelerate config
in the terminal in order to configure you environment, for more the details see accelerate.
We will also need rouge-score
and sentence-transformers
pip install rouge-score
pip install sentence-transformers
Now we are ready to clone the repository and to start working
git clone https://github.com/ArmelRandy/self-instruct
cd self-instruct
This part is related to the directory instruction_io
. We prompt the model with the following template
### Instruction :
{instruction}
### Output :
{output}
For the instructions that provides an input (a code in case of a debugging task or a translation task), we concatenate the instruction and the input under the keyword "Instruction:", we then have
### Instruction :
{instruction}{input}
### Output :
{output}
The possibility to change the trigger words Instruction:
and Output:
into other words such as Request:
and Answer:
respectively for example is given. However, the change has to be done directly in the code, as they trigger words are used as constant throughout the code.
cd instruction_io
accelerate launch instruction_output.py \
--batch_dir = "data_io/santacoder_generations/" \
--seed_tasks_path "data_io/code_tasks.jsonl"\
--num_instructions_to_generate 10 \
--model_name_or_path "gpt_bigcode-santacoder"\
--num_prompt_instructions 8 \
--request_batch_size 5 \
--n 2 \
--max_length 2048 \
--temperature 0.2 \
--top_p 0.9 \
--repetition_penalty 1.2 \
--threshold 0.7
This part is related to the directory instruction_iio
. It is the template as designed in Stanford's alpaca. The possibilty to change the trigger words is also provided, with the same limitations as those previously mentionned.
cd instruction_iio
accelerate launch instruction_output.py \
--batch_dir = "data_iio/santacoder_generations/" \
--seed_tasks_path "data_iio/code_tasks.jsonl"\
--num_instructions_to_generate 10 \
--model_name_or_path "gpt_bigcode-santacoder"\
--num_prompt_instructions 8 \
--request_batch_size 5 \
--n 2 \
--max_length 2048 \
--temperature 0.2 \
--top_p 0.9 \
--repetition_penalty 1.2 \
--threshold 0.7
Here we want to apply a post processing to our generated instructions by considering only instructions that are not too similar. In order to do so, we get into the folder self-instruct
and we launch
python unique_post_processing.py
--batch_dir = "instruction_io/data_io/santacoder_generations/" \
--threshold 0.5