A repository to perform self-instruct with a model on Hugging Face Hub.
This repository is dedicated to Self-instruct. It is an iterative approach which allows to generate a dataset of instructions by boostrapping on a model's prediction. For it to work well, the model used has to be powerful. The original work actually focuses on OpenAI's text-davinci-003
engine which is one of their most powerful model. Our aim is to give a chance to modest, decoder-based models to be used for a data generation purpose.
- May 24, 2023: We've built a space which allow to visualize the data generated by self-instruct when the model used is StarCoder💫, the recent SOTA open-source code LLM by Hugging Face 🤗.
- Our approach requires the availability of a good amount of computational resources.
- We will focus on the dataset generation pipeline and the curation rather than the fine-tuning.
- Keep in mind that the quality of the dataset obtained by this method is strongly dependent on the quality of the model that is used.
Self-instruct is an iterative method that helps LM improve their ability to follow natural language instructions. The idea is to use a seed set of manually-written instructions and use them to prompt the model to generate new instructions and their corresponding input-output instances. The method includes a filtering step to ensure the novelty of the generated task.
Our implementation is inspired by the original Self-instruct method and recent updates including Stanford's alpaca and Code alpaca. While the last two are almost identical, with the sole difference being the set of seed tasks used, the original work has a different mindset. As a matter of fact, self-instruct's author uses a set of seed tasks and prompt the model with some of them to make it generate instructions. Later on, the output to the generated instructions are found separately. Conversely, Alpaca is all in one in the sense that the model is prompted to generate an instruction as well as the input-output pair at the same time. It uses the following template
### Instruction:
{instruction}
### Input:
{input}
### Output:
{output}
The advantage is that this all in one template allows to reduce the inference cost of the method, and the quality of the generated instances is not proven to be significantly impaired. We believe, intuitively, that this prompting approach generates feasible instructions thanks to the obligation to have a sound input-output pair associated to it.
Our approach is focused on code use cases, therefore our modifications are mostly relevant for that framework.
During our tests, we realized that, at least with "small" code models, the trigger words Input
and Output
tend to make them generate test cases instead. It is significantly impairing because given an instruction, we want a working implementation rather than a potentially buggy test case. In order to alleviate this issue, we decided to get rid of the Input
trigger word. We adopt an instruction-output format.
Using Instruction
, Input
and Output
seems to work well for text-davinci-003
but how well does it work for other models? This parameter is definitely relevant for small models as this can have a huge impact on the quality of their generations. Following this intuition, we included in our code the possibility to change the trigger words that are used during the prompting. This allows to accomodate to every single model.
How to select and post-process the instructions that are generated by prompting a model? In the original work, the instructions are generated iteratively, and we keep those with a rouge score stricly less than 0.7
with any previously generated instruction. This allows diversity in the dataset, at least in terms of how the instructions are worded. According to our experiments, it is still possible to generate a problem multiple times with a different formulation each time. We propose to extend take the curation further with multiple ideas.
We came up with a strong data instruction filtering technique. The idea is very simple, we want to test if the model is consistent with what it generates. We verify that by prompting the model to generate and instruction based the output. It is a complicated task for a LM and for a human because in many cases, it results in an unsolvable task. In the case where the model is able to generate an instruction, we compare it in terms of meaning with the ground-truth. For that, we use Sentence-BERT, precisely All-MiniLM-L6-v2 with the threshold of our choice (typically 0.5). This filtering technique is not recommended for models with a frailty ability to understand natural language text.
Another alternative is to post-process the raw dataset by only keeping instructions that are not similar to each other in terms of meaning. Once again we make use of Sentence-BERT. An instruction is kept if any previously generated instruction has a similarity score less than a threshold (typically 0.5) w.r.t the considered instruction.
We modified the seed tasks to keep only those who are related to code. For that we combine the tasks from Code Alpaca (code tasks extrated from the original seed tasks + some new tasks probably created by the repo's author) and some leetcode tasks. We have a total of 41
seed tasks.
StarCoder was trained on GitHub code, thus it can be used to perform code generation. More precisely, the model can complete the implementation of a function or infer the following characters in a line of code. This can be done with the help of the 🤗's transformers library.
Here, we present a step by step recipe that anybody can use in order to apply our self-instruct method on its prefered LLM in a conda environment. Create a new conda environment and activate it
conda create -n env
conda activate env
Install the pytorch
version compatible with your version of cuda here, for example the following command works with cuda 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
Install transformers
and accelerate
conda install -c huggingface transformers
pip install git+https://github.com/huggingface/accelerate
Do not forget to launch accelerate config
in the terminal in order to configure you environment, for more the details see accelerate.
We will also need rouge-score
pip install rouge-score
Now we are ready to clone the repository and to start working
git clone https://github.com/ArmelRandy/self-instruct
cd Self-instruct
This part is related to the directory instruction_io
. We prompt the model with the following template
### Instruction :
{instruction}
### Output :
{output}
For the instructions that provides an input (a code in case of a debugging task or a translation task), we concatenate the instruction and the input under the keyword Instruction
, we then have
### Instruction :
{instruction}{input}
### Output :
{output}
The possibility to change the trigger words Instruction
and Output
into other words such as Request
and Answer
respectively for example is given. However, the change has to be done directly in the code, as they trigger words are used as constant throughout the code.
cd instruction_io
accelerate launch instruction_output.py \
--batch_dir "data_io/santacoder_generations/" \
--seed_tasks_path "data_io/code_tasks.jsonl"\
--num_instructions_to_generate 10 \
--model_name_or_path "bigcode/gpt_bigcode-santacoder"\
--num_prompt_instructions 8 \
--request_batch_size 5 \
--n 2 \
--max_length 2048 \
--temperature 0.2 \
--top_p 0.9 \
--repetition_penalty 1.2 \
--threshold 0.7
This part is related to the directory instruction_iio
. It is the template as designed in Stanford's alpaca. The possibilty to change the trigger words is also provided, with the same limitations as those previously mentionned.
cd instruction_iio
accelerate launch instruction_input_output.py \
--batch_dir "data_iio/santacoder_generations/" \
--seed_tasks_path "data_iio/code_tasks.jsonl"\
--num_instructions_to_generate 10 \
--model_name_or_path "bigcode/gpt_bigcode-santacoder"\
--num_prompt_instructions 8 \
--request_batch_size 5 \
--n 2 \
--max_length 2048 \
--temperature 0.2 \
--top_p 0.9 \
--repetition_penalty 1.2 \
--threshold 0.7
This part requires an additional requirement, that is sentence-transformers whose installation is as follows :
pip install -U sentence-transformers
Here, we run the file output_instruction.py
with the help of accelerate
accelerate launch output_instruction.py \
--batch_dir "data_io/santacoder_generations" \
--num_trials 4 \
--seed_tasks_path "data_io/code_tasks.jsonl" \
--model_name_or_path "bigcode/gpt_bigcode-santacoder" \
--num_prompt_instructions 8 \
--n 1 \
--max_length 2048 \
It will create a file regen.jsonl
into batch_dir
.
Here we want to apply a post-processing to our generated instructions by considering only instructions that are not too similar. In order to do so, we get into the folder self-instruct
and we launch
python unique_post_processing.py
--batch_dir = "instruction_io/data_io/santacoder_generations/" \
--threshold 0.5
It will create a file machine_generated_instructions_processed.jsonl
into batch_dir
.
It is possible to visualize the instructions generated in terms of how they are phrased. Specifically we can show the most common used root verbs and their top 4 direct noun objects. This functionality is inherited from the implementation provided by self-instruct's author. Its usage requires additional libraries, spacy, benepar and plotly
pip install -U spacy
python -m spacy download en_core_web_md
pip install benepar
pip install plotly
Now, it is possible to run the notebook instruction_visualize.ipynb
. We also provide dataset_to_hub.ipynb
in order to push the generated dataset to the hub.
Now that the dataset is available, we can fine-tune our favorite text/code LLM to make it follow instructions. Our choice is naturally towards StarCoder. This repository gives a comprehensive method that can be used to fine-tune starcoder on any instruction dataset available on the hub.