Skip to content

A repository to perform self-instruct with a model on HF Hub

License

Notifications You must be signed in to change notification settings

ArmelRandy/Self-instruct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Self-instruct 🤗

A repository to perform self-instruct with a model on HF Hub

What is this about?

This repository is dedicated to Self-instruct. It is an iterative approach which allows to generate a dataset of instructions by boostrapping on a model's prediction. For it to work well, the model used has to be powerful. The original work actually focuses on OpenAI's text-davinci-003 engine which is one of their most powerful model. Our aim is to give a chance to modest, decoder-based models to be used for a data generation purpose.

News

  • May 24, 2023: We've build a space which allow to visualize the data generated by self-instruct when the model use is StarCoder💫, the recent SOTA open-source code LLM by Hugging Face 🤗.

Disclaimer

  • Our approach requires the availability of a good amount of computational resources.
  • We will focus on the dataset generation pipeline and the curation rather than the fine-tuning.
  • Keep in mind that the quality of the dataset obtained by this method is strongly dependent on the quality of the model that is used.

Table of Contents

  1. Overview of the method
  2. Related work
  3. Our approach
  4. Quickstart

Overview

Self-instruct is an iterative method that helps LM improve their ability to follow natural language instructions. The idea is to use a seed set of manually-written instructio and use them to prompt the model to generate new instructions and their corresponding input-output instances. The method includes a filtering step to ensure the novelty of the generated task.

Related work

Our implementation is inspired by the original Self-instruct method and recent updates including Stanford's alpaca and Code alpaca. While the last two are almost identical, with the sole difference being the set of seed tasks used, the original work has a different mindset. As a matter of fact, self-instruct's author uses a set of seed tasks and prompt the model with some of them to make it generate instructions. Later on, the output to the generated instructions are found separately. Conversely, alpaca is all in one in the sense that the model is prompted generates instruction as well as input-output at the same time. It uses the following template

### Instruction:
{instruction}

### Input:
{input}

### Output:
{output}

The advantage is that this all in one template allows to reduce the inference cost of the method, and the quality of the generated instances is not proven to be significantly impaired. We believe, intuitively, that this prompting approach generates feasible instructions thanks to the obligation to have a sound input-output pair associated to it.

Our approach

Our approach is focused on code use cases, therefore our modifications are mostly relevant for that framework.

The prompting format

During our tests, we realized that, at least with "small" code models, the trigger words Input: and Output: tend to make them generate test cases instead. It is significantly impairing because given an instruction, we want a working implementation rather than a potentially buggy test case. In order to alleviate this issue, we decided to get rid of the Input: trigger word. We adopt an instruction-output format.

The trigger words

Using Instruction:, Input: and Output: seems to work well for text-davinci-003 but how well does it work for other models? This parameter is definitely relevant for small models as this can have a huge impact on the quality of their generations. Following this intuition, we included in our code the possibility to change the trigger words that are used during the prompting. This allows to accomodate to every single model.

The post-processing

How to select and post-process the instructions that are generated by prompting a model? In the original work, the instructions are generated iteratively, and we keep those with a rouge score stricly less than 0.7 with any previously generated instruction. This allows diversity in the dataset, at least in terms of how the instructions are worded. According to our experiments, it is still possible to generate a problem multiple times with a different formulation each time. We propose to extend take the curation further with multiple ideas.

Self-consistency

We came up with a strong data instruction filtering technique. The idea is very simple, we want to test if the model is consistent with what it generates. We verify that by prompting the output of each generated instruction, the model generates the corresponding input and we check the consistency between that generation and the ground truth.

Uniqueness

Another alternative is to post-process the raw dataset by only keeping a set of unique instructions.

Further details

We modified the seed tasks to keep only those who are related to code. For that we combine the tasks from Code Alpaca (code tasks extrated from the original seed tasks + some new tasks probably created by the repo's author) and some leetcode tasks. We have a total of 41 seed tasks.

Quickstart

StarCoder was trained on GitHub code, thus it can be used to perform code generation. More precisely, the model can complete the implementation of a function or infer the following characters in a line of code. This can be done with the help of the 🤗's transformers library.

Step by step installation with conda

Here, we present a step by step recipe that anybody can use in order to apply our self-instruct method on its prefered LLM in a conda environment. Create a new conda environment and activate it

conda create -n env
conda activate env

Install the pytorch version compatible with your version of cuda here, for example the following command works with cuda 11.6

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

Install transformers and accelerate

conda install -c huggingface transformers 
pip install git+https://github.com/huggingface/accelerate.git

Do not forget to launch accelerate config in the terminal in order to configure you environment, for more the details see accelerate. We will also need rouge-score and sentence-transformers

pip install rouge-score
pip install sentence-transformers

Now we are ready to clone the repository and to start working

git clone https://github.com/ArmelRandy/self-instruct
cd self-instruct

Instruction - output

This part is related to the directory instruction_io. We prompt the model with the following template

### Instruction :
{instruction}

### Output :
{output}

For the instructions that provides an input (a code in case of a debugging task or a translation task), we concatenate the instruction and the input under the keyword "Instruction:", we then have

### Instruction :
{instruction}{input}

### Output :
{output}

The possibility to change the trigger words Instruction: and Output: into other words such as Request: and Answer: respectively for example is given. However, the change has to be done directly in the code, as they trigger words are used as constant throughout the code.

cd instruction_io
accelerate launch instruction_output.py \
    --batch_dir = "data_io/santacoder_generations/" \
    --seed_tasks_path "data_io/code_tasks.jsonl"\
    --num_instructions_to_generate 10 \
    --model_name_or_path "gpt_bigcode-santacoder"\
    --num_prompt_instructions 8 \
    --request_batch_size 5 \
    --n 2 \
    --max_length 2048 \
    --temperature 0.2 \
    --top_p 0.9 \
    --repetition_penalty 1.2 \
    --threshold 0.7

Instruction - input - output

This part is related to the directory instruction_iio. It is the template as designed in Stanford's alpaca. The possibilty to change the trigger words is also provided, with the same limitations as those previously mentionned.

cd instruction_iio
accelerate launch instruction_output.py \
    --batch_dir = "data_iio/santacoder_generations/" \
    --seed_tasks_path "data_iio/code_tasks.jsonl"\
    --num_instructions_to_generate 10 \
    --model_name_or_path "gpt_bigcode-santacoder"\
    --num_prompt_instructions 8 \
    --request_batch_size 5 \
    --n 2 \
    --max_length 2048 \
    --temperature 0.2 \
    --top_p 0.9 \
    --repetition_penalty 1.2 \
    --threshold 0.7

Post-processing : unique strategy

Here we want to apply a post processing to our generated instructions by considering only instructions that are not too similar. In order to do so, we get into the folder self-instruct and we launch

python unique_post_processing.py
    --batch_dir = "instruction_io/data_io/santacoder_generations/" \
    --threshold 0.5

About

A repository to perform self-instruct with a model on HF Hub

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published