Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
yunx-z committed Aug 14, 2024
0 parents commit 81a6339
Show file tree
Hide file tree
Showing 21 changed files with 4,283 additions and 0 deletions.
167 changes: 167 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
*.sh
predictions/
checkpoint/
data/
*.tmp
*/bak/
logs/*/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
189 changes: 189 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
# Small Language Models Need Strong Verifiers to Self-Correct Reasoning

## Abstract

> Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs), where LLMs refine their solutions using self-generated critiques that pinpoint the errors. This work explores whether small (≤ 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs. We propose a novel pipeline that prompts smaller LMs to collect self-correction data that supports the training of self-refinement abilities. First, we leverage correct solutions to guide the model in critiquing their incorrect responses. Second, the generated critiques, after filtering, are used for supervised fine-tuning of the self-correcting reasoner through solution refinement. Our experimental results show improved self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier, though limitations are identified when using a weak self-verifier for determining when to correct.

Below we introduce the command for reproducing out experimental results. We decompose the task of self-correction into two phases: (SELF-)VERIFY and SELF-REFINE.

For Verifiers, we implement the following options:

- Fine-tuning Small LM as self-verifier
- Prompting GPT-4 as verifier
- Prompting Small LM as verifier

For Refiners, we implement the following options:

- SCORE-fine-tuned small LM as refiner
- Prompting gpt-4 as refiner

For the following commands, set the `${TASK}` parameter as follows:

| Dataset Name | ${TASK} |
| ------------- | ----------- |
| GSM8K | gsm8k_nl |
| MATH Subset | math |
| CommonsenseQA | csqa |
| | riddlesense |
| QASC | qasc |

And set the `${MODEL}` parameter as either `meta-llama/Llama-2-13b-chat-hf` or `google/gemma-7b-it`.

## Refiner Options

### Ours Refiner Training Data Collection Pipeline (SCORE)

#### Step a. Sample and label N=10 solutions per question

Sample diverse reasoning chains for each question.

`python sample_ans.py --task ${TASK} --model ${MODEL} --split train --num_generations 10`

Score each reasoning chains based on final answer.
`python pal_code_exec.py --task ${TASK} --model ${MODEL} --split train`

#### Step b. Generate critique for incorrect solutions using correct solutions as hints

`python sample_feedback.py --task ${TASK} --model ${MODEL} --method diff --num_generations 1 --greedy

> Appy rule-based filtering to filter out invalid feedback. These criteria include: 1) The number of steps and feedbacks (counted by the appearances of “Step {i}:” and “Feedback:”) should be the same. 2) Each step should be exactly copied from the initial solution. 3) The feedback for the last step should provide the correct answer.
`python prefilter_feedback.py --task ${TASK} --model ${MODEL} --greedy`

#### Step c. Check whether base LM can recover the correct solution with filtered critique

Generate refinement based on the feedback.

`python apply_feedback.py --task ${TASK} --method diff --model ${MODEL} --num_generations 1 --greedy`

Score refined solution based on final answer correctness.

`python apply_feedback_results.py --task ${TASK} --model ${MODEL}`

#### Step d. Supervised Finetuning on critique-correction data

Prepare fine-tuning data into LLaMA-Factory format.

`python prepare_sft_data.py --task ${TASK} --model ${MODEL} --stage sft`

Use LLaMA-Factory for SFT.

### Baseline: Prompting Small LM as refiner

```bash
python sample_verification.py \
--task ${TASK} \
--model ${MODEL} \
--verifier ${MODEL} \
--split test \
--num_generations 1 \
--greedy \
--self_refine
```

## Verifier Options

Generate and label one solution for each question to be verified.

```bash
# Generate solutions
python sample_ans.py --task ${TASK} --model ${MODEL} --split dev --num_generations 1 --greedy
python sample_ans.py --task ${TASK} --model ${MODEL} --split test --num_generations 1 --greedy
# Label solutions
python pal_code_exec.py --task ${TASK} --model ${MODEL} --split dev --greedy
python pal_code_exec.py --task ${TASK} --model ${MODEL} --split test --greedy
```

### Option 1: Fine-tuning Small LM as self-verifier

Prepare verifier fine-tuning data.

```bash
python --task ${TASK} --model ${MODEL} --split train
python --task ${TASK} --model ${MODEL} --split dev --greedy
python --task ${TASK} --model ${MODEL} --split test --greedy
```

Fine-tuning verifier with a binary classification objective.

```bash
# Fine-tuning
python llama_sequence_classification.py \
--data_path data/${TASK}/verifier/${MODEL} \
--output_path checkpoint/verifier/${TASK}/${MODEL} \
--model_name ${MODEL} \
--set_pad_id \
--lr 1e-5 \
--train_batch_size 32 \
--eval_batch_size 32 \
--num_epochs 3 \
--lora_rank 32 \
--lora_alpha 64
# Inference on dev and test
python llama_sequence_classification.py \
--data_path data/${TASK}/verifier/${MODEL}/dev.json \
--output_path checkpoint/verifier/${TASK}/${MODEL}/dev.json \
--model_name ${MODEL} \
--set_pad_id \
--predict_only
python llama_sequence_classification.py \
--data_path data/${TASK}/verifier/${MODEL}/test.json \
--output_path checkpoint/verifier/${TASK}/${MODEL}/test.json \
--model_name ${MODEL} \
--set_pad_id \
--predict_only
```

### Option 2: Prompting GPT-4 as verifier

```bash
python sample_verification.py \
--task ${TASK} \
--model ${MODEL} \
--verifier gpt-4 \
--split test \
--num_generations 1 \
--greedy
```

### Option 3: Prompting Small LM as verifier

```bash
python sample_verification.py \
--task ${TASK} \
--model ${MODEL} \
--verifier ${MODEL} \
--split test \
--num_generations 1 \
--greedy
```

## Inference and Evaluation

Prepare inference data into LLaMA-Factory format.

```bash
python prepare_sft_data.py --task ${TASK} --model ${MODEL} --stage predict_feedback --split dev --sample 1
python prepare_sft_data.py --task ${TASK} --model ${MODEL} --stage predict_feedback --split test --sample 1
```

Use LLaMA-Factory to generate one critique-correction for each question-solution.

Then report the final task accuracies after self-correction.

> During inference, the self-verifier model outputs a probability of the initial solution being incorrect, and the refinement is introduced only when the confidence of the verifier’s predictions exceeds a certain threshold, which is automatically chosen in a way that maximizes the accuracy on the dev set ...
`python ft_results.py --task ${TASK} --stage eval_greedy_refine --input prompted --verifier_type ours_ft_on_prompted_solutions --verifier_model ${MODEL} --refiner_type filtered_feedback4_refinement_diff --refiner_model ${MODEL} --split dev --verbose`

This will print the optimal threshold ${THRESHOLD}.

> and then fixed during test-time predictions.
`python ft_results.py --task ${TASK} --stage eval_greedy_refine --input prompted --verifier_type ours_ft_on_prompted_solutions --verifier_model ${MODEL} --refiner_type filtered_feedback4_refinement_diff --refiner_model ${MODEL} --split test --best_dev_confidence ${THRESHOLD} --verbose`

For other settings:

- change to `--refiner_type prompted --refiner_model ${MODEL}` for prompted small LM as self-refiner

- change to `--verifier_type prompted --verifier_model gpt-4` for prompted gpt-4 as verifier
- change to `--verifier_type prompted --verifier_model ${MODEL}` for prompted small LM as verifier
Loading

0 comments on commit 81a6339

Please sign in to comment.