RLHF Implementation (Based on InstructGPT)

I did this project to gain more familiarity with Reinforcement Learning from Human Feedback (RLHF) as a whole.
It is based on the paper: Training language models to follow instructions with human feedback (InstructGPT).

Models pushed to the hub here: https://huggingface.co/ArnavM3434

Pretrained model: gpt2
LoRA configuration: Model wrapped with a LoRA adapter (details in notebook)
Dataset: Alpaca — single-turn instruction–completion pairs I put in the form:
```
Human: ...
Assistant: ...
```
Training details:
- Loss computed on completion only (masked out prompts and padding tokens)
- Training loss converged around 2.15 (started around ~2.7)
- Results were consistent across multiple learning rates and LoRA configs
Outcome:
- Much better BLEU score and qualitative completions than pretrained GPT-2
- See training runs and examples in the notebook (Note: The notebook loss starts small because this was the 3rd or 4th run from an existing checkpoint)

Weights & Biases dashboard:

Dataset: Dahoas/rm-static
Loss function: from Bradley–Terry pairwise ranking
Architecture: Added a reward head to the SFT model
Observations:
- Training loss converged around 0.65. Validation accuracy converged around 61%
- I modified the data to only be single turn completions to better match the distribution the SFT model was trained on, yieled better results than using the raw 'Dahoas/rm-static'

Prompts: Alpaca dataset
Config: See PPO configuration in notebooks.
Observations:
- KL divergence remained negative, despite various adjustments:
  - Reward normalization
  - Generation tweaks
  - Different KL coefficients
  - Different clip ranges
  - Different reward models
- Tried another reward model (OpenAssistant/reward-model-deberta-v3-base), with similarly poor results
- Still working on this part, will probably just write my own loop instead of using PPO trainer to debug.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Evals		Evals
Notebooks		Notebooks
RL		RL
RewardModel		RewardModel
SFT		SFT
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt