Adding SFT training #14

TJ-Solergibert · 2024-07-30T21:55:13Z

Hello,

In this PR, I include everything necessary to perform SFT with the llama model. The main features included are:

Sample packing multiple conversations (Questions & Answers)
Train on completions only. This means that we will only compute the loss with the tokens that belong to the answers generated by the model and not the questions asked by the human or the chat template.
No cross-attention between samples. This way, each token will only attend to the tokens of its own conversation.

As detailed below, in this first beta, we will allow activating and deactivating features 2 & 3. I have designed this to measure the effect of these parameters, although I propose getting rid of them in the final version.

In my first PR in the nanotron repo (huggingface#187), I used as a reference the implementation on axolotl. The problem was that it contained padding tokens to fill the sequence length. I finally opted for a padding-free implementation and used the new implementation from HuggingFace Transformers as a reference [1], [2]. I included the script tools/check_sft.py to compare the generations of both models (HF & nanotron) and ensure they are the same. I emphasize that the generations are the same and not the logits. This is because, although we have the same parameters in both implementations, we do not perform exactly the same operations. In nanotron, we have 1. Fused QKV matrix, 2. Fused MLP matrix, 3. FA LayerNorm, which produces slightly different logits (with torch.testing.assert_close 99% of the logits are equal with atol=rtol=1e-2), but the important thing is that the generations are the same, especially the most probable first token.

Here & here you can observe the wandb runs of the 4 different configs toggling Features 2 & 3. As can be seen, using Feature 3 increases the TFLOPs since flash_attn_varlen_func achieves better performance when attending to shorter sequences.

Details & Functionality

In this first "Beta," I introduce 1. A new Dataset & ChatTokenizer & Collator and 2. A new Llama model for SFT (LlamaForSFT).

We will only need to specify in the config file a QA dataset from the HuggingFace Hub. Unlike Nanosets, no preprocessing step is required. In this case, we have an IterableDataset that will handle tokenization + sample packing on the fly. The obvious benefit of this is that we don't need to tokenize the data beforehand, but it has a major drawback: It is not trivial to recover the state of the DataLoader to resume training once interrupted. The only solution I know is through torchdata's StatefulDataloaders, which I am already working on for the final version. We can also activate and deactivate features 2 and 3 via the configurations train_on_completions_only and remove_cross_attention. Finally, remember that we only support the format of conversation datasets from Open-Orca/SlimOrca & Magpie-Align/Magpie-Pro-300K-Filtered, so if you want to use other QA datasets (like this dataset with "content" and "role" keys), you will need to change the dictionary keys.
```
- data:
    dataset:
      hf_dataset: Magpie-Align/Magpie-Pro-300K-Filtered
      hf_dataset_split: train
      conversation_column_name: conversations
      train_on_completions_only: true
      remove_cross_attention: true
    num_loading_workers: 1
    seed: 42
  name: General purpose training (Single dataset)
  start_training_step: 1
```
Finally, to apply the chat template and tokenize the data, I included ChatTokenizer, very similar to the one included in meta-llama/llama3, with the difference that we will also register THE ROLE of the tokens necessary for feature 2.
LlamaForSFT only supports SFT training. I have removed everything related to the inference of the nanotron checkpoints with the script run_generate.py since we have never tested it nor do we intend to. I included the RoPE embeddings from HF transformers, which, although their performance is not very good compared to FlashAttention's RoPEs written in Triton, are the only ones I have seen that support position ids (necessary for Feature 3). In the future, we could try to write a kernel for this. Also, for Feature 3, it is necessary to use flash_attn_varlen_func instead of flash_attn_func.

Keep in mind that as we are already packing multiple samples, the tokens.micro_batch_size will be always 1. Then, the maximum number of tokens we will have is tokens.micro_batch_size * tokens.sequence_length.

TODOs

Write DOCS
Efficient RoPE embeddings
Ability to recover DataLoader states from an interruption
Delete options to Experiment with Feature 2 & 3
Delete tools/check_sft.py
Delete tools/todi
Delete convert_hf_nanotron.ipynb

… sft

TJ-Solergibert added 10 commits July 27, 2024 11:03

First prototype, let's jump padding free

419b33a

This mess produces sames generations as hf

03f4308

Added SFT generations check script

c57533d

Added masked LOSS check

a66b0c6

Getting ready

06af8cf

RCP Working

a8f979d

Added todi scripts

c026422

Added SFT docs

38f3815

Merge branch 'sft' of https://github.com/tj-solergibert/nanotron into…

2d882db

… sft

Dont predict EOText token

d5228bb

xrsrke mentioned this pull request Aug 2, 2024

Adding support for training chat models huggingface/nanotron#187

Closed

TJ-Solergibert mentioned this pull request Aug 22, 2024

Deleting cross attention between documents during pertaining #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding SFT training #14

Adding SFT training #14

Uh oh!

TJ-Solergibert commented Jul 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Adding SFT training #14

Are you sure you want to change the base?

Adding SFT training #14

Uh oh!

Conversation

TJ-Solergibert commented Jul 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details & Functionality

TODOs

Uh oh!

Uh oh!

TJ-Solergibert commented Jul 30, 2024 •

edited

Loading