Skip to content

Commit

Permalink
Add flexible padding bonus experiment (#438)
Browse files Browse the repository at this point in the history
* Add flexible padding bonus experiment

* fix links
  • Loading branch information
rasbt authored Nov 14, 2024
1 parent f6281ab commit ccade77
Show file tree
Hide file tree
Showing 3 changed files with 83 additions and 35 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/check-links.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,6 @@ jobs:
- name: Check links
run: |
pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://openai.com/*" --check-links-ignore "https://arena.lmsys.org" --check-links-ignore "https://www.reddit.com/r/*" --check-links-ignore "https://code.visualstudio.com/*" --check-links-ignore https://arxiv.org/* --check-links-ignore "https://ai.stanford.edu/~amaas/data/sentiment/"
pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://openai.com/*" --check-links-ignore "https://arena.lmsys.org" --check-links-ignore https://unsloth.ai/blog/gradient --check-links-ignore "https://www.reddit.com/r/*" --check-links-ignore "https://code.visualstudio.com/*" --check-links-ignore https://arxiv.org/* --check-links-ignore "https://ai.stanford.edu/~amaas/data/sentiment/"
# pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://arena.lmsys.org" --retries 2 --retry-delay 5
22 changes: 12 additions & 10 deletions ch06/02_bonus_additional-experiments/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@ For example,
| 13 | gpt2-small (124M) | pretrained | last | last_block | context length (1024) | 83.08% | 87.92% | 78.33% | 2.46 min | A100 |
| 14 | gpt2-small (124M) | pretrained | last | last_block | variable: no padding (batch size 1) | 100.00% | 98.66% | 98.00% | 1.75 min | A100 |
| 15 | gpt2-small (124M) | pretrained | last | last_block | variable: no padding (batch size 8) | 99.33% | 98.66% | 98.33% | 1.70 min | A100 |
| 16 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120); but no causal mask | 99.23% | 98.66% | 95.33% | 0.29 min | A100 |
| 17 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120) and `ignore_index` for padding | 96.63% | 99.33% | 95.00% | 0.28 min | A100 |
| 18 | gpt2-small (124M) | pretrained | last + pooled embeddings | last_block | longest train ex. (120) | 97.79% | 99.33% | 96.33% | 0.32 min | A100 |
| 16 | gpt2-small (124M) | pretrained | last | last_block | flexible (last non-padding position) | 99.42% | 98.66% | 98.33% | 0.30 min | A100 |
| 17 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120); but no causal mask | 99.23% | 98.66% | 95.33% | 0.29 min | A100 |
| 18 | gpt2-small (124M) | pretrained | last | last_block | longest train ex. (120) and `ignore_index` for padding | 96.63% | 99.33% | 95.00% | 0.28 min | A100 |
| 19 | gpt2-small (124M) | pretrained | last + pooled embeddings | last_block | longest train ex. (120) | 97.79% | 99.33% | 96.33% | 0.32 min | A100 |

 

Expand All @@ -51,9 +52,10 @@ You can use the following code to reproduce the experiments:
- Row 13: `python additional_experiments.py --context_length "model_context_length"`
- Row 14: `python additional_experiments.py --no_padding --batch_size 1`
- Row 15: `python additional_experiments.py --no_padding --batch_size 1 --accumulation_steps 8`
- Row 16: `python additional_experiments.py --disable_causal_mask`
- Row 17: `python additional_experiments.py --ignore_index 50256`
- Row 18: `python additional_experiments.py --average embeddings`
- Row 16: `python additional_experiments.py --last_token_pos "flexible"`
- Row 17: `python additional_experiments.py --disable_causal_mask`
- Row 18: `python additional_experiments.py --ignore_index 50256`
- Row 19: `python additional_experiments.py --average embeddings`

I've kept the LLM and dataset small on purpose, so you can run the training on a regular laptop like a MacBook Air M3 in about 15 minutes (for the default setting) in case you don't have access to a GPU.

Expand All @@ -69,7 +71,7 @@ I've kept the LLM and dataset small on purpose, so you can run the training on a
6. **Using a Model with Random Weights vs. Pretrained Weights (Row 1 and 5 vs. 10)**: Utilizing a model with random weights yields results that are only slightly worse (by 3% and 1.3%) compared to using pretrained weights.
7. **Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 11 vs. 5, and row 12 vs. 9)**: Keeping the model frozen and adding trainable LoRA layers (see [Appendix E](../../appendix-E/01_main-chapter-code/appendix-E.ipynb) for details) is a viable alternative to training all model parameters and even improves the performance by 1% point (row 11 vs. 5). As it can be seen by the ~1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. Moreover, using LoRA is also more memory-efficient because fewer parameters have to be updated. When training the larger model (row 12 vs. 9), we can also see that LoRA trains much faster (5.79 min instead of 8.12 min).
8. **Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 13)**: Padding the input to the full supported context length results is significantly worse.
9. **Padding vs no padding (Row 1 vs. 14 and 15)**: The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 15, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments, which helps reduce overfitting and slightly boost the test set accuracy.
10. **Disabling the causal attention mask (Row 1 vs. 16)**: Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.
11. **Ignoring the padding indices in the loss and backpropagation (Row 1 vs. 17)**: Setting `--ignore_index 50256` excludes the `|endoftext|` padding tokens in the `cross_entropy` loss function in PyTorch. In this case, it does not have any effect because we replaced the output layers so that the token IDs are either 0 or 1 for the binary classification example. However, this setting is useful when instruction finetuning models in chapter 7.
13. **Averaging the embeddings over all tokens (Row 1 vs. 18)**: Setting `--average_embeddings` will average the embeddings over all tokens. If this option is not used (the default), only the output embeddings at the chosen token position (specified by `--trainable_token_pos`) are considered; for example, the embeddings of the last token. Enabling `--average_embeddings` will mean-pool the embeddings of all tokens into the position chosen by `--trainable_token_pos` (the last token by default). As we can see, this improves the performance from 95.00% to 96.33% with only a minimal increase in run time (0.28 min to 0.32 min) and might be worthwhile considering in practice.
9. **Padding vs no padding (Row 1 vs. 14 & 15, and 16)**: The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 15, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments, which helps reduce overfitting and slightly boost the test set accuracy. In row 16 we apply padding but select the token position based on the last non-padding token. Row 16 should be mathematically similar to row 15, which uses gradient accumulation. However, due to some challenges with gradient accumulation in cases of unequal token counts, there may be small discrepancies (this is discussed in [this](https://unsloth.ai/blog/gradient) blog post).
10. **Disabling the causal attention mask (Row 1 vs. 17)**: Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.
11. **Ignoring the padding indices in the loss and backpropagation (Row 1 vs. 18)**: Setting `--ignore_index 50256` excludes the `|endoftext|` padding tokens in the `cross_entropy` loss function in PyTorch. In this case, it does not have any effect because we replaced the output layers so that the token IDs are either 0 or 1 for the binary classification example. However, this setting is useful when instruction finetuning models in chapter 7.
13. **Averaging the embeddings over all tokens (Row 1 vs. 19)**: Setting `--average_embeddings` will average the embeddings over all tokens. If this option is not used (the default), only the output embeddings at the chosen token position (specified by `--trainable_token_pos`) are considered; for example, the embeddings of the last token. Enabling `--average_embeddings` will mean-pool the embeddings of all tokens into the position chosen by `--trainable_token_pos` (the last token by default). As we can see, this improves the performance from 95.00% to 96.33% with only a minimal increase in run time (0.28 min to 0.32 min) and might be worthwhile considering in practice.
94 changes: 70 additions & 24 deletions ch06/02_bonus_additional-experiments/additional_experiments.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,16 +184,34 @@ def calc_loss_batch(input_batch, target_batch, model, device,
trainable_token_pos=-1, ignore_index=-100, average_embeddings=False):
input_batch, target_batch = input_batch.to(device), target_batch.to(device)

model_output = model(input_batch)
if average_embeddings:
# Average over the sequence dimension (dim=1)
logits = model_output.mean(dim=1)
if trainable_token_pos == "flexible": # Selects the last tokens before the padding tokens
# From https://github.com/rasbt/LLMs-from-scratch/discussions/434
# Find the last non-padding token for each sequence in the batch
pad_token_id = 50256 # <|endoftext|> token used for padding
mask = input_batch != pad_token_id
last_token_pos = mask.sum(dim=1) - 1 # Get position of last real token

# Get model outputs
logits = model(input_batch) # shape: [batch_size, seq_len, num_classes]

# Select the logits corresponding to the last real token of each sequence
batch_size = logits.size(0)
selected_logits = logits[torch.arange(batch_size), last_token_pos]

loss = torch.nn.functional.cross_entropy(selected_logits, target_batch)
return loss

else:
# Select embeddings at the specified token position
logits = model_output[:, trainable_token_pos, :]
model_output = model(input_batch)
if average_embeddings:
# Average over the sequence dimension (dim=1)
logits = model_output.mean(dim=1)
else:
# Select embeddings at the specified token position
logits = model_output[:, trainable_token_pos, :]

loss = torch.nn.functional.cross_entropy(logits, target_batch, ignore_index=ignore_index)
return loss
loss = torch.nn.functional.cross_entropy(logits, target_batch, ignore_index=ignore_index)
return loss


def calc_loss_loader(data_loader, model, device,
Expand Down Expand Up @@ -231,24 +249,48 @@ def calc_accuracy_loader(data_loader, model, device, num_batches=None,
num_batches = len(data_loader)
else:
num_batches = min(num_batches, len(data_loader))
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
input_batch, target_batch = input_batch.to(device), target_batch.to(device)

model_output = model(input_batch)
if average_embeddings:
# Average over the sequence dimension (dim=1)
logits = model_output.mean(dim=1)
if trainable_token_pos == "flexible":
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
input_batch, target_batch = input_batch.to(device), target_batch.to(device)

# Find the last non-padding token for each sequence in the batch
pad_token_id = 50256 # <|endoftext|> token used for padding
mask = input_batch != pad_token_id
last_token_pos = mask.sum(dim=1) - 1 # Get position of last real token

with torch.no_grad():
logits = model(input_batch) # Logits of last output token
# Select the logits corresponding to the last real token of each sequence
batch_size = logits.size(0)
selected_logits = logits[torch.arange(batch_size), last_token_pos]
predicted_labels = torch.argmax(selected_logits, dim=-1)

num_examples += predicted_labels.shape[0]
correct_predictions += (predicted_labels == target_batch).sum().item()
else:
# Select embeddings at the specified token position
logits = model_output[:, trainable_token_pos, :]

predicted_labels = torch.argmax(logits, dim=-1)
break

num_examples += predicted_labels.shape[0]
correct_predictions += (predicted_labels == target_batch).sum().item()
else:
break
else:
for i, (input_batch, target_batch) in enumerate(data_loader):
if i < num_batches:
input_batch, target_batch = input_batch.to(device), target_batch.to(device)

model_output = model(input_batch)
if average_embeddings:
# Average over the sequence dimension (dim=1)
logits = model_output.mean(dim=1)
else:
# Select embeddings at the specified token position
logits = model_output[:, trainable_token_pos, :]

predicted_labels = torch.argmax(logits, dim=-1)

num_examples += predicted_labels.shape[0]
correct_predictions += (predicted_labels == target_batch).sum().item()
else:
break
return correct_predictions / num_examples


Expand Down Expand Up @@ -386,7 +428,7 @@ def replace_linear_with_lora(model, rank, alpha, alternative=False):
type=str,
default="last",
help=(
"Which token position to train. Options: 'first', 'last'."
"Which token position to train. Options: 'first', 'last', 'flexible'."
)
)
parser.add_argument(
Expand Down Expand Up @@ -483,6 +525,10 @@ def replace_linear_with_lora(model, rank, alpha, alternative=False):
args.trainable_token_pos = 0
elif args.trainable_token_pos == "last":
args.trainable_token_pos = -1
# The "flexible" setting selects the last tokens before the padding tokens
# See https://github.com/rasbt/LLMs-from-scratch/discussions/434
elif args.trainable_token_pos == "flexible":
args.trainable_token_pos = "flexible"
else:
raise ValueError("Invalid --trainable_token_pos argument")

Expand Down

0 comments on commit ccade77

Please sign in to comment.