Using distributed training (FSDP) for finetuning GPT2 on wikitext
Instance used - EC2 g4dn.12xlarge instance
Dataset used - wikitext-2-v1
Model used - GPT-2 (124M)
Processing - a per-gpu batch size of 24, which totals to a batch size of 96 (across all 4 GPUs) with FSDP leveraged
Token size - 512
-
Created a base pipeline on Colab with dataset loading, model loading, tokenization, model finetuning.
-
Attempted parallelization on Kaggle since it has multiple GPUs.
-
Setup the AWS instances and made the base pipelines work on a single GPU,
batch_size= 8 -
Had a choice to either create a custom training class or leverage Huggingface's trainer and accelerate modules. Chose the later because of time constraints, sacrificing control for quicker experimentation.
-
Experimented with different parameters of FSDP to facilitate as high a size of batch_size and token_length as possible.
a.Full sharding + size based wrapping strategy + not pre-fetching + mixed precision training (fp 16) made
block_size = 256workb.
block_size = 512was still failing. There was a slight gap between memory reserved and memory allocation. Thought mixed precision training (fp 8) might do the trick. Spent tons of time here but it didn't work as it's not yet properly supported.c. Attempted to figure out the throughput and GPU utilization by plotting the GPU utilization graphs - there is some scope here as there was not 100% utilization of the RAM at all times.
GPU Resource Utilisation Graphs -
GPU utilization during training for block_size = 512
GPU utilization during training for block_size = 256
d. Tried to figure out the most efficient resizing strategies for the embeddings as I could see it being slightly inefficient.
e. Attempted gradient checkpointing.
block_size = 512and higher was working now. The tradeoff is initial accuracy halved/perplexity doubled. -
Saved, pushed, and created a custom fine-tuned model on the Huggingface hub. Easier to access for testing/inferencing.
| Method | Block Size Max ($BS) | Approx Train Time per epoch(minutes) (*) | Perplexity | Notes |
|---|---|---|---|---|
| FSDP + full_shard | 64 | 1:52 | 85 | |
| FSDP + full_shard + transformer_based_wrap | 128 | 1:47 | 62 | |
| FSDP + full_shard + min_num_params = 2K + no-prefetch + no-use_original_params + MPT + fp16 | 256 | 1:22 | 45.36 | |
| FSDP + full_shard + min_num_params = 2K + no-prefetch + no-use_original_params + MPT + fp16 + Gradient checkpointing | 512 | 1:39 | 35.09 - ~24 |
Table 1: Finetuning GPT-2(144M) model 0- different strategies and block sizes
*need to re-confirm/recalculate
| Block Size Max ($BS) | Full-Sharding | Wrapping Strategy | Prefetch | Forward-fetch | use_original_params | CPU-RAM Offloading+Efficient Loading | Mixed Precision Training |
|---|---|---|---|---|---|---|---|
| 64 | ✅ | transformer_based_wrap | ✅ | ✅ | ✅ | ✅ | ❌ |
| 128 | ✅ | transformer_based_wrap | ✅ | ✅ | ✅ | ✅ | ❌ |
| 256 | ✅ | size based wrap - 2k params | ❌ | ❌ | ❌ | ✅ | ✅ |
| 512 | ✅ | size based wrap - 2k params | ❌ | ❌ | ❌ | ✅ | ✅ |
Table 2: FSDP Sharding Strategies for Different Block Sizes
- Mixed precision - fp 16
- Gradient checkpointing
- CPU Offloading
- Full Sharding
- Making the model fit onto the available GPU RAMs
- Experimenting with FSDP parameters
- Making fp8 work
- Access to budget/billing usage
- Make the code more modular to make the training work for different datasets and models.
- Build a custom training class to have more control over the FSDP processes.
- Incorporate hyperparameter tuning for better performance.
- Generate training and validation plots.
- Generate proper GPU utilization graphs.
- Implement more metrics (
accuracy,mauveholistic accuracy metrics -creativity,coherence,diversity.) - PEFT LoRA QloRA
- Automated experiments of FSDP parameters
- Improve GPU utilization
- Optimise using Dynamo
- Integrate experiment tracking
- Better inferencing pipeline

