Finetuning GPT2 using FSDP

Using distributed training (FSDP) for finetuning GPT2 on wikitext

Instance used - EC2 g4dn.12xlarge instance

Dataset used - wikitext-2-v1

Model used - GPT-2 (124M)

Processing - a per-gpu batch size of 24, which totals to a batch size of 96 (across all 4 GPUs) with FSDP leveraged

Token size - 512

My process and thinking

Created a base pipeline on Colab with dataset loading, model loading, tokenization, model finetuning.
Attempted parallelization on Kaggle since it has multiple GPUs.
Setup the AWS instances and made the base pipelines work on a single GPU, batch_size = 8
Had a choice to either create a custom training class or leverage Huggingface's trainer and accelerate modules. Chose the later because of time constraints, sacrificing control for quicker experimentation.
Experimented with different parameters of FSDP to facilitate as high a size of batch_size and token_length as possible.

a.Full sharding + size based wrapping strategy + not pre-fetching + mixed precision training (fp 16) made block_size = 256 work

b. block_size = 512 was still failing. There was a slight gap between memory reserved and memory allocation. Thought mixed precision training (fp 8) might do the trick. Spent tons of time here but it didn't work as it's not yet properly supported.

c. Attempted to figure out the throughput and GPU utilization by plotting the GPU utilization graphs - there is some scope here as there was not 100% utilization of the RAM at all times.

GPU Resource Utilisation Graphs -

GPU utilization during training for block_size = 512

GPU utilization during training for block_size = 256

d. Tried to figure out the most efficient resizing strategies for the embeddings as I could see it being slightly inefficient.

e. Attempted gradient checkpointing. block_size = 512 and higher was working now. The tradeoff is initial accuracy halved/perplexity doubled.
Saved, pushed, and created a custom fine-tuned model on the Huggingface hub. Easier to access for testing/inferencing.

Experiments

Method	Block Size Max ($BS)	Approx Train Time per epoch(minutes) (*)	Perplexity
FSDP + full_shard	64	1:52	85
FSDP + full_shard + transformer_based_wrap	128	1:47	62
FSDP + full_shard + min_num_params = 2K + no-prefetch + no-use_original_params + MPT + fp16	256	1:22	45.36
FSDP + full_shard + min_num_params = 2K + no-prefetch + no-use_original_params + MPT + fp16 + Gradient checkpointing	512	1:39	35.09 - ~24

Table 1: Finetuning GPT-2(144M) model 0- different strategies and block sizes

*need to re-confirm/recalculate

Block Size Max ($BS)	Full-Sharding	Wrapping Strategy	Prefetch	Forward-fetch	use_original_params	CPU-RAM Offloading+Efficient Loading	Mixed Precision Training
64	✅	transformer_based_wrap	✅	✅	✅	✅	❌
128	✅	transformer_based_wrap	✅	✅	✅	✅	❌
256	✅	size based wrap - 2k params	❌	❌	❌	✅	✅
512	✅	size based wrap - 2k params	❌	❌	❌	✅	✅

Table 2: FSDP Sharding Strategies for Different Block Sizes

What worked

Mixed precision - fp 16
Gradient checkpointing
CPU Offloading
Full Sharding

Challenges

Making the model fit onto the available GPU RAMs
Experimenting with FSDP parameters
Making fp8 work
Access to budget/billing usage

Limitations

Make the code more modular to make the training work for different datasets and models.
Build a custom training class to have more control over the FSDP processes.
Incorporate hyperparameter tuning for better performance.

Things to Do

Generate training and validation plots.
Generate proper GPU utilization graphs.
Implement more metrics (accuracy, mauve holistic accuracy metrics - creativity, coherence, diversity.)
PEFT LoRA QloRA
Automated experiments of FSDP parameters
Improve GPU utilization
Optimise using Dynamo
Integrate experiment tracking
Better inferencing pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
images		images
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md
run_clm_no_trainer.py		run_clm_no_trainer.py
run_huggingfaceinferencing.py		run_huggingfaceinferencing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Finetuning GPT2 using FSDP

My process and thinking

Experiments

What worked

Challenges

Limitations

Things to Do

About

Uh oh!

Releases

Packages

Languages

License

Anwesh2/wikitext_fsdp

Folders and files

Latest commit

History

Repository files navigation

Finetuning GPT2 using FSDP

My process and thinking

Experiments

What worked

Challenges

Limitations

Things to Do

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages