build nanoGPT with Tensor Paralellism

This repo adds a version of nanoGPT which implements tensor parallelism following this tutorial.

After countless hours of fun, I was able to make it work, except it doesn't work very well 🤣

GPU(s)	Parallelism	Throughput
1x 3090	-	~ 66.5k tok/sec
2x 3090	DDP	~132.0k tok/sec
2x 3090	TP	~ 70.5k tok/sec

The batch size was always B = 16 and the 3090s are NV-Linked. It is quite possible that I have made a rookie mistake somewhere and that's why it's disappointingly slow 🤷‍♂️.

I haven't touched any of the files from the original repo, everything happens in the new train_gpt2_tp.py file.

Launch command:

torchrun --standalone --nproc-per-node=2 train_gpt2_tp.py

Some interesting learnigs

1. `nullcontext()`

nullcontext() allows you to do this:

from contextlib import nullcontext

with loss_parallel() if tp else nullcontext():
    logits, _ = model(x)
    ...
    loss.backward()

rather than much more messy:

if tp:
    with loss_parallel():
        logits, _ = model(x)
        ...
        loss.backward()
else:
    logits, _ = model(x)
    ...
    loss.backward()

2. Debugging parallel processes in VScode

... with breakpoints, stepping into libraries and whatnots! All you need to do is add the following configuration to your launch.json file:

{
    "version": "0.2.0",
    "configurations": [
        ...,
        {
            "name": "Distributed: Current File",
            "type": "debugpy",
            "request": "launch",
            "purpose": ["debug-in-terminal"],
            "console": "integratedTerminal",
            "module":"torch.distributed.run",
            "args":["--standalone","--nproc_per_node=2","${file}"],
            "justMyCode": false,
        },
    ]
}

build nanoGPT

This repo holds the from-scratch reproduction of nanoGPT. The git commits were specifically kept step by step and clean so that one can easily walk through the git commit history to see it built slowly. Additionally, there is an accompanying video lecture on YouTube where you can see me introduce each commit and explain the pieces along the way.

We basically start from an empty file and work our way to a reproduction of the GPT-2 (124M) model. If you have more patience or money, the code can also reproduce the GPT-3 models. While the GPT-2 (124M) model probably trained for quite some time back in the day (2019, ~5 years ago), today, reproducing it is a matter of ~1hr and ~$10. You'll need a cloud GPU box if you don't have enough, for that I recommend Lambda.

Note that GPT-2 and GPT-3 and both simple language models, trained on internet documents, and all they do is "dream" internet documents. So this repo/video this does not cover Chat finetuning, and you can't talk to it like you can talk to ChatGPT. The finetuning process (while quite simple conceptually - SFT is just about swapping out the dataset and continuing the training) comes after this part and will be covered at a later time. For now this is the kind of stuff that the 124M model says if you prompt it with "Hello, I'm a language model," after 10B tokens of training:

Hello, I'm a language model, and my goal is to make English as easy and fun as possible for everyone, and to find out the different grammar rules
Hello, I'm a language model, so the next time I go, I'll just say, I like this stuff.
Hello, I'm a language model, and the question is, what should I do if I want to be a teacher?
Hello, I'm a language model, and I'm an English person. In languages, "speak" is really speaking. Because for most people, there's

And after 40B tokens of training:

Hello, I'm a language model, a model of computer science, and it's a way (in mathematics) to program computer programs to do things like write
Hello, I'm a language model, not a human. This means that I believe in my language model, as I have no experience with it yet.
Hello, I'm a language model, but I'm talking about data. You've got to create an array of data: you've got to create that.
Hello, I'm a language model, and all of this is about modeling and learning Python. I'm very good in syntax, however I struggle with Python due

Lol. Anyway, once the video comes out, this will also be a place for FAQ, and a place for fixes and errata, of which I am sure there will be a number :)

For discussions and questions, please use Discussions tab, and for faster communication, have a look at my Zero To Hero Discord, channel #nanoGPT:

Video

Let's reproduce GPT-2 (124M) YouTube lecture

Errata

Minor cleanup, we forgot to delete register_buffer of the bias once we switched to flash attention, fixed with a recent PR.

Earlier version of PyTorch may have difficulty converting from uint16 to long. Inside load_tokens, we added npt = npt.astype(np.int32) to use numpy to convert uint16 to int32 before converting to torch tensor and then converting to long.

The torch.autocast function takes an arg device_type, to which I tried to stubbornly just pass device hoping it works ok, but PyTorch actually really wants just the type and creates errors in some version of PyTorch. So we want e.g. the device cuda:3 to get stripped to cuda. Currently, device mps (Apple Silicon) would become device_type CPU, I'm not 100% sure this is the intended PyTorch way.

Confusingly, model.require_backward_grad_sync is actually used by both the forward and backward pass. Moved up the line so that it also gets applied to the forward pass.

Prod

For more production-grade runs that are very similar to nanoGPT, I recommend looking at the following repos:

FAQ

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.gitignore		.gitignore
README.md		README.md
fineweb.py		fineweb.py
hellaswag.py		hellaswag.py
input.txt		input.txt
play.ipynb		play.ipynb
train_gpt2.py		train_gpt2.py
train_gpt2_tp.py		train_gpt2_tp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

build nanoGPT with Tensor Paralellism

Some interesting learnigs

1. `nullcontext()`

2. Debugging parallel processes in VScode

build nanoGPT

Video

Errata

Prod

FAQ

License

About

Releases

Packages

Languages

marib00/build-nanogpt

Folders and files

Latest commit

History

Repository files navigation

build nanoGPT with Tensor Paralellism

Some interesting learnigs

1. nullcontext()

2. Debugging parallel processes in VScode

build nanoGPT

Video

Errata

Prod

FAQ

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `nullcontext()`

Packages