Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributing LLaMA on multiple machines within the same network #176

Open
fabawi opened this issue Mar 10, 2023 · 5 comments
Open

Distributing LLaMA on multiple machines within the same network #176

fabawi opened this issue Mar 10, 2023 · 5 comments

Comments

@fabawi
Copy link

fabawi commented Mar 10, 2023

Using torch.distribution and fairscale, LLaMA can be parallelized on multiple devices or machines, which works quite well already. However, each GPU device is expected to have a large VRAM since weights are loaded onto all. I've seen quite a few solutions, some involved offloading the model in part or as a whole to the CPU while others reduced the weight resolution. Using a meta device to load the weights could also help reduce the burden on each GPU by initializing the model only once the weights are set for each layer. Then again, this only helps when loading weights so you wouldn't run out of memory on initialization. Most approaches, if not all, as far as I can tell, assume the model weights are loaded on every GPU, atleast initially.

To solve this issue, I developed a LLaMA version distributed on multiple machines and GPUs using Wrapyfi (https://github.com/fabawi/wrapyfi). The outputs of the Transformer blocks are split (similar to fairscale pipelines but more controllable) and transmitted through ZeroMQ; The performance seems better than variants running on CPU and more accurate than 8-bit variants (I haven't verified the latter, this is purely based on what the corresponding developers state). I tried the approach on 7B and 13B, and in theory, it should work on the larger models. I will try it on larger variants soon, but until then, I would appreciate feedback on what works and what doesn't.

https://github.com/modular-ml/wrapyfi-examples_llama

@yokie121
Copy link

it's greate !!

we want to know how to work on the larger models 13B or 65B. ? thanks !

@fabawi
Copy link
Author

fabawi commented Mar 13, 2023

it's greate !!

we want to know how to work on the larger models 13B or 65B. ? thanks !

Thanks @yokie121 ! Checkout the example in the repo's readme, under
Running 7B on 4 machines

The same applies to larger models. If the 2 model variant worked, and then the 4 machines/GPU devices worked, then it should work on larger models if you have sufficient VRAM. To run on 13B, do not change nproc_per_node. This is always 1 with our version of LLaMA. What you do instead is change the model location to 13B, --wrapyfi_device_idx, and --wrapyfi_total_devices

@leo-a11
Copy link

leo-a11 commented Jun 9, 2023

hello fabwai,
can please tell how did you load model on multiple GPU's and fine tune the model?

@Negashev
Copy link

Any updates?

@b4rtaz
Copy link

b4rtaz commented Jan 20, 2024

Meantime the Distrubuted Llama project was created (by me 😅).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants