Distributing LLaMA on multiple machines within the same network #176

fabawi · 2023-03-10T22:13:44Z

Using torch.distribution and fairscale, LLaMA can be parallelized on multiple devices or machines, which works quite well already. However, each GPU device is expected to have a large VRAM since weights are loaded onto all. I've seen quite a few solutions, some involved offloading the model in part or as a whole to the CPU while others reduced the weight resolution. Using a meta device to load the weights could also help reduce the burden on each GPU by initializing the model only once the weights are set for each layer. Then again, this only helps when loading weights so you wouldn't run out of memory on initialization. Most approaches, if not all, as far as I can tell, assume the model weights are loaded on every GPU, atleast initially.

To solve this issue, I developed a LLaMA version distributed on multiple machines and GPUs using Wrapyfi (https://github.com/fabawi/wrapyfi). The outputs of the Transformer blocks are split (similar to fairscale pipelines but more controllable) and transmitted through ZeroMQ; The performance seems better than variants running on CPU and more accurate than 8-bit variants (I haven't verified the latter, this is purely based on what the corresponding developers state). I tried the approach on 7B and 13B, and in theory, it should work on the larger models. I will try it on larger variants soon, but until then, I would appreciate feedback on what works and what doesn't.

https://github.com/modular-ml/wrapyfi-examples_llama

yokie121 · 2023-03-13T03:09:05Z

it's greate !!

we want to know how to work on the larger models 13B or 65B. ? thanks !

fabawi · 2023-03-13T07:01:36Z

it's greate !!

we want to know how to work on the larger models 13B or 65B. ? thanks !

Thanks @yokie121 ! Checkout the example in the repo's readme, under
Running 7B on 4 machines

The same applies to larger models. If the 2 model variant worked, and then the 4 machines/GPU devices worked, then it should work on larger models if you have sufficient VRAM. To run on 13B, do not change nproc_per_node. This is always 1 with our version of LLaMA. What you do instead is change the model location to 13B, --wrapyfi_device_idx, and --wrapyfi_total_devices

leo-a11 · 2023-06-09T12:29:42Z

hello fabwai,
can please tell how did you load model on multiple GPU's and fine tune the model?

Negashev · 2023-11-10T18:08:57Z

Any updates?

b4rtaz · 2024-01-20T21:34:20Z

Meantime the Distrubuted Llama project was created (by me 😅).

subramen added the community-discussion label Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributing LLaMA on multiple machines within the same network #176

Distributing LLaMA on multiple machines within the same network #176

fabawi commented Mar 10, 2023 •

edited

Loading

yokie121 commented Mar 13, 2023

fabawi commented Mar 13, 2023

leo-a11 commented Jun 9, 2023

Negashev commented Nov 10, 2023

b4rtaz commented Jan 20, 2024

Distributing LLaMA on multiple machines within the same network #176

Distributing LLaMA on multiple machines within the same network #176

Comments

fabawi commented Mar 10, 2023 • edited Loading

yokie121 commented Mar 13, 2023

fabawi commented Mar 13, 2023

leo-a11 commented Jun 9, 2023

Negashev commented Nov 10, 2023

b4rtaz commented Jan 20, 2024

fabawi commented Mar 10, 2023 •

edited

Loading