You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using torch.distribution and fairscale, LLaMA can be parallelized on multiple devices or machines, which works quite well already. However, each GPU device is expected to have a large VRAM since weights are loaded onto all. I've seen quite a few solutions, some involved offloading the model in part or as a whole to the CPU while others reduced the weight resolution. Using a meta device to load the weights could also help reduce the burden on each GPU by initializing the model only once the weights are set for each layer. Then again, this only helps when loading weights so you wouldn't run out of memory on initialization. Most approaches, if not all, as far as I can tell, assume the model weights are loaded on every GPU, atleast initially.
To solve this issue, I developed a LLaMA version distributed on multiple machines and GPUs using Wrapyfi (https://github.com/fabawi/wrapyfi). The outputs of the Transformer blocks are split (similar to fairscale pipelines but more controllable) and transmitted through ZeroMQ; The performance seems better than variants running on CPU and more accurate than 8-bit variants (I haven't verified the latter, this is purely based on what the corresponding developers state). I tried the approach on 7B and 13B, and in theory, it should work on the larger models. I will try it on larger variants soon, but until then, I would appreciate feedback on what works and what doesn't.
we want to know how to work on the larger models 13B or 65B. ? thanks !
Thanks @yokie121 ! Checkout the example in the repo's readme, under
Running 7B on 4 machines
The same applies to larger models. If the 2 model variant worked, and then the 4 machines/GPU devices worked, then it should work on larger models if you have sufficient VRAM. To run on 13B, do not change nproc_per_node. This is always 1 with our version of LLaMA. What you do instead is change the model location to 13B, --wrapyfi_device_idx, and --wrapyfi_total_devices
Using torch.distribution and fairscale, LLaMA can be parallelized on multiple devices or machines, which works quite well already. However, each GPU device is expected to have a large VRAM since weights are loaded onto all. I've seen quite a few solutions, some involved offloading the model in part or as a whole to the CPU while others reduced the weight resolution. Using a meta device to load the weights could also help reduce the burden on each GPU by initializing the model only once the weights are set for each layer. Then again, this only helps when loading weights so you wouldn't run out of memory on initialization. Most approaches, if not all, as far as I can tell, assume the model weights are loaded on every GPU, atleast initially.
To solve this issue, I developed a LLaMA version distributed on multiple machines and GPUs using Wrapyfi (https://github.com/fabawi/wrapyfi). The outputs of the Transformer blocks are split (similar to fairscale pipelines but more controllable) and transmitted through ZeroMQ; The performance seems better than variants running on CPU and more accurate than 8-bit variants (I haven't verified the latter, this is purely based on what the corresponding developers state). I tried the approach on 7B and 13B, and in theory, it should work on the larger models. I will try it on larger variants soon, but until then, I would appreciate feedback on what works and what doesn't.
https://github.com/modular-ml/wrapyfi-examples_llama
The text was updated successfully, but these errors were encountered: