Description
Local LAN 1x 1070 1x 4070 1x 4070 configured with new RPC with patched server to use RPC.
I did a run to fully offload mixtral Q4_K_M into the 3 GPUs with RPC all looked good:
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: RPC buffer size = 7043.34 MiB (1070)
llm_load_tensors: RPC buffer size = 9391.12 MiB (4070)
llm_load_tensors: RPC buffer size = 8711.09 MiB (4070)
All layers offloaded and the timings I am getting are:
pp 105.99 tokens per second
tg 25.68 tokens per second
This compares to around 5tps generation with CPU+GPU to 1 4070 so over 5x speedup is nice. And it seems to be working OK. Some issues I have found so far :
The rpc servers are spamming rpc_get_tensor and rpc_set_tensor messages to console, this needs to be shut off unless doing debugging.
I initially tried a partial offload to two machines (8G +12G) but I got an out of memory crash on one of the servers, so I
am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory there is no possibility to pick up the rest of the layers with CPU? More of a question. It should be possible to pick up the remaining layers that won't RPC into GPUs on the host running the server (my host has 128G RAM).
When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) Something is not being cleaned up correctly when the rpc servers SEGV crash.
Also Great work on this feature it is extremely useful! It will be very good to support mixed CPU and GPU with this mode though so the crazy models such as dbrx commandr+ falcon180 and the 70G llama3 monster could be run if desired.