RPC works with LLama but not Qwen. #10943

ehtom · 2024-12-22T13:39:17Z

ehtom
Dec 22, 2024

Hello!

I have been experimenting using the following machine configuration:

Machine1: 4090, WSL Linux on Win11, 2.5GbE wired. This machine runs rpc-server.
Machine2: 2*2080Ti, Ubuntu Linux, 1GbE wired. This machine runs llama-cli or llama-server.

I have been attempting to run/test the following models. I had to comment out

llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp

Line 467 in ebdee94

GGML_ASSERT(tensor->ne[0] % 512 == 0 && "unsupported quantized tensor");

to run these quantized models:

Tulu3 70b Q4_K_M (bartowski quant) - WORKS 7tok/s all layers offloaded.
Llama 3.3 70b Q4_K_M (bartowski quant) - WORKS 7tok/s all layers offloaded.
Qwen2.5 72b Q4_K_M (official and bartowski quant) - RUNS BUT EMITS GARBAGE when all layers offloaded. WORKS but very slowly with about 40 layers offloaded.

Based on the line which I commented out, I suspect that this is because Qwen2.5-72b has a intermediate_size of 29568, which is not divisible by 512?

If this is the reason, is it possible to get Qwen2.5 working over RPC by implementing cuda-like padding of 512 in ggml-rpc.cpp?

I think this RPC functionality is extremely cool and its a lot more lightweight and configurable for enthusiasts than other options in other engines, which seem geared towards setting up production inference clusters as they all rely on docker + ray combo it seems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC works with LLama but not Qwen. #10943

{{title}}

Replies: 0 comments

Select a reply

RPC works with LLama but not Qwen. #10943

ehtom Dec 22, 2024

Replies: 0 comments

ehtom
Dec 22, 2024