You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tulu3 70b Q4_K_M (bartowski quant) - WORKS 7tok/s all layers offloaded.
Llama 3.3 70b Q4_K_M (bartowski quant) - WORKS 7tok/s all layers offloaded.
Qwen2.5 72b Q4_K_M (official and bartowski quant) - RUNS BUT EMITS GARBAGE when all layers offloaded. WORKS but very slowly with about 40 layers offloaded.
Based on the line which I commented out, I suspect that this is because Qwen2.5-72b has a intermediate_size of 29568, which is not divisible by 512?
If this is the reason, is it possible to get Qwen2.5 working over RPC by implementing cuda-like padding of 512 in ggml-rpc.cpp?
I think this RPC functionality is extremely cool and its a lot more lightweight and configurable for enthusiasts than other options in other engines, which seem geared towards setting up production inference clusters as they all rely on docker + ray combo it seems.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello!
I have been experimenting using the following machine configuration:
I have been attempting to run/test the following models. I had to comment out
llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp
Line 467 in ebdee94
Based on the line which I commented out, I suspect that this is because Qwen2.5-72b has a intermediate_size of 29568, which is not divisible by 512?
If this is the reason, is it possible to get Qwen2.5 working over RPC by implementing cuda-like padding of 512 in ggml-rpc.cpp?
I think this RPC functionality is extremely cool and its a lot more lightweight and configurable for enthusiasts than other options in other engines, which seem geared towards setting up production inference clusters as they all rely on docker + ray combo it seems.
Beta Was this translation helpful? Give feedback.
All reactions