-
Notifications
You must be signed in to change notification settings - Fork 369
Remove repeat
operation and add acceleration support for other architectures
#395
Conversation
This is very cool and looks reasonable from my glance over the diff. (Couple of docs tweaks, but I can do those.) What do you need from me to get this across the line? |
I just need some more time to get falcon/gptneox working. GPT2 needs some additional cuda-copy kernels, i'll try to ask the cuda god himself for any advice there. I'll probably merge this and create additional PRs if i get one of the other architectures working. This should also enable metal support for all of the above mentioned architectures but i can't test those. Currently i can only confirm gpt-j as working with cuda acceleration |
No problem, take your time. Let me know if you want me to test anything Metal. |
Ok i played around a bit and here are my results:
I would like to keep the offloading code and work on these problems in additional PRs. Would be great if you could test these architectures on metal and report if they work/what error they throw. |
Will test soon-ish! |
Tested with Metal. Llama-2: Works great. No issues.
GPT-2: Immediate not-implemented:
GPT-J: Similar to Bloom:
GPT-NeoX: Immediate invalid-weight:
MPT: Similar to GPT-2:
So, not a magic fix - looks like Metal is still undercooked - but it's an improvement over the current state of affairs, so I'm going to merge it. |
Removes the
op_repeat
ggml operation and replaces it with broadcasting.This should enable the gpu acceleration of:
gpt2
,gptj
,gptneox
andfalcon
Also fixes: #391