-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script to convert Grok-1 weights from raw JAX pickle files. #7058
base: master
Are you sure you want to change the base?
Conversation
Does this merge the experts into a single tensor? |
It does the opposite -- in the raw data, the 8 experts are part of the same tensor. This splits them, which is also what the chatllm.cpp script does. If there is a way to keep them within one tensor I'm happy to make that change. |
The preferred way to export the expert tensors is as a single 3D tensor for all the experts. It is still possible to use one tensor per expert for backwards compatibility, but it forces the model weights to be copied to a buffer while loading, rather than using them directly from the memory mapped file. For large models like grok, I think it is especially important to be able to avoid this copy and use mmap. |
Understood. That will actually make the script simpler. Would you happen to know the tensor names I should use in this case? Currently when using splitting, they are
|
The tensor names are defined in gguf-py: llama.cpp/gguf-py/gguf/constants.py Lines 249 to 251 in 60325fa
It would be good to use these constants rather than hardcoding the names. |
As per ggerganov#7058 (comment). This helps avoid a memcopy when running.
Thanks! I have updated the branch to no longer split the MoE weights into separate tensors. That simplifies the script as it's now one weight per file. The original script permutated the order in which these weights are written for some reason, I stopped doing that now and thus there's only one list of weight names. I also moved to the values in the PTAL. |
@heiner, name of my project is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge after lint fixes
Hm, I tested
Might need more work |
My apologies. As I said above I couldn't actually test running the full model on my setup. I will fix @foldl's suggestions. Would you happen to have something like the sha-1 of each tensor of a checkpoint based on the HF weights? Otherwise I can download those and run that conversion for comparision. |
Thanks @foldl for the hints. It's well possible I mixed something else up as well, e.g., swapped two tensors with the same shape and dtype. Would you happen to have a |
Thanks. I have removed the multiplication with
It's likely something else is wrong but I'm unsure what it is, and the multiple-hour iteration time makes it infeasible to just try out random things. |
ee5921e
to
cd38c87
Compare
As per ggerganov#7058 (comment). This helps avoid a memcopy when running.
I added two more fixes. I then compared the output of this PR with Arki05/Grok-1-GGUF/Q8_0 from HF via this script. All tensors are exactly the same now. (I have to The changes:
Unfortunately, I cannot run the Arki05/Grok-1-GGUF/Q8_0 weights on my MacBook as it OOMs. I can run a two-expert version of this PR (very slowly, several minutes per token), but the output is not great:
Could someone with the right hardware run Arki05/Grok-1-GGUF/Q8_0 and see if it's any better? If it is, perhaps I missed some header setting (I didn't see any difference that seemed relevant). Otherwise, I believe this conversion is as good as the quantization supports? |
Nice catch, 😄 |
(The Docker image test failure is unrelated to the changes in this PR.) |
Yup, it's bypassed for now. Please rebase on top of current master |
As per ggerganov#7058 (comment). This helps avoid a memcopy when running.
This saves weights in the order in which they are in the Grok-1 files. Since we operate weight-by-weight now, we no longer need caches and name2key translations. Per reviewer request, I also moved to using keys in gguf.TENSOR_NAMES.
This makes tensors exactly as in https://huggingface.co/Arki05/Grok-1-GGUF/tree/main/Q8_0
This is equivalent to gguf.quantize_q8_0 but doesn't round-trip to Numpy.
Done. |
Does the conversion work correctly now? I can run some tests if you need confirmation? |
The actual generations still don't look great, e.g. for Q8_0:
What would be very useful is if you could run both this (in Q8_0) as well as Arki05/Grok-1-GGUF/Q8_0 and let me know if there's any difference in the output, and if so, what plausibly could cause that. I have verified that the actual weights in both cases are the same, so any difference would presumably be kv settings. |
This adds a script to convert the raw weights in the pickle files to GGUF format. This allows using @arki05's work in #6204 directly from the Grok-1 torrent.
Code is based on @foldl's conversion script in chatllm.cpp, which in turn is based on @chu-tianxiang's gist.
Main ideas to avoid excessive memory:
mmap
.Note that I couldn't run the full model due to RAM constrains and it's possible I mixed up some tensor names.