-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert-hf : save memory with lazy evaluation #7075
Conversation
For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer.
`os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors".
It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses.
…tion There are no abstract methods used anyway, so using ABC isn't really necessary.
This looks awesome! A lot of great changes! Any differences in performance if RAM capacity isn't a bottleneck? Or is the speed the same with or without? If unknown I can also test |
b361d69
to
98db434
Compare
I don't know, since on my machine RAM capacity is a bottleneck, so lazy conversion is always better there. I think the performance shouldn't be that different if there's enough RAM. It's still reading, calculating, and writing the same things. Either everything is first read, computed, and logged, kept in RAM, then all written (with I personally think the UX (user experience) is better with lazy conversion, because the (new) progress bar shown when writing the tensors gives a good enough ETA, while in non-lazy conversion (with
Yes, this would be appreciated. This way we'll know whether or not there's an overhead with lazy conversion vs non-lazy conversion. Not sure how to get a consistent measurement though, with filesystem caches and stuff like that. (though lazy conversion has an advantage there, since it leaves more RAM for the filesystem cache) |
Instead of pre-loading them all into a dict, iterate on the tensors in the model parts progressively as needed in Model.write_tensors Conversion for some architectures relies on checking for the presence of specific tensor names, so for multi-part models, the weight map is read from the relevant json file to quickly get these names up-front.
It does seem likely that there shouldn't be much if any performance hit, since realistically there's no good reason to have ALL tensors in memory at once While doing this, any insight into converting bf16 to fp16 by first going to FP32? Feels like it should be a straightforward process but I have dove into the black box yet |
I agree it's a cool showcase, but I'd argue it's not worth it to keep an obsolete file in master branch.
Afaik all produce slightly different results. Establishing only one right way to generate models would go long way towards reducing issues we need to respond to.
I didn't know about this, but it is useful. It should be simple enough to add it to I guess we could maybe move it to |
|
Correct, the idea was I'm personally interested in small convert scripts that not necessarily have all the cool features of About this PR - shall we proceed to merge? Not sure if anyone else would like to review the changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a super in depth review, but nothing stood out to me.
I've found a way to support Even if Numpy doesn't have the if data.dtype != np.float32:
data = data.astype(np.float32)
data.dtype = np.int32
data = (data >> 16).astype(int16) This way,
|
The irony is that HuggingFace heavily relies on torch under the hood. The abstractions make it difficult to see unless you really poke at it. This statement is accurate though. The transformers framework makes me think of the blob, consuming everything around it. Regardless, I've been waiting for this PR to get merged. |
@compilade I didn't see that for some reason when I looked here ealier! I must've just missed your post. You can bit shift the values in Python, but there is a decent enough margin of error that precision is noticeably affected. I had written tests for this and it's just due to the way Python handles numbers. There's no way around it except to implement a FFI. It's cool that this workaround seems to do the trick though. I like it because it's simple. However, expect precision to be affected. |
@teleprint-me Not sure what you've tried before, but I don't think any precision is lost in this process when converting from The values are first converted from From my understanding, this is a lossless conversion. (EDIT: |
I've noticed a problem with metadata; the quantization type of tensors are not using the same enum values as the file types. This makes This requires adding new constants in |
Unfortunately I'm force-pushing this branch back to before 1eccde6 to match the reviewed changes. |
I think we should go with Here is some more information about the file-type enums (it's a bit messy): The C++ enum used to determine the file type in Lines 108 to 114 in 83330d8
This enum is specific to The So now, But anyway, the oversight here that you noticed is that the F32 and F16 values of |
1eccde6
to
bffdaf4
Compare
I've added Again, reverting back, because I need to give this I will then proceed to merge this PR (after the tests finish again, so in an hour, I guess?) with the previously-approved changes, because I think they fulfill the goal of this PR and because there was no disagreement in the 2 days since the last change. (EDIT: some Docker test is failing because of network errors. I guess this happens, but it doesn't seem related to this PR, so I'm merging anyway.) |
cad22e1
to
bffdaf4
Compare
…erganov#7075 (shameless copy from LlamaModel).
…erganov#7075 (shameless copy from LlamaModel).
@compilade sorry to resurrect this, but looks like lazy conversion is broken for Falcon:
using --no-lazy it converts with no issue (so far) Using newest falcon model: https://huggingface.co/tiiuae/falcon-11B |
@bartowski1182 try again with the latest |
@compilade oh thanks, i'll wait for the release and give it another shot! |
In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present. This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some persistent problem, but shall do in the meantime.
Continued from #7031 so that all shape-changing operations are made with PyTorch.
This was motivated by some remarks about
convert-hf-to-gguf.py
using too much RAM and/orconvert.py
being better for Llama models because it uses lazy tensors.This fixes the RAM usage problem by bringing lazy conversion for ALL model architectures supported by
convert-hf-to-gguf.py
(assuming none were broken with #7031).Summary of changes
meta
device as well as__torch_function__
overrides and__getattr__
switcheroos, wrapped inLazyTorchTensor
LazyTorchTensor
s form a tree of tensors to share the eager evaluation when multiple tensors are split from onemodify_tensors
to be usable BOTH in lazy evaluation mode and when eagerly evaluating (as before). This keeps the semantics easier for everyone..to("meta")
works in Script to convert Grok-1 weights from raw JAX pickle files. #7058, it wasn't really clear from PyTorch's documentation)LazyTensor
class ingguf_writer.py
to abstract away how the tensor is actually calculatedgguf-py
library depend on PyTorch.astype()
andbyteswap()
andtofile()
convert-hf-to-gguf.py
has a new flag,--no-lazy
to let the user choose to not use lazy evaluation (same behavior as before, accumulating everything in RAM before beginning to write in the output file)modify_tensors
step now happens very very quickly, most time is spent writing to disk. I've added a progress bar inGGUFWriter.write_tensors_to_file
(which is only enabled inconvert-hf-to-gguf.py
, otherwise it's disabled by default).tqdm
which is already in the user's Python environment because it's a dependency oftransformers
gguf-convert-endian.py
, which was a good suggestion from @cebtenzzre in convert.py: add python logging instead of print() #6511 (comment)Testing
Everything should pretty much Just Work (TM), but some models use special PyTorch operators and need more careful testing.
Ideally, there should be no fatal failure, and the resulting model should be the same when lazy (by default) and when using
--no-lazy
.LlamaModel
with MoE (torch.stack
, as well as permuting and reshapes for Q and K)MambaModel
(__neg__
(the unary minus operator),torch.equal
(which is used in a condition, so the involved tensors need to be eagerly evaluated!))GemmaModel
(__add__
, to do+ 1
somewhere)tok_embd.weight
when it got converted frombfloat16
tofloat32
, then RAM usage lowered and stayed low. This was the biggest individual tensor in the models I tested.)InternLM2Model
(__getitem__
for sliced indexing,torch.Tensor.T
needs to be wrapped,was replaced with reshapes)einops.rearrange
einops.rearrange
operations with reshapes (was more trivial than I thought), Tested in lazy conversion mode with https://huggingface.co/internlm/internlm2-chat-1_8b and got some very coherent text, so I assume the implementation is correct.BloomModel
(__getitem__
for sliced indexing)sha256sum
matches when lazily converting and with--no-lazy
for https://huggingface.co/bigscience/bloom-560m (required bypassing the pre-tokenizer check). Inference seems good too.FalconModel
(__getitem__
for sliced indexing)RefactModel
(__getitem__
for sliced indexing)(@bartowski1182, @fairydreaming, @phymbert, @arki05, @slaren, @ggerganov, this will most likely benefit you all because I'm sure you've dealt with converting bigger models than your RAM before)