-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring back the ggml model format and revert breaking mmap change (#613) #711
Conversation
So basically, we were required to change the format to ggjt (though "jt" stands for the initials of jart, it was @slaren who wrote almost all of the code in this #613 – shouldn't his initials be included instead?), which led to versioning issues. All of this for a feature that doesn't function properly? @anzz1 You're doing god's work, if we turn a blind eye to any negative and ego-driven commits that are introduced to this project and fail to address them, we risk leaving the project and seeking a repository or a fork that is managed by someone who is more dedicated to correcting these unsatisfactory mistakes. |
Originally posted by @anzz1 in #693 (comment) Also the wider issue here is that making pull requests pushing new features with a description which can be only best described as a sales pitch, deliberately leaving out any negative consequences and possible incompatibilities, is disingenous at best and outright malicious at worst. Any positives and negatives should be weighed in a honest discussion with feedback from the community. This is an open-source project, not politics. Please leave trying to push your will unto others at the door and collaborate instead. Dismissing everyone who don't agree to your view and implementation isn't the way to go. |
The beautiful thing about open source is anyone has the power to fork the project. The license gives you that freedom. In fact, we've also put a lot of thought into making the codebase as simple and dependency-free as possible, so you not only have the freedom to do it, it'll also be easy! There are several forks of our project, built by people who each have their own goals, and I think it really goes to show how successful we've been. You can do it too. The only thing you'd have to do is choose a name. |
Well let me go get the popcorn ;-) I hope we can resolve this in a friendly manner. My take is that @jart's (and @slaren's) work is really valuable, but it the format change was unfortunate. For example, some people are trying to plot performance vs commits here: #603. A change in the file format makes this more complicated. And for many other users it caused confusion and additional work too. At this point I'm torn on whether a revert is the right thing.
I don't think I'm unterstanding this right: You're saying that memory will not be freed by the OS after the process terminates? |
Let's not jump to conclusions so quickly. Forking can lead to fragmentation, duplication of efforts, increased maintenance burden, and compatibility issues. It should be considered a last resort solution. Collaborating with the original project maintainers and contributing to the existing project can often provide greater benefits for everyone involved, fostering a more united and efficient development process. |
You're understanding it perfectly. The whole raison d'etre for This can have benefits like the ability of using the same loaded model from multiple However, these benefits come with multiple caveats like decreased performance, worse compatibility (this is not a POSIX-only project) and increased complexity. It should be up to an individual whether the benefits outweigh the cost, not something forced upon everyone because a slice of the community thinks they are. In any case, we shouldn't be forcing features which people might want or not want to use to everyone. Especially when it comes with multiple caveats like in this case. There is absolutely no reason to implement this in a forced manner like this, as it's perfectly doable as an option. Sometimes there can be cases where something can't be viably implemented as an option, but this isn't one of those cases. To be perfectly clear: I am not saying to throw |
One of the raisons d'etre of an operating system is to clean up when a process terminates, so I refuse to believe this.
Agree, that's a benefit of mmap. I don't think it needs to be made optional. edit: from
|
@anzz1 You are misunderstanding how mmap and/or memory paging works. The data is not staying in RAM forever. Here are some resources for you:
Once you understand how mmap and memory paging works, you will see there is no reason why not to use mmap, therefore it does not make sense to make it optional. |
The memory mappings are not controlled by the processes but by the kernel, and yes, in typical use-cases the OS will clean up the handles to the memory blocks when process exits, 'freeing' them. However mmap() is specifically designed to separate the handles from the process so they can be shared between processes. Depending on the implementation, this can mean either that it will be cleaned up when the last process holding a handle is existed, or that it needs to be explicitly free'd manually. The best way to think of it is creating a temporary file with explicit "no-flush-to-disk" property attached to it, deleted when all the handles to file do not exist anymore. For Windows, this can be done by calling CreateFile and specifying
The recent issues give a whole different story to this. Platform implementation for mmap() is spotty at best and can also lead to unforeseen problems. And whether you, I, or any single individual think whether they want to use it or not is irrelevant here, even if person X wants to use it, it doesn't mean that person Y should too. Having it as an option also has the added benefit of rooting out the cause of bugs. If it's just baked into a forced feature, it's much harder to say whether a bug X is because of a specific mmap() implementation in platform Y or a bug related to the llama.cpp code. |
Exactly.
I'd consider that a severe kernel bug. If the particular usage of mmap in llama.cpp has issues, we should address them. We might also consider not enabling by default for problematic environments (Windows?). But on proper Unixes, I don't think it needs to be optional. edit: If it is optional, it should be on by default. |
The thing is that this is not a POSIX-only , or a "proper Unix" project no more than it's a "Windows project", or "a project for embedded systems" or a "mobile phone project". This project is all of them. That's kinda the part most people pushing POSIX-features seem to constantly forget (no matter the project), that not everyone lives in that same world. Again, I'm not arguing for not having the feature. I'm arguing for having it as an option. And the unnecessary changing of model format simply has no ground to stand on. What exactly is the problem of going back and rethinking this? It's not like the code would disappear into thin air after reverting, not being able to be used in a better implementation. |
I'm not expert on mmap but in this commit's case my 13 b model loading time went from 55 seconds to 600 seconds or so (10 minutes) Unless there's some huge interference speed boost with mmap, I personally would prefer being able to disable it through parameter like --no-mmap |
I simply cannot understand an argument against individual choice. Whether you think it's a good feature or not is irrelevant. To be honest it's kinda stupid to even be arguing about this. |
I belong to the group of people that don't know how mmap works... I'm neither against nor in favor. I'm just trying to learn more about code and LLM hence following this project. However, if you don't mind, a few short (naive) questions for my understanding. Assume one has more than sufficient ram to load the model and operate the OS. If one is using mmap, does the program then check whether it has sufficient ram anyway and just loads the entire model? Or does is only load parts when they are needed as long as they are needed? Assuming one doesn't have enough ram and in the case of evaluating a LLM, why is this then more performant than swapping: is the goal of the inference not to evaluate the entire network i.e. are we not in essence performing a bunch of multiplications (or dot products) for each layer? I tought we always need to evalute the entire model and hence we can't skip a layer or omit a part of a layer, so we need to load the entire tensors anyway? Note I'm assuming here that the tensors are dense, which i guess is valid since llama cpp doesn't support sparse computations. How does this affect the speed when performing a long chat? Since we can't predict the next question, we don't know which parts of the model are needed in the future. Could it then be the case that the chat is in fact slower due to loading/unloading parts of the model from the disk to the ram? If my reasoning is correct, then I guess in the case of chat it's more interesting to have a bit longer loading time but faster response during the chat. So here an optional mmap flag could be useful? two remarks:
|
@x02Sylvie Please open a separate issue, so we can address it somehow. How are common users supposed to know what is mmap and whether they should enable it? edit: issue has been already opened: #705 |
@KASR I recommended reading the resources I gave in the above comment. Short answer:
The data is loaded lazily into what appears to be a single contiguous virtual memory space. When the data is sequential on disk, it's also sequential in virtual memory, allowing you to have more control over how the data is arranged in virtual memory by controlling its arrangement on disk (this occurred when the file format was changed). RAM only holds parts that have been accessed recently in virtual memory. If a program requests data that isn't there (either because it has never been accessed before or it was accessed and then replaced when the kernel needed to free up some RAM), the data must be lazily loaded (once again). |
To the "thumb down" vote brigade: having a proper understanding of memory management is not a vote for or against this PR. A slowdown of 10x as seen by @x02Sylvie clearly should be investigated, I'm not trying to sweep that under the rug. I agree with @prusnak, though: if people experience slow performance, how are they supposed to know that they should use the My issue with the whole story is solely the format change. If #685 truly makes this unnecessary, as @anzz1 says, I'm in favor of reworking this to the old ggml format, and address the performance regression that people are seeing. |
can't this be resolved by a short info section in the readme? e.g., having a MMAP section saying that the option is there, and the user should experiment with the option to see whether or not it boosts performance for the specific configuration? |
This doubling comes from the os not trusting data in the program, so one in memory on one program, and one in page cache to send data to other programs. |
This makes absolutely no sense, and is not at all how memory management works, and it's fortunate otherwise it would be unreliable to even modify files between programs! For what you describe to happen you would need to use |
You allocate MAP_PRIVATE by mallocing, and you copy data to it using read() instead of mmap. That is why there are two copies. |
With all respect, according to yesterday's Discord chat, @jart considers working with bigger-than-RAM models something not worth supporting. I am not sure if I agree. Multiple people tried that and mention that mmap() converts a model that is impossible to load into the one that works somehow. I tried a 30B model (so, a 20 GB file) on a 16 GB 2-in-1 laptop. It works. But it also spends 20 seconds per token (with a crappy laptoppy NVMe that only boasts 1 GB/s sequential read speed), which makes it not really useful. Maybe somebody with an Optane drive will disagree. Also yesterday in the Discord chat I expressed an opinion that we would need some optimizations on top of mmap if we do want to support larger-than-RAM models. As all weights are processed, approximately in the same order, for every token generated, it is easy to imagine a failure mode when the OS pages in every page of the model file sequentially, and then, when the other half of the file is needed, pages it out as the least-recently used. Which means that, for every token, the system would need to sequentially load all weights from the disk, which is consistent with the "20 seconds per token" figure that I get. If we could mlock() a part of the weights that fits into RAM (instead of the current "all or nothing" situation), then the mlocked part (let's say 14 GB) will always be processed fast, while the other part (6 GB) would still be as slow as the disk. On average, this "mmap + partial mlock" should still be faster than the current mmap solution. BUT - mlock requires privileges, and I think that approximately the same "14 GB statically in RAM, 6 GB loaded and replaced as needed" situation can be achieved without extra privileges, by reading the 14 GB portion of the model into a malloc()-ed buffer, while using mmap() on the rest. Or by managing application buffers in a smart way - e.g. keep the 14 GB buffer with the first part of the model, and a 1 GB buffer into which other parts of the model are loaded dynamically as needed using read(). But this is significantly more complicated than what's currently implemented, and I am not sure that it is a good direction. Maybe there is some combination of madvise() calls that could help this use case without resorting to mlock() or malloc(), but I don't know... EDIT: tested the partial-mlock idea, it works, will file a separate issue for that. |
Even the theoretical optimum is few seconds per token, but mmap + 3 bit
weights would get Llama 30B to 14.7GB, which may be usable on a 16GB
machine if the operating system is able to page everything else out.
…On Mon, Apr 3, 2023 at 9:31 AM CoderRC ***@***.***> wrote:
Solution is on the way there are 2 possible solutions.
#740 <#740>
#734 <#734>
—
Reply to this email directly, view it on GitHub
<#711 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN5SWAHC4NSAX32I2Y5WTC3W7LUOPANCNFSM6AAAAAAWQJC53U>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
Does it make inference for 30B more usable on only 16GB of RAM? If so that would be very interesting. |
Looks like memory management thread should be used for low ram system speed increase compatibility. |
Kind-of. Please disregard the previous "20 seconds per token" figure I mentioned before, I forgot to remove "-t 1" from the benchmark. Also please partially disregard the "crappy laptoppy 1 GB/s NVMe" remark - it is valid for one thread only, i.e. if there is no parallelism. With -t 4, and without mlock, I get 13.7 seconds per token, which is too slow. The maximum that I was able to pass to ggml_mlock without the app getting killed is 8.5 GB (perhaps because ggml mlocks also its internal data structures, besides the model itself), and this with -t 4 results in 7.8 seconds per token. So, 1.75x of improvement. For comparison, on the same laptop, the 13B model consumes 419 ms per token with 4 threads, and 433 ms per token with 8 threads. The 7B model consumes 226 ms and 243 ms per token, respectively. |
Before you consider rolling things back to benefit people at the lower end of computing environment, I want to thank the project for allowing me to run the 65B model on a Mac Studio. The model+context is 125GB, it uses the 16 performance cores and the 4 efficiency cores are idle, there is spare 3GB of RAM to run other things like a browser. Inference times for that big model are:
First invocation of ./main takes maybe 30 seconds, next invocations are instantaneous. Thanks! |
Having such a partial mlock solution would be great.
|
Probs the discord is @jart the @patrakov mentioned: |
Is there actually an official or semi official discord ? I think ggml/whisper and llama.cpp would deserve one. Of course the channel/server should be run/owned by gg |
I don't really know what I'm doing. But this little tweak here seems to be doing the trick for me: Would be great if you or anyone else more experienced with this stuff could chime in there and confirm or question whether this solution I got from ChatGPT makes sense universally or not. 🤖 😄
I think GitHub discussions can be quite good for that as well, although the interface isn't that great and it's easy to clutter PR discussions instead (like I'm doing here now: 😝 ) |
FYI, I'm working on an overhaul of the loading code that will
I ran out of time tonight before finishing it, but it should be done tomorrow. |
OK, but I was asking why you were saying that mlock() would prevent a double copy compared to mmap() alone. Or maybe you were in fact saying mmap()+mlock() compared to malloc()+read() ? If so I agree. But that's not what I understood from your comment ;-) |
(sorry, still not finished…) |
I grow tired of reading people arguing against forks in this pr. The primary benefit to open source is that it gives users and consumers the ability to choose control/ownership/maintainership of the programs/sdks/apis they use by voting with their feet. If you don't like how say, for example, microsoft runs windows, you can't vote with your feet to another maintainer of windows nor can you start your own fork (with blackjack! and hookers!). but you can do that here, and I strongly suggest doing so. Everybody in this thread arguing otherwise: shame on you. There is zero reason why @ggerganov or anybody is entitled to be the leader of the llama.cpp project. The act of open sourcing something involves giving away that right. |
You misunderstood the topic as nobody argued against "forks". And of course ggerganov is entitled to be the leader of this project, he's founded it and he's supplied ggml. |
It's not about entitlement it's about division of labour. If every minor disagreement led to a new fork, no open source project would make it past the first few weeks. It divides the community and multiplies effort needlessly. As long as an original maintainer is responsive and collaborative I will choose their repo over forks every time. Also this is an extremely silly thing to fork over considering the project has only existed for a couple weeks. These early breaking architectural changes tend to happen all the time in pre-1.0 software, and |
oh, ya, i agree forking over this would be silly. i just disagree with anybody telling anybody else to not fork. forking and letting people vote with their feet is better then arguments that spawn multiple days and make github threads lag out. the overall trend of: 'don't fork, it will fracture the community' is a non-argument. open source was made to enable that exact thing. edit:
google |
LOL |
holds up imaginary puppet and adorns an appropriately high-pitched voice for the role |
* Add configurable default chat completion format. * Remove chat_template file to avoid circular import * Update llama_types * Add chat format
Revert recent breaking change to model format and regressions introduced by #613
After that the mmap support can be introduced properly in a better thought-out manner
When reintroducing mmap support, it should at least:
#ifdef
'd so the support for it can easily be built (or not built) on supported platforms--mmap
command line option so it can be used optionallyA model format change isn't required here at all, so reverting back to the perfectly functional ggml format makes sense.
If/when future breaking changes to the model format actually has to be done, the proper way to do this is to actually use the versioning introduced in #252 . That is , incrementing the version
LLAMA_FILE_VERSION = 1
and not changing the magicLLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex
. More information about that in this issue: #647