Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring back the ggml model format and revert breaking mmap change (#613) #711

Closed
wants to merge 9 commits into from

Conversation

anzz1
Copy link
Contributor

@anzz1 anzz1 commented Apr 2, 2023

Revert recent breaking change to model format and regressions introduced by #613
After that the mmap support can be introduced properly in a better thought-out manner

When reintroducing mmap support, it should at least:

  • be #ifdef 'd so the support for it can easily be built (or not built) on supported platforms
  • introduce a --mmap command line option so it can be used optionally
  • not break the current model format by using the C API to implement the functionality as recently introduced in Feature: Added api for getting/setting the kv_cache #685 by loading/storing the kv_cache instead of padding the model files themselves
  • have a proper feedback from the community and testing on all supported platforms to be functional and not introduce regressions
  • overall solve the issues before pushing to master, draft pull requests and feature branches exist precisely for this reason

A model format change isn't required here at all, so reverting back to the perfectly functional ggml format makes sense.

If/when future breaking changes to the model format actually has to be done, the proper way to do this is to actually use the versioning introduced in #252 . That is , incrementing the version LLAMA_FILE_VERSION = 1 and not changing the magic LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex . More information about that in this issue: #647

@anzz1 anzz1 changed the title Revert mmap change (#613) Revert breaking mmap change and change back the model format (#613) Apr 2, 2023
@anzz1 anzz1 changed the title Revert breaking mmap change and change back the model format (#613) Revert breaking mmap change (#613) Apr 2, 2023
@anzz1 anzz1 changed the title Revert breaking mmap change (#613) Bring back the ggml model format and revert breaking mmap change (#613) Apr 2, 2023
@BadisG
Copy link

BadisG commented Apr 2, 2023

So basically, we were required to change the format to ggjt (though "jt" stands for the initials of jart, it was @slaren who wrote almost all of the code in this #613 – shouldn't his initials be included instead?), which led to versioning issues. All of this for a feature that doesn't function properly?

@anzz1 You're doing god's work, if we turn a blind eye to any negative and ego-driven commits that are introduced to this project and fail to address them, we risk leaving the project and seeking a repository or a fork that is managed by someone who is more dedicated to correcting these unsatisfactory mistakes.

@anzz1
Copy link
Contributor Author

anzz1 commented Apr 2, 2023

...
Also the claimed "loading weights 10-100x faster" comes with huge asterisks like "not on all platforms" and "only on subsequent loads as the first load is actually slower", "memory mapping means that the model will stay behind and eat your memory even after the process is closed, this isn't explained and is currently forced with no way to turn it off" and "hasn't actually been properly tested to even function correctly on all platforms", all of which are deliberately suppressed to push this change through.

The whole process has been very disingenious from the beginning to the point it was pretty much force-pushed through without any proper input or polling the community. I'd say in general, regarding anything, to be wary when only the positives are discussed and where any negative effects are being hushed or even dismissed as invalid. That being said, don't take my word for it and make your own conclusions instead.
...

Originally posted by @anzz1 in #693 (comment)

Also the wider issue here is that making pull requests pushing new features with a description which can be only best described as a sales pitch, deliberately leaving out any negative consequences and possible incompatibilities, is disingenous at best and outright malicious at worst. Any positives and negatives should be weighed in a honest discussion with feedback from the community.

This is an open-source project, not politics. Please leave trying to push your will unto others at the door and collaborate instead. Dismissing everyone who don't agree to your view and implementation isn't the way to go.

@jart
Copy link
Contributor

jart commented Apr 2, 2023

@anzz1 You're doing god's work, if we turn a blind eye to any negative commits that are introduced to this project and fail to address them, we risk leaving the project and seeking a repository or a fork that is managed by someone who is more dedicated to correcting these unsatisfactory mistakes.

The beautiful thing about open source is anyone has the power to fork the project. The license gives you that freedom. In fact, we've also put a lot of thought into making the codebase as simple and dependency-free as possible, so you not only have the freedom to do it, it'll also be easy! There are several forks of our project, built by people who each have their own goals, and I think it really goes to show how successful we've been. You can do it too. The only thing you'd have to do is choose a name.

@sw
Copy link
Contributor

sw commented Apr 2, 2023

Well let me go get the popcorn ;-) I hope we can resolve this in a friendly manner.

My take is that @jart's (and @slaren's) work is really valuable, but it the format change was unfortunate. For example, some people are trying to plot performance vs commits here: #603. A change in the file format makes this more complicated. And for many other users it caused confusion and additional work too.

At this point I'm torn on whether a revert is the right thing.

memory mapping means that the model will stay behind and eat your memory even after the process is closed

I don't think I'm unterstanding this right: You're saying that memory will not be freed by the OS after the process terminates?

@mlvzk
Copy link

mlvzk commented Apr 2, 2023

The beautiful thing about open source is anyone has the power to fork the project. The license gives you that freedom. In fact, we've also put a lot of thought into making the codebase as simple and dependency-free as possible, so you not only have the freedom to do it, it'll also be easy! There are several forks of our project, built by people who each have their own goals, and I think it really goes to show how successful we've been. You can do it too. The only thing you'd have to do is choose a name.

Let's not jump to conclusions so quickly. Forking can lead to fragmentation, duplication of efforts, increased maintenance burden, and compatibility issues. It should be considered a last resort solution. Collaborating with the original project maintainers and contributing to the existing project can often provide greater benefits for everyone involved, fostering a more united and efficient development process.

@anzz1
Copy link
Contributor Author

anzz1 commented Apr 2, 2023

memory mapping means that the model will stay behind and eat your memory even after the process is closed

I don't think I'm unterstanding this right: You're saying that memory will not be freed by the OS after the process terminates?

You're understanding it perfectly. The whole raison d'etre for mmap() is the ability to leave stuff in RAM (or swap, albeit if that happens it's completely detrimental to this use case) unlinked from the process itself. Basically it's just storing a file/memory block in RAM which can be accessed from multiple processes.

This can have benefits like the ability of using the same loaded model from multiple main processes. And I'm not arguing that it shouldn't be implemented at all , I'm arguing that it should be implemented in a wiser manner, including the option of not using it. There is also no reason to change the file format since you can have a different codepath for --mmap command line option which stores/loads the model and the k/v cache in a mmap'ed file/memory block.

However, these benefits come with multiple caveats like decreased performance, worse compatibility (this is not a POSIX-only project) and increased complexity. It should be up to an individual whether the benefits outweigh the cost, not something forced upon everyone because a slice of the community thinks they are.

In any case, we shouldn't be forcing features which people might want or not want to use to everyone. Especially when it comes with multiple caveats like in this case. There is absolutely no reason to implement this in a forced manner like this, as it's perfectly doable as an option. Sometimes there can be cases where something can't be viably implemented as an option, but this isn't one of those cases.

To be perfectly clear: I am not saying to throw mmap() to the trash and never implementing it. I'm saying that for now, we should revert this change, go back to the drawing board and collaboratively think of a better approach on implementing this feature in a optional manner, so everyone will be satisfied. Proper implementation means a choice of using or not using it. #ifdef's and command line options are everyone's friends.

@sw
Copy link
Contributor

sw commented Apr 2, 2023

The whole raison d'etre for mmap() is the ability to leave stuff in RAM (or swap, albeit if that happens it's completely detrimental to this use case) unlinked from the process itself.

One of the raisons d'etre of an operating system is to clean up when a process terminates, so I refuse to believe this.

Basically it's just storing a file/memory block in RAM which can be accessed from multiple processes.

Agree, that's a benefit of mmap.

I don't think it needs to be made optional.

edit: from man munmap:

The region is also automatically unmapped when the process is terminated.

@prusnak
Copy link
Collaborator

prusnak commented Apr 2, 2023

memory mapping means that the model will stay behind and eat your memory even after the process is closed

@anzz1 You are misunderstanding how mmap and/or memory paging works. The data is not staying in RAM forever. Here are some resources for you:

Once you understand how mmap and memory paging works, you will see there is no reason why not to use mmap, therefore it does not make sense to make it optional.

@anzz1
Copy link
Contributor Author

anzz1 commented Apr 2, 2023

The whole raison d'etre for mmap() is the ability to leave stuff in RAM (or swap, albeit if that happens it's completely detrimental to this use case) unlinked from the process itself.

One of the raisons d'etre of an operating system is to clean up when a process terminates, so I refuse to believe this.

The memory mappings are not controlled by the processes but by the kernel, and yes, in typical use-cases the OS will clean up the handles to the memory blocks when process exits, 'freeing' them. However mmap() is specifically designed to separate the handles from the process so they can be shared between processes. Depending on the implementation, this can mean either that it will be cleaned up when the last process holding a handle is existed, or that it needs to be explicitly free'd manually.

The best way to think of it is creating a temporary file with explicit "no-flush-to-disk" property attached to it, deleted when all the handles to file do not exist anymore.

For Windows, this can be done by calling CreateFile and specifying FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_DELETE_ON_CLOSE , with the process described in more detail in this blog post: MSDN: It's only temporary

Basically it's just storing a file/memory block in RAM which can be accessed from multiple processes.

Agree, that's a benefit of mmap.

I don't think it needs to be made optional.

The recent issues give a whole different story to this. Platform implementation for mmap() is spotty at best and can also lead to unforeseen problems.

And whether you, I, or any single individual think whether they want to use it or not is irrelevant here, even if person X wants to use it, it doesn't mean that person Y should too.

Having it as an option also has the added benefit of rooting out the cause of bugs. If it's just baked into a forced feature, it's much harder to say whether a bug X is because of a specific mmap() implementation in platform Y or a bug related to the llama.cpp code.

@sw
Copy link
Contributor

sw commented Apr 2, 2023

Depending on the implementation, this can mean either that it will be cleaned up when the last process holding a handle is existed,

Exactly.

or that it needs to be explicitly free'd manually.

I'd consider that a severe kernel bug.

If the particular usage of mmap in llama.cpp has issues, we should address them. We might also consider not enabling by default for problematic environments (Windows?). But on proper Unixes, I don't think it needs to be optional. edit: If it is optional, it should be on by default.

@anzz1
Copy link
Contributor Author

anzz1 commented Apr 2, 2023

If the particular usage of mmap in llama.cpp has issues, we should address them. We might also consider not enabling by default for problematic environments (Windows?). But on proper Unixes, I don't think it needs to be optional.

The thing is that this is not a POSIX-only , or a "proper Unix" project no more than it's a "Windows project", or "a project for embedded systems" or a "mobile phone project". This project is all of them. That's kinda the part most people pushing POSIX-features seem to constantly forget (no matter the project), that not everyone lives in that same world.

Again, I'm not arguing for not having the feature. I'm arguing for having it as an option.

And the unnecessary changing of model format simply has no ground to stand on.

What exactly is the problem of going back and rethinking this? It's not like the code would disappear into thin air after reverting, not being able to be used in a better implementation.

@x02Sylvie
Copy link

Once you understand how mmap and memory paging works, you will see there is no reason why not to use mmap, therefore it does not make sense to make it optional.

I'm not expert on mmap but in this commit's case my 13 b model loading time went from 55 seconds to 600 seconds or so (10 minutes)

Previous commit:
229347345-2053d645-0f26-42ef-9f8e-5fc69ad04e1c

After mmap commit:
229345966-ee606c92-e7cb-42f6-8b6f-2d6924ebcfee

Unless there's some huge interference speed boost with mmap, I personally would prefer being able to disable it through parameter like --no-mmap

@anzz1
Copy link
Contributor Author

anzz1 commented Apr 2, 2023

I simply cannot understand an argument against individual choice. Whether you think it's a good feature or not is irrelevant. To be honest it's kinda stupid to even be arguing about this.

@KASR
Copy link
Contributor

KASR commented Apr 2, 2023

Once you understand how mmap and memory paging works, you will see there is no reason why NOT to use mmap, therefore it does not make sense to make it optional.

I belong to the group of people that don't know how mmap works... I'm neither against nor in favor. I'm just trying to learn more about code and LLM hence following this project.

However, if you don't mind, a few short (naive) questions for my understanding.

Assume one has more than sufficient ram to load the model and operate the OS. If one is using mmap, does the program then check whether it has sufficient ram anyway and just loads the entire model? Or does is only load parts when they are needed as long as they are needed?

Assuming one doesn't have enough ram and in the case of evaluating a LLM, why is this then more performant than swapping: is the goal of the inference not to evaluate the entire network i.e. are we not in essence performing a bunch of multiplications (or dot products) for each layer? I tought we always need to evalute the entire model and hence we can't skip a layer or omit a part of a layer, so we need to load the entire tensors anyway? Note I'm assuming here that the tensors are dense, which i guess is valid since llama cpp doesn't support sparse computations.

How does this affect the speed when performing a long chat? Since we can't predict the next question, we don't know which parts of the model are needed in the future. Could it then be the case that the chat is in fact slower due to loading/unloading parts of the model from the disk to the ram? If my reasoning is correct, then I guess in the case of chat it's more interesting to have a bit longer loading time but faster response during the chat. So here an optional mmap flag could be useful?

two remarks:

  • i saw the video of ggerganov where he was talking to 4 llama models at the same time, for such a use case the mmap is clearly a remarkable solution
  • i'm fortunate to have a massive amount of ram, hence this might result i me having a biased view

@prusnak
Copy link
Collaborator

prusnak commented Apr 2, 2023

Unless there's some huge interference speed boost with mmap, I personally would prefer being able to disable it through parameter like --no-mmap

@x02Sylvie Please open a separate issue, so we can address it somehow. How are common users supposed to know what is mmap and whether they should enable it?

edit: issue has been already opened: #705

@prusnak
Copy link
Collaborator

prusnak commented Apr 2, 2023

@KASR I recommended reading the resources I gave in the above comment. Short answer:

If one is using mmap, does the program then check whether it has sufficient ram anyway and just loads the entire model? Or does is only load parts when they are needed as long as they are needed?

The data is loaded lazily into what appears to be a single contiguous virtual memory space. When the data is sequential on disk, it's also sequential in virtual memory, allowing you to have more control over how the data is arranged in virtual memory by controlling its arrangement on disk (this occurred when the file format was changed). RAM only holds parts that have been accessed recently in virtual memory. If a program requests data that isn't there (either because it has never been accessed before or it was accessed and then replaced when the kernel needed to free up some RAM), the data must be lazily loaded (once again).

@sw
Copy link
Contributor

sw commented Apr 2, 2023

To the "thumb down" vote brigade: having a proper understanding of memory management is not a vote for or against this PR.

A slowdown of 10x as seen by @x02Sylvie clearly should be investigated, I'm not trying to sweep that under the rug.

I agree with @prusnak, though: if people experience slow performance, how are they supposed to know that they should use the --no-mmap option?

My issue with the whole story is solely the format change. If #685 truly makes this unnecessary, as @anzz1 says, I'm in favor of reworking this to the old ggml format, and address the performance regression that people are seeing.

@KASR
Copy link
Contributor

KASR commented Apr 2, 2023

I agree with @prusnak, though: if people experience slow performance, how are they supposed to know that they should use the --no-mmap option?

can't this be resolved by a short info section in the readme? e.g., having a MMAP section saying that the option is there, and the user should experiment with the option to see whether or not it boosts performance for the specific configuration?

@CoderRC
Copy link

CoderRC commented Apr 3, 2023

This doubling comes from the os not trusting data in the program, so one in memory on one program, and one in page cache to send data to other programs.

@wtarreau
Copy link
Contributor

wtarreau commented Apr 3, 2023

This doubling comes from the os not trusting data in the program, so one in memory on one program, and one in page cache to send data to other programs.

This makes absolutely no sense, and is not at all how memory management works, and it's fortunate otherwise it would be unreliable to even modify files between programs! For what you describe to happen you would need to use MAP_PRIVATE to explicitly demand a private copy of your modified pages (copy-on-write), and such a copy would only happen once you write to that area, both things which do not happen here.

@CoderRC
Copy link

CoderRC commented Apr 3, 2023

You allocate MAP_PRIVATE by mallocing, and you copy data to it using read() instead of mmap. That is why there are two copies.

@patrakov
Copy link

patrakov commented Apr 3, 2023

The benefit of mmap() here is not the loading speed, it's the ability to actually work with a large model that couldn't otherwise fit into memory. There's no point in having to load all the model in memory when not all of it is used at any instant. It's just a waste of memory that forces other data to be dropped from the cache. With mmap() you can work with only the active portion of the file in memory at a time, and only when memory is scarce, other portions of the file are dropped from memory.

That's how memory allocators work and anything else. Not using mmap() on a system that was designed after 1990 is an offence to all the people who have worked on operating systems since then. I could work with the 13G model on a 4GB ARM board thanks to this. Before that, even the 7G model wouldn't load. THIS is a big difference for end users.

With all respect, according to yesterday's Discord chat, @jart considers working with bigger-than-RAM models something not worth supporting. I am not sure if I agree. Multiple people tried that and mention that mmap() converts a model that is impossible to load into the one that works somehow.

I tried a 30B model (so, a 20 GB file) on a 16 GB 2-in-1 laptop. It works. But it also spends 20 seconds per token (with a crappy laptoppy NVMe that only boasts 1 GB/s sequential read speed), which makes it not really useful. Maybe somebody with an Optane drive will disagree.

Also yesterday in the Discord chat I expressed an opinion that we would need some optimizations on top of mmap if we do want to support larger-than-RAM models. As all weights are processed, approximately in the same order, for every token generated, it is easy to imagine a failure mode when the OS pages in every page of the model file sequentially, and then, when the other half of the file is needed, pages it out as the least-recently used. Which means that, for every token, the system would need to sequentially load all weights from the disk, which is consistent with the "20 seconds per token" figure that I get.

If we could mlock() a part of the weights that fits into RAM (instead of the current "all or nothing" situation), then the mlocked part (let's say 14 GB) will always be processed fast, while the other part (6 GB) would still be as slow as the disk. On average, this "mmap + partial mlock" should still be faster than the current mmap solution. BUT - mlock requires privileges, and I think that approximately the same "14 GB statically in RAM, 6 GB loaded and replaced as needed" situation can be achieved without extra privileges, by reading the 14 GB portion of the model into a malloc()-ed buffer, while using mmap() on the rest. Or by managing application buffers in a smart way - e.g. keep the 14 GB buffer with the first part of the model, and a 1 GB buffer into which other parts of the model are loaded dynamically as needed using read(). But this is significantly more complicated than what's currently implemented, and I am not sure that it is a good direction. Maybe there is some combination of madvise() calls that could help this use case without resorting to mlock() or malloc(), but I don't know...

EDIT: tested the partial-mlock idea, it works, will file a separate issue for that.

@CoderRC
Copy link

CoderRC commented Apr 3, 2023

Solution is on the way there are 2 possible solutions.
#740
#734

@adamritter
Copy link

adamritter commented Apr 3, 2023 via email

@rabidcopy
Copy link
Contributor

EDIT: tested the partial-mlock idea, it works, will file a separate issue for that.

Does it make inference for 30B more usable on only 16GB of RAM? If so that would be very interesting.

@CoderRC
Copy link

CoderRC commented Apr 3, 2023

Looks like memory management thread should be used for low ram system speed increase compatibility.

@patrakov
Copy link

patrakov commented Apr 3, 2023

EDIT: tested the partial-mlock idea, it works, will file a separate issue for that.

Does it make inference for 30B more usable on only 16GB of RAM? If so that would be very interesting.

Kind-of.

Please disregard the previous "20 seconds per token" figure I mentioned before, I forgot to remove "-t 1" from the benchmark. Also please partially disregard the "crappy laptoppy 1 GB/s NVMe" remark - it is valid for one thread only, i.e. if there is no parallelism.

With -t 4, and without mlock, I get 13.7 seconds per token, which is too slow. The maximum that I was able to pass to ggml_mlock without the app getting killed is 8.5 GB (perhaps because ggml mlocks also its internal data structures, besides the model itself), and this with -t 4 results in 7.8 seconds per token. So, 1.75x of improvement.

For comparison, on the same laptop, the 13B model consumes 419 ms per token with 4 threads, and 433 ms per token with 8 threads. The 7B model consumes 226 ms and 243 ms per token, respectively.

@ingenieroariel
Copy link

Screenshot 2023-04-02 at 8 43 45 AM

Before you consider rolling things back to benefit people at the lower end of computing environment, I want to thank the project for allowing me to run the 65B model on a Mac Studio. The model+context is 125GB, it uses the 16 performance cores and the 4 efficiency cores are idle, there is spare 3GB of RAM to run other things like a browser.

Inference times for that big model are:

-t 10  304.16 ms per token
-t 12 257.27 ms per token
-t 14  225.79 ms per token
-t 15   212.37 ms per token
-t 16 200.25 ms per token

First invocation of ./main takes maybe 30 seconds, next invocations are instantaneous. Thanks!

@trollkotze
Copy link

trollkotze commented Apr 3, 2023

@patrakov

EDIT: tested the partial-mlock idea, it works, will file a separate issue for that.

Having such a partial mlock solution would be great.

Also yesterday in the Discord chat
Btw. what discord are you talking about? Haven't found a working link.

@CoderRC
Copy link

CoderRC commented Apr 3, 2023

Probs the discord is @jart the @patrakov mentioned:
From @jart at #91 (comment)
Side Note: I have time to focus solely on solving this issue over the next few days, and that's what I intend to do. In addition to being here on GitHub, anyone is welcome to join me on the "Redbean" Discord server https://discord.gg/AqSvHf4u (channel "ai" where I'm also jart) if you share my personal fondness for low latency collaboration. In the future though, @ggerganov will decide the most appropriate chat venue for those of us helping with his projects.

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 3, 2023

Probs the discord

Is there actually an official or semi official discord ? I think ggml/whisper and llama.cpp would deserve one.
Not that I am fond of Discord but it's the most accessible chat at the moment.

Of course the channel/server should be run/owned by gg

@trollkotze
Copy link

@patrakov (#711 (comment))

EDIT: tested the partial-mlock idea, it works, will file a separate issue for that.

I don't really know what I'm doing. But this little tweak here seems to be doing the trick for me:
#753

Would be great if you or anyone else more experienced with this stuff could chime in there and confirm or question whether this solution I got from ChatGPT makes sense universally or not. 🤖 😄

@CoderRC (#711 (comment))

Probs the discord is @jart the @patrakov mentioned:
Thx, will have a look at that.

@cmp-nct (#711 (comment))

Is there actually an official or semi official discord ? I think ggml/whisper and llama.cpp would deserve one.
Not that I am fond of Discord but it's the most accessible chat at the moment.

Of course the channel/server should be run/owned by gg

I think GitHub discussions can be quite good for that as well, although the interface isn't that great and it's easy to clutter PR discussions instead (like I'm doing here now: 😝 )

@comex
Copy link
Contributor

comex commented Apr 4, 2023

FYI, I'm working on an overhaul of the loading code that will

  • support both mmap and read;
  • support all three formats (ggml, ggmf, ggjt);
  • support multi-file models like before, but improve on the old version by automatically determining the number of parts rather than requiring --n_parts;
  • improve file validation (overlaps with Fix memory bugs in loading code #651, but with even more validation);
  • stop using the per-file type field (f16) entirely in favor of just relying on the per-tensor type/size fields (which improves flexibility, and will make it easier to support the new GPTQ-for-LLaMa models in the future); and
  • be somewhat more cleanly factored than the old code (to partially mitigate the fact that yes, it undoes the simplicity benefit of Make loading weights 10-100x faster  #613).

I ran out of time tonight before finishing it, but it should be done tomorrow.

@wtarreau
Copy link
Contributor

wtarreau commented Apr 4, 2023

You allocate MAP_PRIVATE by mallocing, and you copy data to it using read() instead of mmap. That is why there are two copies.

OK, but I was asking why you were saying that mlock() would prevent a double copy compared to mmap() alone. Or maybe you were in fact saying mmap()+mlock() compared to malloc()+read() ? If so I agree. But that's not what I understood from your comment ;-)

kg7y added a commit to j1oi/github-drama that referenced this pull request Apr 5, 2023
@comex
Copy link
Contributor

comex commented Apr 5, 2023

(sorry, still not finished…)

@ggerganov ggerganov closed this Apr 5, 2023
@sw sw mentioned this pull request Apr 5, 2023
@MrStonedOne
Copy link

I grow tired of reading people arguing against forks in this pr.

The primary benefit to open source is that it gives users and consumers the ability to choose control/ownership/maintainership of the programs/sdks/apis they use by voting with their feet.

If you don't like how say, for example, microsoft runs windows, you can't vote with your feet to another maintainer of windows nor can you start your own fork (with blackjack! and hookers!).

but you can do that here, and I strongly suggest doing so.

Everybody in this thread arguing otherwise: shame on you. There is zero reason why @ggerganov or anybody is entitled to be the leader of the llama.cpp project. The act of open sourcing something involves giving away that right.

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 5, 2023

I grow tired of reading people arguing against forks in this pr.
Everybody in this thread arguing otherwise: shame on you. There is zero reason why @ggerganov or anybody is entitled to be the leader of the llama.cpp project. The act of open sourcing something involves giving away that right.

You misunderstood the topic as nobody argued against "forks".
People argued against using the word "fork" to prevent people having a discussion about non-final decisions on the future of the project.

And of course ggerganov is entitled to be the leader of this project, he's founded it and he's supplied ggml.
Both: the project's core and ggml are significantly above the qualification of most people who are working on this project (and also those who were in focus of this discussion).
That's the beauty of such open projects, people can make useful contributions without understanding all of it.

@green-s
Copy link

green-s commented Apr 5, 2023

Everybody in this thread arguing otherwise: shame on you. There is zero reason why @ggerganov or anybody is entitled to be the leader of the llama.cpp project. The act of open sourcing something involves giving away that right.

It's not about entitlement it's about division of labour. If every minor disagreement led to a new fork, no open source project would make it past the first few weeks. It divides the community and multiplies effort needlessly. As long as an original maintainer is responsive and collaborative I will choose their repo over forks every time.

Also this is an extremely silly thing to fork over considering the project has only existed for a couple weeks. These early breaking architectural changes tend to happen all the time in pre-1.0 software, and mmap is a basic feature of all modern operating systems, used in all major machine learning frameworks. qwopqwop's GPTQ repo breaks backwards compatibility practically every day but nobody's talking about forking that.

@MrStonedOne
Copy link

MrStonedOne commented Apr 5, 2023

oh, ya, i agree forking over this would be silly.

i just disagree with anybody telling anybody else to not fork.

forking and letting people vote with their feet is better then arguments that spawn multiple days and make github threads lag out.

the overall trend of: 'don't fork, it will fracture the community' is a non-argument. open source was made to enable that exact thing.

edit:

If every minor disagreement led to a new fork, no open source project would make it past the first few weeks.

google ss13 for an example of how expansive a fork first mindset can make a community bigger faster stronger more productive..

@rweichler
Copy link

This is an open-source project, not politics.
-@anzz1 (#711 (comment))

LOL

@ggerganov ggerganov deleted the revert-mmap branch April 24, 2023 19:20
@biorpg
Copy link

biorpg commented Aug 24, 2023

holds up imaginary puppet and adorns an appropriately high-pitched voice for the role
#711: Hi everybody! I am a relevant philosophical discussion again!

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
* Add configurable default chat completion format.

* Remove chat_template file to avoid circular import

* Update llama_types

* Add chat format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.