Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt4all-backend: Fix MPT buffer use, deduplicate sampling and tokenizing #589

Merged
merged 4 commits into from
May 16, 2023

Conversation

apage43
Copy link
Member

@apage43 apage43 commented May 16, 2023

  • Removes non-threadsafe use of static inference buffer in mpt code and actually uses model struct buffer
  • Deduplicates tokenizing and sampling code in gptj and mpt models

@apage43 apage43 changed the title Fix MPT buffer use, deduplicate backend code gpt4all-backend: Fix MPT buffer use, deduplicate sampling and tokenizing May 16, 2023
@kuvaus
Copy link
Contributor

kuvaus commented May 16, 2023

This looks awesome! I really like the simplification.

One question:

Since you're changing how the buf_size works, does this comment take into account the progress in ggerganov/ggml#145 in the past 2 days? Right now if you use a long prompt, say 1.5k characters, with MPT and put batch_size to 20 or 56 in GPT4All-chat it will crash the GUI on memory error. The fixes work with relatively low batch sizes but if this PR works on arbitrary batch_size that would be great.

You likely already thought of this since you resize model.buf.resize(buf_size); so maybe it just works with arbitrary large n_predict and batch_size?

@apage43
Copy link
Member Author

apage43 commented May 16, 2023

Haven't gotten crashes and have put enough text through it in the UI to get the "recalculating context" pause.

Though one of the reasons to use MPT is that it should still kind of work when you override n_ctx to be a bigger size than it was trained on and I think its still possibly an underestimate for that case - there's not a way to do that in the UI or llmodel yet and the memory estimates for all the models would probably be changed to support it - or possibly just not allow it on llama and gptj, they're not likely to work that well at longer context sizes.

@kuvaus
Copy link
Contributor

kuvaus commented May 16, 2023

Haven't gotten crashes and have put enough text through it in the UI to get the "recalculating context" pause.

Great! That's all I wanted to hear. :)

I didn't know about the extrapolation to long context lengths. Sounds interesting, looks like its worth doing at some point in the future. And yeah, probably easiest to allow it for MPT only first.

@manyoso manyoso merged commit d14936b into nomic-ai:main May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants