-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding LlamaInfinite model which implements LM-Infinite on Llama #26645
Conversation
Hello! Thank you for taking the time and care to implement this. I'm doing some benchmarking as I'm writing this now :)
Preliminary benchmarking resultsAs I was writing this, my perplexity benchmarking tool from I've ran this experiment for:
Let's go over the details:
To further support my thoughts here, I've also plotted the latencies of using As you can see, both To summarize my results: I'm not very confident in the benefit that
|
Hi Tom Aarsen! Thank you so much for taking the time for a detailed evaluation! I see the point in implementing a plug-in separately for long-term maintenence. I am happy to help in that direction as well (e.g., your efforts in attention_sinks), especially to combine the advantages of both implementations (this and LM-Infinite). To be more specific:
Again, whatever the outcome and final decisions, I see this a great chance for combining and benefiting from both implementations of Chi Han |
As a quick comment regarding the decoding section: my experiments using window attention and |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
In this PR, we implement LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models on Llama model proposed in August 2023, which removes length limits of large language models, and enables them to generate to infinite lengths with intact performance similar to training time, without any parameter updates. Results show that LM-Infinite can encode as long as 128k tokens on a single A100 GPU, and allows generating to infinite tokens, thanks to its$O(n)$ time and space complexity for encoding and $O(1)$ complexity for decoding. Interestingly, later StreamingLLM recently also observed alike results on a similar technique.
This implementation is related to and in response to an issue discussing about integrating LM-Infinite into Huggingface Transformers.
This LlamaInfinite model allows for seamless adaptation from usage of original Llama models, simply by substituting
LlamaForCausalModel.from_pretrained()
withLlamaInfiniteForCausalLM.from_pretrained()
. All other usages remain the same. This implementation is compatible with all previous Llama model checkpoints without any modifications, so new model checkpoints are needed.Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case. It is related to this issue.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.