-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Phi-3 4K output broken after 2000~ tokens (Reproducible) #7709
Comments
I can confirm this. I tried to ask it to summarize an article in italian. Everything is fine until it hits the 2000 tokens wall. After that it outputs garbage. |
I can reproduce, it seems there's some issue with the initial implementation in #6852 |
It's most likely the missing sliding window, as pointed out earlier |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
@jggc This topic relates to Phi-3 model that has degradation in quality before it runs out of context, so I marked your comments as off-topic. Quality degradation after you run out is expected and from what I understood that is the case here. |
Indeed my behavior is slightly different but still degradation WITHIN the context length. I posted in this thread instead of opening a new issue since it had enough similarities that I thought it might be related. I'll rephrase to make things clearer :
At this point, no matter what I do I won't get sensible responses until I restart the server. @Galunid Let me know if I should open a new bug. It is reproducible, I could write a gist. |
Interestingly, phi-3-small use a combination of sliding window + block sparse attention. So even we got a hack for sliding window (used by gemma 2), it will still be messy if we want proper support for phi-3 Link to paper: https://arxiv.org/pdf/2404.14219 |
ONNX runtime has the same bug. This might be a reference for us if they can fix it. |
Commenting to see if there has been an update/solution to this before it gets closed for activity? We've faced this issue for a month now and using the 128K context models is problematic due to available hardware |
What happened?
To reproduce:
Download the official released gguf model from huggingface/microsoft.
Run server.exe -m Phi3-mini-4k.gguf -c 4096
When input prompt < ~2048: Output fine. (but output starts getting weird right after it hits ~2048 in total)
When input prompt > ~2048: Output weird.
The weird output seems like what we expect to see when the context is more than the model support, but happens in ~2048, which seems like there are some bugs.
Also tested Llama3-8B, works fine with input prompt < 8192 as expected (with -c 8192), also works fine with input prompt < 4096 as expected (with -c 4096).
Name and Version
version: 3015 (74b239b)
built with MSVC 19.39.33523.0 for x64
Tried both cuda and avx2 version.
Also tried latest version built it myself @ Intel SYCL
version: 3075 (3d7ebf6)
built with IntelLLVM 2024.1.0
What operating system are you seeing the problem on?
Win10, Win11
Relevant log output
Before ~2000 tokens and after
The text was updated successfully, but these errors were encountered: