Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Phi-3 4K output broken after 2000~ tokens (Reproducible) #7709

Open
Amadeus-AI opened this issue Jun 3, 2024 · 12 comments
Open

Bug: Phi-3 4K output broken after 2000~ tokens (Reproducible) #7709

Amadeus-AI opened this issue Jun 3, 2024 · 12 comments
Labels
bug Something isn't working medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) model Model specific

Comments

@Amadeus-AI
Copy link

Amadeus-AI commented Jun 3, 2024

What happened?

To reproduce:
Download the official released gguf model from huggingface/microsoft.
Run server.exe -m Phi3-mini-4k.gguf -c 4096

When input prompt < ~2048: Output fine. (but output starts getting weird right after it hits ~2048 in total)
When input prompt > ~2048: Output weird.

The weird output seems like what we expect to see when the context is more than the model support, but happens in ~2048, which seems like there are some bugs.

Also tested Llama3-8B, works fine with input prompt < 8192 as expected (with -c 8192), also works fine with input prompt < 4096 as expected (with -c 4096).

Name and Version

version: 3015 (74b239b)
built with MSVC 19.39.33523.0 for x64

Tried both cuda and avx2 version.

Also tried latest version built it myself @ Intel SYCL
version: 3075 (3d7ebf6)
built with IntelLLVM 2024.1.0

What operating system are you seeing the problem on?

Win10, Win11

Relevant log output

Before ~2000 tokens and after
圖片

@Amadeus-AI Amadeus-AI added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jun 3, 2024
@Amadeus-AI Amadeus-AI changed the title Bug: Phi-3 mini output get weird after 2048 tokens Bug: Phi-3 4K weird output after 2048 tokens Jun 3, 2024
@Amadeus-AI Amadeus-AI changed the title Bug: Phi-3 4K weird output after 2048 tokens Bug: Phi-3 4K weird output after 2000~ tokens Jun 3, 2024
@matteoserva
Copy link
Contributor

I can confirm this. I tried to ask it to summarize an article in italian. Everything is fine until it hits the 2000 tokens wall. After that it outputs garbage.
The model uses a sliding windows attention of 2048 tokens. It might be related.

@Amadeus-AI Amadeus-AI changed the title Bug: Phi-3 4K weird output after 2000~ tokens Bug: Phi-3 4K output broken after 2000~ tokens (Reproducible) Jun 4, 2024
@Galunid
Copy link
Collaborator

Galunid commented Jun 4, 2024

Can you try 6369bf0 and 201cc11 to see if there's a difference? First one should work alright, second should break.

@Amadeus-AI
Copy link
Author

@Galunid
version: 2960 (6369bf0)
built with IntelLLVM 2024.1.0

Still break

@Galunid Galunid added bug Something isn't working model Model specific medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) and removed bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jun 4, 2024
@Galunid
Copy link
Collaborator

Galunid commented Jun 4, 2024

I can reproduce, it seems there's some issue with the initial implementation in #6852

@ggerganov
Copy link
Owner

It's most likely the missing sliding window, as pointed out earlier

@jggc

This comment was marked as off-topic.

@jggc

This comment was marked as off-topic.

@Galunid
Copy link
Collaborator

Galunid commented Jun 4, 2024

@jggc This topic relates to Phi-3 model that has degradation in quality before it runs out of context, so I marked your comments as off-topic. Quality degradation after you run out is expected and from what I understood that is the case here.

@jggc
Copy link

jggc commented Jun 4, 2024

Indeed my behavior is slightly different but still degradation WITHIN the context length. I posted in this thread instead of opening a new issue since it had enough similarities that I thought it might be related.

I'll rephrase to make things clearer :

  1. Start server
  2. Call /completion with a short prompt such as "What is 2+2"
  3. Response is OK
  4. Call /completion with a long prompt exceeding context length
  5. It fails will generate garbage, as expected in this context
  6. Call /completion with a short prompt again "What is 2+2"
  7. Get garbage output, this is not expected. Model state should not be broken in the server after a single prompt exceeded the context length in the session.

At this point, no matter what I do I won't get sensible responses until I restart the server.

@Galunid Let me know if I should open a new bug. It is reproducible, I could write a gist.

@ngxson
Copy link
Collaborator

ngxson commented Jul 2, 2024

Interestingly, phi-3-small use a combination of sliding window + block sparse attention. So even we got a hack for sliding window (used by gemma 2), it will still be messy if we want proper support for phi-3

Link to paper: https://arxiv.org/pdf/2404.14219

image

@njsyw1997
Copy link

ONNX runtime has the same bug. This might be a reference for us if they can fix it.
microsoft/onnxruntime-genai#552

@CASE-R
Copy link

CASE-R commented Jul 22, 2024

Commenting to see if there has been an update/solution to this before it gets closed for activity? We've faced this issue for a month now and using the 128K context models is problematic due to available hardware

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) model Model specific
Projects
None yet
Development

No branches or pull requests

8 participants