-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting log probabilities of prompt tokens in both engine and OpenAI API server (aka echo
)
#959
Conversation
echo
.echo
)
…et original prompt.
Hello, thanks for this! Have you tested it using the openai compatible endpoint? when I try using curl it gets the request but I never get a response back. The following is with echo off, but it hangs regardless if it is on/off.
I may just be impatient, but I am excited to try this out with llm eval harness! |
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
What is the LM you were using for the request? Also, how many GPUs? |
Turns out I still had tensor_parallel=2 on, not sure the issue is related to your fork. Setting it to 1 fixed that issue. Running Llama-7B-HF on a RTX 3090. This is the new error internal server error I'm getting when I have the following args to http://localhost:8000/v1/completions with the openai endpoint: JSON Args:
Error:
|
Could you pull the latest commit? I have tested the latest commit with about 4k requests (similar to yours). |
I tested on the latest commit on my own hardware, and on a A6000 on runpod, getting the same error on both. Can you send the exact command for the openai ai sever and the curl request/python code you used to make the request? |
Ah, I have it fixed in the latest commit. It was later introduced by the modifications to the |
It's working on my end now! Thank you for the great work! I am making it work with lm-eval-harness so we can have a much faster model evaluation. |
@zhuohan123 @WoosukKwon - hi guys, I wonder if we could have someone review the PR. If any change requested, I would be happy to work on it. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! This is a feature that a lot of our community members are asking. The PR is relatively complex and I think it may require several rounds of review. I have left some of my initial comments. Please take a look. Additionally, Is it possible for you to add a test on this functionality here? After this round, I will run this PR, check the correctness of the PR, and then perform another round of review. Thanks again!
@zhuohan123 I wonder if there is any update on the review. Please let me know if you found any blocker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed comments from the previous review. Please let me know if there is any other concerns to merge the PR.
Any update on this? This is very useful. |
@wanmok Thanks for all of your contributions so far. I have been looking into this last week and tried to have a simpler modification to the vLLM core to implement this function. Please find my changes in this PR: #1328. Is it possible for you to add the OpenAI endpoint support of |
Well.... it might take some time to read through the PR and re-implement that. Are you thinking about a new PR after that one being merged or? |
PR #1328 modified the vLLM core. So that you can just add |
Sounds good. I can file another PR for OpenAI server later. Also, since I have been working on a similar implementation in this PR, could you list me as co-author for #1328?
…-Yunmo
On Oct 14, 2023 at 11:18 PM -0400, Zhuohan Li ***@***.***>, wrote:
> Well.... it might take some time to read through the PR and re-implement that. Are you thinking about a new PR after that one being merged or?
PR #1328 modified the vLLM core. So that you can just add prompt_logprobs=N in SamplingParams, and the returning RequestOutput will include a prompt_logprobs field in it. After #1328 being merged into the main branch, the only remaining thing is to let the OpenAI API server uses this new feature to support echo. Can you submit a new PR for this after #1328 is merged? I believe the code change to openai/api_server.py should be very similar with what you have done in this PR.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sure thing! Actually in that PR I used some of your testing code. Will add it when I merge the PR. |
Cool!is this now available, I mean the logps of prompt tokens. |
I suppose that the PR is not ready yet. If you would like to get logprobs of prompt tokens, you could try with this PR as a temporary solution. |
Any updates on this, please? |
According to issue #201 , the
echo
function is not currently supported in vLLM. This implementation does not require making modifications to individual model files but lay the heavy work on the first generation step.The basic logic of the implementation:
echo
to theSamplingParams
so that the engine can be informed of theecho
requestSampler
to process theecho
request.a. Flags from
SamplingParams
of eachSequenceGroup
would be used to determine whether to performecho
. Log probabilities of prompt tokens are computed at the first generation step.b. Modified
SequenceOutputs
to carry prompt log probs from the first generate step so that engine can post processSequenceData
to store the generated log probabilities of prompt tokens. Since log probabilities of prompt tokens are not necessarily stored in the top-k returns, they are stored in two separate fields asprompt_logprobs
andprompt_top_logprobs
.c. Changed numerical unstable computation of
log(softmax())
tolog_softmax()
.RequestOutput.from_seq_group
so that log probabilities of prompt tokens can be inserted into the final outputs.echo
.create_logprobs
function to reflect engine output changes.echo=True
andmax_tokens=0
, which is a valid case in OpenAI APIs. In this case, theSamplingParams.max_tokens
would be set to1
and enable aecho_self
flag. The flag is then used in the following post processing to ensure the additionally generated token being removed from the final output.Please let me know if you have any questions or concerns with respect to merging this PR. Thanks!