-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Frontend] Feature: support transcription API with language detection #13465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: albert <xzr@xzr.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! :)
I think we should take the chance to do some bookkeeping and organize the code around constants a bit better, left some comments.
Also, we will need appropriate tests for this feature.
if ( | ||
"v3" in self.model_config.model.lower() | ||
and self.model_config.hf_config.model_type.lower() == "whisper" | ||
): | ||
id2token = LANG_ID_TO_LANG_TOKEN["whisper-v3"] | ||
else: | ||
return default_lang_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't have an approximate check here, I'd rather build a mapping dynamically for all whisper models sharing the same content. See comment below for a cleaner alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. It is better to wait for official supported language config and id-token mapping.
"bo": "Tibetan", | ||
} | ||
|
||
LANG_ID_TO_LANG_TOKEN = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some bookkeeping with this extra dict here @DarkLight1337 . How about we use the SupportsTranscription
interface to query the model class for a mapping of supported languages?
Otherwise this will quickly become a mess if only another model gets added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanna support more asr models, need a better way to structure those mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we use the
SupportsTranscription
interface to query the model class for a mapping of supported languages?
Sounds good to me!
PromptType, | ||
{ | ||
"encoder_prompt": { | ||
"prompt": "", | ||
"multi_modal_data": { | ||
"audio": audio_data, | ||
}, | ||
}, | ||
"decoder_prompt": "<|startoftranscript|>", | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thought, it would be nice to avoid running the encoder twice. Ideally one could start the generation with the same decoder prompt as here, have the model predict a lang token, and then continue by forcing the subsequent tokens to be <|lang|><|transcribe|><|notimestamps|>{prompt}
.
But I see how this would be invasive to implement.
We could re-use the encoder kv cache which is already populated here, but how do we make sure it's not evicted, would re-using the same request id here clash with scheduler policy? @DarkLight1337
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why It is a temporary fix. The current method does not take full adv of vllm features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not relying on the internals of KV cache for now as it may change significantly in V1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this would target V0 though, encoder-decoder models are still far from being supported in V1 afaik #12761
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot about that. I guess it's fine then.
We could re-use the encoder kv cache which is already populated here, but how do we make sure it's not evicted, would re-using the same request id here clash with scheduler policy?
@robertgshaw2-redhat @ywang96 might have a better idea regarding this.
Signed-off-by: albert <xzr@xzr.com>
Signed-off-by: albert <xzr@xzr.com>
Add a simple test case |
This pull request has merge conflicts that must be resolved before it can be |
Hello, I also needs this feature so that transcription works on multiple languages :) |
Hey! |
This is a follow-up to the Whisper transcription API endpoint, contributed by @DarkLight1337 and @NickLucche ([#12909], [#13301]).
The previous code restricted the use of audio with unknown languages, forcing it to be recognized as English. This update, adapted from the transformers library, introduces language detection. This is a temporary fix to support audio with unknown languages; a more robust and general solution will be implemented in the future.
Thanks to @DarkLight1337 @NickLucche and etc.