Skip to content

[Frontend] Feature: support transcription API with language detection #13465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mru4913
Copy link

@mru4913 mru4913 commented Feb 18, 2025

This is a follow-up to the Whisper transcription API endpoint, contributed by @DarkLight1337 and @NickLucche ([#12909], [#13301]).

The previous code restricted the use of audio with unknown languages, forcing it to be recognized as English. This update, adapted from the transformers library, introduces language detection. This is a temporary fix to support audio with unknown languages; a more robust and general solution will be implemented in the future.

Thanks to @DarkLight1337 @NickLucche and etc.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Feb 18, 2025
Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! :)
I think we should take the chance to do some bookkeeping and organize the code around constants a bit better, left some comments.

Also, we will need appropriate tests for this feature.

Comment on lines +312 to +318
if (
"v3" in self.model_config.model.lower()
and self.model_config.hf_config.model_type.lower() == "whisper"
):
id2token = LANG_ID_TO_LANG_TOKEN["whisper-v3"]
else:
return default_lang_token
Copy link
Contributor

@NickLucche NickLucche Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't have an approximate check here, I'd rather build a mapping dynamically for all whisper models sharing the same content. See comment below for a cleaner alternative.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. It is better to wait for official supported language config and id-token mapping.

"bo": "Tibetan",
}

LANG_ID_TO_LANG_TOKEN = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some bookkeeping with this extra dict here @DarkLight1337 . How about we use the SupportsTranscription interface to query the model class for a mapping of supported languages?
Otherwise this will quickly become a mess if only another model gets added

Copy link
Author

@mru4913 mru4913 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanna support more asr models, need a better way to structure those mapping

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use the SupportsTranscription interface to query the model class for a mapping of supported languages?

Sounds good to me!

Comment on lines +322 to +332
PromptType,
{
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"audio": audio_data,
},
},
"decoder_prompt": "<|startoftranscript|>",
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, it would be nice to avoid running the encoder twice. Ideally one could start the generation with the same decoder prompt as here, have the model predict a lang token, and then continue by forcing the subsequent tokens to be <|lang|><|transcribe|><|notimestamps|>{prompt}.
But I see how this would be invasive to implement.

We could re-use the encoder kv cache which is already populated here, but how do we make sure it's not evicted, would re-using the same request id here clash with scheduler policy? @DarkLight1337

Copy link
Author

@mru4913 mru4913 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why It is a temporary fix. The current method does not take full adv of vllm features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not relying on the internals of KV cache for now as it may change significantly in V1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this would target V0 though, encoder-decoder models are still far from being supported in V1 afaik #12761

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot about that. I guess it's fine then.

We could re-use the encoder kv cache which is already populated here, but how do we make sure it's not evicted, would re-using the same request id here clash with scheduler policy?

@robertgshaw2-redhat @ywang96 might have a better idea regarding this.

@mru4913
Copy link
Author

mru4913 commented Feb 19, 2025

Thanks for your contribution! :) I think we should take the chance to do some bookkeeping and organize the code around constants a bit better, left some comments.

Also, we will need appropriate tests for this feature.


Add a simple test case

Copy link

mergify bot commented May 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mru4913.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 19, 2025
@francis2tm
Copy link

francis2tm commented May 19, 2025

Hello,
Is this PR stalled? I'm willing to do a new PR if required.

I also needs this feature so that transcription works on multiple languages :)

@mru4913

@esceptico
Copy link

Hey!
Is there any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants