[Frontend] Feature: support transcription API with language detection #13465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mru4913 wants to merge 4 commits into vllm-project:main from mru4913:feat/transcription_detect_lang

+290 −83

mru4913 commented Feb 18, 2025 •

edited by github-actions bot

Loading

This is a follow-up to the Whisper transcription API endpoint, contributed by @DarkLight1337 and @NickLucche ([#12909], [#13301]).

The previous code restricted the use of audio with unknown languages, forcing it to be recognized as English. This update, adapted from the transformers library, introduces language detection. This is a temporary fix to support audio with unknown languages; a more robust and general solution will be implemented in the future.

Thanks to @DarkLight1337 @NickLucche and etc.


          Feature: support transcription API with language detection

1e6a5b5

Signed-off-by: albert <xzr@xzr.com>

github-actions bot commented Feb 18, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify bot added the frontend label

NickLucche suggested changes

View reviewed changes

Contributor

NickLucche left a comment

Thanks for your contribution! :)
I think we should take the chance to do some bookkeeping and organize the code around constants a bit better, left some comments.

Also, we will need appropriate tests for this feature.

vllm/entrypoints/openai/serving_transcription.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/serving_transcription.py

Comment on lines +312 to +318

+                      if (
+                          "v3" in self.model_config.model.lower()
+                          and self.model_config.hf_config.model_type.lower() == "whisper"
+                      ):
+                          id2token = LANG_ID_TO_LANG_TOKEN["whisper-v3"]
+                      else:
+                          return default_lang_token

Contributor

NickLucche Feb 18, 2025 •

edited

Loading

we can't have an approximate check here, I'd rather build a mapping dynamically for all whisper models sharing the same content. See comment below for a cleaner alternative.

Author

mru4913 Feb 18, 2025

Agree. It is better to wait for official supported language config and id-token mapping.

vllm/entrypoints/openai/serving_transcription.py

+                  "bo": "Tibetan",
+              }
+              LANG_ID_TO_LANG_TOKEN = {

Contributor

NickLucche Feb 18, 2025

I think we need some bookkeeping with this extra dict here @DarkLight1337 . How about we use the SupportsTranscription interface to query the model class for a mapping of supported languages?
Otherwise this will quickly become a mess if only another model gets added

Author

mru4913 Feb 18, 2025 •

edited

Loading

If we wanna support more asr models, need a better way to structure those mapping

Member

DarkLight1337 Feb 18, 2025

How about we use the SupportsTranscription interface to query the model class for a mapping of supported languages?

Sounds good to me!

vllm/entrypoints/openai/serving_transcription.py Outdated Show resolved Hide resolved

NickLucche reviewed

View reviewed changes

vllm/entrypoints/openai/serving_transcription.py Outdated Show resolved Hide resolved

NickLucche suggested changes

View reviewed changes

vllm/entrypoints/openai/serving_transcription.py

Comment on lines +322 to +332

+                          PromptType,
+                          {
+                              "encoder_prompt": {
+                                  "prompt": "",
+                                  "multi_modal_data": {
+                                      "audio": audio_data,
+                                  },
+                              },
+                              "decoder_prompt": "<|startoftranscript|>",
+                          },
+                      )

Contributor

NickLucche Feb 18, 2025

on second thought, it would be nice to avoid running the encoder twice. Ideally one could start the generation with the same decoder prompt as here, have the model predict a lang token, and then continue by forcing the subsequent tokens to be <|lang|><|transcribe|><|notimestamps|>{prompt}.
But I see how this would be invasive to implement.

We could re-use the encoder kv cache which is already populated here, but how do we make sure it's not evicted, would re-using the same request id here clash with scheduler policy? @DarkLight1337

Author

mru4913 Feb 18, 2025 •

edited

Loading

That's why It is a temporary fix. The current method does not take full adv of vllm features.

Member

DarkLight1337 Feb 18, 2025

I suggest not relying on the internals of KV cache for now as it may change significantly in V1.

Contributor

NickLucche Feb 18, 2025

I assume this would target V0 though, encoder-decoder models are still far from being supported in V1 afaik #12761

Member

DarkLight1337 Feb 19, 2025

Oh, I forgot about that. I guess it's fine then.

We could re-use the encoder kv cache which is already populated here, but how do we make sure it's not evicted, would re-using the same request id here clash with scheduler policy?

@robertgshaw2-redhat @ywang96 might have a better idea regarding this.

albert added 3 commits

February 18, 2025 18:03


          Style: remove some comments and update with some description

2992dc1

Signed-off-by: albert <xzr@xzr.com>


          Merge remote-tracking branch 'upstream/main' into feat/transcription_…

1b4627a

…detect_lang


          test: test transcription without given language

9daac1d

Signed-off-by: albert <xzr@xzr.com>

mru4913 requested review from robertgshaw2-redhat and simon-mo as code owners

February 19, 2025 08:17

Author

mru4913 commented Feb 19, 2025

Thanks for your contribution! :) I think we should take the chance to do some bookkeeping and organize the code around constants a bit better, left some comments.

Also, we will need appropriate tests for this feature.

Add a simple test case

mru4913 requested a review from NickLucche

February 20, 2025 05:37

Ashitaka06 mentioned this pull request

[Feature]: will whisper add language detection? #14174

Open

1 task

mergify bot commented May 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mru4913.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

francis2tm commented May 19, 2025 •

edited

Loading

Hello,
Is this PR stalled? I'm willing to do a new PR if required.

I also needs this feature so that transcription works on multiple languages :)

esceptico commented Jun 3, 2025

Hey!
Is there any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

DarkLight1337 DarkLight1337 left review comments

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat robertgshaw2-redhat is a code owner

simon-mo Awaiting requested review from simon-mo simon-mo is a code owner

NickLucche Awaiting requested review from NickLucche

Labels

frontend needs-rebase