[Feature]: Support token-level timestamps in whisper models

### 🚀 The feature, motivation and pitch

Dynamic time warping applied on encoder-decoder cross-attention matrices of whisper models can be used to find a word-level alignment between audio and transcriptions. [openai/whisper provides an implementation this in `find_alignment`](https://github.com/openai/whisper/blob/517a43ecd132a2089d85f4ebc044728a71d49f6e/whisper/timing.py#L163) that returns timestamps (start and end) for each word in the transcription (here `text_tokens`).

This has various usecases for us and it would be great to have this capability exposed via vLLM.

### Alternatives

* one alternative here is to use the reference impl `find_alignment` from python directly, calling it once for each sample in a batch of audio samples (or maybe implement a variant `find_alignment` capable of handling batch inputs)
* [whisper.cpp](https://github.com/ggerganov/whisper.cpp) and the code implemented in [this PR](https://github.com/ggerganov/whisper.cpp/pull/1485) is also an option

Both options are feasible but:
* require the client/user to run custom python or native code
* both alternatives are neither efficient nor fast for a large number of (possibly concurrent) audio inputs/requests

### Additional context

[This is the PR for initial whisper support in vLLM](https://github.com/vllm-project/vllm/pull/5964) but afaik there is no support for alignment yet.

Two more comments looking at the reference impl for `find_alignment`:
* batching the encoder inference should be easy, whereas decoder batching is probably more complicated (due to flash attention and bookkeeping of the cross-attention matrices)
* `text_tokens` could be a transcription of the whisper model itself but doesn't have to be (can be any other sequence of tokens, possibly from another model or human-labeled data). As such it would be great if vLLM also supports user-provided inputs for this.

cc @mru4913 @NickLucche 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Support token-level timestamps in whisper models #13400

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Support token-level timestamps in whisper models #13400

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions