Fixed 30s timestamp resets in Whisper long-form transcription #36612

FaresBadrCA · 2025-03-07T19:02:38Z

What does this PR do?

Fixes #34210 and #31942.
This is an alternative to PR #35750

It resolves the issue of timestamps rolling over every 30 seconds in the Whisper model's long-form transcription. It does this by forcing return_segments to be True when return_timestamps is True.

Before submitting

Did you read the contributor guideline pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@eustlb, @Rocketknight1, @gante, @ylacombe

…rcing return_segments.

github-actions · 2025-03-07T19:02:50Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

eustlb · 2025-03-11T10:50:30Z

Hey @FaresBadrCA

Thanks a lot for your PR! 🤗
Can you please provide a reproducer of what's not working with #35750 please (works fine from my tests)? This PR takes into account the internals and specificities of tricky Whisper heuristics and I'd rather work from it. In the case where it's indeed not doing what's expected, I'd be glad to review your PR.

FaresBadrCA · 2025-03-12T04:06:08Z

Hi @eustlb, below is a snippet I used for testing, using the LinusTech dataset.

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device
)

dataset = load_dataset("Whispering-GPT/linustechtips-transcript-audio", split="train", streaming=True)
sample = list(dataset.take(1))[0]
result = pipe(sample['audio'], return_timestamps = True, generate_kwargs = {"language": "english", "condition_on_prev_tokens" : True})
print(result['chunks'][:5])

Running the code twice: Once for this PR (#36612) and once for the other PR (#35750), I get the results below.

PR 36612: Using `return_segments = true`

[{'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this",
  'timestamp': (0.0, 13.72)},
 {'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,',
  'timestamp': (13.72, 20.400000000000002)},
 {'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So",
  'timestamp': (20.400000000000002, 27.36)},
 {'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility',
  'timestamp': (27.36, 33.6)},
 {'text': ' with cases on the market, the 920 is going to offer better performance due to the larger',
  'timestamp': (33.6, 40.0)}]

PR 35750: Using timestamp tokens

[{'timestamp': (0.0, 13.72),
  'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this"},
 {'timestamp': (13.72, 20.4),
  'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,'},
 {'timestamp': (20.4, 27.36),
  'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So"},
 {'timestamp': (30.0, 36.54),
  'text': ' while the 620 uses a thinner style radiator that offers the advantage of better compatibility with'},
 {'timestamp': (36.54, 43.5),
  'text': ' cases on the market, the 920 is going to offer better performance due to the larger surface area.'}]

Note the fourth segment: It should go from 27.3 to 33.6. Instead, goes from 30.0 to 36.5. It is delayed by about 3 seconds, and that delay carries over to subsequent segments.
I noticed this issue only happens when condition_on_prev_tokens = True

For reference, below are the "correct" segments provides in the dataset.

Provided segments (`sample['segments'][:5]`)

{'start': 0.0, 'end': 13.48, 'text': " So guys, today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620."}
{'start': 13.48, 'end': 17.94, 'text': ' So this is the little brother to the Cooler H2O 920.'}
{'start': 17.94, 'end': 22.76, 'text': " Now the key difference between this one and the 920, they're both fairly similar in terms"}
{'start': 22.76, 'end': 27.3, 'text': ' of the design, is the thickness of the radiator.'}
{'start': 27.3, 'end': 33.56, 'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility'}

eustlb · 2025-03-12T14:14:49Z

I took a look at it, and what you've spotted is actually an issue, thanks a lot for that 🙏

That is exactly why we want to go with #35750: output should be equivalent from what you get looking directly at the segments (what you're doing in this PR). That is also why this PR won't get merge: we do not want to bypass decoding directly from the outputted tokens.

Anyway, thanks a lot again for spotting this issue, I added a fix for it in #35750 and will also add a test for it 😊

eustlb · 2025-06-26T14:07:21Z

Closing this now for the above-mentioned reasons.

Fixed 30s timestamp resets in Whisper long-form transcription by enfo…

f36e7e4

…rcing return_segments.

github-actions bot marked this pull request as draft March 7, 2025 19:02

FaresBadrCA marked this pull request as ready for review March 7, 2025 19:30

github-actions bot requested review from ArthurZucker and Rocketknight1 March 7, 2025 19:30

eustlb mentioned this pull request Mar 12, 2025

[Whisper] Pipeline: handle long form generation #35750

Merged

2 tasks

ArthurZucker removed request for ArthurZucker and Rocketknight1 March 20, 2025 10:21

eustlb closed this Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed 30s timestamp resets in Whisper long-form transcription #36612

Fixed 30s timestamp resets in Whisper long-form transcription #36612

Uh oh!

FaresBadrCA commented Mar 7, 2025

Uh oh!

github-actions bot commented Mar 7, 2025

Uh oh!

eustlb commented Mar 11, 2025

Uh oh!

FaresBadrCA commented Mar 12, 2025 •

edited

Loading

Uh oh!

eustlb commented Mar 12, 2025

Uh oh!

eustlb commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixed 30s timestamp resets in Whisper long-form transcription #36612

Fixed 30s timestamp resets in Whisper long-form transcription #36612

Uh oh!

Conversation

FaresBadrCA commented Mar 7, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Mar 7, 2025

Uh oh!

eustlb commented Mar 11, 2025

Uh oh!

FaresBadrCA commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR 36612: Using return_segments = true

PR 35750: Using timestamp tokens

Provided segments (sample['segments'][:5])

Uh oh!

eustlb commented Mar 12, 2025

Uh oh!

eustlb commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FaresBadrCA commented Mar 12, 2025 •

edited

Loading

PR 36612: Using `return_segments = true`

Provided segments (`sample['segments'][:5]`)