Skip to content

Conversation

@FaresBadrCA
Copy link

What does this PR do?

Fixes #34210 and #31942.
This is an alternative to PR #35750

It resolves the issue of timestamps rolling over every 30 seconds in the Whisper model's long-form transcription. It does this by forcing return_segments to be True when return_timestamps is True.

Before submitting

Who can review?

@eustlb, @Rocketknight1, @gante, @ylacombe

@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@github-actions github-actions bot marked this pull request as draft March 7, 2025 19:02
@FaresBadrCA FaresBadrCA marked this pull request as ready for review March 7, 2025 19:30
@eustlb
Copy link
Contributor

eustlb commented Mar 11, 2025

Hey @FaresBadrCA

Thanks a lot for your PR! 🤗
Can you please provide a reproducer of what's not working with #35750 please (works fine from my tests)? This PR takes into account the internals and specificities of tricky Whisper heuristics and I'd rather work from it. In the case where it's indeed not doing what's expected, I'd be glad to review your PR.

@FaresBadrCA
Copy link
Author

FaresBadrCA commented Mar 12, 2025

Hi @eustlb, below is a snippet I used for testing, using the LinusTech dataset.

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device
)

dataset = load_dataset("Whispering-GPT/linustechtips-transcript-audio", split="train", streaming=True)
sample = list(dataset.take(1))[0]
result = pipe(sample['audio'], return_timestamps = True, generate_kwargs = {"language": "english", "condition_on_prev_tokens" : True})
print(result['chunks'][:5])

Running the code twice: Once for this PR (#36612) and once for the other PR (#35750), I get the results below.

PR 36612: Using return_segments = true

[{'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this",
  'timestamp': (0.0, 13.72)},
 {'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,',
  'timestamp': (13.72, 20.400000000000002)},
 {'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So",
  'timestamp': (20.400000000000002, 27.36)},
 {'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility',
  'timestamp': (27.36, 33.6)},
 {'text': ' with cases on the market, the 920 is going to offer better performance due to the larger',
  'timestamp': (33.6, 40.0)}]

PR 35750: Using timestamp tokens

[{'timestamp': (0.0, 13.72),
  'text': " So guys today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620. So this"},
 {'timestamp': (13.72, 20.4),
  'text': ' is the little brother to the Cooler H2O 920. Now the key difference between this one and the 920,'},
 {'timestamp': (20.4, 27.36),
  'text': " they're both fairly similar in terms of the design, is the thickness of the radiator. So"},
 {'timestamp': (30.0, 36.54),
  'text': ' while the 620 uses a thinner style radiator that offers the advantage of better compatibility with'},
 {'timestamp': (36.54, 43.5),
  'text': ' cases on the market, the 920 is going to offer better performance due to the larger surface area.'}]

Note the fourth segment: It should go from 27.3 to 33.6. Instead, goes from 30.0 to 36.5. It is delayed by about 3 seconds, and that delay carries over to subsequent segments.
I noticed this issue only happens when condition_on_prev_tokens = True

For reference, below are the "correct" segments provides in the dataset.

Provided segments (sample['segments'][:5])

{'start': 0.0, 'end': 13.48, 'text': " So guys, today I'm going to be doing a quick unboxing of the Antec Cooler H2O 620."}
{'start': 13.48, 'end': 17.94, 'text': ' So this is the little brother to the Cooler H2O 920.'}
{'start': 17.94, 'end': 22.76, 'text': " Now the key difference between this one and the 920, they're both fairly similar in terms"}
{'start': 22.76, 'end': 27.3, 'text': ' of the design, is the thickness of the radiator.'}
{'start': 27.3, 'end': 33.56, 'text': ' So while the 620 uses a thinner style radiator that offers the advantage of better compatibility'}

@eustlb
Copy link
Contributor

eustlb commented Mar 12, 2025

I took a look at it, and what you've spotted is actually an issue, thanks a lot for that 🙏

That is exactly why we want to go with #35750: output should be equivalent from what you get looking directly at the segments (what you're doing in this PR). That is also why this PR won't get merge: we do not want to bypass decoding directly from the outputted tokens.

Anyway, thanks a lot again for spotting this issue, I added a fix for it in #35750 and will also add a test for it 😊

@eustlb
Copy link
Contributor

eustlb commented Jun 26, 2025

Closing this now for the above-mentioned reasons.

@eustlb eustlb closed this Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing timestamp offset using Whisper with pipeline and sequential decoding

2 participants