Skip to content

Whisper pipeline returns empty segment for each processed audio chunk #36602

@as-suvorov

Description

@as-suvorov

System Info

  • transformers version: 4.46.3
  • Platform: Linux-5.17.15-051715-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.28.1
  • Safetensors version: 0.5.2
  • Accelerate version: 1.4.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.6.0+cpu (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:

Who can help?

@Rocketknight1 @eustlb

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hello!

There is a change in timestamps processing between versions 4.46.3 and 4.47.0. With version 4.47.7 there is an empty segment for each processed audio chunk returned when return_timestamps enabled.
To reproduce the issue please run reproducer.py with transfrormers versions 4.46.3 and 4.47.0.

reproducer.py

from transformers import pipeline
import datasets
import typing


def get_sample_from_dataset():
    ds = datasets.load_dataset(
        "distil-whisper/meanwhile",
        split="test",
        streaming=True,
        trust_remote_code=True,
    )

    ds = typing.cast(datasets.IterableDataset, ds)
    ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))
    ds = ds.take(1)

    return next(iter(ds))["audio"]


sample = get_sample_from_dataset()

whisper = pipeline("automatic-speech-recognition", "openai/whisper-tiny")

transcription = whisper(
    sample.copy(),
    return_timestamps=True,
)

print(transcription["text"])

for chunk in transcription["chunks"]:
    print(chunk)

# transformers version 4.46.3
# {'timestamp': (0.0, 3.2), 'text': ' Folks, if you watch the show, you know, I spent a lot of time'}
# {'timestamp': (3.2, 4.64), 'text': ' right over there.'}
# {'timestamp': (4.64, 7.04), 'text': ' Patiently and astutely scrutinizing the boxwood and'}
# {'timestamp': (7.04, 9.28), 'text': ' mahogany chest set of the days, big stories,'}
# {'timestamp': (9.28, 11.84), 'text': ' developing the central headline pawns,'}
# {'timestamp': (11.84, 15.08), 'text': ' definitely maneuvering an OSO topical night to F6,'}
# {'timestamp': (15.08, 16.8), 'text': ' faming of classic Sicilian,'}
# {'timestamp': (16.8, 18.96), 'text': ' named or variation on the news,'}
# {'timestamp': (18.96, 21.0), 'text': ' all the while seeing eight moves deep and'}
# {'timestamp': (21.0, 24.0), 'text': ' patiently marshalling the latest press releases into a'}
# {'timestamp': (24.0, 27.52), 'text': ' Fisher shows in lip nitsky attack that culminates in the'}
# {'timestamp': (0.0, 3.24), 'text': ' The elegant lethal slow played all-pass on checkmate'}
# {'timestamp': (3.24, 5.18), 'text': ' that is my nightly monologue, but sometimes sometimes'}
# {'timestamp': (5.18, 6.0), 'text': ' folks I'}
# {'timestamp': (6.0, 9.0), 'text': ' sometimes I'}
# {'timestamp': (9.0, 13.0), 'text': ' start a little wake upside down in the monkey bars'}
# {'timestamp': (13.0, 15.48), 'text': ' of a condemned playground on a super fun site.'}
# {'timestamp': (15.48, 17.52), 'text': ' Get all hepped up on goofballs, rummage that were'}
# {'timestamp': (17.52, 20.32), 'text': ' discarded tag bag of defective toys.'}
# {'timestamp': (20.32, 23.4), 'text': ' Yank out a fistball of disembodied doll limbs,'}
# {'timestamp': (23.4, 24.96), 'text': " toss them on a stained kid's place,"}
# {'timestamp': (24.96, 27.98), 'text': ' mad from a defunct denies, set up a table inside a rusty'}
# {'timestamp': (27.98, 29.72), 'text': ' cargo container down by the warf,'}
# {'timestamp': (0.0, 2.28), 'text': ' and challenged toothless drifters to the godless,'}
# {'timestamp': (2.28, 5.76), 'text': ' bug house blitz of tournament that is my segment.'}
# {'timestamp': (5.76, 9.56), 'text': ' Me and Wild.'}

# transformers version 4.47.0
# {'timestamp': (0.0, 3.2), 'text': ' Folks, if you watch the show, you know, I spent a lot of time'}
# {'timestamp': (3.2, 4.64), 'text': ' right over there.'}
# {'timestamp': (4.64, 7.04), 'text': ' Patiently and astutely scrutinizing the boxwood and'}
# {'timestamp': (7.04, 9.28), 'text': ' mahogany chest set of the days, big stories,'}
# {'timestamp': (9.28, 11.84), 'text': ' developing the central headline pawns,'}
# {'timestamp': (11.84, 15.08), 'text': ' definitely maneuvering an OSO topical night to F6,'}
# {'timestamp': (15.08, 16.8), 'text': ' faming of classic Sicilian,'}
# {'timestamp': (16.8, 18.96), 'text': ' named or variation on the news,'}
# {'timestamp': (18.96, 21.0), 'text': ' all the while seeing eight moves deep and'}
# {'timestamp': (21.0, 24.0), 'text': ' patiently marshalling the latest press releases into a'}
# {'timestamp': (24.0, 27.52), 'text': ' Fisher shows in lip nitsky attack that culminates in the'}
# {'timestamp': (27.52, 0.0), 'text': ''}
# {'timestamp': (3.24, 5.18), 'text': ' The elegant lethal slow played all-pass on checkmate that is my nightly monologue, but sometimes sometimes'}
# {'timestamp': (5.18, 6.0), 'text': ' folks I'}
# {'timestamp': (6.0, 9.0), 'text': ' sometimes I'}
# {'timestamp': (9.0, 13.0), 'text': ' start a little wake upside down in the monkey bars'}
# {'timestamp': (13.0, 15.48), 'text': ' of a condemned playground on a super fun site.'}
# {'timestamp': (15.48, 17.52), 'text': ' Get all hepped up on goofballs, rummage that were'}
# {'timestamp': (17.52, 20.32), 'text': ' discarded tag bag of defective toys.'}
# {'timestamp': (20.32, 23.4), 'text': ' Yank out a fistball of disembodied doll limbs,'}
# {'timestamp': (23.4, 24.96), 'text': " toss them on a stained kid's place,"}
# {'timestamp': (24.96, 27.98), 'text': ' mad from a defunct denies, set up a table inside a rusty'}
# {'timestamp': (27.98, 29.72), 'text': ' cargo container down by the warf,'}
# {'timestamp': (29.72, 0.0), 'text': ''}
# {'timestamp': (2.28, 5.76), 'text': ' and challenged toothless drifters to the godless, bug house blitz of tournament that is my segment.'}
# {'timestamp': (5.76, 9.56), 'text': ' Me and Wild.'}

Expected behavior

It looks like the empty segment is unnecessary and should not be returned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions