Skip to content

fix: chunk boundary transcription loss due to missing mel context#264

Merged
Alex-Wengg merged 1 commit into
FluidInference:mainfrom
starkdmi:main
Jan 23, 2026
Merged

fix: chunk boundary transcription loss due to missing mel context#264
Alex-Wengg merged 1 commit into
FluidInference:mainfrom
starkdmi:main

Conversation

@starkdmi
Copy link
Copy Markdown
Contributor

Summary

Fixes transcription truncation at chunk boundaries where valid speech was being lost.

Problem

Audio at certain chunk boundaries produced all-blank predictions. The FastConformer encoder's depthwise convolutions require left context from preceding audio to produce stable features for the first frames of each chunk. Without this context, the encoder output for initial frames can be unstable, causing the TDT decoder to predict silence.

Solution

Prepend 80ms (1280 samples = 1 encoder frame) of context from the overlap region to non-first chunks:

  • Reserve space in chunkSamples calculation to stay within CoreML's 240k sample limit
  • Use existing contextFrameAdjustment parameter to tell decoder to skip context frames
  • Context is drawn from the existing 2.0s overlap region (no additional memory)

Testing

  • Verified on long-form audio (>2 minutes) that previously exhibited truncation
  • WER improved significantly on affected files
  • Streaming mode unaffected (already handles left context correctly)

Notes

This aligns batch mode context handling with how streaming already works.

…ility

The FastConformer encoder's depthwise convolutions need left context from
the previous chunk to produce stable features. Without this, the first
mel frames of a chunk can cause all-blank predictions at certain boundaries.

Changes:
- Add melContextSamples (1280 samples = 80ms = 1 encoder frame)
- Reserve space in chunkSamples so chunk + context <= maxModelSamples
- Prepend context from overlap region for chunks after the first
- Pass contextFrameAdjustment to decoder to skip context frames
@Alex-Wengg Alex-Wengg merged commit 7459740 into FluidInference:main Jan 23, 2026
11 of 12 checks passed
Alex-Wengg added a commit that referenced this pull request May 11, 2026
…ion termination (#594)

Batch transcription drifted French to English at every ~15s chunk boundary
on parakeet-tdt-0.6b-v3-coreml. Streaming on the same audio was clean.
Root cause is three interacting issues:

1. ChunkProcessor created a fresh TdtDecoderState per chunk and SOS-primed
   with the blank token. For non-first chunks this starts the LSTM
   mid-utterance, biased toward TDT v3's English prior.

2. Non-first chunks received only ~80ms of mel-context prefix (from #264),
   while streaming uses ~2s of actual leading audio. FastConformer's
   depthwise convs produce language-biased logits with too little audio
   history, even when the decoder state is correct.

3. When the decoder emits a sentence-final token mid-chunk, the LSTM
   enters a state where the joint predicts BLANK for the remaining frames,
   silently dropping audio. Masked by the per-chunk SOS reset; surfaces
   once state is persisted.

Fix:
- ChunkProcessor.process: serialize chunk processing, persist
  TdtDecoderState across chunks (matches SlidingWindowAsrManager).
- ChunkProcessor: extend non-first chunk audio prefix from 80ms mel-context
  to 2.0s of actual audio. Decoder skips prefix encoder frames via
  contextFrameAdjustment; timestamps remain anchored on global frames.
- TdtDecoderV3: after a sentence-final token, if the decoder emits a long
  blank-only streak with audio remaining, clear predictorOutput to
  re-engage emission while preserving LSTM state.

Verified on reporter's notes_1408_clean.wav: drift gone with --language fr.
English LibriSpeech test-clean smoke (N=5): WER unchanged vs main.
Streaming path unchanged.

Preserves #264's chunk-boundary token-loss fix.

Closes #594
Alex-Wengg added a commit that referenced this pull request May 11, 2026
…ion termination (#594)

Batch transcription drifted French to English at every ~15s chunk boundary
on parakeet-tdt-0.6b-v3-coreml. Streaming on the same audio was clean.
Root cause is three interacting issues:

1. ChunkProcessor created a fresh TdtDecoderState per chunk and SOS-primed
   with the blank token. For non-first chunks this starts the LSTM
   mid-utterance, biased toward TDT v3's English prior.

2. Non-first chunks received only ~80ms of mel-context prefix (from #264),
   while streaming uses ~2s of actual leading audio. FastConformer's
   depthwise convs produce language-biased logits with too little audio
   history, even when the decoder state is correct.

3. When the decoder emits a sentence-final token mid-chunk, the LSTM
   enters a state where the joint predicts BLANK for the remaining frames,
   silently dropping audio. Masked by the per-chunk SOS reset; surfaces
   once state is persisted.

Fix:
- ChunkProcessor.process: serialize chunk processing, persist
  TdtDecoderState across chunks (matches SlidingWindowAsrManager).
- ChunkProcessor: extend non-first chunk audio prefix from 80ms mel-context
  to 2.0s of actual audio. Decoder skips prefix encoder frames via
  contextFrameAdjustment; timestamps remain anchored on global frames.
- TdtDecoderV3: after a sentence-final token, if the decoder emits a long
  blank-only streak with audio remaining, clear predictorOutput to
  re-engage emission while preserving LSTM state.

Verified on reporter's notes_1408_clean.wav: drift gone with --language fr.
English LibriSpeech test-clean smoke (N=5): WER unchanged vs main.
Streaming path unchanged.

Preserves #264's chunk-boundary token-loss fix.

Closes #594
Alex-Wengg added a commit that referenced this pull request May 12, 2026
PR #264 (commit 7459740) added an 80ms (1 encoder frame, 1280 samples)
mel-context prepend on non-first chunks to fix all-blank predictions at
chunk boundaries on long English audio. On `parakeet-tdt-0.6b-v3-coreml`
with non-English audio, that prepend shifts the FastConformer encoder's
first-frame distribution just enough that the SOS-primed TDT decoder
drifts back to its English-biased prior at every chunk seam.

Reproduction (4 fixtures, default vs --no-mel-context):
  - notes_1408 (FR):  drift -> clean
  - wwii (FR):        clean -> clean
  - user_en (EN):     clean -> clean
  - user2 99.9s (FR): clean -> clean

Changes:
  - ASRConfig gains `melChunkContext: Bool = true` (default preserves
    PR #264 behavior; set to false for non-English long-form batch).
  - ChunkProcessor reads the flag and zeroes the prepend when disabled,
    expanding chunkSamples back so chunks aren't 80ms smaller than the
    encoder's max receptive window.
  - `transcribe` and `asr-benchmark` CLIs accept `--no-mel-context`.

Closes #594
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants