fix: chunk boundary transcription loss due to missing mel context by starkdmi · Pull Request #264 · FluidInference/FluidAudio

starkdmi · 2026-01-23T10:14:18Z

Summary

Fixes transcription truncation at chunk boundaries where valid speech was being lost.

Problem

Audio at certain chunk boundaries produced all-blank predictions. The FastConformer encoder's depthwise convolutions require left context from preceding audio to produce stable features for the first frames of each chunk. Without this context, the encoder output for initial frames can be unstable, causing the TDT decoder to predict silence.

Solution

Prepend 80ms (1280 samples = 1 encoder frame) of context from the overlap region to non-first chunks:

Reserve space in chunkSamples calculation to stay within CoreML's 240k sample limit
Use existing contextFrameAdjustment parameter to tell decoder to skip context frames
Context is drawn from the existing 2.0s overlap region (no additional memory)

Testing

Verified on long-form audio (>2 minutes) that previously exhibited truncation
WER improved significantly on affected files
Streaming mode unaffected (already handles left context correctly)

Notes

This aligns batch mode context handling with how streaming already works.

…ility The FastConformer encoder's depthwise convolutions need left context from the previous chunk to produce stable features. Without this, the first mel frames of a chunk can cause all-blank predictions at certain boundaries. Changes: - Add melContextSamples (1280 samples = 80ms = 1 encoder frame) - Reserve space in chunkSamples so chunk + context <= maxModelSamples - Prepend context from overlap region for chunks after the first - Pass contextFrameAdjustment to decoder to skip context frames

…ion termination (#594) Batch transcription drifted French to English at every ~15s chunk boundary on parakeet-tdt-0.6b-v3-coreml. Streaming on the same audio was clean. Root cause is three interacting issues: 1. ChunkProcessor created a fresh TdtDecoderState per chunk and SOS-primed with the blank token. For non-first chunks this starts the LSTM mid-utterance, biased toward TDT v3's English prior. 2. Non-first chunks received only ~80ms of mel-context prefix (from #264), while streaming uses ~2s of actual leading audio. FastConformer's depthwise convs produce language-biased logits with too little audio history, even when the decoder state is correct. 3. When the decoder emits a sentence-final token mid-chunk, the LSTM enters a state where the joint predicts BLANK for the remaining frames, silently dropping audio. Masked by the per-chunk SOS reset; surfaces once state is persisted. Fix: - ChunkProcessor.process: serialize chunk processing, persist TdtDecoderState across chunks (matches SlidingWindowAsrManager). - ChunkProcessor: extend non-first chunk audio prefix from 80ms mel-context to 2.0s of actual audio. Decoder skips prefix encoder frames via contextFrameAdjustment; timestamps remain anchored on global frames. - TdtDecoderV3: after a sentence-final token, if the decoder emits a long blank-only streak with audio remaining, clear predictorOutput to re-engage emission while preserving LSTM state. Verified on reporter's notes_1408_clean.wav: drift gone with --language fr. English LibriSpeech test-clean smoke (N=5): WER unchanged vs main. Streaming path unchanged. Preserves #264's chunk-boundary token-loss fix. Closes #594

PR #264 (commit 7459740) added an 80ms (1 encoder frame, 1280 samples) mel-context prepend on non-first chunks to fix all-blank predictions at chunk boundaries on long English audio. On `parakeet-tdt-0.6b-v3-coreml` with non-English audio, that prepend shifts the FastConformer encoder's first-frame distribution just enough that the SOS-primed TDT decoder drifts back to its English-biased prior at every chunk seam. Reproduction (4 fixtures, default vs --no-mel-context): - notes_1408 (FR): drift -> clean - wwii (FR): clean -> clean - user_en (EN): clean -> clean - user2 99.9s (FR): clean -> clean Changes: - ASRConfig gains `melChunkContext: Bool = true` (default preserves PR #264 behavior; set to false for non-English long-form batch). - ChunkProcessor reads the flag and zeroes the prepend when disabled, expanding chunkSamples back so chunks aren't 80ms smaller than the encoder's max receptive window. - `transcribe` and `asr-benchmark` CLIs accept `--no-mel-context`. Closes #594

Alex-Wengg merged commit 7459740 into FluidInference:main Jan 23, 2026
11 of 12 checks passed

vdt4534 mentioned this pull request May 10, 2026

bug: French transcription drifts to English at chunk boundary in AsrManager.transcribe (batch) — regression introduced by #264 #594

Open

Alex-Wengg mentioned this pull request May 12, 2026

fix(asr): add melChunkContext opt-out flag for Issue #594 #596

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chunk boundary transcription loss due to missing mel context#264

fix: chunk boundary transcription loss due to missing mel context#264
Alex-Wengg merged 1 commit into
FluidInference:mainfrom
starkdmi:main

starkdmi commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

starkdmi commented Jan 23, 2026

Summary

Problem

Solution

Testing

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants