fix: chunk boundary transcription loss due to missing mel context#264
Merged
Conversation
…ility The FastConformer encoder's depthwise convolutions need left context from the previous chunk to produce stable features. Without this, the first mel frames of a chunk can cause all-blank predictions at certain boundaries. Changes: - Add melContextSamples (1280 samples = 80ms = 1 encoder frame) - Reserve space in chunkSamples so chunk + context <= maxModelSamples - Prepend context from overlap region for chunks after the first - Pass contextFrameAdjustment to decoder to skip context frames
Alex-Wengg
added a commit
that referenced
this pull request
May 11, 2026
…ion termination (#594) Batch transcription drifted French to English at every ~15s chunk boundary on parakeet-tdt-0.6b-v3-coreml. Streaming on the same audio was clean. Root cause is three interacting issues: 1. ChunkProcessor created a fresh TdtDecoderState per chunk and SOS-primed with the blank token. For non-first chunks this starts the LSTM mid-utterance, biased toward TDT v3's English prior. 2. Non-first chunks received only ~80ms of mel-context prefix (from #264), while streaming uses ~2s of actual leading audio. FastConformer's depthwise convs produce language-biased logits with too little audio history, even when the decoder state is correct. 3. When the decoder emits a sentence-final token mid-chunk, the LSTM enters a state where the joint predicts BLANK for the remaining frames, silently dropping audio. Masked by the per-chunk SOS reset; surfaces once state is persisted. Fix: - ChunkProcessor.process: serialize chunk processing, persist TdtDecoderState across chunks (matches SlidingWindowAsrManager). - ChunkProcessor: extend non-first chunk audio prefix from 80ms mel-context to 2.0s of actual audio. Decoder skips prefix encoder frames via contextFrameAdjustment; timestamps remain anchored on global frames. - TdtDecoderV3: after a sentence-final token, if the decoder emits a long blank-only streak with audio remaining, clear predictorOutput to re-engage emission while preserving LSTM state. Verified on reporter's notes_1408_clean.wav: drift gone with --language fr. English LibriSpeech test-clean smoke (N=5): WER unchanged vs main. Streaming path unchanged. Preserves #264's chunk-boundary token-loss fix. Closes #594
Alex-Wengg
added a commit
that referenced
this pull request
May 11, 2026
…ion termination (#594) Batch transcription drifted French to English at every ~15s chunk boundary on parakeet-tdt-0.6b-v3-coreml. Streaming on the same audio was clean. Root cause is three interacting issues: 1. ChunkProcessor created a fresh TdtDecoderState per chunk and SOS-primed with the blank token. For non-first chunks this starts the LSTM mid-utterance, biased toward TDT v3's English prior. 2. Non-first chunks received only ~80ms of mel-context prefix (from #264), while streaming uses ~2s of actual leading audio. FastConformer's depthwise convs produce language-biased logits with too little audio history, even when the decoder state is correct. 3. When the decoder emits a sentence-final token mid-chunk, the LSTM enters a state where the joint predicts BLANK for the remaining frames, silently dropping audio. Masked by the per-chunk SOS reset; surfaces once state is persisted. Fix: - ChunkProcessor.process: serialize chunk processing, persist TdtDecoderState across chunks (matches SlidingWindowAsrManager). - ChunkProcessor: extend non-first chunk audio prefix from 80ms mel-context to 2.0s of actual audio. Decoder skips prefix encoder frames via contextFrameAdjustment; timestamps remain anchored on global frames. - TdtDecoderV3: after a sentence-final token, if the decoder emits a long blank-only streak with audio remaining, clear predictorOutput to re-engage emission while preserving LSTM state. Verified on reporter's notes_1408_clean.wav: drift gone with --language fr. English LibriSpeech test-clean smoke (N=5): WER unchanged vs main. Streaming path unchanged. Preserves #264's chunk-boundary token-loss fix. Closes #594
Alex-Wengg
added a commit
that referenced
this pull request
May 12, 2026
PR #264 (commit 7459740) added an 80ms (1 encoder frame, 1280 samples) mel-context prepend on non-first chunks to fix all-blank predictions at chunk boundaries on long English audio. On `parakeet-tdt-0.6b-v3-coreml` with non-English audio, that prepend shifts the FastConformer encoder's first-frame distribution just enough that the SOS-primed TDT decoder drifts back to its English-biased prior at every chunk seam. Reproduction (4 fixtures, default vs --no-mel-context): - notes_1408 (FR): drift -> clean - wwii (FR): clean -> clean - user_en (EN): clean -> clean - user2 99.9s (FR): clean -> clean Changes: - ASRConfig gains `melChunkContext: Bool = true` (default preserves PR #264 behavior; set to false for non-English long-form batch). - ChunkProcessor reads the flag and zeroes the prepend when disabled, expanding chunkSamples back so chunks aren't 80ms smaller than the encoder's max receptive window. - `transcribe` and `asr-benchmark` CLIs accept `--no-mel-context`. Closes #594
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes transcription truncation at chunk boundaries where valid speech was being lost.
Problem
Audio at certain chunk boundaries produced all-blank predictions. The FastConformer encoder's depthwise convolutions require left context from preceding audio to produce stable features for the first frames of each chunk. Without this context, the encoder output for initial frames can be unstable, causing the TDT decoder to predict silence.
Solution
Prepend 80ms (1280 samples = 1 encoder frame) of context from the overlap region to non-first chunks:
chunkSamplescalculation to stay within CoreML's 240k sample limitcontextFrameAdjustmentparameter to tell decoder to skip context framesTesting
Notes
This aligns batch mode context handling with how streaming already works.