Summary
AsrManager.transcribe (batch mode) produces a garbled English segment at the first CoreML chunk boundary (~15 s) when the audio is French. The regression is bisected to PR #264 ("fix: chunk boundary transcription loss due to missing mel context", merged 2026-01-23). The same audio on the parent commit bb96003 and in streaming mode (SlidingWindowAsrManager) is clean on all tested SHAs including HEAD.
Reproducer audio: reporter to attach in a follow-up comment. 45.156s, 16 kHz mono WAV, real French dictation.
Reproduction
Model: parakeet-tdt-0.6b-v3-coreml (AsrModelVersion.v3)
Audio: notes_1408_clean.wav — 45.156 s, 16 kHz mono WAV, French speech
// Broken path — batch
let result = try await asrManager.transcribe(samples: audioSamples, language: .french)
// Clean path — streaming
let streaming = SlidingWindowAsrManager(...)
let result = try await streaming.transcribe(samples: audioSamples, language: .french)
The failure is deterministic: 5/5 runs produce byte-identical output. Passing language: .french explicitly does not mitigate it.
Bisect
The table below was produced by checking out each SHA, building with the default CoreML backend, and running the same 45 s French audio through AsrManager.transcribe.
Side-by-side outputs
Broken: commit 7459740 (or any post-#264 SHA), batch mode
Oui Pierre, écoute, je suis très content qu'on puisse discuter de ça avec l'équipe parce que
de toute façon moi je suis hyper occupé et donc j'ai pas vraiment beaucoup le temps de m'en
occuper personnellement. Si toi tu peux en fait revoir avec le rest of the key what is that
marrangeray je pense que la prochaine fois on va devoir evident un effort sur le progrès et
sur l'avancée évidemment du sujet. Si l'équipe en fait doit travailler plus vite, tu n'as
qu'à me le dire, il faut augmenter les performances. Les compétiteurs n'ont pas de temps à
perdre et nous non plus il faut réellement qu'on se dépêche et je compte sur toi en fait pour
vraiment m'aider tu vois
The drift segment is "rest of the key what is that marrangeray" at the chunk boundary (~15 s mark).
Clean: commit bb96003 (parent of #264), batch mode
Oui Pierre, écoute, je suis très content qu'on puisse discuter de ça avec l'équipe parce que
de toute façon moi je suis hyper occupé et donc j'ai pas vraiment beaucoup le temps de m'en
occuper personnellement. Si toi tu peux en fait revoir avec le reste de l'équipe ce qu'il en
est, ça m'arrangerait vraiment bien. Je pense que la prochaine fois on va devoir évidemment
faire un effort sur le progrès et sur l'avancée évidemment du sujet. Si l'équipe en fait doit
travailler plus vite, tu n'as qu'à me le dire, il faut augmenter les performances. Les
compétiteurs n'ont pas de temps à perdre et nous non plus il faut réellement qu'on se dépêche
et je compte sur toi en fait pour vraiment m'aider tu vois
Clean: commit 7459740, streaming mode (same audio)
Oui Pierre, écoute, je suis très content qu'on puisse discuter de ça avec l'équipe parce que
de toute façon moi je suis hyper occupé et donc j'ai pas vraiment beaucoup le temps de m'en
occuper personnellement. Si toi tu peux en faire. fait revoir avec le reste de l'équipe ce
qu'il en est, ça m'arrangerait vraiment bien. Je pense que la prochaine fois on va devoir
évidemment faire un effort sur le progrès et sur l'avancée évidemment du sujet. Si l'équipe
en fait doit travailler plus vite, tu n'as qu'à me le dire, il faut augmenter les performances.
Les compétiteurs n'ont pas de temps perdre et nous non plus. Il faut réellement qu'on se
dépêche. Et je compte sur toi en fait pour vous.
(The streaming output has minor disfluencies of its own but produces no English drift at the chunk boundary.)
Affected scope
Conditions that trigger the bug:
- Language: French (confirmed)
- Mode: batch (
AsrManager.transcribe)
- Duration: longer than one CoreML chunk (~15 s), i.e. at least one chunk boundary is crossed
- Model:
parakeet-tdt-0.6b-v3-coreml (v3)
Conditions where the bug does not appear:
- English long audio at the same broken SHA — clean
- Single-chunk audio (< ~15 s) — clean (no boundary is ever crossed)
- Streaming mode (
SlidingWindowAsrManager) — clean at all tested SHAs
- Explicitly passing
--language fr / language: .french — does not mitigate
Suspected mechanism
PR #264 introduced 80 ms (1280 samples = 1 encoder frame) of left-context audio prepended to every non-first chunk, drawn from the preceding overlap region. The relevant lines from Sources/FluidAudio/ASR/ChunkProcessor.swift in the PR diff:
// For chunks after the first, prepend context samples from the overlap region.
let contextSamples = chunkIndex > 0 ? melContextSamples : 0
let contextStart = chunkStart - contextSamples
let chunkLengthWithContext = chunkEnd - contextStart
let chunkSamplesArray = try readSamples(offset: contextStart, count: chunkLengthWithContext)
// Context frame adjustment tells decoder to skip the prepended context frames
let contextFrames = contextSamples / ASRConstants.samplesPerEncoderFrame
let (hypothesis, encoderSequenceLength) = try await manager.executeMLInferenceWithTimings(
paddedChunk,
originalLength: samples.count, // Full length including context
actualAudioFrames: actualFrameCount, // Only actual audio frames (excluding context)
decoderState: &decoderState,
contextFrameAdjustment: contextFrames, // Skip context frames in decoder
isLastChunk: isLastChunk,
globalFrameOffset: globalFrameOffset
)
The encoder runs over the full context-padded audio and produces features for all frames including the prepended context frame. The contextFrameAdjustment: contextFrames parameter is then used to tell the TDT decoder to skip that leading frame when consuming encoder output.
Our hypothesis is that this skip is not language-neutral. Parakeet TDT v3 has a strong English prior; when the decoder is asked to skip the very first encoder frame it receives at a chunk boundary, something in the decoder state initialisation or the frame-skip logic produces output that is biased toward English regardless of the audio content. That corruption then bleeds into the following tokens until the decoder re-anchors on the French audio a few words later. Because English audio is already within the model's prior, the same corruption goes unnoticed — the decoder's fallback is already English.
We cannot say with certainty whether the bug is in contextFrameAdjustment being off by one, in how executeMLInferenceWithTimings implements the skip (discarding an encoder frame vs. shifting attention), or in the decoder state not being reset cleanly across the boundary. Streaming mode does not exhibit this, which suggests the streaming path handles left-context differently in a way that does not corrupt decoder language state.
Workarounds we are using
We have migrated our offline (batch) callers to SlidingWindowAsrManager as a short-term workaround — streaming produces clean output on the same audio. We are not calling AsrManager.transcribe directly for any production path at the moment.
If it would help, we are happy to send a PR. Possible directions would be: (a) verifying whether contextFrameAdjustment is 0-indexed vs. 1-indexed causing an off-by-one, or (b) comparing how the streaming path supplies left context to executeMLInferenceWithTimings and aligning batch to match.
Environment
- Hardware: Apple Silicon (M-series)
- OS: macOS (Sequoia / latest)
- FluidAudio version: HEAD at time of report is
ce59fb1 (2026-05-09)
- Model:
parakeet-tdt-0.6b-v3-coreml (AsrModelVersion.v3)
- Swift: 6.x
- Integration: Swift Package Manager, CoreML backend, no custom configuration
Thanks for FluidAudio — happy to provide more reproducers, the audio file, or test patches.
Summary
AsrManager.transcribe(batch mode) produces a garbled English segment at the first CoreML chunk boundary (~15 s) when the audio is French. The regression is bisected to PR #264 ("fix: chunk boundary transcription loss due to missing mel context", merged 2026-01-23). The same audio on the parent commitbb96003and in streaming mode (SlidingWindowAsrManager) is clean on all tested SHAs including HEAD.Reproducer audio: reporter to attach in a follow-up comment. 45.156s, 16 kHz mono WAV, real French dictation.
Reproduction
Model:
parakeet-tdt-0.6b-v3-coreml(AsrModelVersion.v3)Audio:
notes_1408_clean.wav— 45.156 s, 16 kHz mono WAV, French speechThe failure is deterministic: 5/5 runs produce byte-identical output. Passing
language: .frenchexplicitly does not mitigate it.Bisect
The table below was produced by checking out each SHA, building with the default CoreML backend, and running the same 45 s French audio through
AsrManager.transcribe.Side-by-side outputs
Broken: commit
7459740(or any post-#264 SHA), batch modeThe drift segment is
"rest of the key what is that marrangeray"at the chunk boundary (~15 s mark).Clean: commit
bb96003(parent of #264), batch modeClean: commit
7459740, streaming mode (same audio)(The streaming output has minor disfluencies of its own but produces no English drift at the chunk boundary.)
Affected scope
Conditions that trigger the bug:
AsrManager.transcribe)parakeet-tdt-0.6b-v3-coreml(v3)Conditions where the bug does not appear:
SlidingWindowAsrManager) — clean at all tested SHAs--language fr/language: .french— does not mitigateSuspected mechanism
PR #264 introduced 80 ms (1280 samples = 1 encoder frame) of left-context audio prepended to every non-first chunk, drawn from the preceding overlap region. The relevant lines from
Sources/FluidAudio/ASR/ChunkProcessor.swiftin the PR diff:The encoder runs over the full context-padded audio and produces features for all frames including the prepended context frame. The
contextFrameAdjustment: contextFramesparameter is then used to tell the TDT decoder to skip that leading frame when consuming encoder output.Our hypothesis is that this skip is not language-neutral. Parakeet TDT v3 has a strong English prior; when the decoder is asked to skip the very first encoder frame it receives at a chunk boundary, something in the decoder state initialisation or the frame-skip logic produces output that is biased toward English regardless of the audio content. That corruption then bleeds into the following tokens until the decoder re-anchors on the French audio a few words later. Because English audio is already within the model's prior, the same corruption goes unnoticed — the decoder's fallback is already English.
We cannot say with certainty whether the bug is in
contextFrameAdjustmentbeing off by one, in howexecuteMLInferenceWithTimingsimplements the skip (discarding an encoder frame vs. shifting attention), or in the decoder state not being reset cleanly across the boundary. Streaming mode does not exhibit this, which suggests the streaming path handles left-context differently in a way that does not corrupt decoder language state.Workarounds we are using
We have migrated our offline (batch) callers to
SlidingWindowAsrManageras a short-term workaround — streaming produces clean output on the same audio. We are not callingAsrManager.transcribedirectly for any production path at the moment.If it would help, we are happy to send a PR. Possible directions would be: (a) verifying whether
contextFrameAdjustmentis 0-indexed vs. 1-indexed causing an off-by-one, or (b) comparing how the streaming path supplies left context toexecuteMLInferenceWithTimingsand aligning batch to match.Environment
ce59fb1(2026-05-09)parakeet-tdt-0.6b-v3-coreml(AsrModelVersion.v3)Thanks for FluidAudio — happy to provide more reproducers, the audio file, or test patches.