bug: French transcription drifts to English at chunk boundary in AsrManager.transcribe (batch) — regression introduced by #264

## Summary

`AsrManager.transcribe` (batch mode) produces a garbled English segment at the first CoreML chunk boundary (~15 s) when the audio is French. The regression is bisected to PR #264 ("fix: chunk boundary transcription loss due to missing mel context", merged 2026-01-23). The same audio on the parent commit `bb96003` and in streaming mode (`SlidingWindowAsrManager`) is clean on all tested SHAs including HEAD.

**Reproducer audio:** reporter to attach in a follow-up comment. 45.156s, 16 kHz mono WAV, real French dictation.

---

## Reproduction

**Model:** `parakeet-tdt-0.6b-v3-coreml` (`AsrModelVersion.v3`)  
**Audio:** `notes_1408_clean.wav` — 45.156 s, 16 kHz mono WAV, French speech

```swift
// Broken path — batch
let result = try await asrManager.transcribe(samples: audioSamples, language: .french)

// Clean path — streaming
let streaming = SlidingWindowAsrManager(...)
let result = try await streaming.transcribe(samples: audioSamples, language: .french)
```

The failure is **deterministic**: 5/5 runs produce byte-identical output. Passing `language: .french` explicitly does not mitigate it.

---

## Bisect

The table below was produced by checking out each SHA, building with the default CoreML backend, and running the same 45 s French audio through `AsrManager.transcribe`.

| SHA | Date | Result |
|---|---|---|
| c366ca0 | 2025-11-10 | clean |
| 6c352d8 | 2025-12-13 | clean |
| 1180543 | 2026-01-14 | clean |
| **bb96003** | **2026-01-22** | **clean — parent of #264** |
| **7459740** | **2026-01-23** | **broken — PR #264** |
| 5d9176e | 2026-01-30 | broken |
| 064daac | 2026-02-25 | broken |
| 481f47b7 | 2026-04-04 | broken |
| ce59fb1 | 2026-05-09 | broken (HEAD at time of report) |

---

## Side-by-side outputs

**Broken: commit `7459740` (or any post-#264 SHA), batch mode**

```
Oui Pierre, écoute, je suis très content qu'on puisse discuter de ça avec l'équipe parce que
de toute façon moi je suis hyper occupé et donc j'ai pas vraiment beaucoup le temps de m'en
occuper personnellement. Si toi tu peux en fait revoir avec le rest of the key what is that
marrangeray je pense que la prochaine fois on va devoir evident un effort sur le progrès et
sur l'avancée évidemment du sujet. Si l'équipe en fait doit travailler plus vite, tu n'as
qu'à me le dire, il faut augmenter les performances. Les compétiteurs n'ont pas de temps à
perdre et nous non plus il faut réellement qu'on se dépêche et je compte sur toi en fait pour
vraiment m'aider tu vois
```

The drift segment is `"rest of the key what is that marrangeray"` at the chunk boundary (~15 s mark).

**Clean: commit `bb96003` (parent of #264), batch mode**

```
Oui Pierre, écoute, je suis très content qu'on puisse discuter de ça avec l'équipe parce que
de toute façon moi je suis hyper occupé et donc j'ai pas vraiment beaucoup le temps de m'en
occuper personnellement. Si toi tu peux en fait revoir avec le reste de l'équipe ce qu'il en
est, ça m'arrangerait vraiment bien. Je pense que la prochaine fois on va devoir évidemment
faire un effort sur le progrès et sur l'avancée évidemment du sujet. Si l'équipe en fait doit
travailler plus vite, tu n'as qu'à me le dire, il faut augmenter les performances. Les
compétiteurs n'ont pas de temps à perdre et nous non plus il faut réellement qu'on se dépêche
et je compte sur toi en fait pour vraiment m'aider tu vois
```

**Clean: commit `7459740`, streaming mode (same audio)**

```
Oui Pierre, écoute, je suis très content qu'on puisse discuter de ça avec l'équipe parce que
de toute façon moi je suis hyper occupé et donc j'ai pas vraiment beaucoup le temps de m'en
occuper personnellement. Si toi tu peux en faire. fait revoir avec le reste de l'équipe ce
qu'il en est, ça m'arrangerait vraiment bien. Je pense que la prochaine fois on va devoir
évidemment faire un effort sur le progrès et sur l'avancée évidemment du sujet. Si l'équipe
en fait doit travailler plus vite, tu n'as qu'à me le dire, il faut augmenter les performances.
Les compétiteurs n'ont pas de temps perdre et nous non plus. Il faut réellement qu'on se
dépêche. Et je compte sur toi en fait pour vous.
```

(The streaming output has minor disfluencies of its own but produces no English drift at the chunk boundary.)

---

## Affected scope

Conditions that **trigger** the bug:
- Language: French (confirmed)
- Mode: batch (`AsrManager.transcribe`)
- Duration: longer than one CoreML chunk (~15 s), i.e. at least one chunk boundary is crossed
- Model: `parakeet-tdt-0.6b-v3-coreml` (v3)

Conditions where the bug does **not** appear:
- English long audio at the same broken SHA — clean
- Single-chunk audio (< ~15 s) — clean (no boundary is ever crossed)
- Streaming mode (`SlidingWindowAsrManager`) — clean at all tested SHAs
- Explicitly passing `--language fr` / `language: .french` — does **not** mitigate

---

## Suspected mechanism

PR #264 introduced 80 ms (1280 samples = 1 encoder frame) of left-context audio prepended to every non-first chunk, drawn from the preceding overlap region. The relevant lines from `Sources/FluidAudio/ASR/ChunkProcessor.swift` in the PR diff:

```swift
// For chunks after the first, prepend context samples from the overlap region.
let contextSamples = chunkIndex > 0 ? melContextSamples : 0
let contextStart = chunkStart - contextSamples
let chunkLengthWithContext = chunkEnd - contextStart
let chunkSamplesArray = try readSamples(offset: contextStart, count: chunkLengthWithContext)
```

```swift
// Context frame adjustment tells decoder to skip the prepended context frames
let contextFrames = contextSamples / ASRConstants.samplesPerEncoderFrame

let (hypothesis, encoderSequenceLength) = try await manager.executeMLInferenceWithTimings(
    paddedChunk,
    originalLength: samples.count,  // Full length including context
    actualAudioFrames: actualFrameCount,  // Only actual audio frames (excluding context)
    decoderState: &decoderState,
    contextFrameAdjustment: contextFrames,  // Skip context frames in decoder
    isLastChunk: isLastChunk,
    globalFrameOffset: globalFrameOffset
)
```

The encoder runs over the full context-padded audio and produces features for all frames including the prepended context frame. The `contextFrameAdjustment: contextFrames` parameter is then used to tell the TDT decoder to skip that leading frame when consuming encoder output.

Our hypothesis is that this skip is not language-neutral. Parakeet TDT v3 has a strong English prior; when the decoder is asked to skip the very first encoder frame it receives at a chunk boundary, something in the decoder state initialisation or the frame-skip logic produces output that is biased toward English regardless of the audio content. That corruption then bleeds into the following tokens until the decoder re-anchors on the French audio a few words later. Because English audio is already within the model's prior, the same corruption goes unnoticed — the decoder's fallback is already English.

We cannot say with certainty whether the bug is in `contextFrameAdjustment` being off by one, in how `executeMLInferenceWithTimings` implements the skip (discarding an encoder frame vs. shifting attention), or in the decoder state not being reset cleanly across the boundary. Streaming mode does not exhibit this, which suggests the streaming path handles left-context differently in a way that does not corrupt decoder language state.

---

## Workarounds we are using

We have migrated our offline (batch) callers to `SlidingWindowAsrManager` as a short-term workaround — streaming produces clean output on the same audio. We are not calling `AsrManager.transcribe` directly for any production path at the moment.

If it would help, we are happy to send a PR. Possible directions would be: (a) verifying whether `contextFrameAdjustment` is 0-indexed vs. 1-indexed causing an off-by-one, or (b) comparing how the streaming path supplies left context to `executeMLInferenceWithTimings` and aligning batch to match.

---

## Environment

- **Hardware:** Apple Silicon (M-series)
- **OS:** macOS (Sequoia / latest)
- **FluidAudio version:** HEAD at time of report is `ce59fb1` (2026-05-09)
- **Model:** `parakeet-tdt-0.6b-v3-coreml` (`AsrModelVersion.v3`)
- **Swift:** 6.x
- **Integration:** Swift Package Manager, CoreML backend, no custom configuration

---

Thanks for FluidAudio — happy to provide more reproducers, the audio file, or test patches.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: French transcription drifts to English at chunk boundary in AsrManager.transcribe (batch) — regression introduced by #264 #594

Summary

Reproduction

Bisect

Side-by-side outputs

Affected scope

Suspected mechanism

Workarounds we are using

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SHA	Date	Result
`c366ca0`	2025-11-10	clean
`6c352d8`	2025-12-13	clean
`1180543`	2026-01-14	clean
`bb96003`	2026-01-22	clean — parent of #264
`7459740`	2026-01-23	broken — PR #264
`5d9176e`	2026-01-30	broken
`064daac`	2026-02-25	broken
`481f47b`	2026-04-04	broken
`ce59fb1`	2026-05-09	broken (HEAD at time of report)

bug: French transcription drifts to English at chunk boundary in AsrManager.transcribe (batch) — regression introduced by #264 #594

Description

Summary

Reproduction

Bisect

Side-by-side outputs

Affected scope

Suspected mechanism

Workarounds we are using

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions