Skip to content

feat: offline pronunciation pipeline with NeMo CTC + GOP scoring#20

Open
fasuizu-br wants to merge 1 commit intosilvioprog:mainfrom
fasuizu-br:feat/ctc-gop-pronunciation-pipeline
Open

feat: offline pronunciation pipeline with NeMo CTC + GOP scoring#20
fasuizu-br wants to merge 1 commit intosilvioprog:mainfrom
fasuizu-br:feat/ctc-gop-pronunciation-pipeline

Conversation

@fasuizu-br
Copy link

Summary

  • Adds fully offline, privacy-first pronunciation evaluation using NeMo Conformer CTC Small (INT4, 17.1MB) via ONNX Runtime Web
  • Per-phoneme GOP scoring with Viterbi forced alignment — identifies exactly which sounds the student mispronounces
  • L1-aware Brazilian Portuguese accent adaptation (21 phonemes across 3 tiers) with decoded transcript verification
  • Replaces the Web Speech API approach from PR feat: add speaking practice with Web Speech API #19 which auto-corrected student speech (useless for pronunciation evaluation)

Technical details

Metric Value
Model size 17.1 MB (INT4, 61% smaller than FP32)
WER (native) 2.9%
WER (BR accent) 17.8% (does NOT auto-correct)
Unit tests 155 passing
Cross-browser Chrome, Firefox, Safari (incl. iOS)
Privacy 100% offline — no audio leaves device

Pipeline

AudioWorklet → WebWorker → ONNX inference → Viterbi alignment → GOP scoring → L1 adaptation

L1 Scoring Tiers (Brazilian Portuguese)

  • Tier 1 (50% boost): TH, DH, R, NG, ZH — absent in Portuguese
  • Tier 2 (40% boost): AE, IH, AH, UH, EY, OW, ER, AY, AW, OY — vowel confusion
  • Tier 3 (25% boost): L, T, D, S, Z — context-dependent differences

Real Audio Validation (Speech Accent Archive, George Mason University)

  • 12 Brazilian speakers + 6 native Americans
  • 50% BR-to-native gap recovered by L1 scoring
  • Statistical significance: p=0.0017, Cohen's d=1.28 (large effect)

Files changed

  • src/lib/speechUtils.ts — CTC processing, Viterbi alignment, GOP scoring, L1 adaptation
  • src/workers/stt-worker.ts — WebWorker for ONNX inference
  • src/hooks/useSpeechRecognition.ts — React hook (audio capture + worker communication)
  • src/components/Study/SpeakingPractice.tsx — UI with per-phoneme colored feedback
  • src/lib/types.ts — TypeScript types for pronunciation results
  • public/models/ — INT4 ONNX model (17.1MB) + token vocabulary

Test plan

  • npm run build passes cleanly
  • npx vitest run — 155 tests passing
  • Manual test: record pronunciation, verify per-phoneme scores appear
  • Verify L1 feedback tooltips show BR-specific phoneme adjustments
  • Test on mobile (Android Chrome, iOS Safari)
  • Verify RAM usage < 80MB in Chrome DevTools

Replace Web Speech API approach with a fully offline, privacy-first
pronunciation evaluation system using NeMo Conformer CTC Small (INT4,
17.1MB) running via ONNX Runtime Web.

Key features:
- Per-phoneme GOP scoring with Viterbi forced alignment
- L1-aware Brazilian Portuguese accent adaptation (21 phonemes, 3 tiers)
- Decoded transcript verification with 1.5x boost for confirmed BR patterns
- 100% offline, 100% private — no audio leaves the device
- Cross-browser: Chrome, Firefox, Safari (including iOS)

Technical details:
- Model: NeMo Conformer CTC Small INT4 (17.1MB, 61% smaller than FP32)
- WER: 2.9% native, 17.8% BR accent (does NOT auto-correct speech)
- Pipeline: AudioWorklet → WebWorker → ONNX inference → Viterbi → GOP
- 155 unit tests passing

Validated with real audio from Speech Accent Archive (George Mason Univ):
- 12 Brazilian speakers + 6 native Americans
- 50% BR-to-native gap recovered by L1 scoring
- Statistical significance: p=0.0017, Cohen's d=1.28
@fasuizu-br fasuizu-br force-pushed the feat/ctc-gop-pronunciation-pipeline branch from 28e8dc4 to f823235 Compare February 16, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant