fix: improve lang code standardization by JarbasAl · Pull Request #86 · TigreGotico/phoonnx

JarbasAl · 2025-11-24T02:32:35Z

Summary by CodeRabbit

Bug Fixes
- Corrected Tagalog and Portuguese language mappings to improve voice selection and variant accuracy.
- Adjusted total reported language count to reflect current set.
Chores
- Unified and standardized language normalization and voice indexing to improve consistency across the system.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-24T02:33:15Z

Caution

Review failed

The pull request is closed.

Walkthrough

Replaces ad-hoc use of langcodes.standardize_tag with a new phoonnx.util.normalize_lang helper and updates voice metadata for Tagalog and Portuguese (MMS.json, VOICES.md). Calls to normalization were updated across config, model_manager, and scripts; model cache clear added in model_manager main path.

Changes

Cohort / File(s)	Summary
Language utility `phoonnx/util.py`	Added `normalize_lang(lang: str) -> str` that maps Tagalog shortcodes (`"tgl"`, `"tl"`) to `"tl"` then delegates to `standardize_tag()` with fallback to the original input.
Configuration & model manager `phoonnx/config.py`, `phoonnx/model_manager.py`	Replaced `standardize_tag` imports/calls with `normalize_lang`. `VoiceConfig.__post_init__` and TTS model initialization now use `normalize_lang`. `model_manager.py` also clears cache before merging default voices in main.
Indexing scripts & voice index `scripts/index_voices.py`, `phoonnx/voice_index/MMS.json`	Replaced `standardize_tag()` usages with `normalize_lang()` in voice-list generators; added explicit `por -> pt-BR` handling in MMS generator. MMS.json updated: `facebook/mms-tts-por-Portuguese.lang` → `pt-BR`, `facebook/mms-tts-tgl-Tagalog.lang` → `tl`.
Documentation `VOICES.md`	Updated language count (1207 → 1206); Tagalog entry adjusted to use `tl` and repositioned; Portuguese updated to `pt-BR`.

Sequence Diagram(s)

sequenceDiagram
  %% Styling: highlight normalize_lang as new/changed
  participant Caller as Caller (config/scripts/manager)
  participant Norm as normalize_lang()
  participant Langcodes as standardize_tag()
  Caller->>Norm: normalize_lang(input_lang)
  alt Tagalog special-case (tgl/tl)
    Norm-->>Caller: "tl"
  else Other
    Norm->>Langcodes: standardize_tag(input_lang)
    alt standardize_tag succeeds
      Langcodes-->>Norm: normalized_tag
      Norm-->>Caller: normalized_tag
    else failure
      Langcodes-->>Norm: error
      Norm-->>Caller: original input_lang
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Files/areas requiring extra attention:
- phoonnx/util.py — verify special-case mapping and fallback behavior.
- Consistency across MMS.json, VOICES.md, and scripts/index_voices.py for pt → pt-BR.
- model_manager.py cache.clear() call — ensure no unintended side effects at runtime.

Possibly related PRs

documentation: supported voices and languages #80 — touches the same normalization paths and voice metadata updates (config, model_manager, scripts).
feat: MMS/transformers voices support #78 — overlaps on MMS voice indexing and language-tag adjustments in voice index files.
Refactor/drop raw phonemizer #84 — modifies voice-indexing and model-management codepaths that interact with language/phoneme normalization.

Poem

🐰 I hop through tags both short and long,
I map "tgl" to "tl" with a cheerful song.
Portuguese now wears "pt-BR" with pride,
Normalized voices all aligned beside—✨
— a rabbit’s nibble on code, wide-eyed.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: improve lang code standardization' clearly summarizes the main focus of the changeset, which involves replacing standardize_tag with normalize_lang and updating language code mappings across multiple files.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f18b027 and 6fd79c5.

📒 Files selected for processing (4)

phoonnx/config.py (2 hunks)
phoonnx/model_manager.py (3 hunks)
phoonnx/util.py (2 hunks)
scripts/index_voices.py (4 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

phoonnx/model_manager.py (1)

286-289: Clearing cache before merging default voices is reasonable

Calling manager.cache.clear() before merge_default_voices(store=True) ensures the generated cache only reflects the shipped voice index files, avoiding stale or duplicate entries in long-lived caches.

phoonnx/config.py (1)

153-157: VoiceConfig lang_code normalization aligns with new standard

Normalizing self.lang_code via normalize_lang in __post_init__, with a fallback to "und" on failure, keeps configs consistent with model metadata and the voice index.

You could explicitly skip normalization when self.lang_code is falsy (and go straight to "und") to avoid relying on exceptions for that case, but current behavior is functionally fine.
scripts/index_voices.py (1)
255-262: Fix unused exception variable and document exception handling

The Ruff check confirms both issues exist:

F841 (line 261, position 37): e is assigned but never used

BLE001 (line 261, position 20): Blind exception catch

The suggested refactor is appropriate—add logging to use e and make failures visible:
             else:
                 try:
                     std_lang = normalize_lang(lang)
                 except Exception as e:
+                    LOG.warning("Failed to normalize MMS ISO code '%s': %s", lang, e)
                     std_lang = lang

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a01d98 and f18b027.

📒 Files selected for processing (6)

VOICES.md (3 hunks)
phoonnx/config.py (2 hunks)
phoonnx/model_manager.py (3 hunks)
phoonnx/util.py (2 hunks)
phoonnx/voice_index/MMS.json (2 hunks)
scripts/index_voices.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

phoonnx/config.py (1)

phoonnx/util.py (1)

normalize_lang (21-25)

scripts/index_voices.py (1)

phoonnx/util.py (1)

normalize_lang (21-25)

🪛 Ruff (0.14.5)

scripts/index_voices.py

261-261: Do not catch blind exception: Exception

(BLE001)

261-261: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🔇 Additional comments (8)

phoonnx/voice_index/MMS.json (1)

10484-10484: Verification confirms language code mappings are correct and consistent.

The script output confirms:

Portuguese mapping verified: scripts/index_voices.py (lines 255-257) explicitly maps "por" → "pt-BR", which is the source of the MMS.json change.

Tagalog mapping verified: normalize_lang() in phoonnx/util.py has explicit special case handling: "tgl" or "tl" → "tl".

Consistency verified: All code paths that consume lang_code (config.py, model_manager.py) call normalize_lang(), which preserves these values (standardize_tag("pt-BR") returns "pt-BR"; special case returns "tl").

Only two entries affected: Search results confirm only the Portuguese (line 10480-10486) and Tagalog (line 12209-12216) entries require updates.

Downstream compatibility verified: Phonemizers (pt.py, mul.py) and other components expect and support "pt-BR" and "tl" language codes.

The changes are correct and properly aligned with the normalize_lang() implementation.

phoonnx/util.py (1)

12-25: Centralized language normalization helper looks good

normalize_lang cleanly wraps standardize_tag and centralizes the Tagalog special-case; this will make future normalization tweaks easier and keeps call sites consistent.

phoonnx/model_manager.py (1)

66-69: TTSModelInfo lang normalization is consistent with new helper

Using normalize_lang here keeps self.lang and config.lang_code aligned with the same normalization logic used elsewhere (config and indexing), which should avoid mismatches across data sources.

phoonnx/config.py (1)

5-5: Reusing LOG and normalization from util is a nice consolidation

Importing LOG and normalize_lang from phoonnx.util reduces duplication and keeps logging and language normalization behavior centralized.

scripts/index_voices.py (3)

9-9: Importing LOG and normalize_lang keeps tooling and normalization consistent

Bringing in LOG and normalize_lang from phoonnx.util makes this script use the same logging and language normalization conventions as the core library.

151-159: Piper manifest lang normalization via normalize_lang is appropriate

Deriving the language from v["key"].split("-")[0] and normalizing with normalize_lang keeps Piper voices’ language codes aligned with the rest of the system without changing the existing key parsing behavior.

198-215: Mimic3 voice list now benefits from shared normalization

Using normalize_lang(k.split("/")[0]) to set lang for Mimic3 voices unifies language canonicalization with other sources, and the rest of the logic (speaker_map, URLs, config wiring) remains unchanged.

VOICES.md (1)

4-4: VOICES.md changes are consistent with new normalization rules

The updated total language count, facebook/mms-tts-por-Portuguese mapped to pt-BR, and Tagalog mapped to tl all match the new normalize_lang behavior and MMS indexing logic.

Also applies to: 1093-1093, 1300-1300

coderabbitai · 2025-11-24T02:47:06Z

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #87

@JarbasAl

Docstrings generation was requested by @JarbasAl. * #86 (comment) The following files were modified: * `phoonnx/config.py` * `phoonnx/model_manager.py` * `phoonnx/util.py` * `scripts/index_voices.py`

@JarbasAl

* refactor!: tokenizer class + deprecate phoneme_ids.py (#70) * fix: coqui compatibility refactor!: tokenizer class + deprecate phoneme_ids.py fix: missing cotovia data files feat: add new galician models from proxecto nós * log * fix * fix * Merge pull request #71 from TigreGotico/coderabbitai/docstrings/cb634ab 📝 Add docstrings to `tokenizer` * adjust --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.0.0a1 * Update Changelog * feat: community piper voices + pygoruut support (#73) * feat: community piper voices + pygoruut support update model manager voice index Total voices: 284 Total langs: 67 * fix neurlang voice-id * reorder funcs for readability * 📝 Add docstrings to `models_galore` (#74) Docstrings generation was requested by @JarbasAl. * #73 (comment) The following files were modified: * `phoonnx/model_manager.py` * `phoonnx/util.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.1.0a1 * Update Changelog * feat: more piper english community voices (#76) Total voices: 314 * Increment Version to 1.2.0a1 * Update Changelog * feat: transformers support (#78) feat: MMS voices refactor: move index to static .json files * Increment Version to 1.3.0a1 * Update Changelog * documentation: supported voices and languages (#80) * documentation: supported voices and languages * documentation: supported voices and languages * documentation: supported voices and languages * Increment Version to 1.3.0a2 * Update Changelog * documentation: supported voices and languages (#82) * Increment Version to 1.3.0a3 * Update Changelog * fix: failing MMS models indexing (#84) * Increment Version to 1.3.0a4 * Update Changelog * fix: improve lang code standardization (#86) * fix: improve lang code standardization * siimplify error handling * Increment Version to 1.3.1a1 * Update Changelog * Add renovate.json (#89) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Increment Version to 1.3.1a2 * Update Changelog * fix: mantoq2ipa + improve lang code normalization (#90) * fix: improve lang code standardization * siimplify error handling * fix: better arabic ipa g2p * fix tests * rrm unused arg * Increment Version to 1.3.2a1 * chore(deps): update actions/checkout action to v6 (#92) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Increment Version to 1.3.2a2 * chore(deps): update actions/setup-python action to v6 (#96) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Increment Version to 1.3.2a3 * Update Changelog * add more voices (#99) * Increment Version to 1.3.2a4 * Update Changelog * 📝 Add docstrings to `patch-2` (#102) Docstrings generation was requested by @JarbasAl. * #101 (comment) The following files were modified: * `phoonnx/model_manager.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.3.2a5 * Update Changelog * Add files via upload * fix: dont chunk on commas, update voice index (#104) * fix: dont chunk on commas, update voice index * fix: dont chunk on commas, update voice index * 📝 Add docstrings to `fixes` (#105) Docstrings generation was requested by @JarbasAl. * #104 (comment) The following files were modified: * `phoonnx/model_manager.py` * `phoonnx/opm.py` * `phoonnx/phonemizers/base.py` * `phoonnx_train/vits/dataset.py` * `phoonnx_train/vits/lightning.py` * `scripts/index_voices.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix: dont chunk on commas, update voice index --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.3.3a1 * Update Changelog * fix: lazy load VoiceConfig (#107) delay network requests until needed 📝 Add docstrings to `fixes` (#108) Docstrings generation was requested by @JarbasAl. * #107 (comment) The following files were modified: * `phoonnx/model_manager.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.3.3a2 * Update Changelog --------- Co-authored-by: JarbasAI <33701864+JarbasAl@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: JarbasAl <JarbasAl@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

fix: improve lang code standardization

f18b027

github-actions Bot added the fix label Nov 24, 2025

github-actions Bot added fix and removed fix labels Nov 24, 2025

coderabbitai Bot reviewed Nov 24, 2025

View reviewed changes

siimplify error handling

6fd79c5

coderabbitai Bot mentioned this pull request Nov 24, 2025

📝 Add docstrings to fix/lang_code_norm #87

Closed

JarbasAl merged commit 840b7c8 into dev Nov 24, 2025
3 checks passed

github-actions Bot added fix and removed fix labels Nov 24, 2025

coderabbitai Bot mentioned this pull request Dec 27, 2025

fix: mantoq2ipa + improve lang code normalization #90

Merged

coderabbitai Bot mentioned this pull request Jun 7, 2026

chore(index): BCP-47 lang codes + regenerate VOICES.md #154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve lang code standardization#86

fix: improve lang code standardization#86
JarbasAl merged 2 commits into
devfrom
fix/lang_code_norm

JarbasAl commented Nov 24, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Nov 24, 2025 •

edited

Loading

Review failed

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JarbasAl commented Nov 24, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JarbasAl commented Nov 24, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Nov 24, 2025 •

edited

Loading

coderabbitai Bot commented Nov 24, 2025 •

edited

Loading