Skip to content

fix: improve lang code standardization#86

Merged
JarbasAl merged 2 commits into
devfrom
fix/lang_code_norm
Nov 24, 2025
Merged

fix: improve lang code standardization#86
JarbasAl merged 2 commits into
devfrom
fix/lang_code_norm

Conversation

@JarbasAl

@JarbasAl JarbasAl commented Nov 24, 2025

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • Bug Fixes

    • Corrected Tagalog and Portuguese language mappings to improve voice selection and variant accuracy.
    • Adjusted total reported language count to reflect current set.
  • Chores

    • Unified and standardized language normalization and voice indexing to improve consistency across the system.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions Bot added the fix label Nov 24, 2025
@coderabbitai

coderabbitai Bot commented Nov 24, 2025

Copy link
Copy Markdown
Contributor

Caution

Review failed

The pull request is closed.

Walkthrough

Replaces ad-hoc use of langcodes.standardize_tag with a new phoonnx.util.normalize_lang helper and updates voice metadata for Tagalog and Portuguese (MMS.json, VOICES.md). Calls to normalization were updated across config, model_manager, and scripts; model cache clear added in model_manager main path.

Changes

Cohort / File(s) Summary
Language utility
phoonnx/util.py
Added normalize_lang(lang: str) -> str that maps Tagalog shortcodes ("tgl", "tl") to "tl" then delegates to standardize_tag() with fallback to the original input.
Configuration & model manager
phoonnx/config.py, phoonnx/model_manager.py
Replaced standardize_tag imports/calls with normalize_lang. VoiceConfig.__post_init__ and TTS model initialization now use normalize_lang. model_manager.py also clears cache before merging default voices in main.
Indexing scripts & voice index
scripts/index_voices.py, phoonnx/voice_index/MMS.json
Replaced standardize_tag() usages with normalize_lang() in voice-list generators; added explicit por -> pt-BR handling in MMS generator. MMS.json updated: facebook/mms-tts-por-Portuguese.langpt-BR, facebook/mms-tts-tgl-Tagalog.langtl.
Documentation
VOICES.md
Updated language count (1207 → 1206); Tagalog entry adjusted to use tl and repositioned; Portuguese updated to pt-BR.

Sequence Diagram(s)

sequenceDiagram
  %% Styling: highlight normalize_lang as new/changed
  participant Caller as Caller (config/scripts/manager)
  participant Norm as normalize_lang()
  participant Langcodes as standardize_tag()
  Caller->>Norm: normalize_lang(input_lang)
  alt Tagalog special-case (tgl/tl)
    Norm-->>Caller: "tl"
  else Other
    Norm->>Langcodes: standardize_tag(input_lang)
    alt standardize_tag succeeds
      Langcodes-->>Norm: normalized_tag
      Norm-->>Caller: normalized_tag
    else failure
      Langcodes-->>Norm: error
      Norm-->>Caller: original input_lang
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Files/areas requiring extra attention:
    • phoonnx/util.py — verify special-case mapping and fallback behavior.
    • Consistency across MMS.json, VOICES.md, and scripts/index_voices.py for ptpt-BR.
    • model_manager.py cache.clear() call — ensure no unintended side effects at runtime.

Possibly related PRs

Poem

🐰 I hop through tags both short and long,
I map "tgl" to "tl" with a cheerful song.
Portuguese now wears "pt-BR" with pride,
Normalized voices all aligned beside—✨
— a rabbit’s nibble on code, wide-eyed.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: improve lang code standardization' clearly summarizes the main focus of the changeset, which involves replacing standardize_tag with normalize_lang and updating language code mappings across multiple files.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f18b027 and 6fd79c5.

📒 Files selected for processing (4)
  • phoonnx/config.py (2 hunks)
  • phoonnx/model_manager.py (3 hunks)
  • phoonnx/util.py (2 hunks)
  • scripts/index_voices.py (4 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added fix and removed fix labels Nov 24, 2025

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
phoonnx/model_manager.py (1)

286-289: Clearing cache before merging default voices is reasonable

Calling manager.cache.clear() before merge_default_voices(store=True) ensures the generated cache only reflects the shipped voice index files, avoiding stale or duplicate entries in long-lived caches.

phoonnx/config.py (1)

153-157: VoiceConfig lang_code normalization aligns with new standard

Normalizing self.lang_code via normalize_lang in __post_init__, with a fallback to "und" on failure, keeps configs consistent with model metadata and the voice index.

You could explicitly skip normalization when self.lang_code is falsy (and go straight to "und") to avoid relying on exceptions for that case, but current behavior is functionally fine.

scripts/index_voices.py (1)

255-262: Fix unused exception variable and document exception handling

The Ruff check confirms both issues exist:

  • F841 (line 261, position 37): e is assigned but never used
  • BLE001 (line 261, position 20): Blind exception catch

The suggested refactor is appropriate—add logging to use e and make failures visible:

             else:
                 try:
                     std_lang = normalize_lang(lang)
                 except Exception as e:
+                    LOG.warning("Failed to normalize MMS ISO code '%s': %s", lang, e)
                     std_lang = lang
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a01d98 and f18b027.

📒 Files selected for processing (6)
  • VOICES.md (3 hunks)
  • phoonnx/config.py (2 hunks)
  • phoonnx/model_manager.py (3 hunks)
  • phoonnx/util.py (2 hunks)
  • phoonnx/voice_index/MMS.json (2 hunks)
  • scripts/index_voices.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
phoonnx/config.py (1)
phoonnx/util.py (1)
  • normalize_lang (21-25)
scripts/index_voices.py (1)
phoonnx/util.py (1)
  • normalize_lang (21-25)
🪛 Ruff (0.14.5)
scripts/index_voices.py

261-261: Do not catch blind exception: Exception

(BLE001)


261-261: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🔇 Additional comments (8)
phoonnx/voice_index/MMS.json (1)

10484-10484: Verification confirms language code mappings are correct and consistent.

The script output confirms:

  • Portuguese mapping verified: scripts/index_voices.py (lines 255-257) explicitly maps "por""pt-BR", which is the source of the MMS.json change.
  • Tagalog mapping verified: normalize_lang() in phoonnx/util.py has explicit special case handling: "tgl" or "tl""tl".
  • Consistency verified: All code paths that consume lang_code (config.py, model_manager.py) call normalize_lang(), which preserves these values (standardize_tag("pt-BR") returns "pt-BR"; special case returns "tl").
  • Only two entries affected: Search results confirm only the Portuguese (line 10480-10486) and Tagalog (line 12209-12216) entries require updates.
  • Downstream compatibility verified: Phonemizers (pt.py, mul.py) and other components expect and support "pt-BR" and "tl" language codes.

The changes are correct and properly aligned with the normalize_lang() implementation.

phoonnx/util.py (1)

12-25: Centralized language normalization helper looks good

normalize_lang cleanly wraps standardize_tag and centralizes the Tagalog special-case; this will make future normalization tweaks easier and keeps call sites consistent.

phoonnx/model_manager.py (1)

66-69: TTSModelInfo lang normalization is consistent with new helper

Using normalize_lang here keeps self.lang and config.lang_code aligned with the same normalization logic used elsewhere (config and indexing), which should avoid mismatches across data sources.

phoonnx/config.py (1)

5-5: Reusing LOG and normalization from util is a nice consolidation

Importing LOG and normalize_lang from phoonnx.util reduces duplication and keeps logging and language normalization behavior centralized.

scripts/index_voices.py (3)

9-9: Importing LOG and normalize_lang keeps tooling and normalization consistent

Bringing in LOG and normalize_lang from phoonnx.util makes this script use the same logging and language normalization conventions as the core library.


151-159: Piper manifest lang normalization via normalize_lang is appropriate

Deriving the language from v["key"].split("-")[0] and normalizing with normalize_lang keeps Piper voices’ language codes aligned with the rest of the system without changing the existing key parsing behavior.


198-215: Mimic3 voice list now benefits from shared normalization

Using normalize_lang(k.split("/")[0]) to set lang for Mimic3 voices unifies language canonicalization with other sources, and the rest of the logic (speaker_map, URLs, config wiring) remains unchanged.

VOICES.md (1)

4-4: VOICES.md changes are consistent with new normalization rules

The updated total language count, facebook/mms-tts-por-Portuguese mapped to pt-BR, and Tagalog mapped to tl all match the new normalize_lang behavior and MMS indexing logic.

Also applies to: 1093-1093, 1300-1300

@coderabbitai

coderabbitai Bot commented Nov 24, 2025

Copy link
Copy Markdown
Contributor

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #87

coderabbitai Bot added a commit that referenced this pull request Nov 24, 2025
Docstrings generation was requested by @JarbasAl.

* #86 (comment)

The following files were modified:

* `phoonnx/config.py`
* `phoonnx/model_manager.py`
* `phoonnx/util.py`
* `scripts/index_voices.py`
@JarbasAl JarbasAl merged commit 840b7c8 into dev Nov 24, 2025
3 checks passed
@github-actions github-actions Bot added fix and removed fix labels Nov 24, 2025
JarbasAl added a commit that referenced this pull request Feb 16, 2026
* refactor!: tokenizer class + deprecate phoneme_ids.py (#70)

* fix: coqui compatibility

refactor!: tokenizer class + deprecate phoneme_ids.py

fix: missing cotovia data files

feat: add new galician models from proxecto nós

* log

* fix

* fix

* Merge pull request #71 from TigreGotico/coderabbitai/docstrings/cb634ab

📝 Add docstrings to `tokenizer`

* adjust

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Increment Version to 1.0.0a1

* Update Changelog

* feat: community piper voices + pygoruut support (#73)

* feat: community piper voices + pygoruut support

update model manager voice index

Total voices: 284
Total langs: 67

* fix neurlang voice-id

* reorder funcs for readability

* 📝 Add docstrings to `models_galore` (#74)

Docstrings generation was requested by @JarbasAl.

* #73 (comment)

The following files were modified:

* `phoonnx/model_manager.py`
* `phoonnx/util.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Increment Version to 1.1.0a1

* Update Changelog

* feat: more piper english community voices (#76)

Total voices: 314

* Increment Version to 1.2.0a1

* Update Changelog

* feat: transformers support (#78)

feat: MMS voices

refactor: move index to static .json files

* Increment Version to 1.3.0a1

* Update Changelog

* documentation: supported voices and languages (#80)

* documentation: supported voices and languages

* documentation: supported voices and languages

* documentation: supported voices and languages

* Increment Version to 1.3.0a2

* Update Changelog

* documentation: supported voices and languages (#82)

* Increment Version to 1.3.0a3

* Update Changelog

* fix: failing MMS models indexing (#84)

* Increment Version to 1.3.0a4

* Update Changelog

* fix: improve lang code standardization (#86)

* fix: improve lang code standardization

* siimplify error handling

* Increment Version to 1.3.1a1

* Update Changelog

* Add renovate.json (#89)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Increment Version to 1.3.1a2

* Update Changelog

* fix: mantoq2ipa + improve lang code normalization (#90)

* fix: improve lang code standardization

* siimplify error handling

* fix: better arabic ipa g2p

* fix tests

* rrm unused arg

* Increment Version to 1.3.2a1

* chore(deps): update actions/checkout action to v6 (#92)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Increment Version to 1.3.2a2

* chore(deps): update actions/setup-python action to v6 (#96)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

* Increment Version to 1.3.2a3

* Update Changelog

* add more voices (#99)

* Increment Version to 1.3.2a4

* Update Changelog

* 📝 Add docstrings to `patch-2` (#102)

Docstrings generation was requested by @JarbasAl.

* #101 (comment)

The following files were modified:

* `phoonnx/model_manager.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Increment Version to 1.3.2a5

* Update Changelog

* Add files via upload

* fix: dont chunk on commas, update voice index (#104)

* fix: dont chunk on commas, update voice index

* fix: dont chunk on commas, update voice index

* 📝 Add docstrings to `fixes` (#105)

Docstrings generation was requested by @JarbasAl.

* #104 (comment)

The following files were modified:

* `phoonnx/model_manager.py`
* `phoonnx/opm.py`
* `phoonnx/phonemizers/base.py`
* `phoonnx_train/vits/dataset.py`
* `phoonnx_train/vits/lightning.py`
* `scripts/index_voices.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: dont chunk on commas, update voice index

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Increment Version to 1.3.3a1

* Update Changelog

* fix: lazy load VoiceConfig (#107)

delay network requests until needed

📝 Add docstrings to `fixes` (#108)

Docstrings generation was requested by @JarbasAl.

* #107 (comment)

The following files were modified:

* `phoonnx/model_manager.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Increment Version to 1.3.3a2

* Update Changelog

---------

Co-authored-by: JarbasAI <33701864+JarbasAl@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: JarbasAl <JarbasAl@users.noreply.github.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant