fix: improve lang code standardization#86
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughReplaces ad-hoc use of langcodes.standardize_tag with a new phoonnx.util.normalize_lang helper and updates voice metadata for Tagalog and Portuguese (MMS.json, VOICES.md). Calls to normalization were updated across config, model_manager, and scripts; model cache clear added in model_manager main path. Changes
Sequence Diagram(s)sequenceDiagram
%% Styling: highlight normalize_lang as new/changed
participant Caller as Caller (config/scripts/manager)
participant Norm as normalize_lang()
participant Langcodes as standardize_tag()
Caller->>Norm: normalize_lang(input_lang)
alt Tagalog special-case (tgl/tl)
Norm-->>Caller: "tl"
else Other
Norm->>Langcodes: standardize_tag(input_lang)
alt standardize_tag succeeds
Langcodes-->>Norm: normalized_tag
Norm-->>Caller: normalized_tag
else failure
Langcodes-->>Norm: error
Norm-->>Caller: original input_lang
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
phoonnx/model_manager.py (1)
286-289: Clearing cache before merging default voices is reasonableCalling
manager.cache.clear()beforemerge_default_voices(store=True)ensures the generated cache only reflects the shipped voice index files, avoiding stale or duplicate entries in long-lived caches.phoonnx/config.py (1)
153-157: VoiceConfig lang_code normalization aligns with new standardNormalizing
self.lang_codevianormalize_langin__post_init__, with a fallback to"und"on failure, keeps configs consistent with model metadata and the voice index.You could explicitly skip normalization when
self.lang_codeis falsy (and go straight to"und") to avoid relying on exceptions for that case, but current behavior is functionally fine.scripts/index_voices.py (1)
255-262: Fix unused exception variable and document exception handlingThe Ruff check confirms both issues exist:
- F841 (line 261, position 37):
eis assigned but never used- BLE001 (line 261, position 20): Blind exception catch
The suggested refactor is appropriate—add logging to use
eand make failures visible:else: try: std_lang = normalize_lang(lang) except Exception as e: + LOG.warning("Failed to normalize MMS ISO code '%s': %s", lang, e) std_lang = lang
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
VOICES.md(3 hunks)phoonnx/config.py(2 hunks)phoonnx/model_manager.py(3 hunks)phoonnx/util.py(2 hunks)phoonnx/voice_index/MMS.json(2 hunks)scripts/index_voices.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
phoonnx/config.py (1)
phoonnx/util.py (1)
normalize_lang(21-25)
scripts/index_voices.py (1)
phoonnx/util.py (1)
normalize_lang(21-25)
🪛 Ruff (0.14.5)
scripts/index_voices.py
261-261: Do not catch blind exception: Exception
(BLE001)
261-261: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
🔇 Additional comments (8)
phoonnx/voice_index/MMS.json (1)
10484-10484: Verification confirms language code mappings are correct and consistent.The script output confirms:
- Portuguese mapping verified: scripts/index_voices.py (lines 255-257) explicitly maps
"por"→"pt-BR", which is the source of the MMS.json change.- Tagalog mapping verified: normalize_lang() in phoonnx/util.py has explicit special case handling:
"tgl"or"tl"→"tl".- Consistency verified: All code paths that consume lang_code (config.py, model_manager.py) call normalize_lang(), which preserves these values (standardize_tag("pt-BR") returns "pt-BR"; special case returns "tl").
- Only two entries affected: Search results confirm only the Portuguese (line 10480-10486) and Tagalog (line 12209-12216) entries require updates.
- Downstream compatibility verified: Phonemizers (pt.py, mul.py) and other components expect and support "pt-BR" and "tl" language codes.
The changes are correct and properly aligned with the normalize_lang() implementation.
phoonnx/util.py (1)
12-25: Centralized language normalization helper looks good
normalize_langcleanly wrapsstandardize_tagand centralizes the Tagalog special-case; this will make future normalization tweaks easier and keeps call sites consistent.phoonnx/model_manager.py (1)
66-69: TTSModelInfo lang normalization is consistent with new helperUsing
normalize_langhere keepsself.langandconfig.lang_codealigned with the same normalization logic used elsewhere (config and indexing), which should avoid mismatches across data sources.phoonnx/config.py (1)
5-5: Reusing LOG and normalization from util is a nice consolidationImporting
LOGandnormalize_langfromphoonnx.utilreduces duplication and keeps logging and language normalization behavior centralized.scripts/index_voices.py (3)
9-9: Importing LOG and normalize_lang keeps tooling and normalization consistentBringing in
LOGandnormalize_langfromphoonnx.utilmakes this script use the same logging and language normalization conventions as the core library.
151-159: Piper manifest lang normalization via normalize_lang is appropriateDeriving the language from
v["key"].split("-")[0]and normalizing withnormalize_langkeeps Piper voices’ language codes aligned with the rest of the system without changing the existing key parsing behavior.
198-215: Mimic3 voice list now benefits from shared normalizationUsing
normalize_lang(k.split("/")[0])to setlangfor Mimic3 voices unifies language canonicalization with other sources, and the rest of the logic (speaker_map, URLs, config wiring) remains unchanged.VOICES.md (1)
4-4: VOICES.md changes are consistent with new normalization rulesThe updated total language count,
facebook/mms-tts-por-Portuguesemapped topt-BR, and Tagalog mapped totlall match the newnormalize_langbehavior and MMS indexing logic.Also applies to: 1093-1093, 1300-1300
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @JarbasAl. * #86 (comment) The following files were modified: * `phoonnx/config.py` * `phoonnx/model_manager.py` * `phoonnx/util.py` * `scripts/index_voices.py`
* refactor!: tokenizer class + deprecate phoneme_ids.py (#70) * fix: coqui compatibility refactor!: tokenizer class + deprecate phoneme_ids.py fix: missing cotovia data files feat: add new galician models from proxecto nós * log * fix * fix * Merge pull request #71 from TigreGotico/coderabbitai/docstrings/cb634ab 📝 Add docstrings to `tokenizer` * adjust --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.0.0a1 * Update Changelog * feat: community piper voices + pygoruut support (#73) * feat: community piper voices + pygoruut support update model manager voice index Total voices: 284 Total langs: 67 * fix neurlang voice-id * reorder funcs for readability * 📝 Add docstrings to `models_galore` (#74) Docstrings generation was requested by @JarbasAl. * #73 (comment) The following files were modified: * `phoonnx/model_manager.py` * `phoonnx/util.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.1.0a1 * Update Changelog * feat: more piper english community voices (#76) Total voices: 314 * Increment Version to 1.2.0a1 * Update Changelog * feat: transformers support (#78) feat: MMS voices refactor: move index to static .json files * Increment Version to 1.3.0a1 * Update Changelog * documentation: supported voices and languages (#80) * documentation: supported voices and languages * documentation: supported voices and languages * documentation: supported voices and languages * Increment Version to 1.3.0a2 * Update Changelog * documentation: supported voices and languages (#82) * Increment Version to 1.3.0a3 * Update Changelog * fix: failing MMS models indexing (#84) * Increment Version to 1.3.0a4 * Update Changelog * fix: improve lang code standardization (#86) * fix: improve lang code standardization * siimplify error handling * Increment Version to 1.3.1a1 * Update Changelog * Add renovate.json (#89) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Increment Version to 1.3.1a2 * Update Changelog * fix: mantoq2ipa + improve lang code normalization (#90) * fix: improve lang code standardization * siimplify error handling * fix: better arabic ipa g2p * fix tests * rrm unused arg * Increment Version to 1.3.2a1 * chore(deps): update actions/checkout action to v6 (#92) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Increment Version to 1.3.2a2 * chore(deps): update actions/setup-python action to v6 (#96) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * Increment Version to 1.3.2a3 * Update Changelog * add more voices (#99) * Increment Version to 1.3.2a4 * Update Changelog * 📝 Add docstrings to `patch-2` (#102) Docstrings generation was requested by @JarbasAl. * #101 (comment) The following files were modified: * `phoonnx/model_manager.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.3.2a5 * Update Changelog * Add files via upload * fix: dont chunk on commas, update voice index (#104) * fix: dont chunk on commas, update voice index * fix: dont chunk on commas, update voice index * 📝 Add docstrings to `fixes` (#105) Docstrings generation was requested by @JarbasAl. * #104 (comment) The following files were modified: * `phoonnx/model_manager.py` * `phoonnx/opm.py` * `phoonnx/phonemizers/base.py` * `phoonnx_train/vits/dataset.py` * `phoonnx_train/vits/lightning.py` * `scripts/index_voices.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix: dont chunk on commas, update voice index --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.3.3a1 * Update Changelog * fix: lazy load VoiceConfig (#107) delay network requests until needed 📝 Add docstrings to `fixes` (#108) Docstrings generation was requested by @JarbasAl. * #107 (comment) The following files were modified: * `phoonnx/model_manager.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Increment Version to 1.3.3a2 * Update Changelog --------- Co-authored-by: JarbasAI <33701864+JarbasAl@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: JarbasAl <JarbasAl@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Summary by CodeRabbit
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.