feat(analyzer): add Taiwan-specific recognizers for national id and phone#2073
Draft
matheme-justyn wants to merge 12 commits into
Draft
feat(analyzer): add Taiwan-specific recognizers for national id and phone#2073matheme-justyn wants to merge 12 commits into
matheme-justyn wants to merge 12 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Taiwan (TW) country-specific predefined recognizers to presidio-analyzer for detecting Taiwan National ID numbers and Taiwan phone numbers, including default-registry configuration, unit tests, and public documentation updates.
Changes:
- Introduce
TwNationalIdRecognizer(checksum-validated) andTwPhoneNumberRecognizer(TW-region wrapper over genericPhoneRecognizer). - Register both recognizers in
default_recognizers.yaml(disabled by default) and expose them viapresidio_analyzer.predefined_recognizers. - Add unit tests and update supported-entities docs and changelog entries.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/taiwan/tw_national_id_recognizer.py | New TW national ID recognizer with checksum validation. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/taiwan/tw_phone_number_recognizer.py | New TW phone recognizer wrapper restricting validation to region TW. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py | Exposes Taiwan recognizers from the predefined recognizers package. |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Adds disabled-by-default YAML entries for the two TW recognizers with country_code: tw. |
| presidio-analyzer/tests/test_tw_national_id_recognizer.py | Adds unit coverage for TW national ID detection and checksum validation. |
| presidio-analyzer/tests/test_tw_phone_number_recognizer.py | Adds unit coverage for TW phone number detection and metadata defaults. |
| presidio-analyzer/tests/test_recognizer_registry.py | Adds registry-level test ensuring YAML entries exist and classes resolve via loader. |
| docs/supported_entities.md | Documents the new TW entity types (needs section placement/content correction per review). |
| CHANGELOG.md | Notes the addition of the two Taiwan recognizers under Unreleased. |
Comment on lines
+144
to
+145
| | TW_NATIONAL_ID | Taiwan National Identification Number (國民身分證統一編號 / 身分證字號): 1 leading letter followed by 9 digits, with the second digit indicating holder category and a public checksum rule. | Pattern match, context and checksum | | ||
| | TW_PHONE_NUMBER | Taiwan phone number (電話號碼 / 手機號碼): validated by `python-phonenumbers` with the Taiwan (`TW`) region for mobile and landline formats. | Pattern match, context and region-aware phone validation | |
Comment on lines
+104
to
+106
| # Taiwan recognizers | ||
| from .country_specific.taiwan.tw_national_id_recognizer import TwNationalIdRecognizer | ||
| from .country_specific.taiwan.tw_phone_number_recognizer import TwPhoneNumberRecognizer |
Comment on lines
+102
to
+104
| :param pattern_text: the text to validated. | ||
| Only the part in text that was detected by the regex engine | ||
| :return: A bool or None, indicating whether the validation was successful. |
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Description
This PR adds Taiwan-specific predefined recognizers in
presidio-analyzer:TW_NATIONAL_IDTW_PHONE_NUMBERThis iteration intentionally does not include Taiwan Unified Business Number.
Issue reference
Fixes #2065
Public references
Taiwan National ID
Taiwan phone numbers
What is included
TW_NATIONAL_IDTW_NATIONAL_IDrecognizer with checksum validationTW_NATIONAL_IDTW_PHONE_NUMBERTW_PHONE_NUMBERrecognizer using Taiwan (TW) regional phone validationTW_PHONE_NUMBERImplementation notes
TW_NATIONAL_IDfirst matches one leading letter followed by 9 digits, restricts the second digit to1or2, and then validates the candidate with the Taiwan ID checksum calculation implemented locally in Presidio.TW_PHONE_NUMBERfollows the upstream Taiwan phone-validation flow: the input is normalized,+886or regionTWis used to determine the country code, the national trunk prefix0is stripped, and the resulting national number is checked against Taiwan metadata by length and full^(?:pattern)$regular-expression match across the supported number types (for example mobile, fixed line, toll-free, and others).Notes
TW_PHONE_NUMBERis intentionally implemented as a thin Taiwan-specific wrapper around the existing genericPhoneRecognizer, restricting validation to theTWregion instead of duplicating parsing logic.default_recognizers.yamlwithenabled: false, matching the existing convention for many country-specific recognizers.Tests
Executed locally:
../.venv/bin/python -m pytest tests/test_tw_national_id_recognizer.py tests/test_tw_phone_number_recognizer.py../.venv/bin/python -m pytest tests/test_tw_national_id_recognizer.py tests/test_tw_phone_number_recognizer.py tests/test_recognizer_registry.py -k 'tw or country or default_yaml'../.venv/bin/python -m ruff check tests/test_tw_national_id_recognizer.py tests/test_tw_phone_number_recognizer.py presidio_analyzer/predefined_recognizers/country_specific/taiwan/tw_national_id_recognizer.py presidio_analyzer/predefined_recognizers/country_specific/taiwan/tw_phone_number_recognizer.py presidio_analyzer/predefined_recognizers/__init__.pyResult:
63 passed, 19 deselectedruff checkpassed for the touched Python filesChecklist