Skip to content

feat(analyzer): add Taiwan-specific recognizers for national id and phone#2073

Draft
matheme-justyn wants to merge 12 commits into
microsoft:mainfrom
matheme-justyn:codex/issue-2065
Draft

feat(analyzer): add Taiwan-specific recognizers for national id and phone#2073
matheme-justyn wants to merge 12 commits into
microsoft:mainfrom
matheme-justyn:codex/issue-2065

Conversation

@matheme-justyn

@matheme-justyn matheme-justyn commented Jun 18, 2026

Copy link
Copy Markdown

Change Description

This PR adds Taiwan-specific predefined recognizers in presidio-analyzer:

  • TW_NATIONAL_ID
  • TW_PHONE_NUMBER

This iteration intentionally does not include Taiwan Unified Business Number.

Issue reference

Fixes #2065

Public references

Taiwan National ID

Taiwan phone numbers

What is included

  • tests for TW_NATIONAL_ID
  • TW_NATIONAL_ID recognizer with checksum validation
  • docs for TW_NATIONAL_ID
  • tests for TW_PHONE_NUMBER
  • TW_PHONE_NUMBER recognizer using Taiwan (TW) regional phone validation
  • docs for TW_PHONE_NUMBER
  • default recognizer config entries for both Taiwan recognizers
  • registry-level test coverage for default config visibility and loader resolution

Implementation notes

  • TW_NATIONAL_ID first matches one leading letter followed by 9 digits, restricts the second digit to 1 or 2, and then validates the candidate with the Taiwan ID checksum calculation implemented locally in Presidio.
  • TW_PHONE_NUMBER follows the upstream Taiwan phone-validation flow: the input is normalized, +886 or region TW is used to determine the country code, the national trunk prefix 0 is stripped, and the resulting national number is checked against Taiwan metadata by length and full ^(?:pattern)$ regular-expression match across the supported number types (for example mobile, fixed line, toll-free, and others).
  • In this PR, Presidio does not reimplement a separate Taiwan numbering table; it wraps that existing Taiwan-region matching/parsing behavior and adds Taiwan-specific tests and documentation around it.

Notes

  • TW_PHONE_NUMBER is intentionally implemented as a thin Taiwan-specific wrapper around the existing generic PhoneRecognizer, restricting validation to the TW region instead of duplicating parsing logic.
  • Both Taiwan recognizers are added to default_recognizers.yaml with enabled: false, matching the existing convention for many country-specific recognizers.
  • Taiwan Unified Business Number is intentionally left out of this PR.

Tests

Executed locally:

  • ../.venv/bin/python -m pytest tests/test_tw_national_id_recognizer.py tests/test_tw_phone_number_recognizer.py
  • ../.venv/bin/python -m pytest tests/test_tw_national_id_recognizer.py tests/test_tw_phone_number_recognizer.py tests/test_recognizer_registry.py -k 'tw or country or default_yaml'
  • ../.venv/bin/python -m ruff check tests/test_tw_national_id_recognizer.py tests/test_tw_phone_number_recognizer.py presidio_analyzer/predefined_recognizers/country_specific/taiwan/tw_national_id_recognizer.py presidio_analyzer/predefined_recognizers/country_specific/taiwan/tw_phone_number_recognizer.py presidio_analyzer/predefined_recognizers/__init__.py

Result:

  • 63 passed, 19 deselected
  • ruff check passed for the touched Python files

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • Relevant local tests pass
  • My PR contains documentation updates / additions if required

Copilot AI review requested due to automatic review settings June 18, 2026 06:02
@matheme-justyn matheme-justyn marked this pull request as draft June 18, 2026 06:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Taiwan (TW) country-specific predefined recognizers to presidio-analyzer for detecting Taiwan National ID numbers and Taiwan phone numbers, including default-registry configuration, unit tests, and public documentation updates.

Changes:

  • Introduce TwNationalIdRecognizer (checksum-validated) and TwPhoneNumberRecognizer (TW-region wrapper over generic PhoneRecognizer).
  • Register both recognizers in default_recognizers.yaml (disabled by default) and expose them via presidio_analyzer.predefined_recognizers.
  • Add unit tests and update supported-entities docs and changelog entries.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/taiwan/tw_national_id_recognizer.py New TW national ID recognizer with checksum validation.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/taiwan/tw_phone_number_recognizer.py New TW phone recognizer wrapper restricting validation to region TW.
presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py Exposes Taiwan recognizers from the predefined recognizers package.
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml Adds disabled-by-default YAML entries for the two TW recognizers with country_code: tw.
presidio-analyzer/tests/test_tw_national_id_recognizer.py Adds unit coverage for TW national ID detection and checksum validation.
presidio-analyzer/tests/test_tw_phone_number_recognizer.py Adds unit coverage for TW phone number detection and metadata defaults.
presidio-analyzer/tests/test_recognizer_registry.py Adds registry-level test ensuring YAML entries exist and classes resolve via loader.
docs/supported_entities.md Documents the new TW entity types (needs section placement/content correction per review).
CHANGELOG.md Notes the addition of the two Taiwan recognizers under Unreleased.

Comment on lines +144 to +145
| TW_NATIONAL_ID | Taiwan National Identification Number (國民身分證統一編號 / 身分證字號): 1 leading letter followed by 9 digits, with the second digit indicating holder category and a public checksum rule. | Pattern match, context and checksum |
| TW_PHONE_NUMBER | Taiwan phone number (電話號碼 / 手機號碼): validated by `python-phonenumbers` with the Taiwan (`TW`) region for mobile and landline formats. | Pattern match, context and region-aware phone validation |
Comment on lines +104 to +106
# Taiwan recognizers
from .country_specific.taiwan.tw_national_id_recognizer import TwNationalIdRecognizer
from .country_specific.taiwan.tw_phone_number_recognizer import TwPhoneNumberRecognizer
Comment on lines +102 to +104
:param pattern_text: the text to validated.
Only the part in text that was detected by the regex engine
:return: A bool or None, indicating whether the validation was successful.
@matheme-justyn

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Taiwan (TW) country-specific recognizers

2 participants