Skip to content

Add Taiwan (TW) country-specific recognizers #2065

Description

@matheme-justyn

Feature Request: Add Taiwan-specific predefined recognizers

Is your feature request related to a problem? Please describe.

Presidio does not seem to include Taiwan-specific predefined recognizers yet. This makes it harder to detect common Taiwan identifiers in Traditional Chinese / Taiwan datasets using the built-in recognizer set.

Describe the solution you'd like

I would like to contribute Taiwan-specific predefined recognizers to presidio-analyzer, starting with identifiers that seem to have clear public formats and validation logic.

Suggested first scope:

  • TW_NATIONAL_ID — Taiwan's national identification number (commonly called "身分證字號"); format is 1 leading letter plus 9 digits, with public checksum rules available.
    • Similar country-specific personal ID entities already exist in Presidio, such as US_SSN, PL_PESEL, and IT_IDENTITY_CARD.
    • US_SSN does not seem like a good fit because it is explicitly tied to the United States social security system.
    • IT_IDENTITY_CARD is closer at the document level, but Taiwan's commonly used concept is usually the personal ID number itself rather than only the document type.
    • My current view is that Taiwan is better modeled as a country-specific personal identifier, but I would welcome maintainer guidance on the final naming.
  • TW_PHONE_NUMBER — Taiwan phone numbers (commonly called "電話號碼"); Presidio already includes region-aware phone-number recognition, and Taiwan numbering has public structural rules.
    • Related existing support already appears in Presidio's phone recognizer flow, and maintainers have discussed region-based phone support such as US and UK.
    • Taiwan fixed-line numbers seem to have clearer area-code and length rules, so landline support looks like a strong first candidate.
    • Taiwan mobile numbers also appear structurally clear, but I would likely scope the first PR as either landline-only or fixed-line-first with bounded mobile coverage, depending on maintainer preference.

I also plan to update the relevant documentation as part of the contribution.

Describe alternatives you've considered

As of Presidio 2.2.359 on June 16, 2026, the published recognizer/docs surface suggests several country-specific suffix patterns:

  • _PASSPORT: used by India and Italy; this suffix is for passport identifiers, and Taiwan seems possible, but public validation logic looks weaker than the candidates above.
  • _IDENTITY_CARD: used by Italy; this suffix is for national identity-card style document numbers, but Taiwan may still be better modeled as a country-specific personal identifier.
  • _BANK_NUMBER: used by the United States; this suffix is for banking identifiers, and Taiwan does not seem like a good first fit because I have not confirmed a clear low-false-positive validation rule.
  • _DRIVER_LICENSE: used by Italy and the United States; this suffix is for driver's license identifiers, and Taiwan may be possible, but I have not confirmed a stable public validation approach suitable for a first contribution.
  • _MEDICARE, _MBI, _NPI, _NHS: used by Australia, the United States, and the United Kingdom; these suffixes are for healthcare-related identifiers, and while Taiwan has healthcare identifiers, I have not yet confirmed a clearly suitable public validation rule for a safe first contribution.
  • _NINO, _PESEL, _AADHAAR, _PAN, _UEN: used by the United Kingdom, Poland, India, and Singapore; these are country-specific identifier systems without a direct Taiwan counterpart.

Taiwan also has a business registration identifier commonly called "統一編號" or "統編". Its 8-digit validation logic appears to be public and deterministic, but I have not yet found a clearly matching suffix already used across other countries in Presidio, so I think it should remain under maintainer discussion before proposing a final entity name.

Additional context

References:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions