Phonematcher

Phonematcher is a Python library for phonetic fuzzy searching and segment-to-segment distance computation. It allows you to find words that "sound like" a query by analyzing International Phonetic Alphabet (IPA) features rather than just comparing raw text.

By leveraging Distinctive Feature Theory, the library can calculate the articulatory distance between sounds (e.g., recognizing that 'p' and 'b' are more similar than 'p' and 'k') and cluster them to improve search recall.

🚀 Features

Distinctive Feature Matrix: Maps IPA phones to 21 articulatory features (nasality, voicing, place of articulation, etc.).
Weighted Phonetic Distance: Calculates similarity based on linguistic importance (e.g., major class features like "syllabic" carry more weight than "strident").
UPGMA Clustering: Automatically groups similar sounds into clusters based on a configurable sensitivity threshold.
Fuzzy Phonetic Index: Uses a "Phonex" algorithm to generate phonetic variants (including deletions) for high-recall indexing.
Hybrid Scoring: Combines phonetic candidate retrieval with Levenshtein edit-distance ranking.

📦 Installation

Ensure you have the required dependencies:

pip install rapidfuzz

🛠 Usage

1. Phonetic Distance

You can compare two IPA symbols to see how linguistically similar they are.

from phonematcher.distance import phonetic_distance

# Comparing voiced vs voiceless bilabial stops (very similar)
print(phonetic_distance('b', 'p'))  # ~0.043

# Comparing bilabial vs velar stops (less similar)
print(phonetic_distance('p', 'k'))  # ~0.348

# Vowel vs Consonant mismatch (maximal distance)
print(phonetic_distance('a', 'k'))  # 1.0

2. Phonetic Fuzzy Search

The PhoneticFuzzySearch class indexes terms based on their phonetic clusters.

from phonematcher.clustering import PhoneticFuzzySearch, EN_MAPPING

# Initialize with a Grapheme-to-Phoneme mapping
ffs = PhoneticFuzzySearch(EN_MAPPING, cluster_sensitivity=0.5)

# Add terms to your index (id, term)
ffs.add_term('hello world', 0)
ffs.add_term('greetings', 1)
ffs.add_term('friendship', 2)

# Search with typos or phonetic variations
results = ffs.search('helo wrld')
for match in results:
    print(f"ID: {match.id}, Term: {match.term}, Score: {match.score}")

🧬 How it Works

Feature Vectorization

Every phone is resolved into a vector of boolean or null values. For example, the phone c (voiceless palatal stop) is represented by features like:

Consonantal: True
Voice: False
High: True
Back: False

The Search Pipeline

Phoneticization: The library expands a word into all possible IPA sequences based on your mapping.
Clustering: Phones are grouped into numeric IDs (e.g., s, z, and ʃ might all fall into Cluster 5).
Variant Generation: To handle misspellings, the indexer generates "deletes"—versions of the phonetic sequence with one or two sounds removed.
Retrieval: The query is converted to clusters and matched against the index.
Ranking: The resulting candidates are ranked using the Levenshtein distance of the original orthographic strings.

⚙️ Configuration

You can tune the cluster_sensitivity (default 0.5):

Lower values: Create more specific clusters (fewer matches, higher precision).
Higher values: Create broader clusters (more matches, higher recall).

Understanding Phonetic Mappings

Because orthography (spelling) varies wildly across languages, the library uses grapheme-to-IPA mappings to translate "written" words into "spoken" phonetic vectors.

Without a good mapping, the library would treat letters as arbitrary symbols. By mapping them to IPA (International Phonetic Alphabet) symbols, the system can leverage Distinctive Feature Theory:

Linguistic Intelligence: The engine knows that 'p' and 'b' are both bilabial stops and differ only by voicing.
Weighted Distance: Mappings allow the algorithm to compute distance based on articulatory features like nasality or place of articulation rather than simple character replacement.
Cluster Precision: A word is converted into a sequence of cluster IDs. Accurate mappings ensure that words which sound similar (e.g., "frend" and "friend") result in the same or highly similar cluster sequences, significantly increasing search recall.

A mapping is a Python dictionary where keys are graphemes (single letters or digraphs) and values are lists of potential IPA realizations:

# Example: Mapping 'x' to its multiple sounds
"x": ["ʃ", "ks", "z", "s"]

The PhoneticFuzzySearch class uses a Breadth-First Search (BFS) substitution traversal to expand a single word into all possible phonetic sequences. For example, a word containing "x" would generate several phonetic variants to ensure that no matter how the user perceives the sound, the index can find a match.

How to Support a New Language

To add support for a new language, follow these three steps:

Create a dictionary covering the unique phonology of your language. You can inherit from BASE_LATIN to save time on standard characters.

MY_LANG_MAPPING = {
    **BASE_LATIN,
    "sh": ["ʃ"],  # Explicitly define digraphs
    "aa": ["aː"], # Long vowels
}

If a letter has multiple sounds (like 'c' in English), include all of them in the list. The engine's variant generator will index all possibilities, allowing for a "fuzzy" phonetic match.

Pass your custom mapping into the PhoneticFuzzySearch constructor.

from phonematcher.clustering import PhoneticFuzzySearch

ffs = PhoneticFuzzySearch(mapping=MY_LANG_MAPPING, cluster_sensitivity=0.4)

Pro Tip: If your language is highly phonetic (like Spanish or Italian), you can keep cluster_sensitivity low (approx. 0.3). For languages with complex spelling-to-sound rules (like English or French), a higher sensitivity (0.5 - 0.6) helps group divergent spellings into the same cluster.

📜 License

This project is adapted from pyphone and fast_fuzzy_search by lingz and is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
phonematcher		phonematcher
tests		tests
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phonematcher

🚀 Features

📦 Installation

🛠 Usage

1. Phonetic Distance

2. Phonetic Fuzzy Search

🧬 How it Works

Feature Vectorization

The Search Pipeline

⚙️ Configuration

Understanding Phonetic Mappings

How to Support a New Language

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

TigreGotico/phonematcher

Folders and files

Latest commit

History

Repository files navigation

Phonematcher

🚀 Features

📦 Installation

🛠 Usage

1. Phonetic Distance

2. Phonetic Fuzzy Search

🧬 How it Works

Feature Vectorization

The Search Pipeline

⚙️ Configuration

Understanding Phonetic Mappings

How to Support a New Language

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages