Phonematcher is a Python library for phonetic fuzzy searching and segment-to-segment distance computation. It allows you to find words that "sound like" a query by analyzing International Phonetic Alphabet (IPA) features rather than just comparing raw text.
By leveraging Distinctive Feature Theory, the library can calculate the articulatory distance between sounds (e.g., recognizing that 'p' and 'b' are more similar than 'p' and 'k') and cluster them to improve search recall.
- Distinctive Feature Matrix: Maps IPA phones to 21 articulatory features (nasality, voicing, place of articulation, etc.).
- Weighted Phonetic Distance: Calculates similarity based on linguistic importance (e.g., major class features like "syllabic" carry more weight than "strident").
- UPGMA Clustering: Automatically groups similar sounds into clusters based on a configurable sensitivity threshold.
- Fuzzy Phonetic Index: Uses a "Phonex" algorithm to generate phonetic variants (including deletions) for high-recall indexing.
- Hybrid Scoring: Combines phonetic candidate retrieval with Levenshtein edit-distance ranking.
Ensure you have the required dependencies:
pip install rapidfuzz
You can compare two IPA symbols to see how linguistically similar they are.
from phonematcher.distance import phonetic_distance
# Comparing voiced vs voiceless bilabial stops (very similar)
print(phonetic_distance('b', 'p')) # ~0.043
# Comparing bilabial vs velar stops (less similar)
print(phonetic_distance('p', 'k')) # ~0.348
# Vowel vs Consonant mismatch (maximal distance)
print(phonetic_distance('a', 'k')) # 1.0The PhoneticFuzzySearch class indexes terms based on their phonetic clusters.
from phonematcher.clustering import PhoneticFuzzySearch, EN_MAPPING
# Initialize with a Grapheme-to-Phoneme mapping
ffs = PhoneticFuzzySearch(EN_MAPPING, cluster_sensitivity=0.5)
# Add terms to your index (id, term)
ffs.add_term('hello world', 0)
ffs.add_term('greetings', 1)
ffs.add_term('friendship', 2)
# Search with typos or phonetic variations
results = ffs.search('helo wrld')
for match in results:
print(f"ID: {match.id}, Term: {match.term}, Score: {match.score}")Every phone is resolved into a vector of boolean or null values. For example, the phone c (voiceless palatal stop) is represented by features like:
- Consonantal:
True - Voice:
False - High:
True - Back:
False
- Phoneticization: The library expands a word into all possible IPA sequences based on your mapping.
- Clustering: Phones are grouped into numeric IDs (e.g.,
s,z, andʃmight all fall into Cluster 5). - Variant Generation: To handle misspellings, the indexer generates "deletes"—versions of the phonetic sequence with one or two sounds removed.
- Retrieval: The query is converted to clusters and matched against the index.
- Ranking: The resulting candidates are ranked using the Levenshtein distance of the original orthographic strings.
You can tune the cluster_sensitivity (default 0.5):
- Lower values: Create more specific clusters (fewer matches, higher precision).
- Higher values: Create broader clusters (more matches, higher recall).
Because orthography (spelling) varies wildly across languages, the library uses grapheme-to-IPA mappings to translate "written" words into "spoken" phonetic vectors.
Without a good mapping, the library would treat letters as arbitrary symbols. By mapping them to IPA (International Phonetic Alphabet) symbols, the system can leverage Distinctive Feature Theory:
- Linguistic Intelligence: The engine knows that 'p' and 'b' are both bilabial stops and differ only by voicing.
- Weighted Distance: Mappings allow the algorithm to compute distance based on articulatory features like nasality or place of articulation rather than simple character replacement.
- Cluster Precision: A word is converted into a sequence of cluster IDs. Accurate mappings ensure that words which sound similar (e.g., "frend" and "friend") result in the same or highly similar cluster sequences, significantly increasing search recall.
A mapping is a Python dictionary where keys are graphemes (single letters or digraphs) and values are lists of potential IPA realizations:
# Example: Mapping 'x' to its multiple sounds
"x": ["ʃ", "ks", "z", "s"] The PhoneticFuzzySearch class uses a Breadth-First Search (BFS) substitution traversal to expand a single word into all possible phonetic sequences. For example, a word containing "x" would generate several phonetic variants to ensure that no matter how the user perceives the sound, the index can find a match.
To add support for a new language, follow these three steps:
Create a dictionary covering the unique phonology of your language. You can inherit from BASE_LATIN to save time on standard characters.
MY_LANG_MAPPING = {
**BASE_LATIN,
"sh": ["ʃ"], # Explicitly define digraphs
"aa": ["aː"], # Long vowels
}If a letter has multiple sounds (like 'c' in English), include all of them in the list. The engine's variant generator will index all possibilities, allowing for a "fuzzy" phonetic match.
Pass your custom mapping into the PhoneticFuzzySearch constructor.
from phonematcher.clustering import PhoneticFuzzySearch
ffs = PhoneticFuzzySearch(mapping=MY_LANG_MAPPING, cluster_sensitivity=0.4)Pro Tip: If your language is highly phonetic (like Spanish or Italian), you can keep
cluster_sensitivitylow (approx. 0.3). For languages with complex spelling-to-sound rules (like English or French), a higher sensitivity (0.5 - 0.6) helps group divergent spellings into the same cluster.
This project is adapted from pyphone and fast_fuzzy_search by lingz and is released under the MIT License.