Skip to content

German Language Issues: Umlaut Removal and Compound Word Truncation in c-TF-IDF #2468

@MHZach

Description

@MHZach

Summary

BERTopic's c-TF-IDF calculation has significant issues when processing German text, requiring extensive             
workarounds (~1,000 lines of code) in production environments. The two main problems are:                           
                                                                                                                    
1. **Umlaut Removal**: German umlauts (ä, ö, ü, ß) are removed or corrupted in topic words                          
2. **Compound Word Truncation**: Long German compound words (Komposita) are truncated                               
                                                                                                                    
These issues affect all German-speaking developers using BERTopic for topic modeling.                               
                                                                                                                    
---                                                                                                                 
                                                                                                                    
## Problem 1: Umlaut Removal                                                                                        
                                                                                                                    
### Description                                                                                                     
BERTopic's CountVectorizer and c-TF-IDF calculation strips German umlauts from topic words, making the output       
difficult to interpret.                                                                                             
                                                                                                                    
### Examples                                                                                                        
| Input Keyword | BERTopic Output | Expected |                                                                      
|---------------|-----------------|----------|                                                                      
| gütertransport | gtertransport | gütertransport |                                                                 
| kühlkette | khlkette | kühlkette |                                                                                
| qualität | qualitt | qualität |                                                                                   
| zuverlässigkeit | zuverlssigkeit | zuverlässigkeit |                                                              
| größe | grsse | größe |                                                                                           
| präferenz | prferenz | präferenz |                                                                                
| verfügbar | verfgbar | verfügbar |                                                                                
| öffnungszeiten | ffnungszeiten | öffnungszeiten |                                                                 
                                                                                                                    
### Root Cause                                                                                                      
The c-TF-IDF calculation appears to use tokenization that doesn't properly handle Unicode characters. Even when     
 using:                                                                                                             
```python                                                                                                           
UNIVERSAL_TOKEN_PATTERN = r'(?u)\b[^\W\d_]{2,}\b'                                                                   
vectorizer = CountVectorizer(token_pattern=UNIVERSAL_TOKEN_PATTERN)                                                 
```                                                                                                                 
The umlauts are still stripped from the final topic representation.                                                 
                                                                                                                    
### Current Workaround (Production Code)                                                                            
We maintain **~750 lines** of workaround code including:                                                            
                                                                                                                    
1. **Pre-processing normalization** (ä→ae before BERTopic):                                                         
```python                                                                                                           
UMLAUT_TO_ASCII = {                                                                                                 
    'ä': 'ae', 'ö': 'oe', 'ü': 'ue', 'ß': 'ss',                                                                     
    'é': 'e', 'è': 'e', 'ê': 'e', ...                                                                               
}                                                                                                                   
                                                                                                                    
def normalize_umlauts(text):                                                                                        
    for umlaut, ascii_rep in UMLAUT_TO_ASCII.items():                                                               
        text = text.replace(umlaut, ascii_rep)                                                                      
    return text                                                                                                     
```                                                                                                                 
                                                                                                                    
2. **Post-processing restoration** (ae→ä after BERTopic):                                                           
```python                                                                                                           
def restore_umlauts(text):                                                                                          
    # Must handle carefully: "ae" could be legitimate                                                               
    for ascii_rep, umlaut in ASCII_TO_UMLAUT.items():                                                               
        text = text.replace(ascii_rep, umlaut)                                                                      
    return text                                                                                                     
```                                                                                                                 
                                                                                                                    
3. **Pattern-based restoration** (~50 regex patterns):                                                              
```python                                                                                                           
def restore_missing_umlauts(text):                                                                                  
    # tt am Ende → ät (qualitt→qualität)                                                                            
    result = re.sub(r'(\w{3,})tt\b', r'\1tät', result)                                                              
    # fh → fäh (unfhig→unfähig)                                                                                     
    result = re.sub(r'fh', 'fäh', result)                                                                           
    # mgl → mögl (mglich→möglich)                                                                                   
    result = re.sub(r'mgl', 'mögl', result)                                                                         
    # ... 47 more patterns                                                                                          
    return result                                                                                                   
```                                                                                                                 
                                                                                                                    
4. **Dictionary-based fixes** (~170 explicit mappings):                                                             
```python                                                                                                           
UMLAUT_FIXES = {                                                                                                    
    'gtertransport': 'gütertransport',                                                                              
    'khlkette': 'kühlkette',                                                                                        
    'qualitt': 'qualität',                                                                                          
    'zuverlssigkeit': 'zuverlässigkeit',                                                                            
    # ... 166 more entries                                                                                          
}                                                                                                                   
```                                                                                                                 
                                                                                                                    
5. **Separate UmlautFixer class** (536 lines) with:                                                                 
   - Similarity-based matching against known vocabulary                                                             
   - Fuzzy search with umlaut variant generation                                                                    
   - Domain-specific static fixes (~240 logistics terms)                                                            
   - Cache system for performance                                                                                   
                                                                                                                    
---                                                                                                                 
                                                                                                                    
## Problem 2: Compound Word Truncation                                                                              
                                                                                                                    
### Description                                                                                                     
German compound words (Komposita) are truncated in topic word output, losing important semantic information.        
                                                                                                                    
### Examples                                                                                                        
| Full Keyword | BERTopic Output | Missing Part |                                                                   
|--------------|-----------------|--------------|                                                                   
| kommissionierung | kommißionie | -rung |                                                                          
| sendungsverfolgung | sendungsverfolg | -ung |                                                                     
| güterverkehr | güterverkeh | -r |                                                                                 
| stückgutverkehr | stückgutver | -kehr |                                                                           
| sammelgutverkehr | sammelgutve | -rkehr |                                                                         
| expresslieferung | expressliefe | -rung |                                                                         
                                                                                                                    
### Root Cause                                                                                                      
Appears to be related to:                                                                                           
1. CountVectorizer's max token length or frequency cutoffs                                                          
2. c-TF-IDF internal processing truncating long tokens                                                              
3. Possibly related to BPE or subword tokenization effects                                                          
                                                                                                                    
### Current Workaround                                                                                              
We maintain a **WORD_COMPLETIONS dictionary** (~35 entries):                                                        
```python                                                                                                           
WORD_COMPLETIONS = {                                                                                                
    'kommißionie': 'kommissionierung',                                                                              
    'sendungsverfolg': 'sendungsverfolgung',                                                                        
    'güterverkeh': 'güterverkehr',                                                                                  
    'stückgutver': 'stückgutverkehr',                                                                               
    'sammelgutve': 'sammelgutverkehr',                                                                              
    'expressliefe': 'expresslieferung',                                                                             
    'disponente': 'disponent',                                                                                      
    'fahrzeugdis': 'fahrzeugdisposition',                                                                           
    # ... more entries                                                                                              
}                                                                                                                   
                                                                                                                    
# Applied in get_clean_topic_words()                                                                                
for truncated, full in WORD_COMPLETIONS.items():                                                                    
    if word.startswith(truncated):                                                                                  
        word = full                                                                                                 
```                                                                                                                 
                                                                                                                    
---                                                                                                                 
                                                                                                                    
## Impact Assessment                                                                                                
                                                                                                                    
### Code Overhead                                                                                                   
| Component | Lines of Code | Maintenance Burden |                                                                  
|-----------|---------------|-------------------|                                                                   
| UmlautFixer class | 536 | High |                                                                                  
| restore_missing_umlauts() | 140 | High (regex patterns) |                                                         
| UMLAUT_FIXES dictionary | 175 | Medium (add new terms) |                                                          
| fix_broken_umlauts() | 88 | Medium |                                                                              
| WORD_COMPLETIONS | 35 | Low (rare additions) |                                                                    
| Support functions | 32 | Low |                                                                                    
| **Total** | **~1,000** | **High** |                                                                               
                                                                                                                    
### Affected Users                                                                                                  
- All German-speaking BERTopic users                                                                                
- French users (accent handling similar issues)                                                                     
- Any language with diacritical marks (ñ, ç, ø, etc.)                                                               
                                                                                                                    
### Business Impact                                                                                                 
- Topic labels become unreadable without post-processing                                                            
- Manual review required to verify topic quality                                                                    
- Significant development time for workarounds                                                                      
- Each new domain requires additional dictionary entries                                                            
                                                                                                                    
---                                                                                                                 
                                                                                                                    
## Proposed Solutions                                                                                               
                                                                                                                    
### Option 1: Fix at CountVectorizer Level (Recommended)                                                            
Ensure the vectorizer properly preserves Unicode characters:                                                        
```python                                                                                                           
# In BERTopic's _c_tf_idf.py or similar                                                                             
vectorizer = CountVectorizer(                                                                                       
    token_pattern=r'(?u)\b\w+\b',  # Unicode-aware                                                                  
    strip_accents=None,  # Explicitly disable accent stripping                                                      
    lowercase=True,                                                                                                 
    # ... other params                                                                                              
)                                                                                                                   
```                                                                                                                 
                                                                                                                    
### Option 2: Language-Aware Mode                                                                                   
Add a `language` parameter that adjusts processing:                                                                 
```python                                                                                                           
topic_model = BERTopic(                                                                                             
    language="de",  # Enables German-specific handling                                                              
    preserve_diacritics=True,  # New parameter                                                                      
)                                                                                                                   
```                                                                                                                 
                                                                                                                    
### Option 3: Custom Tokenizer Hook                                                                                 
Allow users to provide custom tokenization:                                                                         
```python                                                                                                           
def german_tokenizer(text):                                                                                         
    # Preserve umlauts, handle compounds                                                                            
    return tokens                                                                                                   
                                                                                                                    
topic_model = BERTopic(                                                                                             
    custom_tokenizer=german_tokenizer                                                                               
)                                                                                                                   
```                                                                                                                 
                                                                                                                    
### Option 4: Post-Processing Hook                                                                                  
Provide official hook for topic word cleaning:                                                                      
```python                                                                                                           
def clean_german_topics(words):                                                                                     
    # User-defined cleaning                                                                                         
    return cleaned_words                                                                                            
                                                                                                                    
topic_model = BERTopic(                                                                                             
    topic_word_postprocessor=clean_german_topics                                                                    
)                                                                                                                   
```                                                                                                                 
                                                                                                                    
---                                                                                                                 
                                                                                                                    
## Minimal Reproduction                                                                                             
                                                                                                                    
```python                                                                                                           
from bertopic import BERTopic                                                                                       
from sentence_transformers import SentenceTransformer                                                               
                                                                                                                    
# German logistics keywords                                                                                         
docs = [                                                                                                            
    "gütertransport nach münchen",                                                                                  
    "kühlkette für lebensmittel",                                                                                   
    "qualitätsmanagement in der logistik",                                                                          
    "zuverlässigkeit der lieferung",                                                                                
    "sendungsverfolgung per gps",                                                                                   
    "kommissionierung im lager",                                                                                    
    # ... more German text                                                                                          
]                                                                                                                   
                                                                                                                    
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')                                                
topic_model = BERTopic(embedding_model=model)                                                                       
topics, probs = topic_model.fit_transform(docs)                                                                     
                                                                                                                    
# Check topic words - umlauts will be missing                                                                       
for topic_id in range(len(topic_model.get_topic_info()) - 1):                                                       
    words = topic_model.get_topic(topic_id)                                                                         
    print(f"Topic {topic_id}: {[w for w, _ in words[:5]]}")                                                         
    # Expected: ['gütertransport', 'kühlkette', ...]                                                                
    # Actual:   ['gtertransport', 'khlkette', ...]                                                                  
```                                                                                                                 
                                                                                                                    
---                                                                                                                 
                                                                                                                    
## Environment                                                                                                      
- BERTopic version: 0.15.x / 0.16.x                                                                                 
- Python: 3.10+                                                                                                     
- OS: Linux/Windows/macOS                                                                                           
- Language: German (de), also affects French (fr)                                                                   
                                                                                                                    
---                                                                                                                 
                                                                                                                    
## Related Issues                                                                                                   
- (Search for existing German/Unicode issues in BERTopic repo)                                                      
                                                                                                                    
---                                                                                                                 
                                                                                                                    
## Summary                                                                                                          
                                                                                                                    
German BERTopic users currently need **~1,000 lines of workaround code** to get readable topic labels. A native     
 solution would:                                                                                                    
                                                                                                                    
1. Reduce maintenance burden for German developers                                                                  
2. Improve out-of-box experience for non-English users                                                              
3. Prevent data quality issues in production systems                                                                
                                                                                                                    
We're happy to contribute to a fix if pointed in the right direction.                                               
                                                                                                                    
---                                                                                                                 
                                                                                                                    
*This issue was documented based on production experience with the Semantic Engine Engine (SEE) project,            
processing 50,000+ German keywords.*                  

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions