German Language Issues: Umlaut Removal and Compound Word Truncation in c-TF-IDF

## Summary                                                                                                          
                                                                                                                        
    BERTopic's c-TF-IDF calculation has significant issues when processing German text, requiring extensive             
    workarounds (~1,000 lines of code) in production environments. The two main problems are:                           
                                                                                                                        
    1. **Umlaut Removal**: German umlauts (ä, ö, ü, ß) are removed or corrupted in topic words                          
    2. **Compound Word Truncation**: Long German compound words (Komposita) are truncated                               
                                                                                                                        
    These issues affect all German-speaking developers using BERTopic for topic modeling.                               
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    ## Problem 1: Umlaut Removal                                                                                        
                                                                                                                        
    ### Description                                                                                                     
    BERTopic's CountVectorizer and c-TF-IDF calculation strips German umlauts from topic words, making the output       
    difficult to interpret.                                                                                             
                                                                                                                        
    ### Examples                                                                                                        
    | Input Keyword | BERTopic Output | Expected |                                                                      
    |---------------|-----------------|----------|                                                                      
    | gütertransport | gtertransport | gütertransport |                                                                 
    | kühlkette | khlkette | kühlkette |                                                                                
    | qualität | qualitt | qualität |                                                                                   
    | zuverlässigkeit | zuverlssigkeit | zuverlässigkeit |                                                              
    | größe | grsse | größe |                                                                                           
    | präferenz | prferenz | präferenz |                                                                                
    | verfügbar | verfgbar | verfügbar |                                                                                
    | öffnungszeiten | ffnungszeiten | öffnungszeiten |                                                                 
                                                                                                                        
    ### Root Cause                                                                                                      
    The c-TF-IDF calculation appears to use tokenization that doesn't properly handle Unicode characters. Even when     
     using:                                                                                                             
    ```python                                                                                                           
    UNIVERSAL_TOKEN_PATTERN = r'(?u)\b[^\W\d_]{2,}\b'                                                                   
    vectorizer = CountVectorizer(token_pattern=UNIVERSAL_TOKEN_PATTERN)                                                 
    ```                                                                                                                 
    The umlauts are still stripped from the final topic representation.                                                 
                                                                                                                        
    ### Current Workaround (Production Code)                                                                            
    We maintain **~750 lines** of workaround code including:                                                            
                                                                                                                        
    1. **Pre-processing normalization** (ä→ae before BERTopic):                                                         
    ```python                                                                                                           
    UMLAUT_TO_ASCII = {                                                                                                 
        'ä': 'ae', 'ö': 'oe', 'ü': 'ue', 'ß': 'ss',                                                                     
        'é': 'e', 'è': 'e', 'ê': 'e', ...                                                                               
    }                                                                                                                   
                                                                                                                        
    def normalize_umlauts(text):                                                                                        
        for umlaut, ascii_rep in UMLAUT_TO_ASCII.items():                                                               
            text = text.replace(umlaut, ascii_rep)                                                                      
        return text                                                                                                     
    ```                                                                                                                 
                                                                                                                        
    2. **Post-processing restoration** (ae→ä after BERTopic):                                                           
    ```python                                                                                                           
    def restore_umlauts(text):                                                                                          
        # Must handle carefully: "ae" could be legitimate                                                               
        for ascii_rep, umlaut in ASCII_TO_UMLAUT.items():                                                               
            text = text.replace(ascii_rep, umlaut)                                                                      
        return text                                                                                                     
    ```                                                                                                                 
                                                                                                                        
    3. **Pattern-based restoration** (~50 regex patterns):                                                              
    ```python                                                                                                           
    def restore_missing_umlauts(text):                                                                                  
        # tt am Ende → ät (qualitt→qualität)                                                                            
        result = re.sub(r'(\w{3,})tt\b', r'\1tät', result)                                                              
        # fh → fäh (unfhig→unfähig)                                                                                     
        result = re.sub(r'fh', 'fäh', result)                                                                           
        # mgl → mögl (mglich→möglich)                                                                                   
        result = re.sub(r'mgl', 'mögl', result)                                                                         
        # ... 47 more patterns                                                                                          
        return result                                                                                                   
    ```                                                                                                                 
                                                                                                                        
    4. **Dictionary-based fixes** (~170 explicit mappings):                                                             
    ```python                                                                                                           
    UMLAUT_FIXES = {                                                                                                    
        'gtertransport': 'gütertransport',                                                                              
        'khlkette': 'kühlkette',                                                                                        
        'qualitt': 'qualität',                                                                                          
        'zuverlssigkeit': 'zuverlässigkeit',                                                                            
        # ... 166 more entries                                                                                          
    }                                                                                                                   
    ```                                                                                                                 
                                                                                                                        
    5. **Separate UmlautFixer class** (536 lines) with:                                                                 
       - Similarity-based matching against known vocabulary                                                             
       - Fuzzy search with umlaut variant generation                                                                    
       - Domain-specific static fixes (~240 logistics terms)                                                            
       - Cache system for performance                                                                                   
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    ## Problem 2: Compound Word Truncation                                                                              
                                                                                                                        
    ### Description                                                                                                     
    German compound words (Komposita) are truncated in topic word output, losing important semantic information.        
                                                                                                                        
    ### Examples                                                                                                        
    | Full Keyword | BERTopic Output | Missing Part |                                                                   
    |--------------|-----------------|--------------|                                                                   
    | kommissionierung | kommißionie | -rung |                                                                          
    | sendungsverfolgung | sendungsverfolg | -ung |                                                                     
    | güterverkehr | güterverkeh | -r |                                                                                 
    | stückgutverkehr | stückgutver | -kehr |                                                                           
    | sammelgutverkehr | sammelgutve | -rkehr |                                                                         
    | expresslieferung | expressliefe | -rung |                                                                         
                                                                                                                        
    ### Root Cause                                                                                                      
    Appears to be related to:                                                                                           
    1. CountVectorizer's max token length or frequency cutoffs                                                          
    2. c-TF-IDF internal processing truncating long tokens                                                              
    3. Possibly related to BPE or subword tokenization effects                                                          
                                                                                                                        
    ### Current Workaround                                                                                              
    We maintain a **WORD_COMPLETIONS dictionary** (~35 entries):                                                        
    ```python                                                                                                           
    WORD_COMPLETIONS = {                                                                                                
        'kommißionie': 'kommissionierung',                                                                              
        'sendungsverfolg': 'sendungsverfolgung',                                                                        
        'güterverkeh': 'güterverkehr',                                                                                  
        'stückgutver': 'stückgutverkehr',                                                                               
        'sammelgutve': 'sammelgutverkehr',                                                                              
        'expressliefe': 'expresslieferung',                                                                             
        'disponente': 'disponent',                                                                                      
        'fahrzeugdis': 'fahrzeugdisposition',                                                                           
        # ... more entries                                                                                              
    }                                                                                                                   
                                                                                                                        
    # Applied in get_clean_topic_words()                                                                                
    for truncated, full in WORD_COMPLETIONS.items():                                                                    
        if word.startswith(truncated):                                                                                  
            word = full                                                                                                 
    ```                                                                                                                 
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    ## Impact Assessment                                                                                                
                                                                                                                        
    ### Code Overhead                                                                                                   
    | Component | Lines of Code | Maintenance Burden |                                                                  
    |-----------|---------------|-------------------|                                                                   
    | UmlautFixer class | 536 | High |                                                                                  
    | restore_missing_umlauts() | 140 | High (regex patterns) |                                                         
    | UMLAUT_FIXES dictionary | 175 | Medium (add new terms) |                                                          
    | fix_broken_umlauts() | 88 | Medium |                                                                              
    | WORD_COMPLETIONS | 35 | Low (rare additions) |                                                                    
    | Support functions | 32 | Low |                                                                                    
    | **Total** | **~1,000** | **High** |                                                                               
                                                                                                                        
    ### Affected Users                                                                                                  
    - All German-speaking BERTopic users                                                                                
    - French users (accent handling similar issues)                                                                     
    - Any language with diacritical marks (ñ, ç, ø, etc.)                                                               
                                                                                                                        
    ### Business Impact                                                                                                 
    - Topic labels become unreadable without post-processing                                                            
    - Manual review required to verify topic quality                                                                    
    - Significant development time for workarounds                                                                      
    - Each new domain requires additional dictionary entries                                                            
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    ## Proposed Solutions                                                                                               
                                                                                                                        
    ### Option 1: Fix at CountVectorizer Level (Recommended)                                                            
    Ensure the vectorizer properly preserves Unicode characters:                                                        
    ```python                                                                                                           
    # In BERTopic's _c_tf_idf.py or similar                                                                             
    vectorizer = CountVectorizer(                                                                                       
        token_pattern=r'(?u)\b\w+\b',  # Unicode-aware                                                                  
        strip_accents=None,  # Explicitly disable accent stripping                                                      
        lowercase=True,                                                                                                 
        # ... other params                                                                                              
    )                                                                                                                   
    ```                                                                                                                 
                                                                                                                        
    ### Option 2: Language-Aware Mode                                                                                   
    Add a `language` parameter that adjusts processing:                                                                 
    ```python                                                                                                           
    topic_model = BERTopic(                                                                                             
        language="de",  # Enables German-specific handling                                                              
        preserve_diacritics=True,  # New parameter                                                                      
    )                                                                                                                   
    ```                                                                                                                 
                                                                                                                        
    ### Option 3: Custom Tokenizer Hook                                                                                 
    Allow users to provide custom tokenization:                                                                         
    ```python                                                                                                           
    def german_tokenizer(text):                                                                                         
        # Preserve umlauts, handle compounds                                                                            
        return tokens                                                                                                   
                                                                                                                        
    topic_model = BERTopic(                                                                                             
        custom_tokenizer=german_tokenizer                                                                               
    )                                                                                                                   
    ```                                                                                                                 
                                                                                                                        
    ### Option 4: Post-Processing Hook                                                                                  
    Provide official hook for topic word cleaning:                                                                      
    ```python                                                                                                           
    def clean_german_topics(words):                                                                                     
        # User-defined cleaning                                                                                         
        return cleaned_words                                                                                            
                                                                                                                        
    topic_model = BERTopic(                                                                                             
        topic_word_postprocessor=clean_german_topics                                                                    
    )                                                                                                                   
    ```                                                                                                                 
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    ## Minimal Reproduction                                                                                             
                                                                                                                        
    ```python                                                                                                           
    from bertopic import BERTopic                                                                                       
    from sentence_transformers import SentenceTransformer                                                               
                                                                                                                        
    # German logistics keywords                                                                                         
    docs = [                                                                                                            
        "gütertransport nach münchen",                                                                                  
        "kühlkette für lebensmittel",                                                                                   
        "qualitätsmanagement in der logistik",                                                                          
        "zuverlässigkeit der lieferung",                                                                                
        "sendungsverfolgung per gps",                                                                                   
        "kommissionierung im lager",                                                                                    
        # ... more German text                                                                                          
    ]                                                                                                                   
                                                                                                                        
    model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')                                                
    topic_model = BERTopic(embedding_model=model)                                                                       
    topics, probs = topic_model.fit_transform(docs)                                                                     
                                                                                                                        
    # Check topic words - umlauts will be missing                                                                       
    for topic_id in range(len(topic_model.get_topic_info()) - 1):                                                       
        words = topic_model.get_topic(topic_id)                                                                         
        print(f"Topic {topic_id}: {[w for w, _ in words[:5]]}")                                                         
        # Expected: ['gütertransport', 'kühlkette', ...]                                                                
        # Actual:   ['gtertransport', 'khlkette', ...]                                                                  
    ```                                                                                                                 
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    ## Environment                                                                                                      
    - BERTopic version: 0.15.x / 0.16.x                                                                                 
    - Python: 3.10+                                                                                                     
    - OS: Linux/Windows/macOS                                                                                           
    - Language: German (de), also affects French (fr)                                                                   
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    ## Related Issues                                                                                                   
    - (Search for existing German/Unicode issues in BERTopic repo)                                                      
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    ## Summary                                                                                                          
                                                                                                                        
    German BERTopic users currently need **~1,000 lines of workaround code** to get readable topic labels. A native     
     solution would:                                                                                                    
                                                                                                                        
    1. Reduce maintenance burden for German developers                                                                  
    2. Improve out-of-box experience for non-English users                                                              
    3. Prevent data quality issues in production systems                                                                
                                                                                                                        
    We're happy to contribute to a fix if pointed in the right direction.                                               
                                                                                                                        
    ---                                                                                                                 
                                                                                                                        
    *This issue was documented based on production experience with the Semantic Engine Engine (SEE) project,            
    processing 50,000+ German keywords.*                  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German Language Issues: Umlaut Removal and Compound Word Truncation in c-TF-IDF #2468

Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

German Language Issues: Umlaut Removal and Compound Word Truncation in c-TF-IDF #2468

Description

Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions