-
Notifications
You must be signed in to change notification settings - Fork 879
Open
Description
Summary
BERTopic's c-TF-IDF calculation has significant issues when processing German text, requiring extensive
workarounds (~1,000 lines of code) in production environments. The two main problems are:
1. **Umlaut Removal**: German umlauts (ä, ö, ü, ß) are removed or corrupted in topic words
2. **Compound Word Truncation**: Long German compound words (Komposita) are truncated
These issues affect all German-speaking developers using BERTopic for topic modeling.
---
## Problem 1: Umlaut Removal
### Description
BERTopic's CountVectorizer and c-TF-IDF calculation strips German umlauts from topic words, making the output
difficult to interpret.
### Examples
| Input Keyword | BERTopic Output | Expected |
|---------------|-----------------|----------|
| gütertransport | gtertransport | gütertransport |
| kühlkette | khlkette | kühlkette |
| qualität | qualitt | qualität |
| zuverlässigkeit | zuverlssigkeit | zuverlässigkeit |
| größe | grsse | größe |
| präferenz | prferenz | präferenz |
| verfügbar | verfgbar | verfügbar |
| öffnungszeiten | ffnungszeiten | öffnungszeiten |
### Root Cause
The c-TF-IDF calculation appears to use tokenization that doesn't properly handle Unicode characters. Even when
using:
```python
UNIVERSAL_TOKEN_PATTERN = r'(?u)\b[^\W\d_]{2,}\b'
vectorizer = CountVectorizer(token_pattern=UNIVERSAL_TOKEN_PATTERN)
```
The umlauts are still stripped from the final topic representation.
### Current Workaround (Production Code)
We maintain **~750 lines** of workaround code including:
1. **Pre-processing normalization** (ä→ae before BERTopic):
```python
UMLAUT_TO_ASCII = {
'ä': 'ae', 'ö': 'oe', 'ü': 'ue', 'ß': 'ss',
'é': 'e', 'è': 'e', 'ê': 'e', ...
}
def normalize_umlauts(text):
for umlaut, ascii_rep in UMLAUT_TO_ASCII.items():
text = text.replace(umlaut, ascii_rep)
return text
```
2. **Post-processing restoration** (ae→ä after BERTopic):
```python
def restore_umlauts(text):
# Must handle carefully: "ae" could be legitimate
for ascii_rep, umlaut in ASCII_TO_UMLAUT.items():
text = text.replace(ascii_rep, umlaut)
return text
```
3. **Pattern-based restoration** (~50 regex patterns):
```python
def restore_missing_umlauts(text):
# tt am Ende → ät (qualitt→qualität)
result = re.sub(r'(\w{3,})tt\b', r'\1tät', result)
# fh → fäh (unfhig→unfähig)
result = re.sub(r'fh', 'fäh', result)
# mgl → mögl (mglich→möglich)
result = re.sub(r'mgl', 'mögl', result)
# ... 47 more patterns
return result
```
4. **Dictionary-based fixes** (~170 explicit mappings):
```python
UMLAUT_FIXES = {
'gtertransport': 'gütertransport',
'khlkette': 'kühlkette',
'qualitt': 'qualität',
'zuverlssigkeit': 'zuverlässigkeit',
# ... 166 more entries
}
```
5. **Separate UmlautFixer class** (536 lines) with:
- Similarity-based matching against known vocabulary
- Fuzzy search with umlaut variant generation
- Domain-specific static fixes (~240 logistics terms)
- Cache system for performance
---
## Problem 2: Compound Word Truncation
### Description
German compound words (Komposita) are truncated in topic word output, losing important semantic information.
### Examples
| Full Keyword | BERTopic Output | Missing Part |
|--------------|-----------------|--------------|
| kommissionierung | kommißionie | -rung |
| sendungsverfolgung | sendungsverfolg | -ung |
| güterverkehr | güterverkeh | -r |
| stückgutverkehr | stückgutver | -kehr |
| sammelgutverkehr | sammelgutve | -rkehr |
| expresslieferung | expressliefe | -rung |
### Root Cause
Appears to be related to:
1. CountVectorizer's max token length or frequency cutoffs
2. c-TF-IDF internal processing truncating long tokens
3. Possibly related to BPE or subword tokenization effects
### Current Workaround
We maintain a **WORD_COMPLETIONS dictionary** (~35 entries):
```python
WORD_COMPLETIONS = {
'kommißionie': 'kommissionierung',
'sendungsverfolg': 'sendungsverfolgung',
'güterverkeh': 'güterverkehr',
'stückgutver': 'stückgutverkehr',
'sammelgutve': 'sammelgutverkehr',
'expressliefe': 'expresslieferung',
'disponente': 'disponent',
'fahrzeugdis': 'fahrzeugdisposition',
# ... more entries
}
# Applied in get_clean_topic_words()
for truncated, full in WORD_COMPLETIONS.items():
if word.startswith(truncated):
word = full
```
---
## Impact Assessment
### Code Overhead
| Component | Lines of Code | Maintenance Burden |
|-----------|---------------|-------------------|
| UmlautFixer class | 536 | High |
| restore_missing_umlauts() | 140 | High (regex patterns) |
| UMLAUT_FIXES dictionary | 175 | Medium (add new terms) |
| fix_broken_umlauts() | 88 | Medium |
| WORD_COMPLETIONS | 35 | Low (rare additions) |
| Support functions | 32 | Low |
| **Total** | **~1,000** | **High** |
### Affected Users
- All German-speaking BERTopic users
- French users (accent handling similar issues)
- Any language with diacritical marks (ñ, ç, ø, etc.)
### Business Impact
- Topic labels become unreadable without post-processing
- Manual review required to verify topic quality
- Significant development time for workarounds
- Each new domain requires additional dictionary entries
---
## Proposed Solutions
### Option 1: Fix at CountVectorizer Level (Recommended)
Ensure the vectorizer properly preserves Unicode characters:
```python
# In BERTopic's _c_tf_idf.py or similar
vectorizer = CountVectorizer(
token_pattern=r'(?u)\b\w+\b', # Unicode-aware
strip_accents=None, # Explicitly disable accent stripping
lowercase=True,
# ... other params
)
```
### Option 2: Language-Aware Mode
Add a `language` parameter that adjusts processing:
```python
topic_model = BERTopic(
language="de", # Enables German-specific handling
preserve_diacritics=True, # New parameter
)
```
### Option 3: Custom Tokenizer Hook
Allow users to provide custom tokenization:
```python
def german_tokenizer(text):
# Preserve umlauts, handle compounds
return tokens
topic_model = BERTopic(
custom_tokenizer=german_tokenizer
)
```
### Option 4: Post-Processing Hook
Provide official hook for topic word cleaning:
```python
def clean_german_topics(words):
# User-defined cleaning
return cleaned_words
topic_model = BERTopic(
topic_word_postprocessor=clean_german_topics
)
```
---
## Minimal Reproduction
```python
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
# German logistics keywords
docs = [
"gütertransport nach münchen",
"kühlkette für lebensmittel",
"qualitätsmanagement in der logistik",
"zuverlässigkeit der lieferung",
"sendungsverfolgung per gps",
"kommissionierung im lager",
# ... more German text
]
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
topic_model = BERTopic(embedding_model=model)
topics, probs = topic_model.fit_transform(docs)
# Check topic words - umlauts will be missing
for topic_id in range(len(topic_model.get_topic_info()) - 1):
words = topic_model.get_topic(topic_id)
print(f"Topic {topic_id}: {[w for w, _ in words[:5]]}")
# Expected: ['gütertransport', 'kühlkette', ...]
# Actual: ['gtertransport', 'khlkette', ...]
```
---
## Environment
- BERTopic version: 0.15.x / 0.16.x
- Python: 3.10+
- OS: Linux/Windows/macOS
- Language: German (de), also affects French (fr)
---
## Related Issues
- (Search for existing German/Unicode issues in BERTopic repo)
---
## Summary
German BERTopic users currently need **~1,000 lines of workaround code** to get readable topic labels. A native
solution would:
1. Reduce maintenance burden for German developers
2. Improve out-of-box experience for non-English users
3. Prevent data quality issues in production systems
We're happy to contribute to a fix if pointed in the right direction.
---
*This issue was documented based on production experience with the Semantic Engine Engine (SEE) project,
processing 50,000+ German keywords.*
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels