BertCC

BertCC is a context-aware Chinese text converter that uses BERT (Bidirectional Encoder Representations from Transformers) to convert Simplified Chinese (zh_CN) to Traditional Chinese (zh_TW). Unlike dictionary-based approaches, BertCC leverages the power of contextual understanding to provide more accurate conversions.

Features

Context-Aware Conversion: Utilizes BERT's contextual understanding to make intelligent conversion choices based on surrounding text
```
Input:  他向师生发表演说
BertCC: 他向師生發表演說  ✓
OpenCC: 他向師生髮表演說  ✗
```
Configurable Processing:
- Adjustable batch size for optimal performance on different hardware
- Configurable overlap between batches to maintain context across segments
- Optional context prefix to guide conversion
Unlimited Text Length: Processes text of any length through intelligent batch processing with overlap
Hardware Flexibility: Supports both CPU and CUDA processing
Detailed Output Option: Can show conversion probabilities and candidate selections when run in verbose mode

Installation

pip install bertcc

Usage

Command Line Interface

# Basic usage
bertcc "要转换的文字" --batch-size 450 --overlap-size 50

# Using context prefix for proper nouns
bertcc "都是上里作的好事" -c "上里一將是人名。"
# Output: 都是上里作的好事  (preserves "上里" as a name instead of converting to "上裡")

# Show detailed conversion process
bertcc "他的头与发皆白" --verbose

Python API

from bertcc.converter import ConversionConfig, ChineseConverter

config = ConversionConfig(
    batch_size=450,
    overlap_size=50,
    context_prefix='上里一將是人名。',  # Context hint for proper nouns
    device="cuda",  # or "cpu"
    model_name="bert-base-chinese"
)

converter = ChineseConverter(config)
result = converter.convert("都是上里作的好事", show_details=True)

Usage Considerations

Input Text Quality

For optimal conversion results, consider these important guidelines:

Pure Chinese Text Preferred:

Good: 他向师生发表演说

Avoid: <div>他向师生发表演说</div>  // HTML tags may affect context

Why This Matters:
- BERT models are trained primarily on natural Chinese text
- Non-Chinese elements (HTML, markdown, special characters) can:
  - Disrupt the contextual understanding
  - Lead to incorrect character conversions
  - Break the natural language flow
Best Practices:
- Clean input text of HTML/XML tags before conversion
- Remove or minimize non-Chinese characters where possible
- Keep formatting markup separate from text being converted
- Use context prefix for proper nouns rather than special markers
Handling Mixed Content: If you must process text with mixed content:
- Consider splitting the text into Chinese and non-Chinese segments
- Process Chinese segments separately
- Reassemble the text after conversion

Configuration Options

Parameter	Description	Default
batch_size	Number of characters processed in each batch	450
overlap_size	Number of overlapping characters between batches	50
context_prefix	Optional prefix to provide additional context	''
device	Computation device ("cuda" or "cpu")	"cuda" if available
model_name	BERT model to use for conversion	"bert-base-chinese"

How It Works

BertCC uses a unique approach to Chinese text conversion:

Text Processing:
- Splits input text into manageable batches with overlap to maintain context
- Identifies ambiguous characters that have multiple possible Traditional Chinese representations
Masking and Prediction:
- Replaces ambiguous characters with BERT's [MASK] token
- For example, "发" could be "發" (to express) or "髮" (hair)
- The surrounding context helps BERT understand which meaning is intended
Contextual Decision:
- BERT model predicts the most likely Traditional Chinese character for each mask
- Predictions are influenced by:
  - Surrounding text context
  - Optional context prefix (useful for proper nouns)
  - Known character mappings and frequencies
Batch Processing:
- Processes text in overlapping batches to handle long texts
- Merges results while maintaining consistency at batch boundaries

Limitations

Phrase-Level Variations: Cannot handle regional phrase differences between Simplified and Traditional Chinese due to its character-by-character processing architecture. For example:
```
CN: 互联网        TW: 網際網路
CN: 数据库        TW: 資料庫
CN: 软件          TW: 軟體
```
BertCC will convert these character-by-character (e.g., 互联网 → 互聯網) rather than using the regionally appropriate phrase (網際網路). This is because the model operates on character-level masking and prediction, not phrase-level transformation. For applications requiring region-specific terminology conversion, additional post-processing or a different approach would be needed.
Computational Resources: As a neural network-based solution, BertCC requires more computational resources compared to dictionary-based approaches like OpenCC.
Processing Speed: Due to the contextual analysis, conversion speed is slower than dictionary-based methods, though this is mitigated through batch processing.

Comparison with Other Tools

Unlike traditional conversion tools like OpenCC that rely on character-to-character mapping, BertCC:

Considers the entire context when making conversion decisions
Handles ambiguous characters more accurately by understanding their usage
Provides confidence scores for conversions in verbose mode
Maintains contextual consistency across long texts through overlap processing

Contributing

Contributions are welcome! Please feel free to submit pull requests, report issues, or suggest improvements.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
bertcc		bertcc
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BertCC

Features

Installation

Usage

Command Line Interface

Python API

Usage Considerations

Input Text Quality

Configuration Options

How It Works

Limitations

Comparison with Other Tools

Contributing

License

About

Uh oh!

Releases 1

Uh oh!

Languages

License

Benau/BertCC

Folders and files

Latest commit

History

Repository files navigation

BertCC

Features

Installation

Usage

Command Line Interface

Python API

Usage Considerations

Input Text Quality

Configuration Options

How It Works

Limitations

Comparison with Other Tools

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Languages