A comprehensive system for translating Tibetan Buddhist texts with support for commentaries, terminology standardization, and multi-language outputs.
This project provides a complete pipeline for translating Tibetan Buddhist texts to various target languages including English, Chinese, Hindi, and others. The translation workflow consists of two phases:
- Initial Translation: Translates Tibetan texts with support for commentaries and basic glossary
- Post-Translation Processing: Performs terminology standardization and provides word-by-word translation references
- Python 3.8+
- Anthropic API key (Claude)
- Input files in JSON/JSONL format with Tibetan text
- Clone this repository:
git clone https://github.com/OpenPecha/translation-workflow.git
cd translation-workflow
- Set up your environment variables in a
.env
file:
ANTHROPIC_API_KEY=your_anthropic_api_key_here
The run.sh
script provides a convenient way to execute the complete translation workflow. It handles both the initial translation and post-translation processing phases.
./run.sh
This runs the workflow with default settings:
- Input file:
input/choenjuk_bo.json
- Target language: Chinese
- Batch size: 10
- Output prefix:
choenjuk_zh
The script supports several command-line options:
./run.sh [options]
Option | Description |
---|---|
--input FILE |
Input JSON or JSONL file with Tibetan text |
--language LANG |
Target language (English, Chinese, Hindi, etc.) |
--batch-size SIZE |
Number of texts to process in parallel |
--retries NUM |
Number of retry attempts for failed batches |
--delay SECONDS |
Delay between retry attempts |
--output PREFIX |
Output file prefix |
--debug |
Enable debug logging |
--help |
Show help message |
Translate to English with a larger batch size:
./run.sh --input input/my_text.json --language English --batch-size 20 --output my_text_en
Translate to Hindi with debug logging:
./run.sh --language Hindi --output choenjuk_hi --debug
The script generates several output files in the outputs
directory:
[prefix].jsonl
: Initial translation results[prefix]_final.jsonl
: Post-processed translation with standardized terminology[prefix]_glossary.csv
: Standardized glossary derived from the translations
doc/
: Documentation for various componentsexamples/
: Example scripts for different use casesinput/
: Input Tibetan text filesoutputs/
: Generated translations and glossariestibetan_translator/
: Core translation modulesprocessors/
: Various text processing components
run.sh
: Main workflow script
For more detailed information, refer to the documentation files in the doc/
directory:
SYSTEM_OVERVIEW.md
: Full system architectureUSAGE_GUIDE.md
: Comprehensive usage instructionsPOST_TRANSLATION.md
: Details on post-translation processingMODULE_DETAILS.md
: Information about individual modules