This is a simple Python notebook that extracts sentence pairs from a .tmx (Translation Memory eXchange) file and saves them as a clean .csv.
It’s useful if you work with translation memories and want a quick way to turn them into bilingual data for analysis, training, or anything else.
- Reads a
.tmxfile - Extracts source and target segments
- Cleans up whitespace
- Saves everything into a CSV file (
bilingual_corpus.csv)
- Open the
translation_memory_project.ipynbnotebook - Replace the filename (
yourfile.tmx) with your own - Adjust the language codes if needed (like
'en'and'es') - Run all the cells
That’s it! Your clean bilingual file will be ready as bilingual_corpus.csv.
Python 3.xpandas
Install with:
pip install pandas
📝 Example output
source,target
"Hello, world!","¡Hola, mundo!"
"How are you?","¿Cómo estás?"
source,target
"Hello, world!","¡Hola, mundo!"
"How are you?","¿Cómo estás?"💾SQL + Python Project Ideas (more advanced but manageable)
1.Translation Memory Database Manager
💡 Idea: Instead of reading .tmx into a CSV, store the data in a SQL database.
✴️ Use sqlite3 to save source–target pairs in a table
✴️ Add fields like language_pair, domain, file_source
✴️ Include sample SQL queries like:
WHERE source LIKE '%hello%' AND language_pair = 'en-es';
2. Idiom Database with Emotional Valence Tags
💡 Build a small idiom database with columns: idiom, lang, valence_score, transparency, familiarity
-Use SQL for: AVG valence per language
Idioms common to multiple languages
Complex filters (e.g., “neutral Turkish idioms that are familiar but low transparency”)
3. Corpus Search Tool 💡 Build a searchable bilingual corpus using SQL
-Create a corpus table with id, source, target, domain, lang_pair
-Let users search by keyword, domain, or language pair
Bonus: Write a Python UI for it with Streamlit
Created by @avocadoyoon,
because bilingual data deserves to be ✨clean, sorted, and a lil assertive✨.