Skip to content

avocadoyoon/SQL_TM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

🗺️ bilingual-batchifier

This is a simple Python notebook that extracts sentence pairs from a .tmx (Translation Memory eXchange) file and saves them as a clean .csv.

It’s useful if you work with translation memories and want a quick way to turn them into bilingual data for analysis, training, or anything else.

📁 What it does

  • Reads a .tmx file
  • Extracts source and target segments
  • Cleans up whitespace
  • Saves everything into a CSV file (bilingual_corpus.csv)

▶️ How to use

  1. Open the translation_memory_project.ipynb notebook
  2. Replace the filename (yourfile.tmx) with your own
  3. Adjust the language codes if needed (like 'en' and 'es')
  4. Run all the cells

That’s it! Your clean bilingual file will be ready as bilingual_corpus.csv.

🧰 Requirements

  • Python 3.x
  • pandas

Install with:

pip install pandas

📝 Example output
source,target
"Hello, world!","¡Hola, mundo!"
"How are you?","¿Cómo estás?"
source,target
"Hello, world!","¡Hola, mundo!"
"How are you?","¿Cómo estás?"

🔮 Future Upgrades

💾SQL + Python Project Ideas (more advanced but manageable)

1.Translation Memory Database Manager

💡 Idea: Instead of reading .tmx into a CSV, store the data in a SQL database.

✴️ Use sqlite3 to save source–target pairs in a table

✴️ Add fields like language_pair, domain, file_source

✴️ Include sample SQL queries like:

WHERE source LIKE '%hello%' AND language_pair = 'en-es';

2. Idiom Database with Emotional Valence Tags

💡 Build a small idiom database with columns: idiom, lang, valence_score, transparency, familiarity

-Use SQL for: AVG valence per language

Idioms common to multiple languages

Complex filters (e.g., “neutral Turkish idioms that are familiar but low transparency”)

3. Corpus Search Tool 💡 Build a searchable bilingual corpus using SQL

-Create a corpus table with id, source, target, domain, lang_pair

-Let users search by keyword, domain, or language pair

Bonus: Write a Python UI for it with Streamlit

👩🏻‍💻 Made with linguistic love

Created by @avocadoyoon,

because bilingual data deserves to be ✨clean, sorted, and a lil assertive✨.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published