Dataprep

This repository contains data preparation, translation, and analysis scripts used to evaluate and compare human and machine translations (GPT, NMT, LLaMA) across multiple languages. It supports sentence- and paragraph-level processing, alignment, and evaluation using structured pipelines. Data includes annotated literary translations and WMT test sets, with support for both automatic and manual quality checks. Output includes alignment scores, cross-word ratio (XWR), and structural variation metrics for multilingual translation systems.

Dataprep

Standardize data format. Retrieve source sentences.

Raw data

all_annotations_v1.json contains littrans data from https://github.com/marzenakrp/LiteraryTranslation

wmt23/ contains en-de and de-en datasets from the WMT2023 testsets

json2csv_littrans.py : all_annotations_v1.json -> all_csv/

Parses a json file with annotations and creates csv files for each book (language pair): para (gpt3, human) and sent formatted as para (gpt3, nmt). Additionaly, extracts human preferences into a csv file output/littrans_annotators_choices.csv
Removes new lines within text chunks.

txt2csv_wmt23.py : wmt23/ -> all_csv/

Converts the WMT23 txt files to csv files, merging the source and target languages into one file. Para (human, gpt4), sent (nmt). Removes new lines within text chunks.

run_csv2json4Llama.sh : dataprep/all_csv/{lang}.para.human.csv -> inputs/source_para_json/

csv2json4Llama.py extracts source paragraphs into json files formatted for Llama.

split_source_sents.py : dataprep/all_csv/{lang}.para.human.csv -> inputs/source_sent_json

needs GPU

Preprocesses source paragraphs, standardizes punctuation based on lang prior to segmentation. Splits source texts into sentences. Writes json files formatted for Llama. Also writes txt files.

Create translations

translate_gpt.sh : inputs/source_${level}_json/*.json -> dataprep/translated/${level}-level

translate_with_openAI.py uses OpenAI API to produce translations with GPT-3 and GPT-4. Script needs to be manually adjusted depending on level and model. Read annotation.

Translating with Llama

Translate (needs 4 GPUs) work is done on a cluster
run_json2csv4Llama.sh : dataprep/llama_translations/llama_{level}_json -> dataprep/llama_translations/llama_{level}_csv
clean_Llama_with_gpt4.py : dataprep/llama_translations/llama_{level}_csv -> dataprep/llama_translations/llama_{level}_gpt4_cleaned/

Flags missing transaltions with NO TRANSLATION FOUND. Flagged lines are sent back to the model for re-evaluation, which produces flags: <<WRONG STATEMENT, TRANSLATION FOUND>>, <<INACCURATE TRANSLATION>>, and <<CORRECT STATEMENT, NO TRANSLATION FOUND, because>> Make sure to indicate the "id" number of the line where to start processing file.
Feed flagged src-tgt pairs back to Llama for re-translation.
remove_gpt4_flags.py : dataprep/llama_translations/llama_{level}_llama_fixed -> dataprep/translated/{level}-level

Merging sentences into paragraphs

run_merge_sents2paras.sh : dataprep/translated/sent-level -> inputs/sents

merge_sents2paras.py merges target sentences into pargraphs by aligning them with the source paragrasphs via source sentences from translated/sent-level. Sentences are already preprocessed, but source paragraphs from inputs/source_para_json/${langs}.para.source.json are not preprocessed.

The script preprocesses all texts equaly, removes remaining translation artifacts, normalizes punctuation and spaces.

Outputs csv files that are ready for the analysis.

Copy all remaining files into inputs

cp dataprep/translated/para-level/* inputs/paras/
cp dataprep/all_csv/*para* inputs/paras
cp dataprep/all_csv/*sent* inputs/sents

Analysis

needs GPU

cd analysis

bash run_analysis.sh :

python3 align_sents.py -l ${level} : inputs/target_sent_json_{level}-level -> results/{level}_n2m_scores.csv

writes csv files with aligned sentences to output/aligned_sentences_{level}
writes results to results/{level}_n2m_scores.csv with ["lang", "system", "total_src_sents", "n2m", "n2mR", "length_var", "merges", "splits", "mergesRatio", "splitsRatio"]

python3 calculate_xwr.py -l ${level} : inputs/{level}s/ -> results/{level}_alignment_scores.csv

Performs word alignment and calculates cross word ratio (XWR)
writes all alignment data to output/alignments_per_file/
writes results to results/{level}_alignment_scores.csv with ["lang", "system", "all_alignments", "cross_alignments", "xwr_mean", "xwr_std"]

python3 merge_csv.py -l ${level}

Final dataframe: results/{level}_syntax_scores.csv

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
analysis		analysis
dataprep		dataprep
few-shot		few-shot
inputs		inputs
output		output
process_results		process_results
results		results
viz		viz
.gitignore		.gitignore
README.md		README.md
documentation.pdf		documentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataprep

Standardize data format. Retrieve source sentences.

Raw data

Create translations

Translating with Llama

Merging sentences into paragraphs

Copy all remaining files into inputs

Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Languages

shaitarAn/syntactic_creativity

Folders and files

Latest commit

History

Repository files navigation

Dataprep

Standardize data format. Retrieve source sentences.

Raw data

Create translations

Translating with Llama

Merging sentences into paragraphs

Copy all remaining files into inputs

Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages