Skip to content

CodeMixQA is a benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms.

License

Notifications You must be signed in to change notification settings

gentaiscool/codemixqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeMixQA

Pull Requests Welcome License CodeMixQA Dataset

A benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms.

We use SimpleQA Verified as our source dataset. We select the SimpleQA Verified, as it is a challenging evaluation set that has not been saturated yet by current models and has desirable properties such as verifiable answers (through source reconciliation), de-duplicated data points, topic balancing, and that it is markedly different from most standard tasks that are prevalent in code switching studies such as language identification, NER, and machine translation.

In this dataset, we employ multiple data generation strategies, including random switching, selective switching, and grammar-constrained approaches. This dataset enables systematic evaluation of LLM performance across different code-switching patterns and text generation strategies.

📜 Paper

This is the source code of the paper, "Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?". This code has been written using Python. If you use any code or datasets from this toolkit in your research, please cite the associated paper.

@article{winata2026can,
  title={Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?},
  author={Winata, Genta Indra and Anugraha, David and Irawan, Patrick Amadeus and Das, Anirban and Yoo, Haneul and Dashore, Paresh and Kulkarni, Shreyas and Zhang, Ruochen and Sakajo, Haruki and Hudi, Frederikus and others},
  journal={arXiv preprint arXiv:2601.07153},
  year={2026}
}

⚡ Environment Setup

Please run the following command to install the required libraries to reproduce the benchmark results.

Via pip

pip install -r requirements.txt

📊 Generate Dataset

This is the command to generate code-switched dataset.

python generate_codemix_data.py --openai_key <OPENAI_KEY>

Arguments

Argument Description Example / Default
--openai_key OPENAI_KEY sk-....

📊 Generate Transliteration

The transliteration is only done for Indic languages.

python generate_transliterated_data.py --openai_key <OPENAI_KEY> --language <LANGUAGE>

Arguments

Argument Description Example / Default
--openai_key OPENAI_KEY sk-....
--language Language Hindi

🧪 Run Evaluation

python src/inference.py -d <DATASET_NAMES> -o <OUTPUT>

Arguments

Argument Description Example / Default
--dataset_names or -d Dataset names all
--output_folder or -o Output folder output
--chunk_size Output folder 1
--start_offset Start offset 0
--end_offset End offset -1
--seeds_list List of seeds to use. Provide one or more integers separated by spaces (e.g., --seeds_list 0 1 2). Defaults to [0, 1, 2]. 0 1 2
--safe-infer Filter out input that is longer than max-model-len minus output length (store_true)
--debug Debug with {DEBUG_COUNT} samples. (store_true)

About

CodeMixQA is a benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages