CodeMixQA

A benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms.

We use SimpleQA Verified as our source dataset. We select the SimpleQA Verified, as it is a challenging evaluation set that has not been saturated yet by current models and has desirable properties such as verifiable answers (through source reconciliation), de-duplicated data points, topic balancing, and that it is markedly different from most standard tasks that are prevalent in code switching studies such as language identification, NER, and machine translation.

In this dataset, we employ multiple data generation strategies, including random switching, selective switching, and grammar-constrained approaches. This dataset enables systematic evaluation of LLM performance across different code-switching patterns and text generation strategies.

📜 Paper

This is the source code of the paper, "Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?". This code has been written using Python. If you use any code or datasets from this toolkit in your research, please cite the associated paper.

@article{winata2026can,
  title={Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?},
  author={Winata, Genta Indra and Anugraha, David and Irawan, Patrick Amadeus and Das, Anirban and Yoo, Haneul and Dashore, Paresh and Kulkarni, Shreyas and Zhang, Ruochen and Sakajo, Haruki and Hudi, Frederikus and others},
  journal={arXiv preprint arXiv:2601.07153},
  year={2026}
}

⚡ Environment Setup

Please run the following command to install the required libraries to reproduce the benchmark results.

Via `pip`

pip install -r requirements.txt

📊 Generate Dataset

This is the command to generate code-switched dataset.

python generate_codemix_data.py --openai_key <OPENAI_KEY>

Arguments

Argument	Description	Example / Default
`--openai_key`	OPENAI_KEY	sk-....

📊 Generate Transliteration

The transliteration is only done for Indic languages.

python generate_transliterated_data.py --openai_key <OPENAI_KEY> --language <LANGUAGE>

Arguments

Argument	Description	Example / Default
`--openai_key`	OPENAI_KEY	sk-....
`--language`	Language	Hindi

🧪 Run Evaluation

python src/inference.py -d <DATASET_NAMES> -o <OUTPUT>

Arguments

Argument	Description	Example / Default
`--dataset_names` or `-d`	Dataset names	all
`--output_folder` or `-o`	Output folder	output
`--chunk_size`	Output folder	1
`--start_offset`	Start offset	0
`--end_offset`	End offset	-1
`--seeds_list`	List of seeds to use. Provide one or more integers separated by spaces (e.g., --seeds_list 0 1 2). Defaults to [0, 1, 2].	0 1 2
`--safe-infer`	Filter out input that is longer than max-model-len minus output length	(store_true)
`--debug`	Debug with {DEBUG_COUNT} samples.	(store_true)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
LICENSE		LICENSE
README.md		README.md
generate_codemix_data.py		generate_codemix_data.py
generate_transliterated_data.py		generate_transliterated_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeMixQA

📜 Paper

⚡ Environment Setup

Via `pip`

📊 Generate Dataset

Arguments

📊 Generate Transliteration

Arguments

🧪 Run Evaluation

Arguments

About

Uh oh!

Releases

Packages

Languages

License

gentaiscool/codemixqa

Folders and files

Latest commit

History

Repository files navigation

CodeMixQA

📜 Paper

⚡ Environment Setup

Via pip

📊 Generate Dataset

Arguments

📊 Generate Transliteration

Arguments

🧪 Run Evaluation

Arguments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Via `pip`

Packages