This is the repo for CaLMQA: Exploring culturally specific long-form question answering across 23 languages. CaLMQA is a long-form question answering (LFQA) dataset spanning 23 high- to low-resource languages.
We recommend using the CaLMQA HF dataset to view or use CaLMQA. This repo focuses on reproducibility of our experiments.
If you find CaLMQA useful, please cite:
TBA
Make a new Python 3.10+ environment using virtualenv or conda. Then install the calmqa package locally using:
git clone https://github.com/2015aroras/CaLMQA.git
cd CaLMQA
pip install -e .
To run automatic evaluations, you will also need to install optional extra dependencies using:
pip install -e ".[autoeval]"
Some models are prompted via API calls. These API calls require access credentials. Our code can read relevant
credentials from environment variables or from a .env file (using python-dotenv).
The credentials depend on the type of model:
- OpenAI models (GPT-4 Turbo, GPT-4o):
OPENAI_ORG_IDandOPENAI_API_KEY - Claude models (Claude Opus):
ANTHROPIC_API_KEY - HF Transformers models (Aya 13B):
HF_USER_ACCESS_TOKEN - Together AI models (Mixtral 8x22B, Llama 3 70B):
TOGETHER_API_KEY - Vertex AI models (Gemini 1.5 Pro):
GOOGLE_API_KEY,PROJECT_IDandREGION
More information about API authentication can be determined from the documentation of the API's producer.
For reproducibility purposes, we spread our dataset across multiple 'dataset' files (in the data/datasets directory) and store extra state information (e.g. model version, temperature) in these files along with the QA data. Each dataset file contains the entries consisting of:
- A question object. This object contains the question text and any of its translations along with the state when these translations were produced.
- A list of answer objects. Each answer contains the model that generated the answer (including human) and
the state (
prompting_state) when the answer was generated.
We prompt models to generate answers to questions using scripts/generate.py. For each question in a data file (e.g. data/datasets/dataset-specific-german.json), this script fills in the prompt template data/prompts/generation-prompt.txt with the question and then prompts a model with the question. The base form of the command is
python scripts/generate.py <model> --dataset_load_path <dataset path> --dataset_save_path <save path> --temperature <temperature>
Supported models and more options can be found by running python scripts/generate.py --help.
For culturally agnostic questions (e.g. those in
data/datasets/dataset-agnostic-german.json),
the extra argument --q_translation_langs <language> should be passed to tell the script to prompt
using the non-English version of the question.
We generated all answers with temperature set to 0.
We translate culturally agnostic questions to other languages using scripts/translate.py. For each question the English culturally agnostic data file data/datasets/dataset-agnostic-english.json, this script fills in the prompt template data/prompts/question-translation-prompt.txt with the English question, English answer and target language name, and then prompts a model to perform the translation. The base form of the command is
python scripts/translate.py questions <model> --source_langs English --target_langs <target language> --dataset_save_path <save path> --temperature <temperature>
Supported models and more options can be found by running python scripts/translate.py --help.
We ran all translations of culturally agnostic questions with GPT-4 Turbo and temperature set to 0.
We prompt models to generate answers to questions using scripts/categorize.py. For each question in a data file (e.g. data/datasets/dataset-specific-german.json), this script fills in the categorization prompt (e.g. data/prompts/categorization-english-prompt.txt) with the question, category names, category descriptions and category examples. The script uses the prompt to make a model categorize the question. The base form of the command is
python scripts/categorize.py <model> --all_categories -p <prompt file> --dataset_load_path <dataset path> --dataset_save_path <save path> --temperature <temperature>
Supported models and more options can be found by running python scripts/categorize.py --help. We ran all our categorization with
temperature set to 0.
We detect language using
scripts/detect_lang.py.
For each question in a data file
(e.g. data/datasets/dataset-specific-german.json),
this script uses the polyglot or
langid package to detect the language
of a question or answer.
The command to detect the language of a model's answers is:
python scripts/detect_lang.py --dataset_load_path <dataset path> --model_name <model>
Similarly, language detection for questions can be done using
python scripts/detect_lang.py --dataset_load_path <dataset path> --check_questions
For culturally agnostic questions (e.g. those in
data/datasets/dataset-agnostic-german.json),
the extra argument --q_translation_lang <language> should be passed to tell the script to prompt
using the non-English version of the question.
We detect repetitions using
scripts/detect_repetitions.py.
For each question in a data file
(e.g. data/datasets/dataset-specific-german.json),
the script tokenizes answers using tiktoken with the
o200_base encoding and then looks for repeated n-grams.
The command to detect the percentage of a model's answers with repetitions is:
python scripts/detect_repetitions.py --dataset_load_path <dataset path> --model_name <model>
More options can be found by running python scripts/detect_repetitions.py --help.