Add new Arabic benchmarks (5) and enhance existing tasks #372

alielfilali01 · 2024-10-23T09:41:46Z

Renamed arabic_mmlu to arabic_mmlu_mt:
- This change reflects that the previous Arabic MMLU was machine translated (MT) using a neural machine translation (NMT) engine (most probably Google Translate API).
Introduced three new MMLU-style benchmarks:
- arabic_mmlu: Native Arabic MMLU benchmark introduced by MBZUAI, based on the official "ArabicMMLU" paper (https://arxiv.org/abs/2402.12840).
- arabic_mmlu_ht: Human-translated version from MBZUAI, providing a more accurate and high-quality translation of the original work by Hendrycks et al. (2021) on Measuring Massive Multitask Language Understanding (MMLU).
- arabic_mmmlu: Arabic subset of OpenAI's Multilingual MMLU (MMMLU), which is human-annotated, targeting similar subjects.
Added AraTrust benchmark:
- Integrated AraTrust, a benchmark designed for evaluating trustworthiness in Arabic LLMs (based on the paper "AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic" https://arxiv.org/abs/2403.09017).
Added MadinahQA benchmark:
- MadinahQA, which is generously contributed by MBZUAI for the sake of the 2nd versio. of OALL (https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard). This dataset focuses mainly on educational and linguistic assessments.
Comparative study across different versions of Arabic MMLU:
- Detailed performance analysis shows a strong correlation between OpenAI’s MMMLU (human annotated) and MBZUAI’s Arabic MMLU HT (human-translated).
- The arabic_mmlu_mt (machine translated using an NMT engine) shows competitive results compared to human-translated versions, indicating the efficacy of the translation engine.
- The Okapi version (arabic_mmlu_okapi), which was translated using GPT-3.5 API (ChatGPT), shows lower correlation and performance, reflecting potential flaws and lower translation quality.

The attached table below shows the comparative analysis of model performances across the different Arabic MMLU datasets.

Model Name	Model Size (B)	Average Score	Arabic MMLU Okapi	Arabic MMLU MT	Arabic MMLU HT	Arabic MMLU OpenAI
Qwen_Qwen2.5-7B-Instruct	7	49.79	37.94	50.95	55.41	54.86
Qwen_Qwen2.5-7B	7	47.35	36.21	49.21	51.7	52.28

cc : @clefourrier , @NathanHB , @hynky1999

Add new Arabic benchmarks and update existing tasks - Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin. - Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017). - Enhanced prompt functions for better flexibility in answer options.

Rename file to refelect that it is v1 leaderboard tasks

Tasks for v2 of OALL

add new and renamed tasks

hynky1999 · 2024-10-24T11:03:18Z

Hi, we thanks for adding the benches.
Two things:
Do you think you could use the prompt templates ? This will ensure that you can easily switch between formulations (-> be able to evaluate models at early stage of training) and that the task implementation are consistent.

Secondly we have added the both arabic_mmmlu and openai_mmlu. I would prefer not adding duplicates, but I am open to discuss adding the with/without instruction modifications

alielfilali01 · 2024-10-24T11:55:28Z

Hey @hynky1999, thanks for your input!

Secondly, we have added both arabic_mmlu and openai_mmlu.**

I went ahead and added them just to test how the different implementations might affect the scores (hopefully it doesn’t!). I can run this test on my fork and compare the results with your version, and then we can decide whether to keep them or nto. What do you think?

Do you think you could use the prompt templates?**

I haven’t fully wrapped my head around the templates yet—it might take me a few days. If you’re able to help with this integration in the meantime, feel free to contribute! Otherwise, I’ll try to get to it by next week max.

Also, I’m unsure why the format check is failing. I ran ruff format . on my local machine before pushing, but it’s still being flagged. Could you help me figure out what might be going wrong?

Thanks!

NathanHB · 2024-10-24T14:24:35Z

Hi ! Thanks for the PR, for the formating issues, you should use the precommit hooks

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

Fix formatting issues for

NathanHB · 2024-11-04T14:08:26Z

Thanks for fixing the formating ! You can find the doc to add a new task using the prompt templates here don't hesitate to reach out if you need any help :)

alielfilali01 · 2024-11-05T09:50:06Z

Hey @NathanHB Thanks for pointing out to the docs of adding prompt templates. I'am planning to add that in a separate PR in the near future. For now i believe we can move on with this unless it contradicts with the team's plan for the future versions of LightEval.

clefourrier · 2024-11-14T14:20:44Z

Hi! I think we can indeed move forward with this. One last thing before: did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu?

alielfilali01 added 4 commits October 23, 2024 11:55

Update and rename OALL_tasks.txt to OALL_v1_tasks.txt

64d4e11

Rename file to refelect that it is v1 leaderboard tasks

Create OALL_v2_tasks.txt

f2596d5

Tasks for v2 of OALL

Update all_arabic_tasks.txt

a164472

add new and renamed tasks

alielfilali01 changed the title ~~Add new Arabic MMLU benchmarks and enhance existing tasks~~ Add new Arabic benchmarks and enhance existing tasks Oct 23, 2024

alielfilali01 changed the title ~~Add new Arabic benchmarks and enhance existing tasks~~ Add new Arabic benchmarks (5) and enhance existing tasks Oct 23, 2024

Merge branch 'main' into main

71c3167

alielfilali01 and others added 2 commits October 31, 2024 15:31

Update arabic_evals.py

b6d61dc

Fix formatting issues for

Merge branch 'main' into main

abb7244

Merge branch 'main' into main

506e5d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new Arabic benchmarks (5) and enhance existing tasks #372

Add new Arabic benchmarks (5) and enhance existing tasks #372

alielfilali01 commented Oct 23, 2024 •

edited

Loading

hynky1999 commented Oct 24, 2024 •

edited

Loading

alielfilali01 commented Oct 24, 2024

NathanHB commented Oct 24, 2024

NathanHB commented Nov 4, 2024

alielfilali01 commented Nov 5, 2024

clefourrier commented Nov 14, 2024

Add new Arabic benchmarks (5) and enhance existing tasks #372

Are you sure you want to change the base?

Add new Arabic benchmarks (5) and enhance existing tasks #372

Conversation

alielfilali01 commented Oct 23, 2024 • edited Loading

hynky1999 commented Oct 24, 2024 • edited Loading

alielfilali01 commented Oct 24, 2024

NathanHB commented Oct 24, 2024

NathanHB commented Nov 4, 2024

alielfilali01 commented Nov 5, 2024

clefourrier commented Nov 14, 2024

alielfilali01 commented Oct 23, 2024 •

edited

Loading

hynky1999 commented Oct 24, 2024 •

edited

Loading