Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Arabic benchmarks (5) and enhance existing tasks #372

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

alielfilali01
Copy link
Contributor

@alielfilali01 alielfilali01 commented Oct 23, 2024

  • Renamed arabic_mmlu to arabic_mmlu_mt:

    • This change reflects that the previous Arabic MMLU was machine translated (MT) using a neural machine translation (NMT) engine (most probably Google Translate API).
  • Introduced three new MMLU-style benchmarks:

    • arabic_mmlu: Native Arabic MMLU benchmark introduced by MBZUAI, based on the official "ArabicMMLU" paper (https://arxiv.org/abs/2402.12840).
    • arabic_mmlu_ht: Human-translated version from MBZUAI, providing a more accurate and high-quality translation of the original work by Hendrycks et al. (2021) on Measuring Massive Multitask Language Understanding (MMLU).
    • arabic_mmmlu: Arabic subset of OpenAI's Multilingual MMLU (MMMLU), which is human-annotated, targeting similar subjects.
  • Added AraTrust benchmark:

    • Integrated AraTrust, a benchmark designed for evaluating trustworthiness in Arabic LLMs (based on the paper "AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic" https://arxiv.org/abs/2403.09017).
  • Added MadinahQA benchmark:

  • Comparative study across different versions of Arabic MMLU:

    • Detailed performance analysis shows a strong correlation between OpenAI’s MMMLU (human annotated) and MBZUAI’s Arabic MMLU HT (human-translated).
    • The arabic_mmlu_mt (machine translated using an NMT engine) shows competitive results compared to human-translated versions, indicating the efficacy of the translation engine.
    • The Okapi version (arabic_mmlu_okapi), which was translated using GPT-3.5 API (ChatGPT), shows lower correlation and performance, reflecting potential flaws and lower translation quality.

The attached table below shows the comparative analysis of model performances across the different Arabic MMLU datasets.

Model Name Model Size (B) Average Score Arabic MMLU Okapi Arabic MMLU MT Arabic MMLU HT Arabic MMLU OpenAI
Qwen_Qwen2.5-7B-Instruct 7 49.79 37.94 50.95 55.41 54.86
Qwen_Qwen2.5-7B 7 47.35 36.21 49.21 51.7 52.28

cc : @clefourrier , @NathanHB , @hynky1999

Add new Arabic benchmarks and update existing tasks

- Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin.
- Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017).
- Enhanced prompt functions for better flexibility in answer options.
Rename file to refelect that it is v1 leaderboard tasks
Tasks for v2 of OALL
add new and renamed tasks
@alielfilali01 alielfilali01 changed the title Add new Arabic MMLU benchmarks and enhance existing tasks Add new Arabic benchmarks and enhance existing tasks Oct 23, 2024
@alielfilali01 alielfilali01 changed the title Add new Arabic benchmarks and enhance existing tasks Add new Arabic benchmarks (5) and enhance existing tasks Oct 23, 2024
@hynky1999
Copy link
Collaborator

hynky1999 commented Oct 24, 2024

Hi, we thanks for adding the benches.
Two things:
Do you think you could use the prompt templates ? This will ensure that you can easily switch between formulations (-> be able to evaluate models at early stage of training) and that the task implementation are consistent.

Secondly we have added the both arabic_mmmlu and openai_mmlu. I would prefer not adding duplicates, but I am open to discuss adding the with/without instruction modifications

@alielfilali01
Copy link
Contributor Author

Hey @hynky1999, thanks for your input!

Secondly, we have added both arabic_mmlu and openai_mmlu.**

I went ahead and added them just to test how the different implementations might affect the scores (hopefully it doesn’t!). I can run this test on my fork and compare the results with your version, and then we can decide whether to keep them or nto. What do you think?

Do you think you could use the prompt templates?**

I haven’t fully wrapped my head around the templates yet—it might take me a few days. If you’re able to help with this integration in the meantime, feel free to contribute! Otherwise, I’ll try to get to it by next week max.

Also, I’m unsure why the format check is failing. I ran ruff format . on my local machine before pushing, but it’s still being flagged. Could you help me figure out what might be going wrong?

Thanks!

@NathanHB
Copy link
Member

Hi ! Thanks for the PR, for the formating issues, you should use the precommit hooks

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

@NathanHB
Copy link
Member

NathanHB commented Nov 4, 2024

Thanks for fixing the formating ! You can find the doc to add a new task using the prompt templates here don't hesitate to reach out if you need any help :)

@alielfilali01
Copy link
Contributor Author

Hey @NathanHB Thanks for pointing out to the docs of adding prompt templates. I'am planning to add that in a separate PR in the near future. For now i believe we can move on with this unless it contradicts with the team's plan for the future versions of LightEval.

@clefourrier
Copy link
Member

Hi! I think we can indeed move forward with this. One last thing before: did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants