-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Arabic benchmarks (5) and enhance existing tasks #372
base: main
Are you sure you want to change the base?
Conversation
Add new Arabic benchmarks and update existing tasks - Renamed `arabic_mmlu` to `arabic_mmlu_mt` to highlight its machine-translated origin. - Added new benchmarks: `arabic_mmlu` ArabicMMLU (https://arxiv.org/abs/2402.12840), `arabic_mmlu_ht` (human-translated), and `MadinahQA` from MBZUAI. As well as `arabic_mmmlu` (OpenAI MMMLU), and `AraTrust` a trustworthiness benchmark for Arabic LLMs (https://arxiv.org/abs/2403.09017). - Enhanced prompt functions for better flexibility in answer options.
Rename file to refelect that it is v1 leaderboard tasks
Tasks for v2 of OALL
add new and renamed tasks
Hi, we thanks for adding the benches. Secondly we have added the both arabic_mmmlu and openai_mmlu. I would prefer not adding duplicates, but I am open to discuss adding the with/without instruction modifications |
Hey @hynky1999, thanks for your input!
I went ahead and added them just to test how the different implementations might affect the scores (hopefully it doesn’t!). I can run this test on my fork and compare the results with your version, and then we can decide whether to keep them or nto. What do you think?
I haven’t fully wrapped my head around the templates yet—it might take me a few days. If you’re able to help with this integration in the meantime, feel free to contribute! Otherwise, I’ll try to get to it by next week max. Also, I’m unsure why the format check is failing. I ran Thanks! |
Hi ! Thanks for the PR, for the formating issues, you should use the precommit hooks
|
Fix formatting issues for
Thanks for fixing the formating ! You can find the doc to add a new task using the prompt templates here don't hesitate to reach out if you need any help :) |
Hey @NathanHB Thanks for pointing out to the docs of adding prompt templates. I'am planning to add that in a separate PR in the near future. For now i believe we can move on with this unless it contradicts with the team's plan for the future versions of LightEval. |
Hi! I think we can indeed move forward with this. One last thing before: did you check the difference in result between your and the current implementations of arabic mmlu and open ai mmlu? |
Renamed
arabic_mmlu
toarabic_mmlu_mt
:Introduced three new MMLU-style benchmarks:
arabic_mmlu
: Native Arabic MMLU benchmark introduced by MBZUAI, based on the official "ArabicMMLU" paper (https://arxiv.org/abs/2402.12840).arabic_mmlu_ht
: Human-translated version from MBZUAI, providing a more accurate and high-quality translation of the original work by Hendrycks et al. (2021) on Measuring Massive Multitask Language Understanding (MMLU).arabic_mmmlu
: Arabic subset of OpenAI's Multilingual MMLU (MMMLU), which is human-annotated, targeting similar subjects.Added AraTrust benchmark:
Added MadinahQA benchmark:
Comparative study across different versions of Arabic MMLU:
arabic_mmlu_mt
(machine translated using an NMT engine) shows competitive results compared to human-translated versions, indicating the efficacy of the translation engine.arabic_mmlu_okapi
), which was translated using GPT-3.5 API (ChatGPT), shows lower correlation and performance, reflecting potential flaws and lower translation quality.The attached table below shows the comparative analysis of model performances across the different Arabic MMLU datasets.
cc : @clefourrier , @NathanHB , @hynky1999