Skip to content

franciellevargas/MFTCXplain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DOI

MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of Large Language Models.


Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. To fill these relevant gaps, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanations using the Moral Foundations Theory. Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. Our findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.

This repository contains the MFTCXplain dataset that comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Below is an example annotation for the Moral Foundations Theory (MFT) categories.

SSC-logo-200x71

MFTCXplain Statistics

Language Hate Speech Non-Hate Speech Total
English (EN) 310 394 704
Italian (IT) 300 321 621
Persian (PE) 302 306 608
Portuguese (PO) 541 526 1,067
All Languages 1,453 1,547 3,000

SSC-logo-200x71


If you have any questions, feel free to contact me at: franciellealvargas@gmail.com.

CITING / BIBTEX

Please cite our paper if you use our dataset:

@inproceedings{trager-etal-2025-mftcxplain,
    title = "{MFTCX}plain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of {LLM}s through Multi-hop Hate Speech Explanation",
    author = "Trager, Jackson  and
      Vargas, Francielle  and
      Alves, Diego  and
      Guida, Matteo  and
      Ngueajio, Mikel K.  and
      Agrawal, Ameeta  and
      Daryani, Yalda  and
      Malekabadi, Farzan Karimi  and
      Plaza-del-Arco, Flor Miriam",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.851/",
    doi = "10.18653/v1/2025.findings-emnlp.851",
    pages = "15709--15740",
    ISBN = "979-8-89176-335-7",
    abstract = "Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via multi-hop hate speech explanations using the Moral Foundations Theory. MFTCXplain comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Our results show a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 {\ensuremath{<}} 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. Our findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning."
}



FUNDING

SSC-logo-300x171 SSC-logo-300x171


About

MFTCXplain is the first multilingual benchmark dataset designed to evaluate the moral reasoning of Large Language Models (LLM) through multi-hop hate speech explanations grounded in Moral Foundations Theory (MFT).

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors