Official repository for the paper "A Benchmark for Neural Readability Assessment of Texts in Spanish" by @lmvasquezr, @pcuenq, @fireblend and @feralvam.
If you have any question, please don't hesitate to contact us. Feel free to submit any issue/enhancement in GitHub.
Our datasets include a combination of texts that are freely available and with a data license agreement. Nevertheless, we have published the collected datasets that are freely available datasets in HuggingFace to support further readability studies. Please find the links in the table below:
Dataset | Original Readability Level |
---|---|
HablaCultura | CEFR |
kwiziq | CEFR |
coh-metrix-esp | simple, complex |
CAES | CEFR |
Simplext* | simple, complex |
Newsela* | School Grade Levels (2-12) and Readability Levels (0-4) |
OneStopCorpus | basic, intermediate, advanced |
*Please request your access for Newsela and Simplext corpus (Horacio Saggion) and we will be happy to share our splits upon request.
We have released all of our pretrained models in HuggingFace:
Model | Granularity | # classes |
---|---|---|
BERTIN (ES) | paragraphs | 2 |
BERTIN (ES) | paragraphs | 3 |
mBERT (ES) | paragraphs | 2 |
mBERT (ES) | paragraphs | 3 |
mBERT (EN+ES) | paragraphs | 3 |
BERTIN (ES) | sentences | 2 |
BERTIN (ES) | sentences | 3 |
mBERT (ES) | sentences | 2 |
mBERT (ES) | sentences | 3 |
mBERT (EN+ES) | sentences | 3 |
For the zero-shot setting, we used the original models BERTIN and mBERT with no further training.
Also, you can find our TF-IDF+Logistic Regression
approach in model_regression.py
, which is based on this implementation.
We have published all of our datasets and models in HuggingFace. However, as a reference, we have also included our training and data processing scripts in the source folder.
If you use our results and scripts in your research, please cite our work: "A Benchmark for Neural Readability Assessment of Texts in Spanish" (to be published)
@inproceedings{vasquez-rodriguez-etal-2022-benchmarking,
title = "A Benchmark for Neural Readability Assessment of Texts in Spanish",
author = "V{\'a}squez-Rodr{\'\i}guez, Laura and
Cuenca-Jim{\'\e}nez, Pedro-Manuel and
Morales-Esquivel, Sergio Esteban and
Alva-Manchego, Fernando",
booktitle = "Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), EMNLP 2022",
month = dec,
year = "2022",
}
We have downloaded the datasets below from its original website to make it available to the community in HuggingFace. If you use this data, please credit the original author and our work as well.
We have extracted the CAES corpus from their website. If you use this corpus, please also cite their work as follows:
@article{Parodi2015,
author = "Giovanni Parodi",
title = "Corpus de aprendices de español (CAES)",
journal = "Journal of Spanish Language Teaching",
volume = "2",
number = "2",
pages = "194-200",
year = "2015",
publisher = "Routledge",
doi = "10.1080/23247797.2015.1084685",
URL = "https://doi.org/10.1080/23247797.2015.1084685",
eprint = "https://doi.org/10.1080/23247797.2015.1084685"
}
We have made available in the HF the collected dataset from Coh-Metrix-Esp paper. If you use their data, please cite their work as follows:
@inproceedings{quispesaravia-etal-2016-coh,
title = "{C}oh-{M}etrix-{E}sp: A Complexity Analysis Tool for Documents Written in {S}panish",
author = "Quispesaravia, Andre and
Perez, Walter and
Sobrevilla Cabezudo, Marco and
Alva-Manchego, Fernando",
booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
month = may,
year = "2016",
address = "Portoro{\v{z}}, Slovenia",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L16-1745",
pages = "4694--4698",
}
For these datasets, please also give credit to HablaCultura.com and Kwiziq websites.