🇺🇾 UY22 corpus

This repo contains the notebooks used for scraping the following uruguayan media sites:

El Observador (elobservador.com.uy)
El País (elpais.com.uy)
La Diaria (ladiaria.com.uy)
Montevideo Portal (montevideo.com.uy)

Every article scraped was stored as a .json file with the following structure:

{
    "url":      string,
    "id":       int,
    "date":     string,
    "category": string,
    "title":    string,
    "keywords": []string,
    "cover":    string,
    "body":     string,
}

where

url: URL pointing to original article
id: numeric ID (if exists, else random UID)
date: article's timestamp
category: article's category
title: article's title or header
keywords: article's tags
cover: URL pointing to article's front image (if any)
body: article's body

Every site is assagined a directory, and every articles is stored inside a directory named after its publishing year.

e.g., uy22-raw/ep22/2019/20190101120000-142502-Los_datos_del_Rey_de.json

For every corpus, there are two versions available:

raw: where body contains the raw unprocessed articles' HTML
clean: where body contains just text without HTML tags

Both raw & clean versions are about 6 GiB & 4 GiB respectively (totalling 10.3 GiB) and can be downloaded from here or here.

For every site there's also available an unified+splitted version of every article in a single .txt file. (totalling 2.4 GiB). Slipped means that every line contains a single sentence, and unified means every articles is separated by a blank line. The splitting was made using pln-fing-udelar/ sentence-splitter

542M dic 28 20:47 ep22-unified-splitted.txt
876M dic 27 23:04 eo22-unified-splitted.txt
854M dic 27 18:58 mp22-unified-splitted.txt

The concatenations of these files were used to train a RoBERTa-like LM using the HuggingFace library, and can be found here huggingface.co/datasets/pln-udelar/uy22 or here archive.org.

Cite this work

@inproceedings{rouberta2024,
  title={A Language Model Trained on Uruguayan Spanish News Text},
  author={Filevich, Juan Pablo and Marco, Gonzalo and Castro, Santiago and Chiruzzo, Luis and Ros{\'a}, Aiala},
  booktitle={Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability@ LREC-COLING 2024},
  pages={53--60},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🇺🇾 UY22 corpus

Cite this work

Files

README.md

Latest commit

History

README.md

File metadata and controls

🇺🇾 UY22 corpus

Cite this work