AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web. Read more here: https://corpling.uis.georgetown.edu/gum/amalgum.html
Latest data without Reddit texts is available under amalgum/
and some additional data beyond the target size of 4M tokens amalgum_extra/
. (The amalgum
directory contains around 500,000 tokens for each genre, while the extra directory contains some more data beyond the genre-balanced corpus.)
You may download the older version 0.1 of the corpus without Reddit texts as a zip. The complete corpus, with Reddit data, is available upon request: please email lg876@georgetown.edu.
AMALGUM (A Machine-Annotated Lookalike of GUM) is an English web corpus spanning 8 genres with 4,000,000 tokens and several annotation layers.
Source data was scraped from eight different sources containing stylistically distinct text. Each text's source is indicated with a slug in its filename:
academic
: MDPIbio
: Wikipediafiction
: Project Gutenberginterview
: Wikinews, Interview categorynews
: Wikinewsreddit
: Redditwhow
: wikiHowvoyage
: wikiVoyage
AMALGUM contains annotations for the following information:
- Tokenization
- UD and Extended PTB part of speech tags
- Lemmas
- UD dependency parses
- (Non-)named nested entities
- Coreference resolution
- Rhetorical Structure Theory discourse parses (constituent and dependency versions)
- Date/Time annotations in TEI format
These annotations are across four file formats: GUM-style XML, CONLLU, WebAnno TSV, and RS3.
You can see samples of the data for AMALGUM_news_khadr
: xml, conllu, tsv, rs3
Current scores on the GUM corpus test set per task:
task | metric | performance |
---|---|---|
tokenizer | F1 | 99.92 |
sentencer | Acc / F1 | 99.85 / 94.35 |
xpos | Acc | 98.16 |
dependencies | LAS / UAS* | 92.16 / 94.25 |
NNER | Micro F1 | 70.8 |
coreference | CoNLL F1 | 51.4 |
RST | S / N / R | 77.98 / 61.79 / 44.07 |
* Parsing scores ignore punctuation attachment; punctuation is attached automatically via udapi.
Please see our paper.
@inproceedings{gessler-etal-2020-amalgum,
title = "{AMALGUM} {--} A Free, Balanced, Multilayer {E}nglish Web Corpus",
author = "Gessler, Luke and
Peng, Siyao and
Liu, Yang and
Zhu, Yilun and
Behzad, Shabnam and
Zeldes, Amir",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.648",
pages = "5267--5275",
abstract = "We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a {``}better than NLP{''} benchmark and evaluate the accuracy of the resulting resource.",
language = "English",
ISBN = "979-10-95546-34-4",
}
All annotations under the folders amalgum/
and amalgum_extra/
are available under a Creative Commons Attribution (CC-BY) license, version 4.0. Note that their texts are sourced from the following websites under their own licenses:
academic
: MDPI, CC BY 4.0bio
: Wikipedia, CC BY-SA 3.0fiction
: Project Gutenberg, The Project Gutenberg Licenseinterview
: Wikinews, CC BY 2.5news
: Wikinews, CC BY 2.5whow
: wikiHow, CC BY-NC-SA 3.0voyage
: wikiVoyage, CC BY-SA 3.0
See DEVELOPMENT.md.