Skip to content

andreasvc/openboek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The OpenBoek corpus

OpenBoek is a corpus of public domain Dutch literature with several layers of linguistic annotations.

Creative Commons License
OpenBoek is licensed under a Creative Commons Attribution 4.0 International License.

Annotations:

  • coref: exported files in CoNLL 2012 format. The coreference column is manually corrected. The POS, parse bit, and NER columns are extracted from automatically derived parse trees. Annotations follow the dutchcoref guidelines

  • parses: hand-corrected parse trees in the Alpino XML format, one XML file per sentence. See the releases in this repository for the automatically parsed texts of the full corpus, with and without spelling normalization.

  • features: tab-separated files with entity features: gender and number. Each entity is identified by the indices (sentence number, begin/end token) of its first mention. Gender has values:

    • f (female)
    • m (male)
    • fm (unknown or mixed gender)
    • n (neuter, non-human)

    Any gender except n implies a human entity.

    Number:

    • sg (singular)
    • pl (plural; an entity consisting of multiple individuals/objects)

    The semantic number is annotated (e.g., "the group" is plural since it could be referred to by "they"), regardless of the syntactic number. In addition to features, there is also a column indicating the entity type/category. See the annotation guidelines for the possible values: https://github.com/nitgi/Thesis/blob/main/annotationguidelines.pdf

  • original: The original texts of the novels, except for some manual spelling changes applied to Multatuli (y -> ij, koffi -> koffie) and Nescio (eg., datti -> dat -ie). To review the changes that have been made, download the original text and compare it with a character-based diff as follows: git diff --word-diff=color --word-diff-regex=. pg11024.txt original/Multatuli_MaxHavelaar.txt

  • tokenized: one sentence per line of space-separated words, prefixed with an sentence identifier of the form parno-sentno. Used as input for Alpino and the spelling normalization tool.

  • spelling: spelling normalized versions of the tokenized texts. The input is in a format understood by the Alpino parser. The automatically normalized version (silver standard spelling) is given for all texts, based on the output of https://github.com/gertjanvannoord/oudeboeken For a subset, manually corrected versions (gold standard spelling) of these are provided as well.

  • pos: manually corrected POS tags (CGN coarse tags).

  • quotes: the .xml files contain manually annotated speakers of direct speech spans; annotated using https://github.com/muzny/quoteannotator/ For annotation guidelines and more information, see https://github.com/frenkvdberg/dutchqa The .tsv files contain both speaker and addressee annotations.

  • events: .tsv files containing annotations of events (verbs) as well as agents and patients of those events. See the annotation guidelines: https://github.com/bbjoverbeek/event-prediction/blob/main/label_studio/annotation_guidelines.pdf The tsv files have four columns: 1. token id (zero-indexed), 2. token, 3. label, 4. parent id. The label column has one of the following values:

    • REALIS (a verb describing an event that happens or happened in the story),
    • IRREALIS (a hypothetical or future event)
    • B-AGENT, I-AGENT,
    • B-PATIENT, I-PATIENT
    • O: this token did not receive a label during annotation.
    • "-": when all tokens in a sentence have a dash as label, it means the sentence was not annotated. An agent/patient may play a role in multiple events. The parent column is a comma separated list of token IDs indicating the event to which an agent/patient is linked.

Corpus

All texts in the corpus are public domain texts from Project Gutenberg. The corpus includes classic Dutch texts as well as translated novels. Each fragment is at least the first 10,000 words from the text; the total annotated dataset contains about 103,000 tokens.

Gutenberg ID Date Author Title
PG13214 1877 Leo Tolstoy Anna Karenina
PG37316 1862 Victor Hugo De ellendigen
PG11318 1872 Jules Verne De reis om de wereld in tachtig dagen
PG29719 1911 Nescio De uitvreter
PG29719 1918 Nescio Dichtertje
PG29719 1915 Nescio Titaantjes
PG19563 1889 Louis Couperus Eline Vere
PG11024 1860 Multatuli Max Havelaar
PG30933 1890 Arthur Conan Doyle Sherlock Holmes en de Agra-schat

Reference

If you use this dataset for research, please cite the following paper:

@article{vancranenburgh2022openboek,
    author={van Cranenburgh, Andreas  and  van Noord, Gertjan},
    year={2022},
    title={OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization},
    journal={Computational Linguistics in the Netherlands Journal},
    volume={12},
    month={Dec.},
    pages={235–251},
    url={https://clinjournal.org/clinj/article/view/157},
}

About

The OpenBoek corpus

Resources

Stars

Watchers

Forks

Packages

No packages published