Whow_UCCA is a corpus of English WikiHow instructional guides semantically annotated with Universal Conceptual Cognitive Annotation (UCCA). It is comprised of 11 documents about varying topics, which were previously annotated with an array of linguistic (POS, syntax, discourse structure and more) and document-structure information as part of the GUM project (github).
The UCCA annotations were carried out by the students and instructor of the Advanced Semantic Representation course at Georgetown University in the Fall semester 2018.
- ucca-guidelines.pdf: The version of the UCCA annotation guidelines used for the compilation of this corpus.
- unreviewed/xml: Annotated passages before review/adjudication.
- raw/txt: Passages in raw text format (tokenized).
The corpus contains 11 documents with token counts ranging from 656 to 1160. For comparability and to facilitate annotation, we split each document into 2-4 passages ranging between 104 and 355 tokens each. At least 2 passages / 607 tokens of each document have been annotated with UCCA by at least one annotator.
Filenames follow the pattern whow_<DOCUMENT>_<PASSAGE>_<XXXX>.xml
. So the file whow_ballet_2_orig.xml
, for instance, contains the annotation for the 2nd passage of document "whow_ballet".
In order to compute inter-annotator agreement (IAA), one randomly selected passage per document has been annotated by two additional annotators. In the file naming schema described above, <XXXX>
(one out of {orig, iaa1, iaa2}) indicates whether the annotation of this passage was done by the annotator originally assigned to it (primary annotator), or one of the secondary annotators.
The annotations were carried out through the web-based annotation tool UCCAApp (demo).
The UCCA Python API by Daniel Hershcovich and Amit Beka provides functionality to read, analyze, manipulate, and write the annotations as XML. To get it, you can clone the following github repository:
git clone https://github.com/danielhers/ucca.git
or install the package via pip:
pip install ucca
- Universal Conceptual Cognitive Annotation (UCCA)
Omri Abend and Ari Rappoport (2013). ACL 2013. - UCCAApp: Web-application for Syntactic and Semantic Phrase-based Annotation
Omri Abend, Shai Yerushalmi and Ari Rappoport (2017). ACL 2017. - The GUM Corpus: Creating Multilayer Resources in the Classroom
Amir Zeldes (2017). Language Resources and Evaluation 51(3), 581–612.