Skip to content

AM CoNLL file format

Matthias Lindemann edited this page Jul 10, 2020 · 5 revisions

The AM parser internally represents derivations of graphs as AM dependency trees. These can be written in a file in the AM-CoNLL file format, which is a variant of the usual CoNLL(-U) format for syntactic dependency trees.

The columns have the following meanings:

Column Meaning Corresponding CoNLL-U column
1 ID (position in sentence), 1-based 1 ID
2 FORM (exact word form) 2 FORM
3 REPLACEMENT (_name_, _number_, _date_ etc.); _ if no replacement needed --
4 LEMMA, graphbank-specific 3 LEMMA
5 POS, graphbank-specific 4 UPOS
6 NER: CoreNLP named entity tag (finegrained), or O if none --
7 delexicalized supertag; an s-graph (graph with sources) in PENMAN notation, where the label of the node containing lexical information is replaced with --LEX-- --
8 lexical label (the lexical label removed from the supertag), can refer to other columns. --
9 as-graph type; essentially containing source annotations for the s-graph supertag (not included in column 7!) --
10 HEAD (head ID, i.e. where in the sentence the incoming edge comes from; 0 if no head) 7 HEAD
11 DEPREL (incoming edge label; IGNORE if no edge) 8 DEPREL
12 aligned (true or false, typically true): whether the fact that edge and graph constant are at this position are to be read as alignment. When false, the edge and graph constant could belong to any word. --

In the lexical label column, we can also refer to other columns. For example, if the lemma was "see" and the lexical label was "$LEMMA$-02", we could reconstruct the label "see-02" in post-processing. The list of placeholders that can be used to refer to other columns is $FORM$, $REPL$, $POS$, $LEMMA$.

Historically, REPLACEMENT comes from the AMR pre-processing. For EDS, we use it to capture additional lemma information, e.g. "four" is represented as "4" -- we do this to make the prediction of the lexical label eaiser. In this example, if the proper lexical label was "4", we could write "$REPL$" in the column for the lexical label and thereby refer to it. That is, it should be treated as a graphbank specific column that holds relevant information for post-processing.

Every line that starts with # is a comment. For some purposes sentences can have additional attributes (like raw untokenized sentence) then we add a line with the format #[key]:[value] directly before the sentence. For instance:

#raw:New York is great!

Unparsed AM-CoNLL files (which serve as input to the AM parser) only have meaningful values in columns 1-6, and blank values in the other columns. The AM parser will then output another AM-CoNLL file with the other columns filled in.

Clone this wiki locally