-
Notifications
You must be signed in to change notification settings - Fork 9
AM CoNLL file format
The AM parser internally represents derivations of graphs as AM dependency trees. These can be written in a file in the AM-CoNLL file format, which is a variant of the usual CoNLL(-U) format for syntactic dependency trees.
The columns have the following meanings:
Column | Meaning | Corresponding CoNLL-U column |
---|---|---|
1 | ID (position in sentence), 1-based | 1 ID |
2 | FORM (exact word form) | 2 FORM |
3 | REPLACEMENT (_name_ , _number_ , _date_ etc.); _ if no replacement needed |
-- |
4 | LEMMA, graphbank-specific | 3 LEMMA |
5 | POS, graphbank-specific | 4 UPOS |
6 | NER: CoreNLP named entity tag (finegrained), or O if none |
-- |
7 | delexicalized supertag; an s-graph (graph with sources) in PENMAN notation, where the label of the node containing lexical information is replaced with --LEX--
|
-- |
8 | lexical label (the lexical label removed from the supertag), can refer to other columns. | -- |
9 | as-graph type; essentially containing source annotations for the s-graph supertag (not included in column 7!) | -- |
10 | HEAD (head ID, i.e. where in the sentence the incoming edge comes from; 0 if no head) |
7 HEAD |
11 | DEPREL (incoming edge label; IGNORE if no edge) |
8 DEPREL |
12 | aligned (true or false , typically true ): whether the fact that edge and graph constant are at this position are to be read as alignment. When false , the edge and graph constant could belong to any word. |
-- |
In the lexical label column, we can also refer to other columns. For example, if the lemma was "see" and the lexical label was "$LEMMA$-02", we could reconstruct the label "see-02" in post-processing. The list of placeholders that can be used to refer to other columns is $FORM$, $REPL$, $POS$, $LEMMA$
.
Historically, REPLACEMENT
comes from the AMR pre-processing. For EDS, we use it to capture additional lemma information, e.g. "four" is represented as "4" -- we do this to make the prediction of the lexical label eaiser. In this example, if the proper lexical label was "4", we could write "$REPL$" in the column for the lexical label and thereby refer to it. That is, it should be treated as a graphbank specific column that holds relevant information for post-processing.
Every line that starts with # is a comment. For some purposes sentences can have additional attributes (like raw untokenized sentence) then we add a line with the format #[key]:[value]
directly before the sentence. For instance:
#raw:New York is great!
Unparsed AM-CoNLL files (which serve as input to the AM parser) only have meaningful values in columns 1-6, and blank values in the other columns. The AM parser will then output another AM-CoNLL file with the other columns filled in.