Skip to content

Alignments

jgroschwitz edited this page May 14, 2022 · 4 revisions

Alignments are a crucial ingredient in training the AM parser. While we can learn most of the structure of the latent AM trees, currently the parser still relies on alignments at training time. The alignments must be provided, in form of associating nodes in the semantic graph with the sentence.

Alignment objects

Internally, alignments are handled with the Alignment class in am-tools. Essentially, each alignment groups nodes together into a graph constant, and aligns that constant to the sentence. An Alignment has three parts:

  1. a set of nodes, identified by their name,
  2. a span in the sentence that the nodes are aligned to, and
  3. a subset of the nodes named lexical nodes (lexnodes), which are the nodes whose labels directly correspond to words in the span. This relates to the copy-function of the am-parser (i.e. copying words or their lemmas directly into the graphs as node labels).

The nodes here are always specified via their node name, which corresponds to the notion of node names in the context of the SGraph class, as well as the node names in AMR string notation, such as d and j in (j / jump :ARG0 (d / dog)). The Alignment class also has two optional properties; color and weight; these are only used for visualization and our AMR aligner (see below) and not when training the parser.

There are several constructors for the alignment class, in practice we recommend the following two:

Alignment(Set<String> nodes, int index, String lexicalNode)
Alignment(String nn, int index)

The first aligns a set of nodes to a single word (with 0-based index) and specifies a single lexical node. The second is convenient when the alignment contains only a single node with name nn; it uses a singleton set containing just that node as its node set and automatically uses this same node as the lexical node. The reason why these are particularly useful is due to the assumptions 3 and 4 below

You can find more info and examples in the Alignment paragraph here.

String format

Say we want to align nodes with names n1, n2 and n3 to the span starting at index 4 (inclusive, 0-based) and ending at 7 (exclusive, 0-based), with n1 being the lexical node. Our string representation of this alignment is n1!|n2|n3||4-7. That is, nodes and span are separated with a ||, the nodes are separated from each other with a single |, and lexical nodes are marked with an exclamation point.

Optionally, a weight can be given at the end, separated again with ||, for example: n1!|n2|n3||4-7||1.0. Again, weights are ignored when training the parser.

Alignment requirements for the parser

The AM parser makes some assumptions about the set of alignments between the graph and the sentence. If any of these are violated, the sentence-graph pair will be skipped during training. The assumptions are:

  1. Every node in the graph must be part of exactly one alignment. That is, the alignments partition the graph.
  2. Every word in the sentence can be part of at most one alignment. (Not every word has to be part of an alignment.)
  3. Every span must contain exactly one word. This is because the supertagger and dependency parser will make predictions token by token.
  4. Every alignment can have at most one lexical node. This is a requirement of our copy mechanism.
  5. In each alignment, the nodes must form a connected subgraph.
  6. In each alignment, there can be at most one node that has incident edges that are not part of its blob (see e.g. Section 4.1 here for a definition of blobs). This is because all such edges would be "outside" the graph constant defined by the alignment and would need to attach at the root of the constant; and the constant can have only one root.

If the alignments on your graphbank do not match these conditions, you may be able to still satisfy them with some pre- and postprocessing steps, such as contracting a span of tokens into one special token (we do this for AMR). We also found that the parser has good performance even when some graphs are skipped in the training corpus due to violating these requirements. For example, some (<5%) of the graphs in the AMR corpora violate assumptions 4, 5 or 6.

AMR aligner

We used our own rule-based aligner in our experiments for AMR. The code is here. It gets used in the AMR preprocessing routine.

Clone this wiki locally