Skip to content

Learning compositional structures

jgroschwitz edited this page May 10, 2022 · 28 revisions

This page documents how to run the system presented in Groschwitz et al. 2021, Learning compositional structures for semantic graph parsing, and how to make it work for new graphbanks.

System requirements and installation.

This project is in the branches unsupervised2020 of https://github.com/coli-saar/am-parser and new_decomposition of https://github.com/coli-saar/am-tools .

In addition to the requirements described in the main README, this project also requires PyJnius, which you can install using pip install pyjnius==1.2.1 (this is the version the code is tested with). Make sure you have the JAVA_HOME environment variable set correctly. Some versions of PyJnius seem to have incompatibilities with some versions of java, see e.g. here, where PyJnius looks for some files in the wrong location. For me(@jgroschwitz) the following worked:

sudo mkdir -p /usr/lib/jvm/java-11-openjdk-amd64/jre/lib/amd64/
sudo ln -s /usr/lib/jvm/java-11-openjdk-amd64/lib/server /usr/lib/jvm/java-11-openjdk-amd64/jre/lib/amd64/server

Extending the code to new graph formalisms

Code overview

As mentioned above, this code lives in the branches unsupervised2020 of am-parser and new_decomposition of am-tools. To make the parser work for a new graph formalism, you will need to extend both codebases. The code in am-tools is written in Java and acts as the interface between the AM dependency trees and your graphs. The code in am-parser (written in Python) defines the neural parser that predicts the AM dependency trees; it is largely graph-formalism agnostic and you will only need to modify the JSON-configuration files. Overall, you will need to implement the following:

  1. A dataset reader for your graphbank
  2. An edge attachment heuristic (called blob heuristic in the paper) -- this is usually quite straightforward
  3. A dataset writer for your graphbank (to put outputs in the original format again)
  4. A configuration file for your graph formalism
  5. A configuration file for your experiment

Steps 1 and 2 are used for creating the training/dev/test data, and steps 3-5 are used during training of the neural parser and evaluation. The steps are detailed in the following.

Obligatory implementation steps

1. Dataset reader: Class GraphbankDecompositionToolset

In am-tools, create a new class that extends de.saar.coli.amtools.decomposition.formalisms.toolsets.GraphbankDecompositionToolset, preferably in the same package (that will make things a bit more convenient below). It has two abstract functions that need to be implemented; we implement readCorpus here and getEdgeHeuristics in Step 2. The readCorpus function takes as input a filepath to a corpus and returns a list of MRInstance objects. So in this function you will need to read in your corpus and convert every graph-sentence pair into an MRInstance. How to read in the corpus will of course depend on your dataset, so this documentation only describes how to build an MRInstance object.

An MRInstance object (MR for meaning representation) represents an aligned sentence-graph pair. The constructor of an MRInstance takes three arguments: an SGraph graph, a sentence as a list of strings (this is just your tokenized sentence) and a list of Alignment objects.

  • SGraph: Create a new, empty SGraph with the default constructor

      SGraph g = new SGraph();  
    

    Each node in the graph has a node name and a node label; both of type String. The node name is a unique identifier of the node in the graph and has no actual meaning; it is a purely technical construct. The node label is the actual label in the graph and carries meaning. Think of node name and node label as the c and the cat, respectively, in the AMR (c / cat). You can add a node with the addNode function:

      GraphNode catNode = g.addNode("c", "cat");
    

    The node names must be unique in the graph; when you add a second node of the same name, it overwrites the first.

    You can add an edge with the addEdge function that takes two GraphNode objects (the node where the edge originates from, and the node that it points to) as well as a String label. The following example builds the AMR (s / sleep-01 :ARG0 (c / cat)) (The cat sleeps):

      SGraph g = new SGraph();
      GraphNode catNode = g.addNode("c", "cat");
      GraphNode sleepNode = g.addNode("s", "sleep-01");
      GraphEdge edge = g.addEdge(sleepNode, catNode, "ARG0");
    

    Every graph also needs to have a node marked as the root. The root should, linguistically speaking, be the head of the sentence. Typically this is the main predicate; this is closely related to the notion of a root in AMR. The code handles roots with the general notion of source names, using the specific source name "root". You add a root with the addSource function, which takes a source name and a node name as arguments, to add the source name to the given node. Concretely, in the above example sleep-01 should be the root; finish the example with the line

      g.addSource("root", "s");
    
  • Alignment: The Alignment class is a flexible class to align graph nodes to tokens in the sentence. In the context of this parser, an Alignment object specifies that a set of nodes (identified via their node names) and a token in the sentence (identified via its 0-based index in the sentence) are aligned. It also specifies a lexical node that is lexically related to the token. Take for example the sentence a baker and its corresponding AMR (p / person :ARG0-of (b / bake-01)). Here both nodes should be aligned to baker (which has index 1 in the sentence), and bake-01 is the lexical label. So for this sentence we would have a single Alignment object in the list, which specifies the set of node names {p, b}, the index 1 and the lexical node name b. Edges are not part of alignments in this paradigm, they will be taken care of below.

    To obtain valid training data for the parser, the following conditions must be met:

    • Each node in the graph must be part of exactly one Alignment.
    • The nodes in one Alignment must form a connected subgraph. (There are also complex but typically rare constraints on alignments with multiple nodes based on the edge structure of the graph. If you run into problems with too many invalid alignments please get in touch with us).
    • Every Alignment must have exactly one lexical node.
    • Every token can be part of at most one Alignment (but a token may be unaligned if there is no corresponding node in the graph).

    If one of these conditions is not met for an MRInstance, it will be skipped when creating the training data.

    In this context, we recommend the use of the following two constructors in practice:

      Alignment(Set<String> nodes, int index, String lexicalNode)
      Alignment(String nn, int index)
    

    The second is convenient when the alignment contains only a single node; it uses a singleton set containing just that node as its node set and automatically uses this same node as the lexical node.

    For example, in the above sentence a baker, one would create the following alignment:

      List<Alignment> allAlignmentsInTheSentence = new ArrayList<>();
      Set<String> nodeNames = new HashSet<>();
      nodeNames.add("p");
      nodeNames.add("b");
      allAlignmentsInTheSentence.add(new Alignment(nodeNames, 1, "b"));
    

    In the sentence The cat sleeps from further above, one would create the following alignments:

      List<Alignment> allAlignmentsInTheSentence = new ArrayList<>();
      allAlignmentsInTheSentence.add(new Alignment("c", 1));
      allAlignmentsInTheSentence.add(new Alignment("s", 2));
    

Overall, your function readCorpus might look something like this:

@Overrides
public abstract List<MRInstance> readCorpus(String filePath) throws IOException {
    YourCorpus corpus = someFunctionToReadYourCorpus(filePath);
    List<MRInstance> returnedCorpus = new ArrayList();
    for (YourInstance instance : corpus) {
        List<String> tokenizedSentence = someFunction(instance);
        SGraph graph = anotherFunction(instance);
        List<Alignment> allAlignmentsInTheSentence = yetAnotherFunction(instance);
        MRInstance mrInstance = new MRInstance(tokenizedSentence, graph, allAlignmentsInTheSentence);
        returnedCorpus.add(mrInstance);
    }
    return returnedCorpus;
}

In practice of course you may not have your own YourCorpus and YourInstance classes, but instead e.g. read the corpus file line by line and build the MRInstance objects along the way.

2. Edge heuristics: Classes EdgeAttachmentHeuristic and GraphbankDecompositionToolset

This part is about deciding into which graph constant each edge should go, i.e. which edges and nodes should go together. For example, in the AMR (s / sleep-01 :ARG0 (c / cat)), the ARG0 edge should be "attached" to the sleep-01 node, to obtain one constant consisting of the sleep-01 node and the ARG0 edge, and one constant consisting of just the cat node. See also the discussion of blobs in Section 4.1 in the paper (the normalization of edge directions however is only a presentational tool for the paper and not in the code).

To implement an edge heuristic in the code, create a new class that extends de.saar.coli.amtools.decomposition.formalisms.EdgeAttachmentHeuristic. You need to implement one function called isOutbound that takes a GraphEdge object as input and should return true if you want the edge to attach to its origin node and should return false if you want the edge to attach to its target node. Via the GraphEdge object you have access to the edge label as well as the edge's origin and target node, so you can take those into account when making the attachment decision.

In general, an edge between a head and an argument should attach to the head. An edge between a head and a modifier should attach to the modifier. This creates graph constants (supertags) that generalize well. For example in the AMR (s / sleep-01 :ARG0 (c / cat :mod (l / little))) for _The little cat sleeps", the ARG0 edge between the head sleep-01 and the argument cat should attach at the head, which is the edge's origin; that is, isOutbound should return true here. The mod edge between the head cat and the modifier little should attach to the modifier, which is the edge's target; that is, isOutbound should return false. For more examples, see the Grouping paragraph in Section 4.1 of our 2019 paper.

While these heuristics take some linguistic expertise and familiarity with the graphbank, the resulting code is often only a few lines long. For example, complete heuristics for AMR could look like this:

public class AMREdgeHeuristics extends EdgeAttachmentHeuristic {
    public static final String[] OUTBOUND_EDGEPREFIXES = new String[]{"ARG", "op", "snt", "poss", "consist", "domain", "UNKOUT"};

    @Override
    public boolean isOutbound(GraphEdge edge) {
        for (String pref : OUTBOUND_EDGEPREFIXES) {
            if (edge.getLabel().matches(pref+"[0-9]*")) {
                return true;
            }
        }
        return false;
    }
}

Once you have an EdgeAttachmentHeuristic class set up, implement getEdgeHeuristic in your extension of GraphbankDecompositionToolset to return an instance of your heuristic.

3. Dataset writer: Class EvaluationToolset

Create a class that extends de.saar.coli.amtools.evaluation.toolsets.EvaluationToolset, preferrably in the same package. You will want to override one function, namely writeCorpus. It is essentially the inverse of readCorpus from Step 1, in that it takes a list of MRInstance objects and writes that corpus to a file, in your preferred format. The idea is that most graphbanks come with their own evaluation tool; you will need to write the corpus (this will be the corpus of system predictions at evaluation time) in a format such that it can serve as input to whichever evaluation tool you are using. There is a default implementation of writeCorpus that writes each graph into a line in AMR penman format (with an empty line between each graph); this mostly serves testing purposes so that you can look at the output without having to implement this function first.

4. Config file for graph formalism

This .libsonnet file contains information about the graph formalism, in particular, how to apply evaluation metrics. This is also where you call the EvaluationToolset subclass you made in Step 3. The idea is that it will generalize across experiments, as long as the graph formalism stays the same. You can find a template with instructions here. We highly recommend looking at some of the example files in the same folder.

5. Config file for experiment

This .jsonnet file contains information about a specific experiment and ties everything together: what neural model to use, what graph formalism, input data etc. You can find a template with instructions here. We highly recommend looking at some of the example files in the same folder.

This experiment config file also links to a model config file. You can use our premade default model config, or one for toy datasets when you just want to test things on a couple of sentences.

Running the code

Training the model has two steps: (I) generating the input files for training, including the tree automata (also called "decomposition") and (II) training the neural parser with the generated files.

I) Decomposition

This in turn has two steps:

  • Run de.saar.coli.amtools.decomposition.SourceAutomataCLI to obtain training and dev automata zip files. The script takes these arguments: -t and -d are file paths to the input training and dev corpora respectively. These paths will be passed to your dataset reader from above (i.e. to the readCorpus function you built), so they can be paths to files or to directories, depending on your reader. The -dt argument specifies the subclass of GraphbankDecompositionToolset you implemented. Pass to this argument the full classpath of your subclass (if the class is in de.saar.coli.amtools.decomposition.formalisms.toolsets, then passing just the class name is enough). The -s argument specifies how many sources to use during decomposition. We recommend -s 3 or -s 4 depending on your graphbank (see paper). The optional -f flag is intended to skip expensive computation of e.g. named entity tags when creating the training/dev data, which can speed up the debugging/development cycle. For example, try running, in the base directory of am-tools (after having compiled the jar file with gradle build):
java -cp build/libs/am-tools.jar de.saar.coli.amtools.decomposition.SourceAutomataCLI -t examples/decomposition_input/mini.dm.sdp -d examples/decomposition_input/mini.dm.sdp -o examples/decomposition_input/dm_out/ -dt DMDecompositionToolset -s 2 -f

which is equivalent to (wrt the -dt option)

java -cp build/libs/am-tools.jar de.saar.coli.amtools.decomposition.SourceAutomataCLI -t examples/decomposition_input/mini.dm.sdp -d examples/decomposition_input/mini.dm.sdp -o examples/decomposition_input/dm_out/ -dt de.saar.coli.amtools.decomposition.formalisms.toolsets.DMDecompositionToolset -s 2 -f

and compare the runtime and output with and without the -f option (look at the amconll file within the zip). (This command uses the same input for training and dev set, as it is just for testing).

  • Run de.saar.coli.amtools.decomposition.CreateEvaluationInput to create dev and test evaluation input. This function has the same -dt and -f parameters as the SourceAutomataCLI script. However, it only takes one input corpus file path as parameter -c (so you run it on the dev and the test set separately) and the -o output path parameter takes a file path (rather than folder path); the file name should end with .amconll. For example, try
java -cp build/libs/am-tools.jar de.saar.coli.amtools.decomposition.CreateEvaluationInput -c examples/decomposition_input/mini.dm.sdp -o examples/decomposition_input/dm_out/evaluation_input.amconll -dt DMDecompositionToolset -f

II) Neural training

Follow the instructions here, using the experiment config file from step 5 above.

More specifically, step by step:

i) Make sure the system requirements at the top of the page are satisfied, and that you are in the main directory of the unsupervised2020 branch of am-parser.

ii) Choose and/or create configuration files for your model, graph formalism and experiment. See steps 4 and 5 above. Examples and templates can be found in the models, formalisms and experiments folders respectively of https://github.com/coli-saar/am-parser/tree/unsupervised2020/jsonnets/unsupervised2020.

iii) Make sure your experiment config file links to the model and formalism config files, as well as the train and dev zip files created by SourceAutomataCLI in Step (I), the dev (and optionally test) input amconll files created by CreateEvaluationInput in Step (I), and the dev (and optionally test) gold files of your dataset.

iv) Train the parser (more options are described here):

python -u train.py <experiment config file> -s <where to save the model>  -f --file-friendly-logging  -o ' {"trainer" : {"cuda_device" :  <your cuda device>  } }'