Skip to content

A* Parser

weissenh edited this page Jun 28, 2021 · 6 revisions

Instead of directly parsing sentences into graphs using the am-parser, you can use the am-parser to only compute supertag and edge scores and then compute AM dependency trees using an A* parser.

First, you need to compute a scores file. Follow the instructions to generate a file scores.zip for your test set.

Running the A* parser

The A* parser will compute, for each sentence, the well-typed projective AM dependency tree with the highest total score according to the scores file.

Follow the instructions on am-tools to build am-tools.jar. You can now run the A* parser on your scores file as follows:

java -cp <am-tools.jar> de.saar.coli.amtools.astar.Astar -s <scores.zip> -o <outdir> 

This will write an amconll file and a logfile into the directory <outdir>.

You can configure the outside heuristic used by the A* parser with the command-line argument --outside-estimator <heuristic_name>. The heuristic names are mapped to classes in the heuristics package in the OUTSIDE_ESTIMATORS field of the main A* class. So, for instance, you could use --outside-estimator ignore_aware.

Further useful command-line options are --threads <N>, which parallelizes parsing over N threads, and --statistics <statistics.csv>, which writes some runtime statistics to the CSV file <statistics.csv>.

On heuristics.
For more details on the heuristics we refer to Lindemann et al. (2020 @EMNLP). The heuristics are described in section 4 on page 4 and evaluation is presented in section 6.1 on page 8 with the relevant table on the following page. Here is how the paper to connects to the code:

  • The TrivialOutsideEstimator() (--outside-estimator trivial) is the first heuristic described in the paper (section 4.1, named "trivial"): this heuristic is assigning 0 to all items. Although this is an admissible heuristic, it is not a very useful one in practice.
  • The SupertagOnlyOutsideEstimator() (--outside-estimator supertagonly) is described at the end of section 4.1, but not evaluated. The cost is estimated as the sum of the lowest-cost supertag for each token outside of the span of the item.
  • The StaticOutsideEstimator (--outside-estimator static) is called "edge-based heuristic" at the beginning of section 4.2. Additionally to using the lowest-cost supertags like the supertagonly heuristic, it also incorporates the sum of the lowest-cost incoming dependency edge (this includes 'root' and 'ignore' edges) for each token outside of the span of the item.
  • The RootAndIgnoreAwareStaticEstimator (--outside-estimator ignore_aware) is called "ignore-aware outside heuristic" at the end of section 4.2 and adds the following constraints on top of the "static" heuristic: (1) There can only be one root (2) 'ignore' edges and the $\bot$ supertag always appear together.
  • the root_aware heuristic (RootAwareStaticEstimator) is not described in the paper.

The current (June 2021) default heuristic used is the "static" one (see Line 460 in Astar.java).

Speeding up the computation

If you are going to parse with the same scores file multiple times, you can drastically reduce startup time of the parser in the following two ways.

First, you can cache certain preprocessing work on the types by passing the command-line option --typecache <typecache.dat>. If the file <typecache.dat> does not exist, this will perform the type preprocessing and store the results in this file. If the file exists, the results of the preprocessing will be loaded from <typecache.dat>, which can be much faster than performing the preprocessing itself.

Second, you can pickle the scores file into a serialized scores file using the following command:

java -cp <am-tools.jar> de.saar.coli.amtools.astar.io.SerializedScoreReader <scores.zip> <serialized-scores.zip>

This will read the scores file <scores.zip> and write it into a file <serialized-scores.zip> in a binary form, which is much faster to read for the A* parser than the scores file itself. In order to use the serialized scores file, pass the command-line option -S <serialized-scores.zip> instead of -s <scores.zip>.