Skip to content

GettingStarted

akoller edited this page Jul 26, 2015 · 19 revisions

Getting Started with Alto

The goal of this section of the wiki is teaching you the basics of working with Alto through examples. Once you have worked through the simple problems that are presented here, you can read more details in the other sections of the wiki, or learn to use more advanced features by reading the tutorials.

We assume that you have installed Alto before you start.

Parsing with a Toy CFG Grammar

A good starting point for working with Alto is the task of parsing with context free grammars. We will first explain how to do this via the graphical user interface and then show you how to use the tool from your own code. First you will have to get Alto - simply download the current version. While you are downloading the you should also obtain the examples zip-archive and unpack it in some examples folder of your choosing.

What a Grammar Looks Like

Before we start parsing let us have a look at the structure of a grammar file. Specifically, we will look at an Interpreted Regular Tree Grammar (IRTG) which we use for CFG parsing. An IRTG generates trees - which are the parses we are interested in - and then turns them into outputs by applying interpretations to each node in the tree. An interpretation consists of an algebra and a mapping for each rule from tree nodes into terms in the algebra and variables.

The grammar file we will be using is elephant.irtg from the examples. Like every IRTG grammar it starts by specifying which Interpretations are used. In this case there is only one, called "i":

#!
interpretation i: de.up.ling.irtg.algebra.StringAlgebra

After the name of the interpretation we specify the algebra it uses. In this case a string algebra which we generate a string output for use - as would any CFG. The specification of algebras is then followed rules of the form:

#!
S! -> r1(NP,VP) 
  [i] *(?1,?2)

NP -> r2
  [i] john

Each of these rules consist of two parts. The first part e.g. "S! -> r1(NP,VP)" specifies part of the Regular Tree Grammar used. This specific part tells us that we can generate a Tree from symbol "S" by taking a tree generated from "NP" and one generated from "VP" and putting them both under a "r1" node. The "!" indicates that trees generated from "S" are part of the language we are interested in i.e. "S" is a start symbol in traditional grammar terms. It is sufficient two write the "!" once. If we had another rule "S -> r" then this would be a rule for the same "S" symbol and "r" would be a part of the language. It is also possible to have more than one symbol with "!" annotation. Note that the symbols "r1, r2 ..." must have fixed arities, this means that if you have "r1" with two symbols in the brackets, then you cannot have another rule where "r1" only encloses 0,1,3,4... symbols otherwise Alto will not accept the grammar as valid.

The second part of any grammar rule are the homomorphic images of the rule label. Here we only have one interpretation - "i" - so we only have to list one image. "(?1,?2)" tells us that we interpret "r1(X,Y)" as "(interpretation(X),interpretation(Y))". What "*" means, is defined by the algebra we gave at the beginning of the file. Note that every label like "r1" should have the same homomorpic images throughout and it is best practice to simply use a unique label for every rule in your grammar. Note that the images used here are fairly simplistic, but you can see more complex examples in the advanced tutorials.

This concludes our first overview over what is in a grammar file. A more formal discussion can be found on the codec page.

Parsing from the GUI

The simplest way to get started with parsing via the GUI, which can be started by calling:

#!shell

java -jar <alto>.jar

This will bring up the GUI which looks something like this:

mainWindow.png

Now we can load the grammar:

load.png

which brings up a file loading dialogue in which you can select elephant.irtg. This brings up a window that shows you the grammar:

elephant.png

for now you can ignore the weights - we will come back to them later. Now we are at the point were we can parse a sentence:

parse.png

elephantPyjamas.png

After clicking O.k. in the last window, Alto will parse the sentence. In this case parsing means finding a new grammar that generates exactly that subset of the original RTG trees that could have been used to generate the input. The parsing result will look like this:

elephantChart.png

Which is just another IRTG. In order to look at the results, we can use the visualization capabilities of Alto. Simply select the show language option:

showLanguage.png

which brings up the language view:

elephantLanguage.png

Here you can see the RTG trees that could have generated your input. You can scroll through the different possibilities by typing different numbers into the field at the bottom or using the forward and backward buttons at the bottom. Maybe you want to also see how the given rule tree is connected to your input, an easy way of doing this is adding another view:

addView.png

which add the term and value of the parse tree you are currently viewing:

sideBySide.png

If you want to store your parsing results for later, simply go back to the window with the results and save them:

saveElephant.png

simply write the grammar to a file and you can load it later the same way we read the grammar with which we parsed.

If you want to achieve more or less the same by using Alto directly from your code, you should have a look at the parsing page.

Using Weights

The current grammar does not express any preferences concerning which rule tree should be picked from among the different options. If we want to encode some simple preferences, we can do so by using weights for rules. They will combine to give the weight of complete trees and we can prefer trees that have a higher weight.

Weights in the Grammar

Among the examples, there is one that extends the "elephant.irtg" grammar we have been using so far with weights. It is appropriately called "elephant-weighted.irtg". If you open this grammar you will see that the rules are written with additional weights assigned to them:

#!

S! -> r1(NP,VP)         [1.0]
  [i] *(?1,?2)

NP -> r2                [0.3]
  [i] john

In this case the weights are intended as probabilities, so there is a 100% probability that a derivation from "S" uses the first rule, but only a 30% chance that a derivation from "NP" uses the second rule. In general you are not limited to probabilities and the numbers in the brackets may be any floating point numbers.