-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial
Jerome Feldman: feldman@icsi.berkeley.edu
Vivek Raghuram: vivek.raghuram@berkeley.edu
Begin by downloading the latest workbench release and then a copy of the ecg_grammars repository. Unzip the workbench release and open the appropriate version of the workbench for your operating system. For now, you will need Java 6 to run the ECG2 workbench.
The WB is used to examine, modify, test, and analyze an ECG2 grammar; several of these are placed in the directory ecg_grammars by the standard start up procedure. We will use a particular grammar, Comp_robots, in this tutorial but only consider a fragment of it.
Figure 1 shows a view of the WB in a common starting state. There are many options, but the most basic functions are editing and testing ECG2 grammars. The large Editor window is used for most of the main functions and, in Figure 1, displays a constructional analysis of the input sentence “robot1 moved a box” shown on the Sentence box just above. The narrow box to the right is used for a wide range of auxiliary displays, here it indexes the file structure of our Comp_robots grammar and the two more basic grammars on which it depends: Core and Research.
The top menu bar has the usual general functions plus two that are special to the WB, Grammar and Window. To open a grammar, on the top menu bar select Grammar | Open Preferences File and then navigate to and choose Comp_robots.prefs from the ecg_grammars repository. The ECG Workbench uses a metafile (the preferences, or prefs, file) describing the grammar files and various other parameters used by the ECG analyzer. The preferences file is a plain text file containing pointers to various elements making up the grammar: the folder containing the actual grammar files, the file extensions for different types of files, and some parameter settings, and example sentences. You will not normally need to modify this file.
With the grammar opened, you can type an example into the Sentence box, followed by CR and have it analyzed by clicking the green circle above this box. Assuming that all goes well, a traditional text version of a parse tree will appear in the large window; this is not shown here, but can be brought up by clicking on the TextOutput box at the bottom left of the display. Just to its right is the selected SemSpec box, which produced Figure 1.
Of course, the details of Figure 1 depend on both the general rules of ECG2 and the specific Constructions of selected grammar.
Figure 1. Constructional Structure of “robot1 moved a box.”
Many of the screen shots in this tutorial are compressed; clicking on any gray box in the main display will expand/compress it. Figure 2 is part of the same analysis; it would be displayed by clicking on the DiscourseElement box in Figure 1. An essential role of the ECG2 analyzer is explicitly linking elements of the SemSpec that are linguistically unified (bound). Following a common convention in unification grammars like HPSG (http://hpsg.stanford.edu/), small boxes with the same (arbitrary) number serve to link unified entries. In this example, we have clicked on the entry for robot1 [18] and all linked entries in both figures are also highlighted. If you hover over a boxed number, the value associated with that index will be temporarily displayed. There is also a drop-down list of example sentences accessed by clicking on the downward triangle at the right end of the Sentence box.
Figure 2. SemSpec Structure of “robot1 moved a box.”
The ECG2 treatment of vocabulary is much more general and powerful than in previous releases. One important addition is the inclusion of the Celex morphological pre-processor; Celex data mapping morphological variations of a lemma is stored in a .ecgmorph file, and tables mapping this information onto ECG2 constraints are stored in the file <grammar>.morph. Celex plus a type-token mechanism described below and some type-based caching yield smaller and faster grammars.
Much of the design is based on the clean retargeting ECG2 full-path products to new domains, which obviously requires new vocabulary. In addition, the user vocabulary for the new product must be mapped to the internal APP terminology of the new product. Other aspects of retargeting are discussed in our retargeting tutorial.
We will start with vocabulary in the Analyzer, which is simpler. Figure 3 is a screen dump derived from Figure 1, by first clicking the Lexicon box above the narrow window. If the Lexicon box is not visible the go to Window > Open View > Other... and then select the Lexicon viewer under the token folder. This opens a tree structure over the vocabulary of and we then expanded the entries for “box” as a token of the type ContainerNoun and “robot” as a token of the type SentientNoun. All of the grammar rules for open-class items (nouns, verbs, adjectives) are written in terms of construction-grammatical types, like ContainerNoun. This improves the efficiency of the Analyzer and also facilitates expanding the vocabulary of a grammar.
We then clicked on the “box” entry causing the large window to display linking information for the lexicon. The entry for “box” species its type, ContainerNoun, and also its ontological-category <-- @box. All nouns have an ontological-category and the forms starting with @ are part of the ECG internal ontology lattice, which we will describe below. Still in the large window, notice that a verb like “run” (just above box) has an actionary rather than an ontological-category. Notice also that other grammatical types like ColorType and ScalarAdjectiveType, are instead associated with properties and values.
Figure 3. Lexical Type/Token structure
Figure 4 is a screen shot of part of the ECG Language Ontology, which is distinct from any ontology or naming structure in the Apps of a product. We first clicked on the Gram. Box above the narrow window and then clicked on core.ont in the resulting display, bringing up the portion of the ontology associated with the Core grammar. All of the ECG2 ontologies are lattices of flat (unstructured) items, which serve several related purposes. Recall that ontology items are marked with @ in Figure 3; they are also used with @ in constructions.
A major distinction is between internal and shared ontology items. Shared items provide the direct link between concepts on language side and related concepts on the App side. The details on how this happens will be discussed below and in the page on the Token Tool. Unshared ontology items are used internally, for example in this entry about halfway down the screen: (type grammaticalValues sub unshared enumeration). A bit further down we find (type genderValues sub grammaticalValues) and a subcase (type male sub genderValues). The third main use is a standard subsumption lattice with entries like: (type physicalEntity sub entity location).
Figure 4. Core ECG Language Ontology
Because all the open class grammar rules are expressed in terms of Construction-grammatical types, it is easy to add new tokens of the provided types or additional ones that might be added. There is a WB tool for exactly this purpose, the Token Editor, illustrated in Figure 5. The narrow window shows an instance of the Token Editor, invoked by clicking on the Token tab above. Further examples and instructions on the Token Editor can be found here. After adding tokens, you may need to refresh the token and .ont files, here research.ont and testing.tokens. A file can be refreshed by right clicking on it (in the Explorer window) and choosing Refresh.
Figure 5. Adding a new token “lamp”.
The Sentence box contains the novel example: “robot1 moved the lamp.”, which would have not been analyzable before we added the new word. Rather than displaying the resulting SemSpec, we show a portion of the grammar illustrating the definition of the general ArtifactNoun type.
The ECG2 framework currently covers a large fraction of the most general English Constructions, but many applications and investigations will not require the full richness at every stage. As you would expect, including the full set of Constructions causes the Analyzer to become significantly larger and slower. Figure 6 is a screen shot after a search for “import” with the file ”robot-argumentstructure-general.grm” opened. Notice that several import lines are commented out; undoing the comment // will have the package included in the next build of the analyzer. We have done considerable testing and believe that any combination of packages will be consistent; please let us know of any problems.
Figure 6. Importing Grammar Packages
The Workbench is derived from the powerful Eclipse system and has many of its capabilities, but most of these are not important here. Referring to the top menu bar ( in all figures), the File, Edit, Search, and Help tabs are all standard. We used Grammar| Open Preferences File to load our starting grammar. It is best to close one preference file before opening another one. Then you should delete (right click| delete) the file from the narrow window, being sure to click the Delete box in the window that appears.
Figure 7
The Grammar| Check tab is useful when making structural changes to a grammar. It performs a number of static tests, but does not try to analyze and examples.
The Window tab has several useful features, but is complex. From the top level, we need only “Open View “. The relevant choices are: Grammar Explorer, Ontology, and Other; we have used the first two in examples. Under OpenView|Other, the important entries are in the Token file, namely the Lexicon Viewer, the Package Viewer, and the Token Adder. Clicking on one of these brings it up in the narrow window, and adds its tab to the bar above that window.
The original goal of ECG was to facilitate the design, exploration, and use of Construction Grammars linked to Embodied Meaning based on Cognitive Linguistics and Neural Computation. Over the past three decades, a wide range of finding have evolved and been documented in various ways. Some of this is described in the home page.
The current emphasis in ECG2 goes beyond grammar and considers language in action and applications, as discussed in other sections of the Wiki. Any large system, like the grammar developed here, has complex interactions and is not robust to local modifications. In fact the interactions in Natural Language are much richer than in a well-engineered artifact. It should be possible to add idiom-like domain dependent constructions with moderate effort.
We believe that exploration of alternative grammar designs, other languages, and possibly different deep semantics is also well supported by the ECG2 design. The Workbench is much improved over earlier versions and has several additional tools described in this Wiki. The Core grammar has no tokens, is relatively simple, and should form a good basis for English research and development. We have found it easy to build French and Spanish analogs, using essentially the same semantic schemas.
The Research grammar does have tokens and is being used to try out new constructions and to work on grammar issues that have not been solved. This often results in new packages or modifications of existing ones.