Description
Python tagger for multiword expression lexicon
This is a task related to research on language and gesture with the NewsScape Library of International Television News. NewsScape is hosted by the University of California Los Angeles Library and developed by the Red Hen consortium for research on multimodal communication. Besides UCLA, Red Hen has capture nodes and research teams at Case Western Reserve University, University of Illinois at Urbana Champaign, University of Southern Denmark, University of Oxford, University of Osnabrück, Texas Tech, National Institute for Advanced Studies in Bangalore, University of Navarra, University of Murcia, and other places (the consortium is constantly expanding). NewsScape contains more than 200.000 hours of television news in English, Spanish and other European languages, indexed by their subtitles/close captioning (more than 3 billion words). Among other functionalities, NewsScape is the first audiovisual database that allows for synchronized searches of subtitles and images. Its search results take to the exact moment of the show when the words in the subtitles/close captioning were uttered.
Almost all large linguistic corpora to this date are written corpora (Corpus of American English, CREA and CORDE from Spain’s Royal Academy, newspaper archives, etc.). NewsScape opens new horizons for the study of oral communication alongside the great variety of elements that accompany verbal expression: gesture and intonation, along with, in the case of television, music, image and sound effects, graphics, etc. NewsScape also facilitates the study of particular news, topics, statements by individuals or institutions, etc. We are developing automatic and manual search and annotation tools for semantic patterns. Besides verbal patterns, we are also developing tools for face recognition, detection of visual patterns, story segmentation, etc. The research groups at Navarra and Murcia are developing the SCHEMOTIME project, which compares language and gesture in the expression of emotions and time, two central concepts for theories of metaphor and cognition. Besides, the collaboration between Navarra and Murcia leads the development of NewsScape in Spanish.
The present task is to write a program that receives an input text in natural language and tags certain phrases. The phrases to be tagged are multiword expressions of time, such as "the years rolled by".
Python is the probably the right programming language for the libraries available (we recommend mwetoolkit).
Part of the job is already done by a preprocessor that tags Parts-of-Speech (prepositions, verbs, nouns, etc) in the raw text.
For instance, the raw text may be the sentence, "AND SO THE YEARS ROLLED BY."
A tool called MBSP, from the CLiPS research group at the University of Antwerp, tags it like this, using the pipe symbol as field separator:
"and/CC/O/O/and|so/IN/I-ADVP/O/so|the/DT/I-NP/O/the|years/NNS/I-NP/O/year|rolled/VBN/I-VP/O/roll|by/RP/I-PRT/O/by|././O/O/."
You are not expected to understand those annotations yet, just know that they exist and that they are what your program will use.
The multiword expressions are specified through a combination of lists of words and these prepared Parts of Speech tags. The full set of specifications is called a lexicon.
For instance, an expression may have the structure As + UNIT OF TIME + MOTION VERB + PREPOSITION. Some examples: As centuries float slowly by, As the seconds trickled past, As the holidays slowly snuck up on her. The construction is further specified as follows in the lexicon:
- A list of words indicating units of time, such as afternoon, age, autumn, century, dawn, decade, evening, and November.
- A list of motion verbs, including fly, shuffle, sneak up, come tumbling down, and roll past.
- The PREPOSITION will be available in the parts-of-speech tags.
So the lexicon defines the multiword expression, and the program must locate that expression in the source text. Three steps are needed:
- Identify the lemmatized form of each word (the lemmas are available in the Parts-of-Speech tags)
- Match the word list in the lexicon against the candidate word in the source text
- Match the parts of speech tag in the lexicon against the parts of speech specification in the lexicon
The final product is a utility that the user submits a sentence to, and the utility tags the sentence according to the multiword expression lexicon. The utility should support a socket server mode.
The project will be mentored by software developers in the Red Hen Lab, which includes faculty at University of Navarra in Spain and the University of California in Los Angeles.
Sample Lexicon of English Time Expressions
- UNIT OF TIME + MANNER-OF-MOTION VERB
Example sentences:
-Time flies. -Days shuffle. -Holidays sneak up on.
-Months come tumbling down. - The years rolled slowly past
UNITS OF TIME: afternoon, age, autumn, century, dawn, decade, evening, fall, holiday, holidays, hour, night, midday, midnight, millenium, milisecond, minute, moment, month, morning, morrow, noon, period, second, spring, summer, today, tomorrow, tonight, twilight, week, weekday, weekend, winter, yesterday. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. January, February, March, April, May, June, July, August, September, October, November, December. Time
Also, some nouns and pronouns refering to processes: movie, stay, course, class, lecture, show, concert, exam, party, meeting, match, war, Christmas, summer, season, it (all), race, project, recording, visit. *This list is expandable.MANNER-OF-MOTION VERBS: fly, shuffle, sneak up, come tumbling down, roll slowly/quickly past, run, walk, bounce, drift, drop, float, glide, move, roll, slide, swing, revolve, rotate, spin, turn, twirl, twist, whirl, wind, amble, bolt, bounce, charge, coast, crawl, creep, dart, dash, dodder, drift ,flit float, fly, frolic, gallop, glide, hasten, hike, hobble, hop, hurry, inch, jump, leap, lurch, march, meander, mince, parade, perambulate, plod, promenade, prowl, race, ramble, roam, roll, run, rush, saunter, scurry, scutter, scuttle, shamble, shuffle, skedaddle, skip, slide, slink, slither, slog, slouch, sneak, speed, stagger, stray, streak, stroll, strut, stumble, swagger, sweep, swim, tear, tiptoe, toddle, totter, traipse, travel, troop, trot, vault, walk, wander, whiz, zigzag, zoom. * This list is expandable.
- As + UNIT OF TIME + go/pass + PREP.
-As seconds go by. -As minutes pass on. -As days go on.
-As centuries go by. -As hours pass by. - As years go by.
This can already be captured, but we want to tag it automatically as a class of multi-word time expressions - As + UNIT OF TIME + go/pass + VPG -As the years go marching on. -As the centuries go passing by. -As the weeks go marching on. - As the days go drifting by.
- As + UNIT OF TIME + MOTION VERB (as + some type-1 expressions above)
- It/that + (all) + take/last/go (on) + TIME EXPRESSION: for ages/a while/a long time/a short time/no time/a day/a month plus time units from type 1. -It all took ages. -That lasted a while. -It took a long time. -That went on for a while. -It lasted a month. -It took a short time.
- Verbs that indicate beginning/end of process in a neutral way: initiate, start, begin, end, finish, complete, open up/close down. OR verbs with a higher emotional value or metaphorical sense: explode, break loose, collapse, die, be born, break up, fade. \* This is an expandable list: we prefer to run a pilot with this reduced number of items first. The war started - The war exploded/arrived/came/burst upon us/erupted/sped up/stopped Swing began in the thirties - Swing was born in the thirties The application period opened up/closed down
This sample lexicon will be expanded, but contains the typical construction types the program needs to handle.
Web-based frontend
The files to be annotated can be assumed to be present in a database, let's say mungodb, mysql, or solr.
The user input consists instead of semantic categories that act as components of multiword expressions.
Examples of such semantic categories are included in the backend task description at #1
For instance, they may include the semantic categories "UNIT OF TIME" and "MANNER-OF-MOTION VERB".
Do we use parameter files for the contents of these categories? If so, how do these parameter files interact with the mwetoolkit?
If we can use parameter files, can we have a number that is small enough to fit the options into a user interface?