-
Notifications
You must be signed in to change notification settings - Fork 1
kaarthic/mallet-eval
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
20110422 MALLET-EVAL PROJECT GENERAL This is a project for evaluating MALLET (MAchine Learning for LanguagE Toolkit). MALLET's binary and source codes are not included, you can check out them from this site: http://mallet.cs.umass.edu/ This distribution only contains sample annotation data and scripts for converting, importing and evaluating. The articles in the two corpora are not included here for copyright reasons. That is why you need their cds for building the complete data sets. We provide two sample corpora: Penn Treebank Sample (5% fragment of Penn Treebank) and HIT CIR LTP Corpora Sample (10% fragemnt of the whole Corpora) http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/ http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm BUILDING THE TRAIN AND TEST DATA FILES In order to obtain the data files you need to perform three steps: 1. Get a local copy of the mallet-eval repository with this command: hg clone https://mallet-eval.googlecode.com/hg/ mallet-eval 2. Set up $MALLET_HOME enviroment: export MALLET_HOME=/path/to/mallet/ 3. Train and test with provided Chunking, POS Tagging and Named Entity Recognition data (chunking/ pos-tagging/ ner/) 4a. (Chunking) ./conlleval < chunking/conlleval.out 4b. (POS-Tagging) cd pos-tagging | ./verify.py 4c. (Named Entity Recognition) ./chn-conlleval < ner/conlleval.out and the results are: processed 47377 tokens with 23852 phrases; found: 23682 phrases; correct: 21441. accuracy: 93.97%; precision: 90.54%; recall: 89.89%; FB1: 90.21 ADJP: precision: 72.35%; recall: 63.93%; FB1: 67.88 387 ADVP: precision: 78.61%; recall: 75.98%; FB1: 77.28 837 CONJP: precision: 40.00%; recall: 44.44%; FB1: 42.11 10 INTJ: precision: 50.00%; recall: 50.00%; FB1: 50.00 2 LST: precision: 0.00%; recall: 0.00%; FB1: 0.00 2 NP: precision: 90.05%; recall: 89.57%; FB1: 89.81 12355 PP: precision: 94.97%; recall: 96.88%; FB1: 95.92 4908 PRT: precision: 71.84%; recall: 69.81%; FB1: 70.81 103 SBAR: precision: 89.01%; recall: 78.69%; FB1: 83.53 473 VP: precision: 91.55%; recall: 90.51%; FB1: 91.03 4605 DATA FORMAT The data files contain one word per line. Empty lines have been used for marking sentence boundaries and a line containing the keyword -DOCSTART- has been added to the beginning of each article in order to mark article boundaries. Each non-empty line contains the following tokens: 1. the current word 2. the lemma of the word (German only) 3. the part-of-speech (POS) tag generated by a tagger 4. the chunk tag generated by a text chunker 5. the named entity tag given by human annotators The tagger and chunker for English are roughly similar to the ones used in the memory-based shallow parser demo available at http://ilk.uvt.nl/ German POS and chunk information has been generated by the Treetagger from the University of Stuttgart: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ In order to simulate a real natural language processing environment, the POS tags and chunk tags have not been checked. This means that they will contain errors. If you have access to annotation software with a performance that is superior to this, you may replace these tags by yours. The chunk tags and the named entity tags use the IOB1 format. This means that in general words inside entity receive the tag I-TYPE to denote that they are Inside an entity of type TYPE. Whenever two entities of the same type immediately follow each other, the first word of the second entity will receive tag B-TYPE rather than I-TYPE in order to show that a new entity starts at that word. The raw data has the same format as the training and test material but the final column has been ommitted. There are word lists for English (extracted from the training data), German (extracted from the training data), and Dutch in the directory lists. Probably you can use the Dutch person names (PER) for English data as well. Feel free to use any other external data sources that you might have access to. Max Lv <lch@fudan.edu.cn>
About
Automatically exported from code.google.com/p/mallet-eval
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published