This project is mainly to establish a universal phrasal tagset for multilingual treebanks, especially for constituent structure treebanks
For more information::
Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This project designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping currently covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.
The studied 25 treebanks cover 21 languages, i.e., Arabic, Catalan, Chinese, Danish, English, Estonian, French, German, Hebrew, Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Portuguese, Spanish, Swedish, Thai, Urdu, and Vietnamese; mappings of other language treebanks are under development...
References:
A Universal Phrase Tagset for Multilingual Treebanks Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yi Lu, Liangye He, and Liang Tian M. Sun et al. (Eds.): CCL and NLP-NABD 2014, LNAI 8801, pp. 247–258, 2014. © Springer International Publishing Switzerland 2014 http://link.springer.com/chapter/10.1007%2F978-3-319-12277-9_22
Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li, Lynn Ling Zhu Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, (GSCL 2013), Darmstadt, Germany, on September 25–27, 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. http://link.springer.com/chapter/10.1007/978-3-642-40722-2_13 Open tool https://github.com/aaronlifenghan/aaron-project-hppr
@incollection{han2014universal, title={A Universal Phrase Tagset for Multilingual Treebanks}, author={Han, Aaron Li-Feng and Wong, Derek F and Chao, Lidia S and Lu, Yi and He, Liangye and Tian, Liang}, booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data}, pages={247--258}, year={2014}, publisher={Springer International Publishing} }
Slides:
2, http://www.slideshare.net/AaronHanLiFeng/pptccl-a-universal-phrase-tagset-for-multilingual-treebanks
Codes: