This repository hosts code and models for the named entity recognition (NER) work performed at Reykjavik University in 2019-2020.
The models presented here have been trained on the Icelandic MIM-GOLD-NER named entity corpus, annotated as part of this work.
Implemented here are three different NER models, and an voting system combining the output of the three models. An evaluation script outputs the F1 score of each of the three models, given a CoNLL file with correct NE labels.
The methods used for training are the following:
- A Conditional Random Fields NER model – implementation based on Passos et al 2014
- Ixa-pipes-ner, a perceptron model with shallow word features and externally trained word clusters – Agerri & Rigau 2017
- NeuroNER, a Bi-LSTM RNN with pre-trained word embeddings (GloVe) – Dernoncourt et al. 2017
- CombiTagger, an ensemble voting system – Henrich et al. 2009
-
Clone this repo:
$ git clone https://github.com/cadia-lvl/NER $ cd NER
-
Install sklearn-crfsuite with pip
-
Install TensorFlow version 1.14 for Python 3.
-
Install the Greynir package, version 2.10.1 for Python 3.
-
Install the pandas python package
-
Install NeuroNER according to the installation guide on https://github.com/Franck-Dernoncourt/NeuroNER
-
Install https://github.com/ixa-ehu/ixa-pipe-nerc anywhere according to their guide. Create a softlink called nerc in ixa-pipe:
$ ln -s /path/to/ixa-pipe-nerc ixa-pipe/nerc
-
Install https://github.com/hrafnl/CombiTagger anywhere according to their guide, create a softlink (symbolic link) to this directory at the root:
$ ln -s /path/to/CombiTagger CombiTagger
-
Download the trained ixa-pipe and CRF models, along with the gazetteers from here. Extract anywhere, and edit the paths in the config.ini file accordingly.
-
Download the word embeddings and the trained model for NeuroNER, extract anywhere, and update token_pretrained_embedding_filepath and pretrained_model_folder in the parameters.ini file accordingly.
The evaluation script run_combined_system.sh shows the evaluation of the output of the three models and CombiTagger. It takes a .tsv file on the CoNLL format (with gold labels) as an argument.
This project is licensed under the Apache License 2.0 - see the (LICENSE)[https://github.com/cadia-lvl/NER/blob/master/LICENSE] file for details.
Reykjavik University
- Ásmundur Alma Guðjónsson asmundur10@ru.is
- Svanhvít Lilja Ingólfsdóttir svanhviti16@ru.is
- Hrafn Loftsson, Associate Professor hrafn@ru.is
This project was funded by the with funding from the Icelandic Strategic Research and Development Programme for Language Technology 2019, grant no. 180027-5301.