Skip to content

nefujiangping/entity_recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Models for Entity Recognition

Some Entity Recognition models for 2019 Datagrand Cup: Text Information Extraction Challenge.

Requirements

Components of Entity Recognition

Word Embedding

  • Static Word Embedding: word2vec, GloVe
  • Contextualized Word Representation: ELMo (_elmo), refer to Sec.

Sentence Representation

Inference

  • sequence labeling (sequence_labeling.py)
    • CRF
    • softmax
  • predict start/end index of entities (_pointer)

Note

According to the three components described above, there actually exists 12 models in all. However, this repo only implemented the following 6 models:

  • Static Word Embedding × (BiLSTM, DGCNN) × (CRF, softmax): sequence_labeling.py
  • (Static Word Embedding, ELMo) × BiLSTM × pointer: bilstm_pointer.py and bilstm_pointer_elmo.py

Other models can be implemented by adding/modifying few codes.

How to run

  1. Prepare data:
    1. download official competition data to data folder
    2. get sequence tagging train/dev/test data: bin/trans_data.py
    3. prepare vocab, tag
      • vocab: word vocabulary, one word per line, with word word_count format
      • tag: BIOES ner tag list, one tag per line (O in first line)
    4. follow the step 2 or 3 below
      • 2 is for models using static word embedding
      • 3 is for model using ELMo
  2. Run model with static word embedding, here take word2vec as an example:
    1. train word2vec: bin/train_w2v.py
    2. modify config.py
    3. run python sequence_labeling.py [bilstm/dgcnn] [softmax/crf] or python bilstm_pointer.py (remember to modify config.model_name before a new run, or the old model will be overridden)
  3. Or run model with ELMo embedding (dump contextualized sentence representation for each sentence of train/dev/test to file first, then load them for train/dev/test, not run ELMo on the fly):
    1. follow the instruction described here to get contextualized sentence representation for train_full/dev/test data from pre-trained ELMo weights
    2. modify config.py
    3. run python bilstm_pointer_elmo.py

How to train a pure token-level ELMo from scratch?

  • Just follow the official instruction described here.
  • Some notes:
    • to train a token-level language model, modify bin/train_elmo.py:
      from vocab = load_vocab(args.vocab_file, 50)
      to vocab = load_vocab(args.vocab_file, None)
    • modify n_train_tokens
    • remove char_cnn in options
    • modify lstm.dim/lstm.projection_dim as you wish.
    • n_gpus=2, n_train_tokens=94114921, lstm['dim']=2048, projection_dim=256, n_epochs=10. It took about 17 hours long on 2 GTX 1080 Ti.
  • After finishing the last step of the instruction, you can refer to the script dump_token_level_bilm_embeddings.py to dump the dynamic sentence representations of your own dataset.

References

About

Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages