Skip to content

Latest commit

 

History

History
112 lines (93 loc) · 5.03 KB

README.md

File metadata and controls

112 lines (93 loc) · 5.03 KB

LUKE-NER

This repository is for NER training/inference using LUKE.

Features:

  • Our implementation relies on Trainer of huggingface/transformers (while the official repository provides examples using AllenNLP).
  • This repository improves preprocessing for non-space-delimited languages.
  • The code is compatible with fine-tuned LUKE NER models available on Hugging Face Hub.

Usage

Installation

$ git clone https://github.com/naist-nlp/luke-ner.git
$ cd luke-ner
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt

Dataset preparation

Datasets must be in the JSON Lines format, where each line represents a document that consists of examples, as exemplified below:

{
  "id": "doc-001",
  "examples": [
    {
      "id": "s1",
      "text": "She graduated from NAIST.",
      "entities": [
        {
          "start": 19,
          "end": 24,
          "label": "ORG"
        }
      ],
      "word_positions": [[0, 3], [4, 13], [14, 18], [19, 24], [24, 25]]
    }
  ]
}

For each example, the surrounding examples in the document are used to extend the context. Note that the field of word_positions can be null as it is optional. word_positions are used to enforce the word boundaries on a tokenizer.

For CoNLL '03 datasets, you can use data/convert_conll2003_to_jsonl.py:

$ python data/convert_conll2003_to_jsonl.py eng.train eng.train.jsonl
$ python data/convert_conll2003_to_jsonl.py eng.testa eng.testa.jsonl
$ python data/convert_conll2003_to_jsonl.py eng.testb eng.testb.jsonl

Fine-tuning

torchrun --nproc_per_node 4 src/main.py \
    --do_train \
    --do_eval \
    --do_predict \
    --train_file data/eng.train.jsonl \
    --validation_file data/eng.testa.jsonl \
    --test_file data/eng.testb.jsonl \
    --model "studio-ousia/luke-large-lite" \
    --output_dir ./output/ \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 8 \
    --max_entity_length 64 \
    --max_mention_length 16 \
    --save_strategy epoch \
    --pretokenize false  # you can enable this to use word boundaries for tokenization

Evaluation/Prediction

torchrun --nproc_per_node 4 src/main.py \
    --do_eval \
    --do_predict \
    --validation_file data/eng.testa.jsonl \
    --test_file data/eng.testb.jsonl \
    --model PATH_TO_YOUR_MODEL \
    --output_dir ./output/ \
    --per_device_eval_batch_size 8 \
    --max_entity_length 64 \
    --max_mention_length 16 \
    --pretokenize false

Performances

CoNLL '03 English (test)

Model Precision Recall F1
LUKE (paper) - - 94.3
studio-ousia/luke-large-finetuned-conll-2003 on notebook 93.86 94.53 94.20
studio-ousia/luke-large-finetuned-conll-2003 on script 94.58 94.65 94.61
studio-ousia/luke-large-finetuned-conll-2003 on our code 93.98 94.67 94.33
studio-ousia/luke-large-lite fine-tuned with our code 93.66 94.79 94.22
mLUKE (paper) - - 94.0
studio-ousia/mluke-large-lite-finetuned-conll-2003 on notebook* 94.23 94.23 94.23
studio-ousia/mluke-large-lite-finetuned-conll-2003 on script* 94.33 93.76 94.05
studio-ousia/mluke-large-lite-finetuned-conll-2003 on our code* 93.76 93.92 93.84
studio-ousia/mluke-large-lite fine-tuned with our code 94.10 94.49 94.29

Performance differences are due to different units of input for tokenization. Note that the codes marked with * are a bit tweaked when evaluating studio-ousia/mluke-large-lite-finetuned-conll-2003 because the current model was fine-tuned with erroneous entity_attention_mask (See the issues #166, #172 for details).