Skip to content

Commit 908a288

Browse files
sguggerthomwolf
andauthored
Add new token classification example (#8340)
* Add new token classification example * Remove txt file * Add test * With actual testing done * Less warmup is better * Update examples/token-classification/run_ner_new.py Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Address review comments * Fix test * Make Lysandre happy * Last touches and rename * Rename in tests * Address review comments * More run_ner -> run_ner_old Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
1 parent c7cb1aa commit 908a288

File tree

21 files changed

+652
-185
lines changed

21 files changed

+652
-185
lines changed

examples/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ git checkout tags/v3.4.0
3737
|---|---|:---:|:---:|:---:|:---:|
3838
| [**`language-modeling`**](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) | Raw text | ✅ | - | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb)
3939
| [**`text-classification`**](https://github.com/huggingface/transformers/tree/master/examples/text-classification) | GLUE, XNLI | ✅ | ✅ | ✅ | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb)
40-
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | - | -
40+
| [**`token-classification`**](https://github.com/huggingface/transformers/tree/master/examples/token-classification) | CoNLL NER | ✅ | ✅ | | -
4141
| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/master/examples/multiple-choice) | SWAG, RACE, ARC | ✅ | ✅ | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ViktorAlm/notebooks/blob/master/MPC_GPU_Demo_for_TF_and_PT.ipynb)
4242
| [**`question-answering`**](https://github.com/huggingface/transformers/tree/master/examples/question-answering) | SQuAD | ✅ | ✅ | - | -
4343
| [**`text-generation`**](https://github.com/huggingface/transformers/tree/master/examples/text-generation) | - | n/a | n/a | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)

examples/test_examples.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,13 @@
2828

2929
SRC_DIRS = [
3030
os.path.join(os.path.dirname(__file__), dirname)
31-
for dirname in ["text-generation", "text-classification", "language-modeling", "question-answering"]
31+
for dirname in [
32+
"text-generation",
33+
"text-classification",
34+
"token-classification",
35+
"language-modeling",
36+
"question-answering",
37+
]
3238
]
3339
sys.path.extend(SRC_DIRS)
3440

@@ -38,6 +44,7 @@
3844
import run_generation
3945
import run_glue
4046
import run_mlm
47+
import run_ner
4148
import run_pl_glue
4249
import run_squad
4350

@@ -185,6 +192,36 @@ def test_run_mlm(self):
185192
result = run_mlm.main()
186193
self.assertLess(result["perplexity"], 42)
187194

195+
def test_run_ner(self):
196+
stream_handler = logging.StreamHandler(sys.stdout)
197+
logger.addHandler(stream_handler)
198+
199+
tmp_dir = self.get_auto_remove_tmp_dir()
200+
testargs = f"""
201+
run_ner.py
202+
--model_name_or_path bert-base-uncased
203+
--train_file tests/fixtures/tests_samples/conll/sample.json
204+
--validation_file tests/fixtures/tests_samples/conll/sample.json
205+
--output_dir {tmp_dir}
206+
--overwrite_output_dir
207+
--do_train
208+
--do_eval
209+
--warmup_steps=2
210+
--learning_rate=2e-4
211+
--per_gpu_train_batch_size=2
212+
--per_gpu_eval_batch_size=2
213+
--num_train_epochs=2
214+
""".split()
215+
216+
if torch_device != "cuda":
217+
testargs.append("--no_cuda")
218+
219+
with patch.object(sys, "argv", testargs):
220+
result = run_ner.main()
221+
self.assertGreaterEqual(result["eval_accuracy_score"], 0.75)
222+
self.assertGreaterEqual(result["eval_precision"], 0.75)
223+
self.assertLess(result["eval_loss"], 0.5)
224+
188225
def test_run_squad(self):
189226
stream_handler = logging.StreamHandler(sys.stdout)
190227
logger.addHandler(stream_handler)

examples/token-classification/README.md

Lines changed: 41 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,40 @@
1-
## Named Entity Recognition
1+
## Token classification
22

3-
Based on the scripts [`run_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py) for Pytorch and
3+
Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech
4+
tagging (POS). The main scrip `run_ner.py` leverages the 🤗 Datasets library and the Trainer API. You can easily
5+
customize it to your needs if you need extra processing on your datasets.
6+
7+
It will either run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own text files for
8+
training and validation.
9+
10+
The following example fine-tunes BERT on CoNLL-2003:
11+
12+
```bash
13+
python run_ner.py \
14+
--model_name_or_path bert-base-uncased \
15+
--dataset_name conll2003 \
16+
--output_dir /tmp/test-ner \
17+
--do_train \
18+
--do_eval
19+
```
20+
21+
or just can just run the bash script `run.sh`.
22+
23+
To run on your own training and validation files, use the following command:
24+
25+
```bash
26+
python run_ner.py \
27+
--model_name_or_path bert-base-uncased \
28+
--train_file path_to_train_file \
29+
--validation_file path_to_validation_file \
30+
--output_dir /tmp/test-ner \
31+
--do_train \
32+
--do_eval
33+
```
34+
35+
## Old version of the script
36+
37+
Based on the scripts [`run_ner_old.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_old.py) for Pytorch and
438
[`run_tf_ner.py`](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_tf_ner.py) for Tensorflow 2.
539

640
The following examples are covered in this section:
@@ -69,7 +103,7 @@ export SEED=1
69103
To start training, just run:
70104

71105
```bash
72-
python3 run_ner.py --data_dir ./ \
106+
python3 run_ner_old.py --data_dir ./ \
73107
--labels ./labels.txt \
74108
--model_name_or_path $BERT_MODEL \
75109
--output_dir $OUTPUT_DIR \
@@ -87,7 +121,7 @@ If your GPU supports half-precision training, just add the `--fp16` flag. After
87121

88122
#### JSON-based configuration file
89123

90-
Instead of passing all parameters via commandline arguments, the `run_ner.py` script also supports reading parameters from a json-based configuration file:
124+
Instead of passing all parameters via commandline arguments, the `run_ner_old.py` script also supports reading parameters from a json-based configuration file:
91125

92126
```json
93127
{
@@ -106,7 +140,7 @@ Instead of passing all parameters via commandline arguments, the `run_ner.py` sc
106140
}
107141
```
108142

109-
It must be saved with a `.json` extension and can be used by running `python3 run_ner.py config.json`.
143+
It must be saved with a `.json` extension and can be used by running `python3 run_ner_old.py config.json`.
110144

111145
#### Evaluation
112146

@@ -250,7 +284,7 @@ cat data_wnut_17/train.txt data_wnut_17/dev.txt data_wnut_17/test.txt | cut -d "
250284

251285
#### Run the Pytorch version
252286

253-
Fine-tuning with the PyTorch version can be started using the `run_ner.py` script. In this example we use a JSON-based configuration file.
287+
Fine-tuning with the PyTorch version can be started using the `run_ner_old.py` script. In this example we use a JSON-based configuration file.
254288

255289
This configuration file looks like:
256290

@@ -274,7 +308,7 @@ This configuration file looks like:
274308

275309
If your GPU supports half-precision training, please set `fp16` to `true`.
276310

277-
Save this JSON-based configuration under `wnut_17.json`. The fine-tuning can be started with `python3 run_ner.py wnut_17.json`.
311+
Save this JSON-based configuration under `wnut_17.json`. The fine-tuning can be started with `python3 run_ner_old.py wnut_17.json`.
278312

279313
#### Evaluation
280314

examples/token-classification/run.sh

Lines changed: 5 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,6 @@
1-
## The relevant files are currently on a shared Google
2-
## drive at https://drive.google.com/drive/folders/1kC0I2UGl2ltrluI9NqDjaQJGw5iliw_J
3-
## Monitor for changes and eventually migrate to nlp dataset
4-
curl -L 'https://drive.google.com/uc?export=download&id=1Jjhbal535VVz2ap4v4r_rN1UEHTdLK5P' \
5-
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > train.txt.tmp
6-
curl -L 'https://drive.google.com/uc?export=download&id=1ZfRcQThdtAR5PPRjIDtrVP7BtXSCUBbm' \
7-
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > dev.txt.tmp
8-
curl -L 'https://drive.google.com/uc?export=download&id=1u9mb7kNJHWQCWyweMDRMuTFoOHOfeBTH' \
9-
| grep -v "^#" | cut -f 2,3 | tr '\t' ' ' > test.txt.tmp
10-
11-
export MAX_LENGTH=128
12-
export BERT_MODEL=bert-base-multilingual-cased
13-
python3 scripts/preprocess.py train.txt.tmp $BERT_MODEL $MAX_LENGTH > train.txt
14-
python3 scripts/preprocess.py dev.txt.tmp $BERT_MODEL $MAX_LENGTH > dev.txt
15-
python3 scripts/preprocess.py test.txt.tmp $BERT_MODEL $MAX_LENGTH > test.txt
16-
cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt
17-
export OUTPUT_DIR=germeval-model
18-
export BATCH_SIZE=32
19-
export NUM_EPOCHS=3
20-
export SAVE_STEPS=750
21-
export SEED=1
22-
231
python3 run_ner.py \
24-
--task_type NER \
25-
--data_dir . \
26-
--labels ./labels.txt \
27-
--model_name_or_path $BERT_MODEL \
28-
--output_dir $OUTPUT_DIR \
29-
--max_seq_length $MAX_LENGTH \
30-
--num_train_epochs $NUM_EPOCHS \
31-
--per_gpu_train_batch_size $BATCH_SIZE \
32-
--save_steps $SAVE_STEPS \
33-
--seed $SEED \
34-
--do_train \
35-
--do_eval \
36-
--do_predict
2+
--model_name_or_path bert-base-uncased \
3+
--dataset_name conll2003 \
4+
--output_dir /tmp/test-ner \
5+
--do_train \
6+
--do_eval

examples/token-classification/run_chunk.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ export NUM_EPOCHS=3
2121
export SAVE_STEPS=750
2222
export SEED=1
2323

24-
python3 run_ner.py \
24+
python3 run_ner_old.py \
2525
--task_type Chunk \
2626
--data_dir . \
2727
--model_name_or_path $BERT_MODEL \

0 commit comments

Comments
 (0)