How to fine-tune BERT on German LER Dataset
Based on the scripts for token classification on the GermEval 2014 (German NER) dataset. The dataset includes two different versions of annotations, 19 fine- and 7 coarse-grained labels. All labels are in BIO format. Distribution of coarse- and fine-grained classes in the dataset:
Coarse-grained classes | # | % | Fine-grained classes | # | % | ||||
---|---|---|---|---|---|---|---|---|---|
1 | PER | Person | 3,377 | 6.30 | 1 | PER | Person | 1,747 | 3.26 |
2 | RR | Judge | 1,519 | 2.83 | |||||
3 | AN | Lawyer | 111 | 0.21 | |||||
2 | LOC | Location | 2,468 | 4.60 | 4 | LD | Country | 1,429 | 2.66 |
5 | ST | City | 705 | 1.31 | |||||
6 | STR | Street | 136 | 0.25 | |||||
7 | LDS | Landscape | 198 | 0.37 | |||||
3 | ORG | Organization | 7,915 | 14.76 | 8 | ORG | Organization | 1,166 | 2.17 |
9 | UN | Company | 1,058 | 1.97 | |||||
10 | INN | Institution | 2,196 | 04.09 | |||||
11 | GRT | Court | 3,212 | 5.99 | |||||
12 | MRK | Brand | 283 | 0.53 | |||||
4 | NRM | Legalnorm | 20,816 | 38.81 | 13 | GS | Law | 18,52 | 34.53 |
14 | VO | Ordinance | 797 | 1.49 | |||||
15 | EUN | European legal norm | 1,499 | 2.79 | |||||
5 | REG | Case-by-case regulation | 3,47 | 6.47 | 16 | VS | Regulation | 607 | 1.13 |
17 | VT | Contract | 2,863 | 5.34 | |||||
6 | RS | Court decision | 12,58 | 23.46 | 18 | RS | Court decision | 12,58 | 23.46 |
7 | LIT | Legal literature | 3,006 | 5.60 | 19 | LIT | Legal literature | 3,006 | 5.60 |
Total | 53,632 | 100 | Total | 53,632 | 100 |
How to fine-tune the BERT with 19 labels see Jupyter Notebook Train_BERT_on_19_labels.ipynb
How to fine-tune the BERT with 7 labels see Jupyter Notebook Train_BERT_on_7_labels.ipynb
You can execute a bash in terminal ./fine_run.sh BERT_MODEL
for training on German LER Dataset with fine-grained labels, e.g. bert-base-german-cased:
./fine_run.sh bert-base-german-cased
Or you can execute ./coarse_run.sh BERT_MODEL
for training with coarse-grained labels:
./coarse_run.sh bert-base-german-cased
Note that is a Pytorch version. With Tensorflow you have to change line 39 to python3 run_tf_ner.py ...
. Make sure to run chmod a+x fine_run.sh
to make your script executable.
You can choose other models for training on 🤗 huggingface, for example:
- BERT multilingual (
bert-base-multilingual-cased
,bert-base-multilingual-uncased
) - BERT German (
bert-base-german-cased
,dbmdz/bert-base-german-uncased
, ...) - DistilBERT (
distilbert-base-german-cased
,distilbert-base-multilingual-cased
) - XLM-RoBERTa (
xlm-roberta-base
,xlm-roberta-large
,facebook/xlm-roberta-xl
, ...) - ELECTRA (
stefan-it/electra-base-gc4-64k-200000-cased-generator
, ...) - DeBERTa (
microsoft/mdeberta-v3-base
) - ...