Skip to content

Latest commit

 

History

History
60 lines (46 loc) · 4.98 KB

README.md

File metadata and controls

60 lines (46 loc) · 4.98 KB

How to fine-tune BERT on German LER Dataset

Based on the scripts for token classification on the GermEval 2014 (German NER) dataset. The dataset includes two different versions of annotations, 19 fine- and 7 coarse-grained labels. All labels are in BIO format. Distribution of coarse- and fine-grained classes in the dataset:

Coarse-grained classes # % Fine-grained classes # %
1 PER Person 3,377 6.30 1 PER Person 1,747 3.26
2 RR Judge 1,519 2.83
3 AN Lawyer 111 0.21
2 LOC Location 2,468 4.60 4 LD Country 1,429 2.66
5 ST City 705 1.31
6 STR Street 136 0.25
7 LDS Landscape 198 0.37
3 ORG Organization 7,915 14.76 8 ORG Organization 1,166 2.17
9 UN Company 1,058 1.97
10 INN Institution 2,196 04.09
11 GRT Court 3,212 5.99
12 MRK Brand 283 0.53
4 NRM Legalnorm 20,816 38.81 13 GS Law 18,52 34.53
14 VO Ordinance 797 1.49
15 EUN European legal norm 1,499 2.79
5 REG Case-by-case regulation 3,47 6.47 16 VS Regulation 607 1.13
17 VT Contract 2,863 5.34
6 RS Court decision 12,58 23.46 18 RS Court decision 12,58 23.46
7 LIT Legal literature 3,006 5.60 19 LIT Legal literature 3,006 5.60
Total 53,632 100 Total 53,632 100

Training with 19 fine-grained labels

How to fine-tune the BERT with 19 labels see Jupyter Notebook Train_BERT_on_19_labels.ipynb

Training with 7 coarse-grained labels

How to fine-tune the BERT with 7 labels see Jupyter Notebook Train_BERT_on_7_labels.ipynb

Run in terminal

You can execute a bash in terminal ./fine_run.sh BERT_MODEL for training on German LER Dataset with fine-grained labels, e.g. bert-base-german-cased:

./fine_run.sh bert-base-german-cased

Or you can execute ./coarse_run.sh BERT_MODEL for training with coarse-grained labels:

./coarse_run.sh bert-base-german-cased

Note that is a Pytorch version. With Tensorflow you have to change line 39 to python3 run_tf_ner.py .... Make sure to run chmod a+x fine_run.sh to make your script executable.

You can choose other models for training on 🤗 huggingface, for example:

  • BERT multilingual (bert-base-multilingual-cased, bert-base-multilingual-uncased)
  • BERT German (bert-base-german-cased, dbmdz/bert-base-german-uncased, ...)
  • DistilBERT (distilbert-base-german-cased, distilbert-base-multilingual-cased)
  • XLM-RoBERTa (xlm-roberta-base, xlm-roberta-large, facebook/xlm-roberta-xl, ...)
  • ELECTRA (stefan-it/electra-base-gc4-64k-200000-cased-generator, ...)
  • DeBERTa (microsoft/mdeberta-v3-base)
  • ...