How to fine-tune BERT on German LER Dataset

Based on the scripts for token classification on the GermEval 2014 (German NER) dataset. The dataset includes two different versions of annotations, 19 fine- and 7 coarse-grained labels. All labels are in BIO format. Distribution of coarse- and fine-grained classes in the dataset:

		Coarse-grained classes	#	%			Fine-grained classes	#	%
1	PER	Person	3,377	6.30	1	PER	Person	1,747	3.26
					2	RR	Judge	1,519	2.83
					3	AN	Lawyer	111	0.21
2	LOC	Location	2,468	4.60	4	LD	Country	1,429	2.66
					5	ST	City	705	1.31
					6	STR	Street	136	0.25
					7	LDS	Landscape	198	0.37
3	ORG	Organization	7,915	14.76	8	ORG	Organization	1,166	2.17
					9	UN	Company	1,058	1.97
					10	INN	Institution	2,196	04.09
					11	GRT	Court	3,212	5.99
					12	MRK	Brand	283	0.53
4	NRM	Legalnorm	20,816	38.81	13	GS	Law	18,52	34.53
					14	VO	Ordinance	797	1.49
					15	EUN	European legal norm	1,499	2.79
5	REG	Case-by-case regulation	3,47	6.47	16	VS	Regulation	607	1.13
					17	VT	Contract	2,863	5.34
6	RS	Court decision	12,58	23.46	18	RS	Court decision	12,58	23.46
7	LIT	Legal literature	3,006	5.60	19	LIT	Legal literature	3,006	5.60
		Total	53,632	100			Total	53,632	100

Training with 19 fine-grained labels

How to fine-tune the BERT with 19 labels see Jupyter Notebook Train_BERT_on_19_labels.ipynb

Training with 7 coarse-grained labels

How to fine-tune the BERT with 7 labels see Jupyter Notebook Train_BERT_on_7_labels.ipynb

Run in terminal

You can execute a bash in terminal ./fine_run.sh BERT_MODEL for training on German LER Dataset with fine-grained labels, e.g. bert-base-german-cased:

./fine_run.sh bert-base-german-cased

Or you can execute ./coarse_run.sh BERT_MODEL for training with coarse-grained labels:

./coarse_run.sh bert-base-german-cased

Note that is a Pytorch version. With Tensorflow you have to change line 39 to python3 run_tf_ner.py .... Make sure to run chmod a+x fine_run.sh to make your script executable.

You can choose other models for training on 🤗 huggingface, for example:

BERT multilingual (bert-base-multilingual-cased, bert-base-multilingual-uncased)
BERT German (bert-base-german-cased, dbmdz/bert-base-german-uncased, ...)
DistilBERT (distilbert-base-german-cased, distilbert-base-multilingual-cased)
XLM-RoBERTa (xlm-roberta-base, xlm-roberta-large, facebook/xlm-roberta-xl, ...)
ELECTRA (stefan-it/electra-base-gc4-64k-200000-cased-generator, ...)
DeBERTa (microsoft/mdeberta-v3-base)
...

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
src		src
.gitattributes		.gitattributes
README.md		README.md
Train_BERT_on_19_labels.ipynb		Train_BERT_on_19_labels.ipynb
Train_BERT_on_7_labels.ipynb		Train_BERT_on_7_labels.ipynb
coarse_run.sh		coarse_run.sh
fine_run.sh		fine_run.sh
run_ner.py		run_ner.py
run_tf_ner.py		run_tf_ner.py
tasks.py		tasks.py
utils_ner.py		utils_ner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to fine-tune BERT on German LER Dataset

Training with 19 fine-grained labels

Training with 7 coarse-grained labels

Run in terminal

About

Releases

Packages

Languages

elenanereiss/bert-legal-ner

Folders and files

Latest commit

History

Repository files navigation

How to fine-tune BERT on German LER Dataset

Training with 19 fine-grained labels

Training with 7 coarse-grained labels

Run in terminal

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages