Anton Drasbæk Schiønning (@drasbaek) and
Mina Almasi (@MinaAlmasi)
Aarhus University, Natural Language Processing Exam (E23)
This repository contains the scripts used to develop SYNEDA
(Synthetic Named Entity Danish dataset) for Danish named entity recognition. Concretely, the dataset is created by developing a reverse-annotation
pipeline, consisting of the following steps:
- Devising entity databases for 18 entity categories, following the OntoNotes 5.0 framework (see dbase/entities_lists)
- Combining entities across databases (randomly) to create
annotation lists
(see dbase/annotations). - Prompting a ChatGPT 4 instance with the
annotations lists
to generate text around them (see data).
The SYNEDA
dataset can be downloaded from the data
folder (already split into train
, dev
and test
).
The repository is structured as such:
Folder/File | Description |
---|---|
data/ |
Contains the .spacy versions of all three splits of SYNEDA. Placeholder files are inserted for DANSK and DaNE+ (obtained with src/external_data/fetch_data.py ). |
dbase/ |
Contains the databases for the entity lists. Also functions as a store for all annotation lists and their corresponding generations. |
plots/ |
Contains all plots used in the SYNEDA paper and appendix. |
results/ |
Contains all evaluation results for all three models specified in the paper. |
src/ |
Contains all Python code related to the project. |
training/ |
Contains SpaCy config files for training the models as well as their logs and a placeholder folder for the models. |
annotations.sh |
Executes all scripts related to non-manual annotations. |
debug.sh , train.sh , evaluate.sh |
For debugging datasets, training and evaluating models with SpaCy pipelines. |
Please note that the src
folder has a seperate README with a greater overview of the scripts within.
The training pipeline was run via on Ubuntu v22.04.3, Python v3.10.12 (UCloud, Coder Python 1.84.2). Creating the annotations and plotting was done locally on a Macbook Pro ‘13 (2020, 2 GHz Intel i5, 16GB of ram).
Python's venv needs to be installed for the code to run as intended.
Please also note that training models is computionally intensive and requires a good CPU. The training was run on a 64 machine on UCloud.
Prior to running any code, please run the command below to create a virtual environment (env
) and install necessary packages within it:
bash setup
To run the training pipelines, you also need to fetch the datasets DANSK and DaNe+ and create a combined SYNEDA
+ DANSK
by running:
bash external_data.sh
The spaCy training pipeline can be rerun by running the three bash scripts debug.sh
, train.sh
, and evaluate.sh
. For instance:
bash train.sh
Other files can be run as shown below while env
is activated. For instance, the file to perform evaluation with bootstrapping:
python src/analysis/bootstrap_eval.py
Please note that some analysis scripts cannot be run without re-running train.sh
to get the models as they are not pushed to training/models folder due to their size.
For any questions regarding the project or its reproducibility, please feel free to contact us:
- drasbaek@post.au.dk (Anton)
- mina.almasi@post.au.dk (Mina)
This work could not have been done without the extensive work by the teams behind spaCy and DaCy as well as the datasets DANSK and DaNe+.