This repository contains implementations of multiple approaches to multilingual entity linking. If you don't know what entity linking is or would like to know more about the methods used, take a look at my thesis: entity linking and its history are described here and training models in the following chapters.
Right now, it is rather hard to run the repository locally. In the upcoming year, it should become a part of LinPipe so all the controlling will be done through it.
pre-limenary results
Language | Recall@1 | Recall@10 | Recall@100 |
---|---|---|---|
Arabic | 91.1 | 94.3 | 95.1 |
German | 90.6 | 94.8 | 95.6 |
English | 85.8 | 91.6 | 93.3 |
Spanish | 88.1 | 93.4 | 94.6 |
Persian | 89.0 | 95.5 | 97.0 |
Japanese | 87.6 | 94.2 | 95.9 |
Serbian | 91.6 | 96.1 | 97.0 |
Tamil | 90.0 | 96.4 | 97.7 |
Turkish | 87.3 | 93.4 | 94.6 |
Language | Recall@1 | Recall@10 | Recall@100 |
---|---|---|---|
Arabic | 90.1 | 93.3 | 94.4 |
German | 89.1 | 93.3 | 94.3 |
English | 80.3 | 88.9 | 91.4 |
Spanish | 84.5 | 91.2 | 92.5 |
Persian | 83.0 | 89.3 | 91.4 |
Japanese | 85.7 | 93.7 | 94.9 |
Serbian | 91.2 | 95.3 | 96.4 |
Tamil | 91.4 | 95.5 | 97.3 |
Turkish | 86.4 | 93.9 | 94.6 |
For OLPEAT explanation click here.
Entity linking is a cute problem in NLP where your are tasked with connecting a mention to an entry in a knowledge base. For example, take the sentence "In order to save Troy, Paris had to be sacrificed" -- the mention Paris should connect to Paris (mythological figure) not the capital of France.
This repository is based on my thesis; however, it implements a plateau of different improvements which were not part of the original code. Most notably:
- supports multilingual entity linking in the style of Entity Linking in 100 Languages,
- the code is significantly faster and more memory efficient.
Other improvements are:
- completely rewritten tokenization pipeline. The previous version was pretty much impossible to extend or change and also inefficient.
- Support for different models,
- and gin-config.
For the audacious ones and as a future reference for myself.
Contains scripts for running everything. The most important scripts are described below.
Downloading and extracting damuel.
It's best to use olpeat_all.sh
which runs OLPEAT from start to finish on tokenized DaMuEL and Mewsli-9.
all_langs_no_slurm.sh
is the preferable way to run the code.
all_langs.sh
requires slurm but does not provide any benefits.
Its downside is that it requires configuring every little call to slurm sbatch which makes it time-consuming to maintain.
Contains various baselines inherited from the code of my thesis. The most important one is OLPEAT. OLPEAT now has a cap on the number of different entity representations in the index so we cannot evaluate it with the same code as trained models.
Some old code that is there for legacy reasons. Should be either moved to models or deleted. Implements wrappers over tokenizers and scann index.
The complete finetuning logic.
Most logic was already moved to RecallCalculator
so here are some wrappers around it that make setting evaluation up easy.
Utils that are needed during training. Mostly moving data around.
There are two possible entry points train.py
and train_ddp.py
.
The former implements single gpu training, the latter does this with distributed data parallel.
DDP is a strongly preferred way of running things -- it's much faster and data loading logic there is improved.
train.py
exists for legacy reasons or for times when I won't have access to a multi-gpu cluster.
The entry point is in generate.py
but most of the logic is in datasets.py
-- the code in it constructs batches which
is probably the hardest thing to get right.
For performance reasons and ease of implementation we are loading all links into RAM.
In case of OOM in generation part it is needed to make datasets.Batcher
lazier.
batch_sampler.py
andnegative_sampler
implement neighbors sampling for batch and negatives respectively.recall_calculator.py
used for calculating recalls in evaluation.*_wrapper.py
different transformers produce the input embeddings differently (pooling, cls...). These can wrap a transformer model and extract the embedding.
Different searchers that can be used to implement the index for sampling negatives during training.
brute_force_searcher.py
-- contains GPU based searcher (single and multi gpu). Very performant and precise.scann_searcher.py
-- Based on ScaNN performant but less precise. So far, we can put our index to a GPU so brute force is preferable. Properly understanding SOAR could give additional performance boost.faiss_searcher.py
-- currently not working well (maybe params are set horribly).
Code for creating multilingual dataset in style of Entity Linking in 100 Languages. The scripts from here are used once to create a multilingual dataset.
combine_embs.py
is a nice utility that allows us to use more than one language for entity description. This offers modest improvements in low resource languages.
Helpful profiling scripts I wrote to make the code efficient.
WIP code for reranking results retrieved by our model.
Scripts for running everything. Already described above.
Simple pipelines that are used for tokenizing inputs.
Adding new pipeline steps can be done easily by extending tokenization.pipeline.base.PipelineStep
.
Bunch of different utilities used in different parts of the codebase.
All tests are in the tests directory. Some slower/integration tests are marked with @pytest.mark.slow.
To run all tests just run
pytest
in the root.
To exclude slow tests, run
pytest -m "not slow"
Right now, wandb
is used for tracking all experiments and there is no simple way to turn this off.
We should make it optional in the future.