A named entity linking pipeline for retro-digitized documents
CHNOBLi is a pipeline for named entity linking and disambiguation in retro-digitized documents. It processes text through three main stages:
| Component | Purpose |
|---|---|
| Tagging | Extract and tag named entities from OCR text using pre-trained NER models |
| Aggregation | Combine and normalize entity mentions across documents and sources |
| Linking | Link mentions to knowledge bases (Wikipedia, Wikidata, GND) and resolve identity |
The easiest way to set up the project is using the interactive setup wizard:
git clone git@github.com:eth-library/CHNOBLi.git
cd CHNOBLi
make setupThe wizard will guide you through:
- Creating your
.envconfiguration file. - Choosing between
- Minimal Setup (using remote APIs) or
- Full Local Setup (cloning and setting up local databases). Once the databases are running, you can import the required data using
make import-data
If you prefer to set everything up manually, follow these steps:
git clone git@github.com:eth-library/CHNOBLi.git
cd CHNOBLiOption A: Using Conda
conda create -n env_chnobli python=3.12 ipython
conda activate env_chnobliOption B: Using venv
python3.12 -m venv .env_chnobli # Windows: py -3.12 -m venv .env_chnobli
source .env_chnobli/bin/activate # Windows: .env_chnobli\Scripts\activatepip install -r requirements.txtDownload the tagging models from the ETH Research Collection (DOI: 10.3929/ethz-c-000799811) and save them to the models/ directory.
Instead of setting up the environment yourself as explained above you can also call:
docker compose --file docker-compose-dev.yml up
that automatically sets up your environment for you, although you still need to set up your own vector database and ElasticSearch index. To link you can then call
docker exec -it linking sh scripts/link_example.sh
A public API endpoint is coming soon. To set up your own:
- Follow the setup guide in CHNOBLi-elasticsearch
- Update
CHNOBLi/.env_templatewith your endpoints and index names - Rename file to
.env - Copy the certificate hierarchy:
secrets/certs/ca/ca.crtfrom the CHNOBLi-elasticsearch directory to this one
A public API endpoint is coming soon. To set up your own:
- Set up Milvus following the setup guide in CHNOBLi-vectordb
- Update
CHNOBLi/.env_templatewith your host and port - Rename file to
.env
The project includes a Makefile to simplify common tasks and ensure consistent environments (especially regarding file permissions).
make setup: Run the interactive configuration wizard.make import-data: Interactive tool to download and import Wikidata, GND, and Milvus data.
make build: Build the Docker images from source.make up: Start the linking pipeline in the background.make down: Stop and remove all containers.make logs: Tail the logs from all running services.
make shell: Drop into a bash shell inside thelinkingcontainer.make shell-root: Same as above, but with root privileges.
1. Tag example documents
sh scripts/tag_example.sh # Windows: python main.py --tasks prep,tag --magazine_year_paths ./data/input_example/tjb/1955_030 --config_file configs/configurations_example.jsonOutput: data/output/tag/
2. Link entities
sh scripts/link_example.sh # Windows: python main.py --tasks finish --magazine_year_paths ./data/output/tag/tjb --config_file ./configs/configurations_example.jsonOutput: data/output/link/
3. Evaluate results
sh scripts/eval_example.sh # Windows: python3 main.py --tasks eval --config_file ./configs/eval_config_example.json --eval_level refOutput: data/output/eval_ref_with_fuzzy/tjb/1955_030.jsonl
The tagging component expects word coordinates (as from ABBYY FineReader). If your OCR comes from another source, we provide transformation utilities:
Transkribus
from utility.utils import transkribus_xml_to_approx_word_coordE-Rara
from utility.utils import erara_xml_to_word_coordTesseract or Plain Text
from utility.utils import txt_file_to_word_coordContributing: Have a transformation function for another format? Please submit a pull request!
Once it is transformed, you can run the pipeline just as you did with the example data.
If you already have entity extractions (e.g., from SpaCy), transform them to our format:
Input example:
{
"mention": "Kamal Kharrazi",
"offset": 237,
"length": 14,
"docName": "APW19981109_0464.htm"
}Transform using:
from utility.utils import offset_len_to_linking_inputThis produces output like:
{
"info": {
"lastnames": ["Kharrazi"],
"firstnames": ["Kamal"],
"abbr_firstnames": [],
"address": [],
"titles": [],
"occupations": [],
"others": [],
"type": "PER",
"id": 0,
"gt_wikipedia": "Kamal_Kharazi",
"gt_wikidata": "Q435799",
"gt_gnd": "1222390949"
},
"pageNo": 0,
"pageNames": "APW19981109_0464.htm",
"pid": "APW19981109_0464.htm",
"sentenceNo": 0,
"positions": "237:14",
"articles": "",
"context": "al bodies about the U.S.-funded Radio Free Europe , the Iran Daily reported Monday. \n\n It quoted Foreign Minister Kamal Kharrazi as saying the radio \"was set up to interfere in Iran\'s internal affairs.\'\' \n\n It did not say when the complaints wil"
}1. Configure your data path
Edit configs/configurations_customtag.json and set CUSTOM_TAGGING_OUTPUT to your data path.
2. Run aggregation (with linking)
python main.py --tasks finish --config_file configs/configurations_customtag.jsonOr skip aggregation and only link:
python main.py --tasks link --config_file configs/configurations_customtag.jsonNote on context: The pipeline reads context from ABBYY FineReader format, which is why for your custom data, you have to include this
"context"key with the context string explicitly. If your data doesn't include context, simply omit the"context"key and the disambiguation via vector database will be skipped.
Note on dates: The publication year is used for sanity checks (e.g., not considering people born after that year). For custom data, this defaults to year 3000 — adjust as needed in
main.py.
Download training and evaluation datasets from Hugging Face:
- Annotated entity mentions
- Linked entities with Wikipedia/Wikidata/GND IDs
- Retro-digitized documents
Citation format coming soon...
Released under MIT by @eth-library.