CHNOBLi

A named entity linking pipeline for retro-digitized documents

Overview

CHNOBLi is a pipeline for named entity linking and disambiguation in retro-digitized documents. It processes text through three main stages:

Component	Purpose
Tagging	Extract and tag named entities from OCR text using pre-trained NER models
Aggregation	Combine and normalize entity mentions across documents and sources
Linking	Link mentions to knowledge bases (Wikipedia, Wikidata, GND) and resolve identity

Installation

The easiest way to set up the project is using the interactive setup wizard:

git clone git@github.com:eth-library/CHNOBLi.git
cd CHNOBLi
make setup

The wizard will guide you through:

Creating your .env configuration file.
Choosing between
1. Minimal Setup (using remote APIs) or
2. Full Local Setup (cloning and setting up local databases). Once the databases are running, you can import the required data using make import-data

Manual Setup (Optional)

If you prefer to set everything up manually, follow these steps:

Step 1: Clone Repository

git clone git@github.com:eth-library/CHNOBLi.git
cd CHNOBLi

Step 2: Create Environment

Option A: Using Conda

conda create -n env_chnobli python=3.12 ipython
conda activate env_chnobli

Option B: Using venv

python3.12 -m venv .env_chnobli # Windows: py -3.12 -m venv .env_chnobli
source .env_chnobli/bin/activate  # Windows: .env_chnobli\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Download Models

Download the tagging models from the ETH Research Collection (DOI: 10.3929/ethz-c-000799811) and save them to the models/ directory.

Alternative to step 2-4: Docker

Instead of setting up the environment yourself as explained above you can also call:

docker compose --file docker-compose-dev.yml up

that automatically sets up your environment for you, although you still need to set up your own vector database and ElasticSearch index. To link you can then call

docker exec -it linking sh scripts/link_example.sh

Step 5: Configure ElasticSearch

A public API endpoint is coming soon. To set up your own:

Follow the setup guide in CHNOBLi-elasticsearch
Update CHNOBLi/.env_template with your endpoints and index names
Rename file to .env
Copy the certificate hierarchy: secrets/certs/ca/ca.crt from the CHNOBLi-elasticsearch directory to this one

Step 6: Configure Milvus

A public API endpoint is coming soon. To set up your own:

Set up Milvus following the setup guide in CHNOBLi-vectordb
Update CHNOBLi/.env_template with your host and port
Rename file to .env

Management with Makefile

The project includes a Makefile to simplify common tasks and ensure consistent environments (especially regarding file permissions).

Setup & Data

make setup: Run the interactive configuration wizard.
make import-data: Interactive tool to download and import Wikidata, GND, and Milvus data.

Container Management

make build: Build the Docker images from source.
make up: Start the linking pipeline in the background.
make down: Stop and remove all containers.
make logs: Tail the logs from all running services.

Interactive Access

make shell: Drop into a bash shell inside the linking container.
make shell-root: Same as above, but with root privileges.

Quick Start

Try It with Example Data

1. Tag example documents

sh scripts/tag_example.sh # Windows: python main.py --tasks prep,tag --magazine_year_paths ./data/input_example/tjb/1955_030 --config_file configs/configurations_example.json

Output: data/output/tag/

2. Link entities

sh scripts/link_example.sh # Windows: python main.py --tasks finish --magazine_year_paths ./data/output/tag/tjb --config_file ./configs/configurations_example.json

Output: data/output/link/

3. Evaluate results

sh scripts/eval_example.sh # Windows: python3 main.py --tasks eval --config_file ./configs/eval_config_example.json --eval_level ref

Output: data/output/eval_ref_with_fuzzy/tjb/1955_030.jsonl

Using Your Data

Input Format: OCR Data

The tagging component expects word coordinates (as from ABBYY FineReader). If your OCR comes from another source, we provide transformation utilities:

Transkribus

from utility.utils import transkribus_xml_to_approx_word_coord

E-Rara

from utility.utils import erara_xml_to_word_coord

Tesseract or Plain Text

from utility.utils import txt_file_to_word_coord

Contributing: Have a transformation function for another format? Please submit a pull request!

Once it is transformed, you can run the pipeline just as you did with the example data.

Custom Tagging Output

Transformation

If you already have entity extractions (e.g., from SpaCy), transform them to our format:

Input example:

{
   "mention": "Kamal Kharrazi",
   "offset": 237,
   "length": 14,
   "docName": "APW19981109_0464.htm"
}

Transform using:

from utility.utils import offset_len_to_linking_input

This produces output like:

{
   "info": {
      "lastnames": ["Kharrazi"],
      "firstnames": ["Kamal"],
      "abbr_firstnames": [],
      "address": [],
      "titles": [],
      "occupations": [],
      "others": [],
      "type": "PER",
      "id": 0,
      "gt_wikipedia": "Kamal_Kharazi",
      "gt_wikidata": "Q435799",
      "gt_gnd": "1222390949"
   },
   "pageNo": 0,
   "pageNames": "APW19981109_0464.htm",
   "pid": "APW19981109_0464.htm",
   "sentenceNo": 0,
   "positions": "237:14",
   "articles": "",
   "context": "al bodies about the U.S.-funded  Radio Free Europe  , the Iran Daily reported Monday. \n\n It quoted  Foreign Minister    Kamal Kharrazi   as saying the radio \"was set up to interfere in Iran\'s internal affairs.\'\' \n\n It did not say when the complaints wil"
}

Linking

1. Configure your data path

Edit configs/configurations_customtag.json and set CUSTOM_TAGGING_OUTPUT to your data path.

2. Run aggregation (with linking)

python main.py --tasks finish --config_file configs/configurations_customtag.json

Or skip aggregation and only link:

python main.py --tasks link --config_file configs/configurations_customtag.json

Note on context: The pipeline reads context from ABBYY FineReader format, which is why for your custom data, you have to include this "context" key with the context string explicitly. If your data doesn't include context, simply omit the "context" key and the disambiguation via vector database will be skipped.

Note on dates: The publication year is used for sanity checks (e.g., not considering people born after that year). For custom data, this defaults to year 3000 — adjust as needed in main.py.

Resources

Ground-Truth Data

Download training and evaluation datasets from Hugging Face:

Annotated entity mentions
Linked entities with Wikipedia/Wikidata/GND IDs
Retro-digitized documents

Citation

Citation format coming soon...

License

Released under MIT by @eth-library.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
configs		configs
data		data
maintenance		maintenance
models		models
scripts		scripts
services/linking		services/linking
src		src
tests		tests
utility		utility
.env_template		.env_template
.flake8		.flake8
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
docker-compose-dev.yml		docker-compose-dev.yml
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CHNOBLi

Table of Contents

Overview

Installation

Installation

Manual Setup (Optional)

Step 1: Clone Repository

Step 2: Create Environment

Step 3: Install Dependencies

Step 4: Download Models

Alternative to step 2-4: Docker

Step 5: Configure ElasticSearch

Step 6: Configure Milvus

Management with Makefile

Setup & Data

Container Management

Interactive Access

Quick Start

Try It with Example Data

Using Your Data

Input Format: OCR Data

Custom Tagging Output

Transformation

Linking

Resources

Ground-Truth Data

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CHNOBLi

Table of Contents

Overview

Installation

Installation

Manual Setup (Optional)

Step 1: Clone Repository

Step 2: Create Environment

Step 3: Install Dependencies

Step 4: Download Models

Alternative to step 2-4: Docker

Step 5: Configure ElasticSearch

Step 6: Configure Milvus

Management with Makefile

Setup & Data

Container Management

Interactive Access

Quick Start

Try It with Example Data

Using Your Data

Input Format: OCR Data

Custom Tagging Output

Transformation

Linking

Resources

Ground-Truth Data

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages