Skip to content

Commit 0f5c76d

Browse files
authored
chore(medcat-plugin-embedding-linker): CU-869c36ruk Separate embedding linker to its own project (#327)
* CU-869c36ruk: Add initial README * CU-869c36ruk: Move embedding linker to separate project / package * CU-869c36ruk: Add initial pyproject.toml * CU-869c36ruk: Move embedding linker config to relevant project * CU-869c36ruk: Add entry point based plugin registration * CU-869c36ruk: Add component registration for embedding linker plugin * CU-869c36ruk: Remove embedding linker registration from core lib * CU-869c36ruk: Centralise plugin / project name * CU-869c36ruk: Centralise name again * CU-869c36ruk: Standardise / fix license format in pyproject.toml * CU-869c36ruk: Add missing dep for embedding linker * CU-869c36ruk: Move embedding linker tests to new project * CU-869c36ruk: Fix typo for lazy registration method name * CU-869c36ruk: Fix import paths for tests * CU-869c36ruk: Add helper module for tests * CU-869c36ruk: Use correct (local) imports for embedding linker config * CU-869c36ruk: Use correct (local) import within tests; add a simple instance test * CU-869c36ruk: Remove non-existant core lib import of config * CU-869c36ruk: Make sure the core lib is marked as typed * CU-869c36ruk: Rename tag (embedding instead of embed) * CU-869c36ruk: Rename embedding linker folder * CU-869c36ruk: Add initial workflows for medcat-embedding-linker * CU-869c36ruk: Fix issue with component registartion (NER/linker) * CU-869c36ruk: Fix linker name in tests * CU-869c36ruk: Unify component naming * CU-869c36ruk: Fix issue with test PyPI push * CU-869c36ruk: Fix workflow typo * CU-869c36ruk: Bump medcat dependency to 2.5 (for lazy registration) * CU-869c36ruk: Update plugin catalog with new entry for medcat-embedding-linker * CU-869c36ruk: Remove license section from README * CU-869c25ux2: Rename workflow * CU-869c25ux2: Moved publishing to the joint workflow * CU-869c25ux2: Fix typo in release workflow job * CU-869c36ruk: Move workflow to uv * CU-869c36ruk: Remove unnecessary step
1 parent 6a19a19 commit 0f5c76d

File tree

15 files changed

+617
-63
lines changed

15 files changed

+617
-63
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
name: medcat-embedding-linker - CI (test | publish)
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
tags:
7+
- 'medcat-embedding-linker/v*.*.*'
8+
pull_request:
9+
paths:
10+
- 'medcat-embedding-linker/**'
11+
- '.github/workflows/medcat-embedding-linker**'
12+
13+
permissions:
14+
id-token: write
15+
16+
defaults:
17+
run:
18+
working-directory: ./medcat-plugins/embedding-linker
19+
20+
jobs:
21+
build:
22+
runs-on: ubuntu-latest
23+
strategy:
24+
matrix:
25+
python-version: [ '3.10', '3.11', '3.12' ]
26+
max-parallel: 4
27+
steps:
28+
- uses: actions/checkout@v6
29+
- name: Install uv for Python ${{ matrix.python-version }}
30+
uses: astral-sh/setup-uv@v7
31+
with:
32+
python-version: ${{ matrix.python-version }}
33+
enable-cache: true
34+
- name: Install the project
35+
run: |
36+
uv sync --all-extras --dev
37+
uv run python -m ensurepip
38+
uv run python -m pip install --upgrade pip
39+
- name: Check types
40+
run: |
41+
uv run python -m mypy --follow-imports=normal src/medcat_embedding_linker
42+
- name: Ruff linting
43+
run: |
44+
uv run ruff check src/medcat_embedding_linker --preview
45+
- name: Test
46+
run: |
47+
uv run python -m unittest discover
48+
49+
publish-to-test-PyPI:
50+
runs-on: ubuntu-latest
51+
needs: build
52+
steps:
53+
- name: Checkout main
54+
uses: actions/checkout@v6
55+
with:
56+
fetch-depth: 0 # fetch all history
57+
fetch-tags: true # fetch tags explicitly
58+
59+
- name: Install uv for Python 3.10
60+
uses: astral-sh/setup-uv@v7
61+
with:
62+
python-version: '3.10'
63+
enable-cache: true
64+
65+
- name: Install dependencies
66+
run: |
67+
uv run python -m ensurepip
68+
69+
- name: Set timestamp-based dev version
70+
run: |
71+
TS=$(date -u +"%Y%m%d%H%M%S")
72+
echo "SETUPTOOLS_SCM_PRETEND_VERSION_FOR_MEDCAT_EMBEDDING_LINKER=0.2.2.dev${TS}" >> $GITHUB_ENV
73+
74+
- name: Build package
75+
run: |
76+
uv build
77+
78+
- name: Publish distribution to TestPyPI
79+
uses: pypa/gh-action-pypi-publish@release/v1
80+
with:
81+
repository_url: https://test.pypi.org/legacy/
82+
packages_dir: medcat-plugins/embedding-linker/dist
83+
84+
publish-to-PyPI:
85+
runs-on: ubuntu-latest
86+
if: startsWith(github.ref, 'refs/tags/')
87+
needs: build
88+
steps:
89+
- name: Checkout main
90+
uses: actions/checkout@v6
91+
92+
- name: Install uv for Python 3.10
93+
uses: astral-sh/setup-uv@v7
94+
with:
95+
python-version: '3.10'
96+
enable-cache: true
97+
98+
- name: Install dependencies
99+
run: |
100+
uv run python -m ensurepip
101+
102+
- name: Build client package
103+
run: |
104+
uv build
105+
106+
- name: Publish production distribution to PyPI
107+
uses: pypa/gh-action-pypi-publish@release/v1
108+
with:
109+
packages_dir: medcat-plugins/embedding-linker/dist
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# MedCAT Embedding Linker
2+
3+
A MedCAT plugin that provides an embedding-based entity linking component using transformer models from HuggingFace.
4+
5+
## Overview
6+
7+
This plugin replaces MedCAT's default linking component with a transformer-based approach that uses semantic similarity between entity contexts and concept embeddings to perform entity disambiguation.
8+
9+
**Key features:**
10+
- Semantic similarity-based linking using transformer embeddings
11+
- Support for any HuggingFace sentence-transformer model
12+
- Efficient batch processing with GPU acceleration
13+
- Configurable similarity thresholds and context windows
14+
- CUI-based filtering (include/exclude lists)
15+
16+
## Requirements
17+
18+
- **MedCAT**: 2.0+ ([PyPI](https://pypi.org/project/medcat/) | [GitHub](https://github.com/CogStack/MedCAT))
19+
- Python 3.10+
20+
- PyTorch
21+
- Transformers
22+
23+
## Installation
24+
25+
```bash
26+
pip install medcat-embedding-linker
27+
```
28+
29+
## Quick Start
30+
31+
```python
32+
from medcat.cat import CAT
33+
from medcat.config import Config
34+
from medcat.components.types import CoreComponentType
35+
36+
from medcat_embedding_linker import EmbeddingLinking
37+
38+
# Load your MedCAT model
39+
cat = CAT.load_model_pack("path/to/model_pack")
40+
41+
# Configure the embedding linker
42+
cat.config.components.linking = EmbeddingLinking()
43+
cat.config.components.linking.embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
44+
45+
# Recreate the pipeline to register the new linker
46+
cat._recreate_pipe()
47+
48+
# Generate embeddings for your concept database
49+
linker = self.get_component(CoreComponentType.linking)
50+
# create
51+
linker.create_embeddings()
52+
53+
# Use as normal
54+
entities = cat.get_entities("Patient presents with chest pain and dyspnea.")
55+
```
56+
57+
## How It Works
58+
59+
### Component Registration
60+
61+
The embedding linker automatically registers itself as `embedding_linker` when `EmbeddingLinking` config is detected. It implements MedCAT's `AbstractEntityProvidingComponent` interface and is lazily loaded when the pipeline is created.
62+
63+
### Embedding Generation
64+
65+
The linker operates on two types of embeddings:
66+
67+
**1. Concept Embeddings** (pre-computed)
68+
- Each CUI is represented by its longest name's embedding
69+
- Stored in `cdb.addl_info["cui_embeddings"]`
70+
- Used for final disambiguation between candidate CUIs
71+
72+
**2. Name Embeddings** (pre-computed)
73+
- Each concept name in the CDB gets its own embedding
74+
- Stored in `cdb.addl_info["name_embeddings"]`
75+
- Used for initial candidate retrieval
76+
77+
Both are generated via `linker.create_embeddings()` and cached for inference.
78+
79+
### Inference Process
80+
81+
For each detected entity:
82+
83+
1. **Context Vector Calculation**: Extract a text snippet around the entity (size controlled by `context_window_size`) and embed it
84+
2. **Candidate Retrieval**: Compare context embedding against all name embeddings to find top matches above `short_similarity_threshold`
85+
3. **Disambiguation**: If multiple CUIs are associated with the best-matching name, compare against CUI embeddings to select the final concept
86+
4. **Filtering**: Apply CUI include/exclude filters and check against `long_similarity_threshold`
87+
88+
## Configuration
89+
90+
### Key Parameters
91+
92+
```python
93+
config.components.linking = EmbeddingLinking(
94+
# Model settings
95+
embedding_model_name="sentence-transformers/all-MiniLM-L6-v2",
96+
max_token_length=128,
97+
98+
# Context settings
99+
context_window_size=10, # tokens on each side of entity
100+
101+
# Similarity thresholds
102+
short_similarity_threshold=0.3, # for candidate retrieval
103+
long_similarity_threshold=0.5, # for final linking
104+
105+
# Batch sizes
106+
embedding_batch_size=4096,
107+
linking_batch_size=512,
108+
109+
# Filtering
110+
filters=Filters(
111+
cuis={"C0018802", "C0011849"}, # include only these
112+
cuis_exclude={"C0000001"} # or exclude these
113+
),
114+
115+
# Advanced options
116+
use_ner_link_candidates=True,
117+
always_calculate_similarity=False,
118+
filter_before_disamb=True,
119+
gpu_device="cuda:0" # or None for auto-detect
120+
)
121+
```
122+
123+
### Embedding Models
124+
125+
Any HuggingFace model compatible with sentence transformers will work. Popular options:
126+
127+
- `sentence-transformers/all-MiniLM-L6-v2` (default, fast and lightweight)
128+
- `sentence-transformers/all-mpnet-base-v2` (higher quality)
129+
- `UFNLP/gatortron-medium` (biomedical domain)
130+
- `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`
131+
132+
## Advanced Usage
133+
134+
### Re-generating Embeddings
135+
136+
If you modify your CDB or want to try a different model:
137+
138+
```python
139+
linker = cat.get_component("embedding_linker")
140+
linker.create_embeddings(
141+
embedding_model_name="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
142+
max_length=256
143+
)
144+
```
145+
146+
### GPU Configuration
147+
148+
```python
149+
# Use specific GPU
150+
cat.config.components.linking.gpu_device = "cuda:1"
151+
152+
# Force CPU
153+
cat.config.components.linking.gpu_device = "cpu"
154+
```
155+
156+
### Filtering
157+
158+
```python
159+
# Include only specific CUIs
160+
cat.config.components.linking.filters.cuis = {"C0011849", "C0018802"}
161+
162+
# Exclude specific CUIs
163+
cat.config.components.linking.filters.cuis_exclude = {"C0000001"}
164+
165+
# Note: If both are set, only include filters are applied
166+
```
167+
168+
## Performance Considerations
169+
170+
- **First-time embedding generation**: Can take several minutes for large CDBs (millions of concepts)
171+
- **GPU recommended**: 10-50x faster inference with CUDA
172+
- **Batch sizes**: Increase if you have GPU memory available
173+
- **Model selection**: Smaller models (e.g., MiniLM) are faster but may be less accurate than larger domain-specific models
174+
175+
## Limitations
176+
177+
- Does not support `prefer_frequent_concepts` or `prefer_primary_name` from the default linker (logs warnings if set)
178+
- Training mode is not applicable (logs warning if enabled)
179+
- Requires pre-computed embeddings before inference
180+
181+
## Citation
182+
183+
If you use this plugin, please cite MedCAT:
184+
185+
```bibtex
186+
@article{medcat2021,
187+
title={Medical Concept Annotation Tool (MedCAT)},
188+
author={Kraljevic, Zeljko and et al.},
189+
journal={arXiv preprint arXiv:2010.01165},
190+
year={2021}
191+
}
192+
```

0 commit comments

Comments
 (0)