PubMed Knowledge Graph

A comprehensive process for generating a knowledge graph from PubMed articles that integrates lexical data, extracted entities, and patient journey information to support healthcare research and analysis.

Introduction

This project creates a knowledge graph by combining:

Lexical Graph: Document structure with chunks, text elements, images, and tables from PubMed articles

Entity Graph: Extracted healthcare entities including medications, treatment arms, clinical outcomes, study populations, and medical conditions

Patient Journey Graph: Real-world patient data including demographics, procedures, lab results, and outcomes

The system uses advanced document processing with Unstructured.IO, LLM-based entity extraction with OpenAI GPT-4, and Neo4j to create a comprehensive healthcare knowledge representation.

Installation and Setup

Prerequisites

Core

Python 3.11 or higher
Neo4j Database (local or remote instance)
OpenAI API key for entity extraction

Unstructured.IO Local Processing

Poppler library for PDF processing
Tesseract OCR engine

Environment Setup

Clone the repository:

git clone https://github.com/cupkes/pubmed-knowledge-graph.git
cd pubmed-knowledge-graph

Install dependencies for local PDF processing:

pdf2image Documentation (Uses Poppler)
Tesseract Documentation

# macOS
brew install poppler tesseract

# Ubuntu/Debian
sudo apt-get install poppler-utils tesseract-ocr

# Windows
# Install poppler and tesseract manually or via conda

Install Python dependencies using Poetry (recommended):
```
make install
```
Or using pip:
```
make install-pip
```
Neo4j Database Configuration:
- Install and start Neo4j locally, or use a remote instance
- Update pyneoinstance_config.yaml with your database credentials:
```
db_info:
  uri: bolt://localhost:7687
  database: your-database-name
  user: neo4j
  password: your-password
```
OpenAI API Configuration:
- Set your OpenAI API key as an environment variable:
  - Using export
```
export OPENAI_API_KEY=your-api-key-here
```
  - Using .env file
    - Create a file in project root named .env
    - Add OPENAI_API_KEY=your-api-key
Data Preparation:
- Confirm PubMed PDF articles are in the articles/pdf/ directory
- Ensure patient journey data is available in data/protocol/extended_patient_journey.csv

Knowledge Graph Generation

These notebooks demonstrate the knowledge graph generation process in four primary steps:

Data Processing

Here we chunk the incoming PDF articles, extract images, tables and entities, and embed the text.

A predefined graph schema is used to extract entities and relationships of interest.

Ingestion

We then load the chunks, entities, relationships and embeddings into Neo4j. Optionally we can load the embeddings into a vector store and synchronize the embeddings with their respective text in Neo4j via IDs.

Post Processing

Here we execute Cypher queries to resolve duplicate entities we've extracted. This process is custom for each entity type we would like to resolve.

We also can link entities to nodes in our patient journey graph. This is also achieved via Cypher queries, but there are many methods that many be used for entity linking.

Validation

Finally we can validate that the entity extraction process went as expected. This involves running multiple Cypher queries and aggregating the results in an analysis report. We can check features such as orphan nodes, expected relationships and property presence.

Notebooks

The knowledge graph generation process is divided into three sequential notebooks that should be executed in order:

1. Patient Journey Graph (`1_generate_patient_graph.ipynb`)

Purpose: Establishes the foundational patient journey graph with simulated healthcare data.

What it does:

Loads structured patient journey data from CSV files
Creates Member, Demographic, MedicalCondition, Procedure, LabResult, and ClinicalOutcome nodes
Establishes relationships between patients and their healthcare journey components
Sets up constraints and indexes for the entire graph database

2. Lexical Graph (`2_generate_lexical_graph.ipynb`)

Purpose: Processes PubMed articles to create a detailed lexical knowledge graph.

What it does:

Partitions PDF articles using Unstructured.IO with high-resolution processing
Extracts text, images, and tables from documents using advanced OCR
Creates Document, Chunk, TextElement, ImageElement, and TableElement nodes
Builds relationships between documents and their constituent parts
Implements chunking strategy based on document titles and sections

3. Entity Graph (`3_generate_entity_graph.ipynb`)

Purpose: Extracts healthcare entities from the lexical graph using LLM-based processing and connects them to lexical and patient journey data.

What it does:

Uses an OpenAI LLM with Instructor library for structured entity extraction
Extracts Medication, TreatmentArm, ClinicalOutcome, StudyPopulation, and MedicalCondition entities
Creates relationships between extracted entities
Links extracted entities to their source text chunks via HAS_ENTITY relationships
Connects entity graph to patient journey graph
Performs entity resolution to merge duplicate medications

Data Models

The system implements three interconnected data models:

Lexical Data Model

Document: Top-level container for articles with metadata
Chunk: Semantic sections of documents created by title-based chunking
UnstructuredElement: Base class for TextElement, ImageElement, and TableElement
Relationships: Document → Chunk → Elements with sequential chunk linking

Entity Data Model

Medication: Drug information with classification, mechanism, and approval status
TreatmentArm: Study groups receiving specific treatments with unique identifiers
ClinicalOutcome: Measured results from treatments linked to studies
StudyPopulation: Patient demographics and study characteristics with inclusion/exclusion criteria
MedicalCondition: Diseases and conditions with ICD-10 validation

Patient Journey Data Model

The patient journey graph is centered around Event nodes that track temporal patient interactions and medical activities. Events are sequentially linked to create a timeline of each patient's healthcare journey.

Patient: Individual patients with unique identifiers
Event: Temporal nodes representing patient interactions (claims, procedures, diagnoses) with sequential PREVIOUS relationships
Demographic: Age, sex, and ZIP code information with composite constraints
Provider: Healthcare providers with specialty information
Claim: Insurance claims with ICD-9, ICD-10, CPT-4, NDC, and RxNorm codes
Procedure: Medical procedures with CPT codes and dates
LabResult: Laboratory test results with LOINC codes and values
MedicalCondition: Diagnoses linked to events with ICD-9/ICD-10 codes
Medication: Medications taken by patients, linked to events
ClinicalOutcome: Clinical outcomes achieved through patient events
CareGap: Identified gaps in care with status tracking
RiskScore: Patient risk stratification scores and groups

Entity Extraction Validation

The entity extraction process may be validated with scripts/validate_entity_graph.py

Node Count Validation: Ensures expected entity quantities across all graph layers
Relationship Integrity: Validates all expected connections exist between node types
Orphan Detection: Identifies isolated nodes without relationships for quality control
Domain vs. Lexical Relationship Balance: Ensures proper connectivity between research and real-world data

Run validation with:

make validate-graph-poetry
# or
make validate-graph-pip

Troubleshooting

Common Issues

OpenAI API Errors:
- Ensure API key is set: echo $OPENAI_API_KEY
- Verify sufficient API credits and rate limits
- Check for API endpoint availability
Neo4j Connection:
- Verify Neo4j service is running: neo4j status
- Test connection credentials in Neo4j Browser
- Check network connectivity and firewall settings
PDF Processing Warnings:
- "Cannot set gray non-stroke color" warnings are non-fatal
- Ensure poppler and tesseract are properly installed
- Verify PDF files are not corrupted

LangGraph Agent Interface

The project includes an interactive LangGraph-based agent that provides a conversational interface for querying the knowledge graph and researching medications.

Description

The agent (agent.py) is a React-style agent that uses Neo4j Cypher querying capabilities to research and analyze medications, studies and patients. It acts as a healthcare expert that can answer questions by dynamically querying the knowledge graph and provides cited responses based on the underlying documents and structured patient data.

Tools Available

The agent uses the Neo4j Cypher MCP server to understand the data model schema and execute read-only Cypher queries.

The agent has access to three primary tools:

get_neo4j_schema - Retrieves the database schema to inform Cypher query construction
read_neo4j_cypher - Executes read-only Cypher queries against the knowledge graph
research_medication - Performs vector similarity search to find relevant document chunks about specific medications

Interface (CLI)

To use the agent, run:

make run-agent-poetry
# or
make run-agent-pip

The agent provides an interactive command-line interface where you can:

Ask questions about medications, diseases, treatments, or patients
Request specific information about drugs in the knowledge graph
Get cited responses with references to source documents

Example queries:

"What are the side effects of metformin?"
"How can metformin affect my patient who is male and 65 years old?"
"What medications interact with warfarin?"

Type exit, quit, or q to end the session.

This system provides a robust foundation for healthcare research, clinical decision support, and pharmaceutical research by combining structured patient data with comprehensive literature analysis in a unified, queryable graph representation.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
articles		articles
assets		assets
cypher		cypher
data		data
docs		docs
neo4j_graphrag		neo4j_graphrag
resources		resources
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
1_generate_patient_graph.ipynb		1_generate_patient_graph.ipynb
2_generate_lexical_graph.ipynb		2_generate_lexical_graph.ipynb
3_generate_entity_graph.ipynb		3_generate_entity_graph.ipynb
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
agent.py		agent.py
poetry.lock		poetry.lock
pyneoinstance_config.yaml		pyneoinstance_config.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PubMed Knowledge Graph

Introduction

Installation and Setup

Prerequisites

Core

Unstructured.IO Local Processing

Environment Setup

Knowledge Graph Generation

Notebooks

1. Patient Journey Graph (`1_generate_patient_graph.ipynb`)

2. Lexical Graph (`2_generate_lexical_graph.ipynb`)

3. Entity Graph (`3_generate_entity_graph.ipynb`)

Data Models

Lexical Data Model

Entity Data Model

Patient Journey Data Model

Entity Extraction Validation

Troubleshooting

Common Issues

LangGraph Agent Interface

Description

Tools Available

Interface (CLI)

About

Uh oh!

Releases 1

Packages

Contributors 2

Languages

neo4j-field/pubmed-knowledge-graph-generation

Folders and files

Latest commit

History

Repository files navigation

PubMed Knowledge Graph

Introduction

Installation and Setup

Prerequisites

Core

Unstructured.IO Local Processing

Environment Setup

Knowledge Graph Generation

Notebooks

1. Patient Journey Graph (1_generate_patient_graph.ipynb)

2. Lexical Graph (2_generate_lexical_graph.ipynb)

3. Entity Graph (3_generate_entity_graph.ipynb)

Data Models

Lexical Data Model

Entity Data Model

Patient Journey Data Model

Entity Extraction Validation

Troubleshooting

Common Issues

LangGraph Agent Interface

Description

Tools Available

Interface (CLI)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

1. Patient Journey Graph (`1_generate_patient_graph.ipynb`)

2. Lexical Graph (`2_generate_lexical_graph.ipynb`)

3. Entity Graph (`3_generate_entity_graph.ipynb`)

Packages