A comprehensive process for generating a knowledge graph from PubMed articles that integrates lexical data, extracted entities, and patient journey information to support healthcare research and analysis.
This project creates a knowledge graph by combining:
Lexical Graph: Document structure with chunks, text elements, images, and tables from PubMed articles
Entity Graph: Extracted healthcare entities including medications, treatment arms, clinical outcomes, study populations, and medical conditions
Patient Journey Graph: Real-world patient data including demographics, procedures, lab results, and outcomes
The system uses advanced document processing with Unstructured.IO, LLM-based entity extraction with OpenAI GPT-4, and Neo4j to create a comprehensive healthcare knowledge representation.
- Python 3.11 or higher
- Neo4j Database (local or remote instance)
- OpenAI API key for entity extraction
- Poppler library for PDF processing
- Tesseract OCR engine
-
Clone the repository:
git clone https://github.com/cupkes/pubmed-knowledge-graph.git cd pubmed-knowledge-graph -
Install dependencies for local PDF processing:
-
pdf2image Documentation (Uses Poppler)
# macOS brew install poppler tesseract # Ubuntu/Debian sudo apt-get install poppler-utils tesseract-ocr # Windows # Install poppler and tesseract manually or via conda
-
-
Install Python dependencies using Poetry (recommended):
make install
Or using pip:
make install-pip
-
Neo4j Database Configuration:
- Install and start Neo4j locally, or use a remote instance
- Update
pyneoinstance_config.yamlwith your database credentials:db_info: uri: bolt://localhost:7687 database: your-database-name user: neo4j password: your-password
-
OpenAI API Configuration:
- Set your OpenAI API key as an environment variable:
- Using
export
export OPENAI_API_KEY=your-api-key-here- Using
.envfile- Create a file in project root named
.env - Add OPENAI_API_KEY=your-api-key
- Create a file in project root named
- Using
- Set your OpenAI API key as an environment variable:
-
Data Preparation:
- Confirm PubMed PDF articles are in the
articles/pdf/directory - Ensure patient journey data is available in
data/protocol/extended_patient_journey.csv
- Confirm PubMed PDF articles are in the
These notebooks demonstrate the knowledge graph generation process in four primary steps:
- Data Processing
Here we chunk the incoming PDF articles, extract images, tables and entities, and embed the text.
A predefined graph schema is used to extract entities and relationships of interest.
- Ingestion
We then load the chunks, entities, relationships and embeddings into Neo4j. Optionally we can load the embeddings into a vector store and synchronize the embeddings with their respective text in Neo4j via IDs.
- Post Processing
Here we execute Cypher queries to resolve duplicate entities we've extracted. This process is custom for each entity type we would like to resolve.
We also can link entities to nodes in our patient journey graph. This is also achieved via Cypher queries, but there are many methods that many be used for entity linking.
- Validation
Finally we can validate that the entity extraction process went as expected. This involves running multiple Cypher queries and aggregating the results in an analysis report. We can check features such as orphan nodes, expected relationships and property presence.
The knowledge graph generation process is divided into three sequential notebooks that should be executed in order:
Purpose: Establishes the foundational patient journey graph with simulated healthcare data.
What it does:
- Loads structured patient journey data from CSV files
- Creates Member, Demographic, MedicalCondition, Procedure, LabResult, and ClinicalOutcome nodes
- Establishes relationships between patients and their healthcare journey components
- Sets up constraints and indexes for the entire graph database
Purpose: Processes PubMed articles to create a detailed lexical knowledge graph.
What it does:
- Partitions PDF articles using Unstructured.IO with high-resolution processing
- Extracts text, images, and tables from documents using advanced OCR
- Creates Document, Chunk, TextElement, ImageElement, and TableElement nodes
- Builds relationships between documents and their constituent parts
- Implements chunking strategy based on document titles and sections
Purpose: Extracts healthcare entities from the lexical graph using LLM-based processing and connects them to lexical and patient journey data.
What it does:
- Uses an OpenAI LLM with Instructor library for structured entity extraction
- Extracts Medication, TreatmentArm, ClinicalOutcome, StudyPopulation, and MedicalCondition entities
- Creates relationships between extracted entities
- Links extracted entities to their source text chunks via HAS_ENTITY relationships
- Connects entity graph to patient journey graph
- Performs entity resolution to merge duplicate medications
The system implements three interconnected data models:
- Document: Top-level container for articles with metadata
- Chunk: Semantic sections of documents created by title-based chunking
- UnstructuredElement: Base class for TextElement, ImageElement, and TableElement
- Relationships: Document → Chunk → Elements with sequential chunk linking
- Medication: Drug information with classification, mechanism, and approval status
- TreatmentArm: Study groups receiving specific treatments with unique identifiers
- ClinicalOutcome: Measured results from treatments linked to studies
- StudyPopulation: Patient demographics and study characteristics with inclusion/exclusion criteria
- MedicalCondition: Diseases and conditions with ICD-10 validation
The patient journey graph is centered around Event nodes that track temporal patient interactions and medical activities. Events are sequentially linked to create a timeline of each patient's healthcare journey.
- Patient: Individual patients with unique identifiers
- Event: Temporal nodes representing patient interactions (claims, procedures, diagnoses) with sequential PREVIOUS relationships
- Demographic: Age, sex, and ZIP code information with composite constraints
- Provider: Healthcare providers with specialty information
- Claim: Insurance claims with ICD-9, ICD-10, CPT-4, NDC, and RxNorm codes
- Procedure: Medical procedures with CPT codes and dates
- LabResult: Laboratory test results with LOINC codes and values
- MedicalCondition: Diagnoses linked to events with ICD-9/ICD-10 codes
- Medication: Medications taken by patients, linked to events
- ClinicalOutcome: Clinical outcomes achieved through patient events
- CareGap: Identified gaps in care with status tracking
- RiskScore: Patient risk stratification scores and groups
The entity extraction process may be validated with scripts/validate_entity_graph.py
- Node Count Validation: Ensures expected entity quantities across all graph layers
- Relationship Integrity: Validates all expected connections exist between node types
- Orphan Detection: Identifies isolated nodes without relationships for quality control
- Domain vs. Lexical Relationship Balance: Ensures proper connectivity between research and real-world data
Run validation with:
make validate-graph-poetry
# or
make validate-graph-pip-
OpenAI API Errors:
- Ensure API key is set:
echo $OPENAI_API_KEY - Verify sufficient API credits and rate limits
- Check for API endpoint availability
- Ensure API key is set:
-
Neo4j Connection:
- Verify Neo4j service is running:
neo4j status - Test connection credentials in Neo4j Browser
- Check network connectivity and firewall settings
- Verify Neo4j service is running:
-
PDF Processing Warnings:
- "Cannot set gray non-stroke color" warnings are non-fatal
- Ensure poppler and tesseract are properly installed
- Verify PDF files are not corrupted
The project includes an interactive LangGraph-based agent that provides a conversational interface for querying the knowledge graph and researching medications.
The agent (agent.py) is a React-style agent that uses Neo4j Cypher querying capabilities to research and analyze medications, studies and patients. It acts as a healthcare expert that can answer questions by dynamically querying the knowledge graph and provides cited responses based on the underlying documents and structured patient data.
The agent uses the Neo4j Cypher MCP server to understand the data model schema and execute read-only Cypher queries.
The agent has access to three primary tools:
- get_neo4j_schema - Retrieves the database schema to inform Cypher query construction
- read_neo4j_cypher - Executes read-only Cypher queries against the knowledge graph
- research_medication - Performs vector similarity search to find relevant document chunks about specific medications
To use the agent, run:
make run-agent-poetry
# or
make run-agent-pipThe agent provides an interactive command-line interface where you can:
- Ask questions about medications, diseases, treatments, or patients
- Request specific information about drugs in the knowledge graph
- Get cited responses with references to source documents
Example queries:
- "What are the side effects of metformin?"
- "How can metformin affect my patient who is male and 65 years old?"
- "What medications interact with warfarin?"
Type exit, quit, or q to end the session.
This system provides a robust foundation for healthcare research, clinical decision support, and pharmaceutical research by combining structured patient data with comprehensive literature analysis in a unified, queryable graph representation.



