Skip to content

Oracle AI Database-powered document intelligence with SQL Property Graphs, vectorless RAG, and knowledge graph visualization

License

Notifications You must be signed in to change notification settings

jasperan/OraclePageIndex

Repository files navigation

OraclePageIndex: Graph-Powered Document Intelligence

Oracle SQL Property GraphsΒ  β—¦ Β No Vector DBΒ  β—¦ Β No EmbeddingsΒ  β—¦ Β Graph-Based Reasoning

Python 3.11+ Oracle 26ai Ollama License: MIT PyPI

Quick StartΒ  β€’ Β  How It WorksΒ  β€’ Β  Live DemoΒ  β€’ Β  SQL/PGQΒ  β€’ Β  Graph SchemaΒ 

What Is This?

OraclePageIndex turns documents into knowledge graphs stored natively in Oracle Database using SQL Property Graphs (SQL/PGQ). Unlike traditional RAG that chunks text and matches vectors, OraclePageIndex builds a structured graph of documents, sections, entities, and relationships β€” then traverses it with standard SQL to find answers.

It's an Oracle AI Database-powered fork of PageIndex by VectifyAI, replacing OpenAI + JSON files with Ollama + Oracle Property Graphs. The result: fully local, fully open-source document intelligence with the power of a real database behind it.


πŸ“‘ Why Graphs Instead of Vectors?

Traditional vector-based RAG relies on semantic similarity β€” but similarity β‰  relevance. When a financial analyst searches for "Apple's supply chain risks," cosine similarity might surface paragraphs about fruit supply chains. What we need is reasoning over structure, and that's exactly what knowledge graphs provide.

OraclePageIndex builds a SQL Property Graph from each document and uses Oracle's GRAPH_TABLE with MATCH patterns to traverse relationships β€” the same way a human expert would flip through a table of contents, cross-reference entities, and follow citations.

Vector RAG OraclePageIndex (Graph RAG)
How it works Chunk text β†’ embed β†’ cosine similarity Parse structure β†’ extract entities β†’ graph traversal
Storage Vector embeddings in a vector DB Property graph in Oracle Database
Retrieval Approximate nearest-neighbor search Exact graph traversal via SQL/PGQ
Relationships Lost during chunking First-class citizens (named edges)
Explainability Opaque similarity scores Traceable graph paths with named relationships
Multi-document Separate vector spaces per doc Unified knowledge graph with cross-document links
Query language Proprietary APIs Standard SQL with GRAPH_TABLE
Multi-hop reasoning Requires re-ranking hacks Native recursive path expressions (->+)

🎯 Core Features

  • No Vector DB: Uses document structure and LLM reasoning for retrieval, powered by Oracle SQL Property Graphs.
  • No Chunking: Documents are organized into natural sections, not arbitrary fixed-size chunks.
  • Graph-Based Reasoning: Retrieval follows named relationships (MENTIONS, PARENT_OF, RELATED_TO) β€” not opaque similarity scores.
  • Standard SQL: Every query is pure SQL with GRAPH_TABLE. No proprietary graph query language to learn.
  • Fully Local: Ollama for LLM inference, Oracle Free for storage. No API keys, no cloud dependencies.
  • Entity Extraction: LLM-powered extraction of people, organizations, technologies, locations, concepts, and more.
  • Interactive Visualization: D3.js force-directed graph with color-coded nodes, searchable entities, and multiple layouts.

πŸ“ˆ Live Demo: Apple 10-K 2024

We indexed Apple's 121-page annual report (SEC 10-K filing, fiscal year 2024) end-to-end. Here's what Oracle's property graph captured:

Graph Statistics

Metric Count
Document 1 (Apple 10-K 2024)
Sections 121 page-level sections with summaries
Entities 686 unique entities extracted via LLM
Mention Edges 951 section β†’ entity connections
Entity Relationships 33 cross-entity links
D3.js Graph 808 nodes, 1,105 edges
Entity Types Extracted

The LLM extracted structured entities across categories:

  • Organizations: Apple Inc., SEC, Standard & Poor's, Bank of New York Mellon
  • Products: iPhone 16 Pro, iPhone 16, iPhone SE, Apple Watch Series 10, Mac
  • Technologies: iOS, macOS, Apple Intelligence, XBRL
  • People: Timothy D. Cook, Luca Maestri, Arthur D. Levinson, board members
  • Locations: China mainland, India, Japan, South Korea, Taiwan, Vietnam
  • Concepts: Risk Factors, Cybersecurity, Human Capital, Net Sales, Tariffs
  • Financial: Common Stock, Equity Awards, RSUs, Commercial Paper, Term Debt

Sample SQL/PGQ Query: Apple's Relationships

SELECT e2.name, e2.entity_type, r.relationship
FROM GRAPH_TABLE (doc_knowledge_graph
    MATCH (e1 IS entity WHERE e1.name = 'Apple Inc.')
          -[r IS related_to]-> (e2 IS entity)
    COLUMNS (e2.name, e2.entity_type, r.relationship)
);

Result:

Entity Type Relationship
Form 10-K CONCEPT USED_BY
SEC ORGANIZATION RELATED_TO
Securities Exchange Act of 1934 LAW RELATED_TO
Net Sales METRIC RELATED_TO
Product Introductions CONCEPT RELATED_TO
Human Capital CONCEPT RELATED_TO
China mainland LOCATION OPERATES_IN
India LOCATION OPERATES_IN
Japan LOCATION OPERATES_IN
South Korea LOCATION OPERATES_IN
Taiwan LOCATION OPERATES_IN
Vietnam LOCATION OPERATES_IN

Sample Query & Answer

$ python run.py query "What are the main risk factors for Apple?"

Answer: Based on the knowledge graph, Apple's main risk factors include:

  • Supply Chain Risks: Heavy reliance on manufacturing in China, South Korea, Taiwan, India, and Vietnam
  • Financial Risks: Commercial paper interest rate exposure, term debt obligations, currency exchange fluctuations
  • Product Risks: Dependence on successful product introductions; component shortages
  • Legal & Regulatory: Subject to Securities Exchange Act; SEC reporting requirements

Sources: Pages 3, 9, 11, 27, 45, 76, 81, 98, 99, 108, 110, 121

Related Entities: Apple Inc., China mainland, India, Japan, South Korea, Taiwan, Vietnam, Net Sales, Product Introductions, Tariffs, SEC


🌲 How It Works

OraclePageIndex processes documents in two phases: indexing (PDF β†’ Property Graph) and querying (question β†’ graph traversal β†’ answer).

Indexing Pipeline

  PDF Document
       β”‚
       β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  1. Parse    β”‚  Extract text per page, detect structure
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 2. Structure β”‚  Ollama builds hierarchical section tree
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 3. Summarize β”‚  Generate concise summary per section
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 4. Extract       β”‚  Ollama identifies PERSON, ORG, TECH, CONCEPT...
  β”‚    Entities      β”‚
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 5. Store in      β”‚  INSERT vertices + edges into Property Graph
  β”‚    Oracle        β”‚
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 6. Link          β”‚  Ollama discovers cross-entity relationships
  β”‚    Entities      β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Result: A rich Oracle Property Graph with SQL/PGQ access

Query Pipeline

  User Question: "What products does Apple sell?"
       β”‚
       β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 1. Concept         β”‚  Ollama extracts: ["Apple", "products"]
  β”‚    Extraction      β”‚
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 2. Graph Traversal β”‚  SQL/PGQ finds matching entities + sections
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 3. Context         β”‚  Retrieve text from graph-matched sections
  β”‚    Assembly        β”‚
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 4. LLM Reasoning   β”‚  Ollama reasons over graph-retrieved context
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  β†’ Answer with sources and related entities from the graph

Architecture

                       OraclePageIndex Architecture

+----------+       +---------+       +───────────────────────────+
β”‚          β”‚       β”‚         β”‚       β”‚   Oracle Database 26ai    β”‚
β”‚   PDF    β”œβ”€β”€β”€β”€β”€β”€β–Ίβ”‚ Ollama  β”œβ”€β”€β”€β”€β”€β”€β–Ίβ”‚                           β”‚
β”‚ Document β”‚       β”‚  LLM    β”‚       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
+----------+       +---------+       β”‚  β”‚ Property Graph      β”‚  β”‚
                    β”‚  parse    β”‚       β”‚  β”‚                     β”‚  β”‚
                    β”‚  extract  β”‚       β”‚  β”‚ [documents]───┐     β”‚  β”‚
                    β”‚  reason   β”‚       β”‚  β”‚ [sections]─────     β”‚  β”‚
                    +--------+  β”‚       β”‚  β”‚ [entities]─────     β”‚  β”‚
                             β”‚  β”‚       β”‚  β”‚ [relationships]     β”‚  β”‚
                             β–Ό  β–Ό       β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                       +---------+      +────────────┬──────────────+
                       β”‚         β”‚                   β”‚
                       β”‚ FastAPI β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ Server  β”‚     SQL/PGQ queries
                       +----┬----+
                            β”‚
                            β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚    D3.js      β”‚
                    β”‚ Visualization β”‚
                    β”‚ (Force Graph) β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βš™οΈ Quick Start

Prerequisites

  • Docker (for Oracle Database 26ai Free)
  • Python 3.11+
  • Ollama with a model pulled (e.g., ollama pull gemma3 or ollama pull llama3.1)

1. Start Oracle Database

docker compose up -d
# Wait ~2 minutes for initialization
docker compose logs -f oracle-db  # watch until "DATABASE IS READY TO USE"

2. Create the Database User

docker exec -i <container-name> sqlplus -s "/ as sysdba" <<'EOF'
ALTER SESSION SET CONTAINER = FREEPDB1;
ALTER PROFILE DEFAULT LIMIT PASSWORD_VERIFY_FUNCTION NULL;
CREATE USER pageindex IDENTIFIED BY pageindex;
GRANT CONNECT, RESOURCE TO pageindex;
GRANT UNLIMITED TABLESPACE TO pageindex;
GRANT CREATE PROPERTY GRAPH TO pageindex;
EXIT;
EOF

The key grant is CREATE PROPERTY GRAPH β€” this enables Oracle's SQL/PGQ capabilities.

3. Install & Initialize

pip install oracledb httpx pyyaml tiktoken PyPDF2 PyMuPDF fastapi uvicorn
# Initialize the schema (creates 6 tables + Property Graph)
python run.py init

This creates the Oracle Property Graph doc_knowledge_graph with:

  • 3 vertex tables: documents, sections, entities
  • 3 edge tables: section_hierarchy, section_entities, entity_relationships

4. Index a Document

python run.py index /path/to/document.pdf
Example output
Indexing complete.
  Document:      apple-10k-2024.pdf
  Sections:      121
  Entities:      686
  Relationships: 33

5. Query the Knowledge Graph

python run.py query "What are the key financial risks?"

The query engine:

  1. Extracts concepts from your question via Ollama
  2. Traverses the Oracle Property Graph to find matching sections
  3. Enriches context with related entities
  4. Reasons over the graph-retrieved context for a grounded answer

6. Visualize

python run.py serve
# Open http://localhost:8000

Interactive D3.js force-directed graph with:

  • Color-coded nodes: Documents (blue), Sections (green), Entities (orange)
  • Edge types: Hierarchy (solid), Mentions (dashed), Relationships (dotted)
  • Click any node for details, search for entities, toggle layouts
Optional CLI parameters
# Override model or Oracle DSN
python run.py --model llama3.1 --oracle-dsn myhost:1521/MYPDB index doc.pdf

# Verbose logging
python run.py -v query "What is the revenue?"

# Custom server host/port
python run.py serve --host 0.0.0.0 --port 9000

πŸ”Œ SQL/PGQ β€” The Power of Oracle Property Graphs

The entire knowledge graph is queryable with standard SQL using GRAPH_TABLE. This is what makes Oracle special β€” no proprietary graph query language, just SQL.

Find Sections Mentioning an Entity

SELECT s.title, e.name AS entity, d.doc_name
FROM GRAPH_TABLE (doc_knowledge_graph
    MATCH (e IS entity WHERE e.name = 'iPhone')
          <-[m IS mentions]- (s IS section)
    COLUMNS (s.title, e.name, s.doc_id)
) gt
JOIN documents d ON d.doc_id = gt.doc_id;

Discover Entity Relationships

SELECT e1.name AS source, r.relationship, e2.name AS target, e2.entity_type
FROM GRAPH_TABLE (doc_knowledge_graph
    MATCH (e1 IS entity WHERE e1.name = 'Apple Inc.')
          -[r IS related_to]-> (e2 IS entity)
    COLUMNS (e1.name, r.relationship, e2.name, e2.entity_type)
);

Recursive Section Hierarchy Traversal

-- Find ALL descendants of a section (multi-hop!)
SELECT child.title, child.summary
FROM GRAPH_TABLE (doc_knowledge_graph
    MATCH (parent IS section WHERE parent.title = 'Introduction')
          -[h IS parent_of]->+ (child IS section)
    COLUMNS (child.title, child.summary)
);

The ->+ syntax is SQL/PGQ's recursive path expression β€” it traverses the hierarchy to any depth in a single query. No recursive CTEs needed.

Cross-Document Entity Discovery

-- Find all documents and sections that discuss a concept
SELECT d.doc_name, s.title, m.relevance, e.entity_type
FROM GRAPH_TABLE (doc_knowledge_graph
    MATCH (e IS entity WHERE e.name = 'Cybersecurity')
          <-[m IS mentions]- (s IS section)
    COLUMNS (e.entity_type, m.relevance, s.title, s.doc_id)
) gt
JOIN documents d ON d.doc_id = gt.doc_id
ORDER BY gt.relevance;

πŸ—‚οΈ Oracle Property Graph Schema

CREATE PROPERTY GRAPH doc_knowledge_graph
    VERTEX TABLES (
        documents   KEY (doc_id)     LABEL document  PROPERTIES ALL COLUMNS,
        sections    KEY (section_id) LABEL section   PROPERTIES ALL COLUMNS,
        entities    KEY (entity_id)  LABEL entity    PROPERTIES ALL COLUMNS
    )
    EDGE TABLES (
        section_hierarchy   -- parent_of: section β†’ section
            KEY (edge_id)
            SOURCE KEY (parent_id) REFERENCES sections (section_id)
            DESTINATION KEY (child_id) REFERENCES sections (section_id)
            LABEL parent_of PROPERTIES ALL COLUMNS,
        section_entities    -- mentions: section β†’ entity
            KEY (edge_id)
            SOURCE KEY (section_id) REFERENCES sections (section_id)
            DESTINATION KEY (entity_id) REFERENCES entities (entity_id)
            LABEL mentions PROPERTIES ALL COLUMNS,
        entity_relationships -- related_to: entity β†’ entity
            KEY (edge_id)
            SOURCE KEY (source_entity) REFERENCES entities (entity_id)
            DESTINATION KEY (target_entity) REFERENCES entities (entity_id)
            LABEL related_to PROPERTIES ALL COLUMNS
    );
Relational tables (under the Property Graph)

Oracle SQL Property Graphs are defined on top of standard relational tables. You always retain full relational SQL access to the same data.

-- Vertex tables
CREATE TABLE documents (
    doc_id          NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    doc_name        VARCHAR2(500) NOT NULL,
    doc_description CLOB,
    source_path     VARCHAR2(1000),
    created_at      TIMESTAMP DEFAULT SYSTIMESTAMP
);

CREATE TABLE sections (
    section_id    NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    doc_id        NUMBER NOT NULL REFERENCES documents(doc_id),
    node_id       VARCHAR2(10),
    title         VARCHAR2(1000),
    summary       CLOB,
    text_content  CLOB,
    start_index   NUMBER,
    end_index     NUMBER,
    depth_level   NUMBER DEFAULT 0
);

CREATE TABLE entities (
    entity_id     NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    name          VARCHAR2(500) NOT NULL,
    entity_type   VARCHAR2(100),
    description   CLOB,
    CONSTRAINT uq_entity UNIQUE (name, entity_type)
);

-- Edge tables
CREATE TABLE section_hierarchy (
    edge_id   NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    parent_id NUMBER NOT NULL REFERENCES sections(section_id),
    child_id  NUMBER NOT NULL REFERENCES sections(section_id)
);

CREATE TABLE section_entities (
    edge_id    NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    section_id NUMBER NOT NULL REFERENCES sections(section_id),
    entity_id  NUMBER NOT NULL REFERENCES entities(entity_id),
    relevance  VARCHAR2(20) DEFAULT 'MENTIONS'
);

CREATE TABLE entity_relationships (
    edge_id       NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    source_entity NUMBER NOT NULL REFERENCES entities(entity_id),
    target_entity NUMBER NOT NULL REFERENCES entities(entity_id),
    relationship  VARCHAR2(100) DEFAULT 'RELATED_TO'
);

πŸ› οΈ Configuration

Edit oracle_pageindex/config.yaml:

ollama:
  base_url: "http://localhost:11434"
  model: "gemma3"          # Any Ollama model
  temperature: 0

oracle:
  user: "pageindex"
  password: "pageindex"
  dsn: "localhost:1521/FREEPDB1"
  pool_min: 1
  pool_max: 5

πŸ§ͺ Testing

The project includes 30 unit tests covering all core modules. All tests use mocked Oracle and Ollama connections β€” no running services needed.

pytest tests/ -v
tests/test_db.py                3 tests  (connection pool, schema init, close)
tests/test_entity_extractor.py  3 tests  (entity extraction, relationships, edge cases)
tests/test_graph.py            19 tests  (all CRUD operations + SQL/PGQ queries)
tests/test_llm.py               5 tests  (sync/async chat, JSON extraction)

πŸ—ΊοΈ Project Structure

OraclePageIndex/
  oracle_pageindex/
    cli.py              # CLI: init, index, query, serve
    config.yaml         # Ollama + Oracle configuration
    db.py               # Oracle connection pool + schema init
    entity_extractor.py # LLM-powered entity extraction
    graph.py            # Property Graph CRUD + SQL/PGQ queries
    indexer.py          # Full indexing pipeline
    llm.py              # Ollama API client
    parser.py           # PDF parsing + section tree builder
    query.py            # Graph-powered query engine
    utils.py            # Config, tokens, tree manipulation
  api/
    server.py           # FastAPI + D3.js visualization server
  viz/
    index.html          # D3.js single-page app
    graph.js            # Force-directed graph rendering
    style.css           # Dark-theme styling
  tests/                # 30 tests (pytest)
  setup_schema.sql      # Oracle DDL + Property Graph definition
  docker-compose.yml    # Oracle 26ai Free container
  run.py                # CLI entry point

πŸ”„ What Changed From PageIndex

PageIndex OraclePageIndex
Storage JSON files on disk Oracle Property Graph (SQL/PGQ)
LLM OpenAI GPT-4o Ollama (fully local, open-source)
Retrieval LLM-driven tree navigation SQL/PGQ graph traversal + LLM reasoning
Entities None Full named entity extraction + cross-doc linking
Relationships None Entity relationship discovery (RELATED_TO, PART_OF, OPERATES_IN, ...)
Visualization None Interactive D3.js force-directed graph
Multi-document Separate JSON per doc Unified knowledge graph across all documents
API None FastAPI serving graph data + queries
Query interface None Natural-language queries with source citations

πŸ› οΈ Tech Stack

Component Technology
Database Oracle Database 26ai Free (SQL Property Graphs, SQL/PGQ)
Graph Model CREATE PROPERTY GRAPH with ISO SQL:2023 GRAPH_TABLE
LLM Ollama (gemma3, llama3.1, or any supported model)
Backend Python 3.11+, FastAPI, python-oracledb
Visualization D3.js v7 (interactive force-directed graph)
PDF Parsing PyMuPDF, PyPDF2
Tokenization tiktoken
Container Docker Compose

⭐ Credits & Acknowledgments

  • PageIndex by VectifyAI β€” the original vectorless, reasoning-based RAG framework that inspired this project.
  • Oracle Database β€” SQL Property Graphs (SQL/PGQ) provide the graph storage and query foundation.
  • Ollama β€” local open-source LLM inference.
  • D3.js β€” interactive graph visualization.

Leave a star if you find this useful!


GitHubΒ  LinkedInΒ  Oracle

MIT License

About

Oracle AI Database-powered document intelligence with SQL Property Graphs, vectorless RAG, and knowledge graph visualization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors