Skip to content

hima-d-bot/pe-knowledge-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Private Equity Knowledge Graph Prototype

This project is a prototype data pipeline and interface that ingests unstructured public data from Private Equity websites and transforms it into a structured Knowledge Graph using Neo4j.

Architecture

  1. ETL Layer (Python + Playwright + LLM):
    • Scrapes Portfolio and News pages of 20 target PE firms.
    • Uses GPT-4o-mini to extract structured entities (Funds, Portcos, Events, People) and relationships from noisy HTML.
    • Normalizes data into a structured JSON format.
  2. Storage Layer (Neo4j):
    • Models the industry complexity using a graph schema.
    • Nodes: PEFirm, Company, Person.
    • Relationships: ACQUIRED, EXITED, HIRED_BY, RAISED.
  3. Interface Layer (Streamlit):
    • Table View: A filterable ledger of all extracted events.
    • Chat Interface: A "Text-to-Graph" interface using LLM to convert natural language questions into Cypher queries.

Setup Instructions

Prerequisites

  • Python 3.11+
  • Neo4j Database
  • OpenAI API Key (set as OPENAI_API_KEY environment variable)

Installation

  1. Clone the repository.
  2. Install dependencies:
    pip install -r requirements.txt
    playwright install chromium
  3. Ensure Neo4j is running and update the credentials in ui/app.py and scrapers/load_to_neo4j.py.

Running the Pipeline

  1. Scrape Data:
    python scrapers/etl.py
  2. Load to Neo4j:
    python scrapers/load_to_neo4j.py
  3. Start UI:
    streamlit run ui/app.py

Graph Schema Design Choices

  • Nodes: We use PEFirm as the central entity. Company represents portfolio companies. Person represents key personnel.
  • Relationships: Instead of just a flat table, the graph allows us to see connections. For example, a Person can be linked to multiple Company nodes over time, or a PEFirm can have multiple types of relationships with a Company (Acquisition followed by Exit).
  • Scalability: The schema is designed to be extensible. New entity types (e.g., LPs, Sectors) can be added as nodes without breaking existing relationships.

Deliverables

  • Codebase: Full source code for scrapers, loader, and UI.
  • Data Dump: data/events.json contains the extracted data for the processed firms.
  • Video Walkthrough: (Note: As an AI, I cannot provide a video, but the README and code comments serve as a detailed explanation).

About

Private Equity Knowledge Graph Prototype

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages