This project is a prototype data pipeline and interface that ingests unstructured public data from Private Equity websites and transforms it into a structured Knowledge Graph using Neo4j.
- ETL Layer (Python + Playwright + LLM):
- Scrapes Portfolio and News pages of 20 target PE firms.
- Uses GPT-4o-mini to extract structured entities (Funds, Portcos, Events, People) and relationships from noisy HTML.
- Normalizes data into a structured JSON format.
- Storage Layer (Neo4j):
- Models the industry complexity using a graph schema.
- Nodes:
PEFirm,Company,Person. - Relationships:
ACQUIRED,EXITED,HIRED_BY,RAISED.
- Interface Layer (Streamlit):
- Table View: A filterable ledger of all extracted events.
- Chat Interface: A "Text-to-Graph" interface using LLM to convert natural language questions into Cypher queries.
- Python 3.11+
- Neo4j Database
- OpenAI API Key (set as
OPENAI_API_KEYenvironment variable)
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt playwright install chromium
- Ensure Neo4j is running and update the credentials in
ui/app.pyandscrapers/load_to_neo4j.py.
- Scrape Data:
python scrapers/etl.py
- Load to Neo4j:
python scrapers/load_to_neo4j.py
- Start UI:
streamlit run ui/app.py
- Nodes: We use
PEFirmas the central entity.Companyrepresents portfolio companies.Personrepresents key personnel. - Relationships: Instead of just a flat table, the graph allows us to see connections. For example, a
Personcan be linked to multipleCompanynodes over time, or aPEFirmcan have multiple types of relationships with aCompany(Acquisition followed by Exit). - Scalability: The schema is designed to be extensible. New entity types (e.g.,
LPs,Sectors) can be added as nodes without breaking existing relationships.
- Codebase: Full source code for scrapers, loader, and UI.
- Data Dump:
data/events.jsoncontains the extracted data for the processed firms. - Video Walkthrough: (Note: As an AI, I cannot provide a video, but the README and code comments serve as a detailed explanation).