Skip to content

Intelligent web scraping system for finding VC partners and their portfolio companies using LLMs

License

Notifications You must be signed in to change notification settings

leeroo-coder/find_VC_parners

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Find VC Partners

An intelligent web scraping and company scoring system that uses Large Language Models (LLMs) to automatically extract, aggregate, enrich, and score portfolio companies from venture capital and private equity firm websites.

Overview

This project automates the complete process of discovering and evaluating portfolio companies from VC/PE firm websites. It addresses the challenge of extracting structured data from diverse, unstructured web sources without manual parser development. The system uses LLMs to understand page structure, generate extraction code, handle navigation patterns, and score companies based on investment criteria.

Features

  • LLM-Driven Parser Generation: Automatically generates JavaScript extraction code by analyzing page structure
  • Self-Correcting Extraction: Validates extracted data and retries with error feedback if validation fails
  • Adaptive Web Navigation: Intelligently determines and executes navigation actions (pagination, infinite scroll, "Load More" buttons)
  • Parser Caching: Saves and reuses successful parsers to reduce API costs and processing time
  • Portfolio Data Aggregation: Combines scraped data from multiple sources with firm attribution
  • Company Scoring: Scores companies using LLM with web search against parsed criteria
  • Data Enrichment: Finds Crunchbase URLs and enriches companies with firmographic data

Architecture

find_VC_parners/
├── main.py                    # Entry point & pipeline orchestration
├── config.py                  # Configuration settings
├── src/
│   ├── orchestrator.py        # Main scraping workflow
│   ├── browser_manager.py     # Playwright + Bright Data
│   ├── parser_generator.py    # LLM parser generation
│   ├── parser_cache_manager.py # Parser caching
│   ├── navigation_strategy.py # Pagination handling
│   ├── results_collector.py   # Deduplication
│   ├── aggregate_all.py       # Portfolio aggregation
│   ├── scorer.py              # Company scoring
│   └── social_media_enricher/
│       ├── crunchbase_enricher.py
│       └── google_search.py
├── output/                    # Scraped data (JSON per URL)
├── aggregated/                # Combined & scored data
├── agent_output/              # Final CSV + config
└── parser_cache/              # Cached parsers

Prerequisites

  • Python 3.8+
  • Anthropic API key (Claude)
  • OpenAI API key
  • Bright Data proxy credentials (optional, for avoiding IP blocks)

Installation

# Clone the repository
git clone https://github.com/leeroo-coder/find_VC_parners.git
cd find_VC_parners

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

Configuration

Create a .env file in the project root with the following variables:

ANTHROPIC_API_KEY=your_anthropic_api_key
OPENAI_API_KEY=your_openai_api_key
BRIGHT_DATA_PROXY_USER=your_proxy_user
BRIGHT_DATA_PROXY_PASSWORD=your_proxy_password
BRIGHT_DATA_PROXY_HOST=your_proxy_host
BRIGHT_DATA_PROXY_PORT=your_proxy_port
BRIGHT_DATA_API_KEY=your_bright_data_api_key

Usage

Complete Pipeline (Recommended)

from main import complete_pipeline
import asyncio

urls = [
    "https://capman.com/private-equity/buyout/portfolio/",
    "https://capza.co/companies-listing/"
]

criteria = """
- Mission-critical B2B software
- 10-100M EUR revenue
- European focused business
"""

result = asyncio.run(complete_pipeline(
    urls=urls,
    filtering_criteria=criteria,
    max_companies=10
))

print(f"Scored {result['companies_scored']} companies")

Individual Commands

# Run complete pipeline
python main.py pipeline

# Scrape a single URL
python main.py scrape https://example.com/portfolio

# Aggregate all scraped data
python main.py aggregate_all

# Score first 10 companies
python main.py score 10

# List cached parsers
python main.py list

Performance

  • Parser caching reduces API costs by 80-90% on repeated scraping
  • Parallel scoring processes 5-10 companies simultaneously
  • Average processing time: 5-10 minutes per portfolio site (first run), 1-2 minutes (cached)

Success Criteria

  • Successfully extract ≥90% of visible companies
  • Score accuracy validated by investment team
  • Processing time <30 minutes per portfolio site

Known Limitations

  • JavaScript-heavy SPAs may require additional wait time
  • Some sites block automated access despite proxies
  • Crunchbase data availability varies by company size/region

Related Documentation

For detailed workflow documentation, see the web_data_integration_llm workflow on the internal wiki.

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

Intelligent web scraping system for finding VC partners and their portfolio companies using LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages