Find VC Partners

An intelligent web scraping and company scoring system that uses Large Language Models (LLMs) to automatically extract, aggregate, enrich, and score portfolio companies from venture capital and private equity firm websites.

Overview

This project automates the complete process of discovering and evaluating portfolio companies from VC/PE firm websites. It addresses the challenge of extracting structured data from diverse, unstructured web sources without manual parser development. The system uses LLMs to understand page structure, generate extraction code, handle navigation patterns, and score companies based on investment criteria.

Features

LLM-Driven Parser Generation: Automatically generates JavaScript extraction code by analyzing page structure
Self-Correcting Extraction: Validates extracted data and retries with error feedback if validation fails
Adaptive Web Navigation: Intelligently determines and executes navigation actions (pagination, infinite scroll, "Load More" buttons)
Parser Caching: Saves and reuses successful parsers to reduce API costs and processing time
Portfolio Data Aggregation: Combines scraped data from multiple sources with firm attribution
Company Scoring: Scores companies using LLM with web search against parsed criteria
Data Enrichment: Finds Crunchbase URLs and enriches companies with firmographic data

Architecture

find_VC_parners/
├── main.py                    # Entry point & pipeline orchestration
├── config.py                  # Configuration settings
├── src/
│   ├── orchestrator.py        # Main scraping workflow
│   ├── browser_manager.py     # Playwright + Bright Data
│   ├── parser_generator.py    # LLM parser generation
│   ├── parser_cache_manager.py # Parser caching
│   ├── navigation_strategy.py # Pagination handling
│   ├── results_collector.py   # Deduplication
│   ├── aggregate_all.py       # Portfolio aggregation
│   ├── scorer.py              # Company scoring
│   └── social_media_enricher/
│       ├── crunchbase_enricher.py
│       └── google_search.py
├── output/                    # Scraped data (JSON per URL)
├── aggregated/                # Combined & scored data
├── agent_output/              # Final CSV + config
└── parser_cache/              # Cached parsers

Prerequisites

Python 3.8+
Anthropic API key (Claude)
OpenAI API key
Bright Data proxy credentials (optional, for avoiding IP blocks)

Installation

# Clone the repository
git clone https://github.com/leeroo-coder/find_VC_parners.git
cd find_VC_parners

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

Configuration

Create a .env file in the project root with the following variables:

ANTHROPIC_API_KEY=your_anthropic_api_key
OPENAI_API_KEY=your_openai_api_key
BRIGHT_DATA_PROXY_USER=your_proxy_user
BRIGHT_DATA_PROXY_PASSWORD=your_proxy_password
BRIGHT_DATA_PROXY_HOST=your_proxy_host
BRIGHT_DATA_PROXY_PORT=your_proxy_port
BRIGHT_DATA_API_KEY=your_bright_data_api_key

Usage

Complete Pipeline (Recommended)

from main import complete_pipeline
import asyncio

urls = [
    "https://capman.com/private-equity/buyout/portfolio/",
    "https://capza.co/companies-listing/"
]

criteria = """
- Mission-critical B2B software
- 10-100M EUR revenue
- European focused business
"""

result = asyncio.run(complete_pipeline(
    urls=urls,
    filtering_criteria=criteria,
    max_companies=10
))

print(f"Scored {result['companies_scored']} companies")

Individual Commands

# Run complete pipeline
python main.py pipeline

# Scrape a single URL
python main.py scrape https://example.com/portfolio

# Aggregate all scraped data
python main.py aggregate_all

# Score first 10 companies
python main.py score 10

# List cached parsers
python main.py list

Performance

Parser caching reduces API costs by 80-90% on repeated scraping
Parallel scoring processes 5-10 companies simultaneously
Average processing time: 5-10 minutes per portfolio site (first run), 1-2 minutes (cached)

Success Criteria

Successfully extract ≥90% of visible companies
Score accuracy validated by investment team
Processing time <30 minutes per portfolio site

Known Limitations

JavaScript-heavy SPAs may require additional wait time
Some sites block automated access despite proxies
Crunchbase data availability varies by company size/region

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Find VC Partners

Overview

Features

Architecture

Prerequisites

Installation

Configuration

Usage

Complete Pipeline (Recommended)

Individual Commands

Performance

Success Criteria

Known Limitations

Related Documentation

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

leeroo-coder/find_VC_parners

Folders and files

Latest commit

History

Repository files navigation

Find VC Partners

Overview

Features

Architecture

Prerequisites

Installation

Configuration

Usage

Complete Pipeline (Recommended)

Individual Commands

Performance

Success Criteria

Known Limitations

Related Documentation

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages