An intelligent web scraping and company scoring system that uses Large Language Models (LLMs) to automatically extract, aggregate, enrich, and score portfolio companies from venture capital and private equity firm websites.
This project automates the complete process of discovering and evaluating portfolio companies from VC/PE firm websites. It addresses the challenge of extracting structured data from diverse, unstructured web sources without manual parser development. The system uses LLMs to understand page structure, generate extraction code, handle navigation patterns, and score companies based on investment criteria.
- LLM-Driven Parser Generation: Automatically generates JavaScript extraction code by analyzing page structure
- Self-Correcting Extraction: Validates extracted data and retries with error feedback if validation fails
- Adaptive Web Navigation: Intelligently determines and executes navigation actions (pagination, infinite scroll, "Load More" buttons)
- Parser Caching: Saves and reuses successful parsers to reduce API costs and processing time
- Portfolio Data Aggregation: Combines scraped data from multiple sources with firm attribution
- Company Scoring: Scores companies using LLM with web search against parsed criteria
- Data Enrichment: Finds Crunchbase URLs and enriches companies with firmographic data
find_VC_parners/
├── main.py # Entry point & pipeline orchestration
├── config.py # Configuration settings
├── src/
│ ├── orchestrator.py # Main scraping workflow
│ ├── browser_manager.py # Playwright + Bright Data
│ ├── parser_generator.py # LLM parser generation
│ ├── parser_cache_manager.py # Parser caching
│ ├── navigation_strategy.py # Pagination handling
│ ├── results_collector.py # Deduplication
│ ├── aggregate_all.py # Portfolio aggregation
│ ├── scorer.py # Company scoring
│ └── social_media_enricher/
│ ├── crunchbase_enricher.py
│ └── google_search.py
├── output/ # Scraped data (JSON per URL)
├── aggregated/ # Combined & scored data
├── agent_output/ # Final CSV + config
└── parser_cache/ # Cached parsers
- Python 3.8+
- Anthropic API key (Claude)
- OpenAI API key
- Bright Data proxy credentials (optional, for avoiding IP blocks)
# Clone the repository
git clone https://github.com/leeroo-coder/find_VC_parners.git
cd find_VC_parners
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers
playwright install chromiumCreate a .env file in the project root with the following variables:
ANTHROPIC_API_KEY=your_anthropic_api_key
OPENAI_API_KEY=your_openai_api_key
BRIGHT_DATA_PROXY_USER=your_proxy_user
BRIGHT_DATA_PROXY_PASSWORD=your_proxy_password
BRIGHT_DATA_PROXY_HOST=your_proxy_host
BRIGHT_DATA_PROXY_PORT=your_proxy_port
BRIGHT_DATA_API_KEY=your_bright_data_api_keyfrom main import complete_pipeline
import asyncio
urls = [
"https://capman.com/private-equity/buyout/portfolio/",
"https://capza.co/companies-listing/"
]
criteria = """
- Mission-critical B2B software
- 10-100M EUR revenue
- European focused business
"""
result = asyncio.run(complete_pipeline(
urls=urls,
filtering_criteria=criteria,
max_companies=10
))
print(f"Scored {result['companies_scored']} companies")# Run complete pipeline
python main.py pipeline
# Scrape a single URL
python main.py scrape https://example.com/portfolio
# Aggregate all scraped data
python main.py aggregate_all
# Score first 10 companies
python main.py score 10
# List cached parsers
python main.py list- Parser caching reduces API costs by 80-90% on repeated scraping
- Parallel scoring processes 5-10 companies simultaneously
- Average processing time: 5-10 minutes per portfolio site (first run), 1-2 minutes (cached)
- Successfully extract ≥90% of visible companies
- Score accuracy validated by investment team
- Processing time <30 minutes per portfolio site
- JavaScript-heavy SPAs may require additional wait time
- Some sites block automated access despite proxies
- Crunchbase data availability varies by company size/region
For detailed workflow documentation, see the web_data_integration_llm workflow on the internal wiki.
MIT License
Contributions are welcome! Please feel free to submit a Pull Request.