Pydantic Scrape

A modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.

Why Pydantic Scrape?

Web scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.

Pydantic Scrape bridges this gap by providing:

AI-powered extraction - Let AI understand and extract what you need instead of writing brittle selectors
Type-safe workflows - Structured data with validation built-in
Academic research focus - First-class support for papers, citations, and research workflows
Browser automation - Handle JavaScript, authentication, and complex interactions seamlessly

Installation

1. Install Chawan Terminal Browser (Required)

Pydantic Scrape uses Chawan for advanced web automation and JavaScript-heavy sites.

Option 1: Homebrew (macOS) - Stable Release

brew install chawan

Option 2: From Source - Latest Development Version

# Install Nim compiler (if not already installed)
brew install nim  # macOS
# OR for Linux: curl https://nim-lang.org/choosenim/init.sh -sSf | sh -s -- -y

# Build latest Chawan from source
git clone https://git.sr.ht/~bptato/chawan
cd chawan && make

# Install locally (recommended for development)
mkdir -p ~/.local/bin ~/.local/libexec
cp target/release/bin/cha ~/.local/bin/
cp -r target/release/libexec/chawan ~/.local/libexec/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

Verify installation:

cha --version

2. Install Pydantic Scrape

Lean Installation (Recommended)

# Core functionality only (~50MB)
pip install pydantic-scrape

Extended Installation with Optional Features

# YouTube video processing
pip install pydantic-scrape[youtube]

# AI services (OpenAI, Google AI)  
pip install pydantic-scrape[ai]

# Academic research (OpenAlex, CrossRef)
pip install pydantic-scrape[academic]

# Document processing (PDF, Word, eBooks)
pip install pydantic-scrape[documents]

# Advanced content extraction
pip install pydantic-scrape[content]

# Everything included (~300MB)
pip install pydantic-scrape[all]

# Development tools (if contributing)
pip install pydantic-scrape[dev]

Quick Start

Get a comprehensive research answer in under 10 lines:

import asyncio
from pydantic_scrape.graphs.search_answer import search_answer

async def research():
    result = await search_answer(
        query="latest advances in quantum computing",
        max_search_results=5
    )
    
    print(f"Found {len(result['answer']['sources'])} sources")
    print(result['answer']['answer'])

asyncio.run(research())

This searches academic sources, extracts content, and generates a structured answer with citations - all automatically.

Core Features

🔍 Smart Content Extraction

from pydantic_scrape.dependencies.fetch import fetch_url

# Automatically handles JavaScript, selects best extraction method
content = await fetch_url("https://example.com/article")
print(content.title, content.text, content.metadata)

🤖 AI-Powered Scraping

from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent

# AI writes the scraping code for you
agent = get_bs4_scrape_script_agent()
result = await agent.run_sync("Extract product prices from this e-commerce page", 
                              html_content=page_html)

📚 Academic Research

from pydantic_scrape.dependencies.openalex import OpenAlexDependency

# Search papers by topic, author, or DOI
openalex = OpenAlexDependency()
papers = await openalex.search_papers("machine learning healthcare")

📄 Document Processing

from pydantic_scrape.dependencies.document import DocumentDependency

# Extract text from PDFs, Word docs, EPUBs
doc = DocumentDependency()
content = await doc.extract_text("research_paper.pdf")

🌐 Advanced Browser Automation

from pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent

# Intelligent search + browse with memory and geographic targeting
agent = SearchAndBrowseAgent()
result = await agent.run(
    "Find 5 cabinet refacing services in North West England with contact details"
)

# Automatically handles: cookie popups, JavaScript, geographic targeting, 
# content caching, parallel browsing, and form detection

Common Use Cases

Literature Reviews - Automatically search, extract, and summarize academic papers
Market Research - Monitor competitor content, pricing, and product updates
News Monitoring - Track mentions, trends, and breaking news across sources
Content Migration - Extract structured data from legacy systems or websites
Research Workflows - Build custom pipelines for domain-specific content extraction

Architecture

Pydantic Scrape organizes code into three layers:

Dependencies (pydantic_scrape.dependencies.*) - Reusable components for specific tasks
Agents (pydantic_scrape.agents.*) - AI-powered workers that make decisions
Graphs (pydantic_scrape.graphs.*) - Orchestrate multi-step workflows

This makes it easy to compose complex workflows from simple, tested components.

Configuration

1. Set API Keys

Create a .env file in your project root:

# AI Providers (choose one or more)
OPENAI_API_KEY=your_openai_key
GOOGLE_GENAI_API_KEY=your_google_key  
ANTHROPIC_API_KEY=your_anthropic_key

# Google Search (for enhanced search capabilities)
GOOGLE_SEARCH_API_KEY=your_google_search_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id

2. Chawan Configuration

The package includes an optimized Chawan configuration in .chawan/config.toml that provides:

7x faster web automation vs default settings
Cookie popup handling without JavaScript overhead
Content caching for instant subsequent operations
Geographic search targeting for accurate local results

No additional Chawan setup required - works out of the box!

Documentation

Installation Guide - Detailed setup instructions
Examples - Working code samples for common tasks
API Reference - Full documentation

Contributing

We welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.

See CONTRIBUTING.md for development setup and guidelines.

License

MIT License - see LICENSE for details.

Ready to build intelligent scraping workflows? Start with pip install pydantic-scrape and try the examples above.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.chawan		.chawan
.gradio		.gradio
chawandocs		chawandocs
examples		examples
pydantic_scrape		pydantic_scrape
.gitignore		.gitignore
CHAWAN_API_SUMMARY.md		CHAWAN_API_SUMMARY.md
README.md		README.md
cookies.txt		cookies.txt
pyproject.toml		pyproject.toml
requirements-academic.txt		requirements-academic.txt
requirements-core.txt		requirements-core.txt
requirements-documents.txt		requirements-documents.txt
requirements-optional.txt		requirements-optional.txt
requirements-youtube.txt		requirements-youtube.txt
requirements.txt		requirements.txt
run_youtube_editor.py		run_youtube_editor.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pydantic Scrape

Why Pydantic Scrape?

Installation

1. Install Chawan Terminal Browser (Required)

2. Install Pydantic Scrape

Quick Start

Core Features

🔍 Smart Content Extraction

🤖 AI-Powered Scraping

📚 Academic Research

📄 Document Processing

🌐 Advanced Browser Automation

Common Use Cases

Architecture

Configuration

1. Set API Keys

2. Chawan Configuration

Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

philmade/pydantic_scrape

Folders and files

Latest commit

History

Repository files navigation

Pydantic Scrape

Why Pydantic Scrape?

Installation

1. Install Chawan Terminal Browser (Required)

2. Install Pydantic Scrape

Quick Start

Core Features

🔍 Smart Content Extraction

🤖 AI-Powered Scraping

📚 Academic Research

📄 Document Processing

🌐 Advanced Browser Automation

Common Use Cases

Architecture

Configuration

1. Set API Keys

2. Chawan Configuration

Documentation

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages