An AI-powered web scraping agent that automates data extraction from websites with intelligent crawling, anti-bot detection, and structured data parsing. Built using LangGraph, LangSmith, Firecrawl, and Anthropic AI tools for seamless AI-driven web scraping and structured data processing.
- Graph-based AI Agent: Uses LangGraph for managing scraping workflows.
- Intelligent Web Crawling: Powered by Firecrawl to extract structured data.
- LLM-Powered Formatting: Uses Anthropic AI for content summarization.
- Adaptive Error Handling: Retries failed requests dynamically.
- Batch Processing: Efficiently processes multiple URLs with batched requests.
- Flexible Output Formats: Supports JSON, Markdown, and more.
Follow these steps to set up and run the project on your local machine:
git clone https://github.com/hmshb/scraping-agent-ai
cd scraping-agent-ai
python -m venv venv
source venv/bin/activate # For Linux/Mac
.\venv\Scripts\activate # For Windows
pip install -U "langgraph-cli[inmem]"
pip install -e .
- Visit LangSmith.
- Create an API key for accessing LangSmith logs.
- Copy the generated API key.
- Visit Anthropic.
- Create an API key for accessing Claude.
- Copy the generated API key.
- Copy the
.env.example
file and rename it to.env
:cp .env.example .env
- Open the
.env
file and update the API keys and configuration values:LANGSMITH_PROJECT=scrapping-agent LANGSMITH_API_KEY=your_api_key_here ANTHROPIC_API_KEY=your_api_key_here FIRECRAWL_API_KEY=your_api_key_here URL_LIMIT=10 BATCH_LIMIT=5
langgraph dev
scraping-agent-ai/
βββ .env # API key configuration file
βββ agent/ # Main AI scraping agent module
β βββ utils/ # Utility modules for various tasks
β β βββ constants.py # Constants for scraping tasks
β β βββ firecrawl.py # Firecrawl integration
β β βββ graph.py # LangGraph-based workflow
β β βββ helpers.py # Utility functions
β β βββ llm.py # LLM-powered formatting
β β βββ nodes.py # Graph-based nodes
β β βββ states.py # Scraping state management
β βββ agent.py # AI-driven scraping workflow
βββ langgraph.json # LangGraph configuration file
βββ pyproject.toml # Python project metadata
βββ README.md # Documentation file
βββ scraped_data.json # This will have the final data
βββ venv/ # Virtual environment
Special thanks to:
- LangGraph for building graph-based AI workflows.
- LangSmith for debugging and monitoring AI agents.
- Firecrawl for powerful web crawling and data extraction.
- Anthropic AI for AI-powered text summarization and formatting.
This project is open-source and licensed under the MIT License.
If you find this repository helpful, please consider:
- β Starring the Repository to show your support.
- π€ Forking the Repository to explore further and make your own customizations.
- π¬ Sharing Your Feedback by opening issues or discussions.
LangGraph, LangSmith, Claude and FireCrawl is currently in limited or preview release (depending on your region and timing), and integration details may change as the service evolves.
Always refer to official documentation for the most up-to-date guidance.