Skip to content

AI-powered web scraping agent built with LangGraph, LangSmith, Firecrawl, and Anthropic AI. Automates intelligent crawling, structured data extraction, and LLM-powered content formatting. Efficiently handles anti-bot mechanisms, error recovery, and batch processing. πŸš€

License

Notifications You must be signed in to change notification settings

hmshb/scraping-agent-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Scraping Agent AI

Python Web Scraping Automation

An AI-powered web scraping agent that automates data extraction from websites with intelligent crawling, anti-bot detection, and structured data parsing. Built using LangGraph, LangSmith, Firecrawl, and Anthropic AI tools for seamless AI-driven web scraping and structured data processing.

πŸ”₯ Features

  • Graph-based AI Agent: Uses LangGraph for managing scraping workflows.
  • Intelligent Web Crawling: Powered by Firecrawl to extract structured data.
  • LLM-Powered Formatting: Uses Anthropic AI for content summarization.
  • Adaptive Error Handling: Retries failed requests dynamically.
  • Batch Processing: Efficiently processes multiple URLs with batched requests.
  • Flexible Output Formats: Supports JSON, Markdown, and more.

Demo

demo.png


πŸ› οΈ Setup Instructions

Follow these steps to set up and run the project on your local machine:

1. Clone the Repository

git clone https://github.com/hmshb/scraping-agent-ai
cd scraping-agent-ai

2. Create a Virtual Environment

python -m venv venv
source venv/bin/activate  # For Linux/Mac

.\venv\Scripts\activate # For Windows

3. Install LangGraph CLI

pip install -U "langgraph-cli[inmem]"

img.png

4. Install Other Dependencies

pip install -e .

img_1.png


5. Generate LangSmith API Key

  1. Visit LangSmith.
  2. Create an API key for accessing LangSmith logs.
  3. Copy the generated API key.

img_2.png


6. Generate Anthropic Claude API Key

  1. Visit Anthropic.
  2. Create an API key for accessing Claude.
  3. Copy the generated API key.

img_3.png


7. Configure the Environment Variables

  • Copy the .env.example file and rename it to .env:
    cp .env.example .env
  • Open the .env file and update the API keys and configuration values:
    LANGSMITH_PROJECT=scrapping-agent
    LANGSMITH_API_KEY=your_api_key_here
    ANTHROPIC_API_KEY=your_api_key_here
    FIRECRAWL_API_KEY=your_api_key_here
    URL_LIMIT=10
    BATCH_LIMIT=5
    

8. Run the project

langgraph dev

img_4.png


9. LangGraph of the AI Agent

img_5.png


πŸ“‚ Project Structure

scraping-agent-ai/
β”œβ”€β”€ .env                 # API key configuration file
β”œβ”€β”€ agent/               # Main AI scraping agent module
β”‚   β”œβ”€β”€ utils/           # Utility modules for various tasks
β”‚   β”‚   β”œβ”€β”€ constants.py # Constants for scraping tasks
β”‚   β”‚   β”œβ”€β”€ firecrawl.py # Firecrawl integration
β”‚   β”‚   β”œβ”€β”€ graph.py     # LangGraph-based workflow
β”‚   β”‚   β”œβ”€β”€ helpers.py   # Utility functions
β”‚   β”‚   β”œβ”€β”€ llm.py       # LLM-powered formatting
β”‚   β”‚   β”œβ”€β”€ nodes.py     # Graph-based nodes
β”‚   β”‚   β”œβ”€β”€ states.py    # Scraping state management
β”‚   β”œβ”€β”€ agent.py         # AI-driven scraping workflow
β”œβ”€β”€ langgraph.json       # LangGraph configuration file
β”œβ”€β”€ pyproject.toml       # Python project metadata
β”œβ”€β”€ README.md            # Documentation file
β”œβ”€β”€ scraped_data.json    # This will have the final data
β”œβ”€β”€ venv/                # Virtual environment

⭐ Acknowledgments

Special thanks to:

  • LangGraph for building graph-based AI workflows.
  • LangSmith for debugging and monitoring AI agents.
  • Firecrawl for powerful web crawling and data extraction.
  • Anthropic AI for AI-powered text summarization and formatting.

πŸ“œ License

This project is open-source and licensed under the MIT License.


πŸ“’ Get Involved!

If you find this repository helpful, please consider:

  • ⭐ Starring the Repository to show your support.
  • πŸ“€ Forking the Repository to explore further and make your own customizations.
  • πŸ’¬ Sharing Your Feedback by opening issues or discussions.

πŸ“ Notes

LangGraph, LangSmith, Claude and FireCrawl is currently in limited or preview release (depending on your region and timing), and integration details may change as the service evolves.

Always refer to official documentation for the most up-to-date guidance.

Let's build smart, scalable AI-powered web scrapers together! πŸš€

About

AI-powered web scraping agent built with LangGraph, LangSmith, Firecrawl, and Anthropic AI. Automates intelligent crawling, structured data extraction, and LLM-powered content formatting. Efficiently handles anti-bot mechanisms, error recovery, and batch processing. πŸš€

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages