Skip to content

Open source web scraping tool that gives LLM-Ready & multiple formats. Provides web scraping functionality with multiple fallback mechanisms for bypassing anti-bot protections (Cloudflare, etc.) and extracting content in various formats.

License

Notifications You must be signed in to change notification settings

EnderGamer6112/open-allscrape

AllScrape Backend API

LLM-ready web scraping & search API built with FastAPI.

Features

  • 🔍 Single URL scraping with multiple output formats (HTML, Markdown, Text)
  • 🌐 Web search with automatic result scraping
  • 🤖 LLM-ready output format
  • 📊 Metadata extraction (title, description, author, date)
  • ⚡ Fast async processing with httpx
  • 🧹 Clean text extraction using trafilatura

Installation

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install headless browser for dynamic sites (once)
python -m playwright install chromium

Usage

# Run the server
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000

API Endpoints

POST /scrape

Scrape a single URL

Request:

{
  "url": "https://example.com",
  "formats": ["markdown", "html", "text"]
}

Response:

{
  "success": true,
  "url": "https://example.com",
  "data": {
    "metadata": {
      "title": "Page Title",
      "description": "Page description",
      "author": "Author name",
      "date": "2025-01-01"
    },
    "content": {
      "text": "Clean text content...",
      "markdown": "# Markdown content...",
      "html": "<div>HTML content...</div>"
    },
    "llm_ready": {
      "title": "Page Title",
      "text": "Clean text for LLM...",
      "word_count": 500,
      "source": "https://example.com"
    }
  }
}

POST /search

Search the web and scrape results

Request:

{
  "query": "python web scraping",
  "max_results": 5
}

Response:

{
  "success": true,
  "query": "python web scraping",
  "results_count": 5,
  "data": [...]
}

Documentation

Interactive API documentation available at:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

About

Open source web scraping tool that gives LLM-Ready & multiple formats. Provides web scraping functionality with multiple fallback mechanisms for bypassing anti-bot protections (Cloudflare, etc.) and extracting content in various formats.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages