Skip to content
qdequele edited this page Nov 10, 2024 · 4 revisions

Scrapix

Scrapix is an advanced web scraping and crawling project designed to simplify the process of extracting and indexing data from websites into Meilisearch. It provides a flexible and efficient way to crawl websites and index their content for search.

Features

Crawler Types

  • Cheerio: Fast and lightweight HTML parser
    • Best for static websites
    • Low memory usage
    • Cannot execute JavaScript
  • Puppeteer: Full Chrome browser automation
    • Handles JavaScript rendering
    • Good for SPAs
    • Higher resource usage
  • Playwright (Beta): Modern browser automation
    • Cross-browser support
    • Modern APIs
    • Currently in beta

Content Extraction Strategies

  • Default: General-purpose content extraction
    • Extracts all page text
    • Uses p tags for content blocks
    • Builds sections from h1-h6 tags
  • DocSearch: Documentation-focused extraction
    • Compatible with DocSearch plugins
    • Preserves documentation structure
  • Schema: Structured data extraction
    • Works with Schema.org markup
    • Good for e-commerce sites
    • Supports rich metadata
  • Markdown: HTML to Markdown conversion
    • Ideal for documentation
    • Perfect for RAG systems
    • Preserves code blocks
  • Custom: User-defined extraction
    • Custom CSS selectors
    • Flexible data structures

URL Control

  • Start URLs: Define crawl entry points
  • URL Exclusion: Skip specific URLs/patterns
  • Index Control: Choose what to index
  • Not Found Detection: Custom 404 detection
  • Sitemap Support:
    • Automatic sitemap discovery
    • Custom sitemap URLs
    • Robots.txt parsing
    • XML sitemap processing

Performance Settings

  • Concurrency Control: Limit parallel requests
  • Rate Limiting: Requests per minute
  • Batch Processing: Document batch size
  • Resource Management: Memory optimization

Request Configuration

  • Custom Headers: Add HTTP headers
  • User Agents: Rotate user agents
  • Browser Options: Custom Puppeteer settings

Meilisearch Integration

  • Index Settings: Configure search behavior
  • Primary Keys: Custom document identifiers
  • Batch Indexing: Efficient document upload
  • Settings Sync: Automatic index configuration

Webhook Support

  • Status Updates: Real-time crawl progress
  • Custom Payloads: Additional context
  • Event Types: Started, active, completed, failed
  • Error Reporting: Detailed failure info

Getting Started

CLI Usage

Download from releases:

scrapix -p config.json

API Server

Run as HTTP server:

scrapix serve

API Endpoints

  1. POST /crawl/async

    • Queue async crawl task
    • Body: Crawl configuration
    • Response: Task ID
  2. POST /crawl/sync

    • Run sync crawl task
    • Body: Crawl configuration
    • Response: Crawl results