Home

Scrapix

Scrapix is an advanced web scraping and crawling project designed to simplify the process of extracting and indexing data from websites into Meilisearch. It provides a flexible and efficient way to crawl websites and index their content for search.

Features

Crawler Types

Cheerio: Fast and lightweight HTML parser
- Best for static websites
- Low memory usage
- Cannot execute JavaScript
Puppeteer: Full Chrome browser automation
- Handles JavaScript rendering
- Good for SPAs
- Higher resource usage
Playwright (Beta): Modern browser automation
- Cross-browser support
- Modern APIs
- Currently in beta

Content Extraction Strategies

Default: General-purpose content extraction
- Extracts all page text
- Uses p tags for content blocks
- Builds sections from h1-h6 tags
DocSearch: Documentation-focused extraction
- Compatible with DocSearch plugins
- Preserves documentation structure
Schema: Structured data extraction
- Works with Schema.org markup
- Good for e-commerce sites
- Supports rich metadata
Markdown: HTML to Markdown conversion
- Ideal for documentation
- Perfect for RAG systems
- Preserves code blocks
Custom: User-defined extraction
- Custom CSS selectors
- Flexible data structures

URL Control

Start URLs: Define crawl entry points
URL Exclusion: Skip specific URLs/patterns
Index Control: Choose what to index
Not Found Detection: Custom 404 detection
Sitemap Support:
- Automatic sitemap discovery
- Custom sitemap URLs
- Robots.txt parsing
- XML sitemap processing

Performance Settings

Concurrency Control: Limit parallel requests
Rate Limiting: Requests per minute
Batch Processing: Document batch size
Resource Management: Memory optimization

Request Configuration

Custom Headers: Add HTTP headers
User Agents: Rotate user agents
Browser Options: Custom Puppeteer settings

Meilisearch Integration

Index Settings: Configure search behavior
Primary Keys: Custom document identifiers
Batch Indexing: Efficient document upload
Settings Sync: Automatic index configuration

Webhook Support

Status Updates: Real-time crawl progress
Custom Payloads: Additional context
Event Types: Started, active, completed, failed
Error Reporting: Detailed failure info

Getting Started

CLI Usage

Download from releases:

scrapix -p config.json

API Server

Run as HTTP server:

scrapix serve

API Endpoints

POST /crawl/async
- Queue async crawl task
- Body: Crawl configuration
- Response: Task ID
POST /crawl/sync
- Run sync crawl task
- Body: Crawl configuration
- Response: Crawl results

Overview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly