-
Notifications
You must be signed in to change notification settings - Fork 9
Home
qdequele edited this page Nov 10, 2024
·
4 revisions
Scrapix is an advanced web scraping and crawling project designed to simplify the process of extracting and indexing data from websites into Meilisearch. It provides a flexible and efficient way to crawl websites and index their content for search.
-
Cheerio: Fast and lightweight HTML parser
- Best for static websites
- Low memory usage
- Cannot execute JavaScript
-
Puppeteer: Full Chrome browser automation
- Handles JavaScript rendering
- Good for SPAs
- Higher resource usage
-
Playwright (Beta): Modern browser automation
- Cross-browser support
- Modern APIs
- Currently in beta
-
Default: General-purpose content extraction
- Extracts all page text
- Uses p tags for content blocks
- Builds sections from h1-h6 tags
-
DocSearch: Documentation-focused extraction
- Compatible with DocSearch plugins
- Preserves documentation structure
-
Schema: Structured data extraction
- Works with Schema.org markup
- Good for e-commerce sites
- Supports rich metadata
-
Markdown: HTML to Markdown conversion
- Ideal for documentation
- Perfect for RAG systems
- Preserves code blocks
-
Custom: User-defined extraction
- Custom CSS selectors
- Flexible data structures
- Start URLs: Define crawl entry points
- URL Exclusion: Skip specific URLs/patterns
- Index Control: Choose what to index
- Not Found Detection: Custom 404 detection
-
Sitemap Support:
- Automatic sitemap discovery
- Custom sitemap URLs
- Robots.txt parsing
- XML sitemap processing
- Concurrency Control: Limit parallel requests
- Rate Limiting: Requests per minute
- Batch Processing: Document batch size
- Resource Management: Memory optimization
- Custom Headers: Add HTTP headers
- User Agents: Rotate user agents
- Browser Options: Custom Puppeteer settings
- Index Settings: Configure search behavior
- Primary Keys: Custom document identifiers
- Batch Indexing: Efficient document upload
- Settings Sync: Automatic index configuration
- Status Updates: Real-time crawl progress
- Custom Payloads: Additional context
- Event Types: Started, active, completed, failed
- Error Reporting: Detailed failure info
Download from releases:
scrapix -p config.json
Run as HTTP server:
scrapix serve
-
POST /crawl/async
- Queue async crawl task
- Body: Crawl configuration
- Response: Task ID
-
POST /crawl/sync
- Run sync crawl task
- Body: Crawl configuration
- Response: Crawl results