Modern web scraper with LLM-enhanced extraction, extensible pipeline, and pluggable parsers.
Beta Release: v1.0.0 is currently in beta. The API is stable but minor changes may occur before the stable release.
- LLM-Ready Output - Content extracted as Markdown, optimized for AI/LLM consumption
- Provider-Agnostic LLM - Works with OpenAI, Anthropic, Ollama, LM Studio, or any OpenAI-compatible API
- Vector Embeddings - Generate embeddings with OpenAI, Azure, Cohere, HuggingFace, Ollama, or local Transformers.js
- Extensible Pipeline - Pluggable extractors with priority-based execution
- Smart Extraction - Uses Mozilla Readability for content, Cheerio for metadata
- Markdown Parsing - Parse markdown content, awesome lists, and GitHub repos
- RSS/Atom Feeds - Parse RSS 2.0, RSS 1.0 (RDF), and Atom feeds with pagination support
- TypeScript First - Full type safety with comprehensive type exports
- Dual Format - ESM and CommonJS builds
npm install scrapex@beta# For LLM features
npm install openai # OpenAI/Ollama/LM Studio
npm install @anthropic-ai/sdk # Anthropic Claude
# For JavaScript-rendered pages
npm install puppeteerimport { scrape } from 'scrapex';
const result = await scrape('https://example.com/article');
console.log(result.title); // "Article Title"
console.log(result.content); // Markdown content
console.log(result.textContent); // Plain text (lower tokens)
console.log(result.excerpt); // First ~300 charsFetch and extract metadata and content from a URL.
import { scrape } from 'scrapex';
const result = await scrape('https://example.com', {
timeout: 10000,
userAgent: 'MyBot/1.0',
extractContent: true,
maxContentLength: 50000,
respectRobots: false,
});Extract from raw HTML without fetching.
import { scrapeHtml } from 'scrapex';
const html = await fetchSomehow('https://example.com');
const result = await scrapeHtml(html, 'https://example.com');interface ScrapedData {
// Identity
url: string;
canonicalUrl: string;
domain: string;
// Basic metadata
title: string;
description: string;
image?: string;
favicon?: string;
// Content (LLM-optimized)
content: string; // Markdown format
textContent: string; // Plain text
excerpt: string; // ~300 char preview
wordCount: number;
// Context
author?: string;
publishedAt?: string;
modifiedAt?: string;
siteName?: string;
language?: string;
// Classification
contentType: 'article' | 'repo' | 'docs' | 'package' | 'video' | 'tool' | 'product' | 'unknown';
keywords: string[];
// Structured data
jsonLd?: Record<string, unknown>[];
links?: ExtractedLink[];
// LLM Enhancements (when enabled)
summary?: string;
suggestedTags?: string[];
entities?: ExtractedEntities;
extracted?: Record<string, unknown>;
// Meta
scrapedAt: string;
scrapeTimeMs: number;
error?: string;
}import { scrape } from 'scrapex';
import { createOpenAI } from 'scrapex/llm';
const llm = createOpenAI({ apiKey: 'sk-...' });
const result = await scrape('https://example.com/article', {
llm,
enhance: ['summarize', 'tags', 'entities', 'classify'],
});
console.log(result.summary); // AI-generated summary
console.log(result.suggestedTags); // ['javascript', 'web', ...]
console.log(result.entities); // { people: [], organizations: [], ... }Generate vector embeddings from scraped content for semantic search, RAG, and similarity matching:
import { scrape } from 'scrapex';
import { createOpenAIEmbedding } from 'scrapex/embeddings';
const result = await scrape('https://example.com/article', {
embeddings: {
provider: { type: 'custom', provider: createOpenAIEmbedding() },
model: 'text-embedding-3-small',
},
});
if (result.embeddings?.status === 'success') {
console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}Features include:
- Multiple providers - OpenAI, Azure, Cohere, HuggingFace, Ollama, Transformers.js
- PII redaction - Automatically redact emails, phones, SSNs before sending to APIs
- Smart chunking - Split long content with configurable overlap
- Caching - Content-addressable cache to avoid redundant API calls
- Resilience - Retry, circuit breaker, rate limiting
See the Embeddings Guide for full documentation.
- LLM provider classes (e.g.,
AnthropicProvider) were removed. Use preset factories likecreateOpenAI,createAnthropic,createOllama, andcreateLMStudioinstead.
import { createAnthropic } from 'scrapex/llm';
const llm = createAnthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
model: 'claude-3-5-haiku-20241022', // or 'claude-3-5-sonnet-20241022'
});
const result = await scrape(url, { llm, enhance: ['summarize'] });import { createOllama } from 'scrapex/llm';
const llm = createOllama({ model: 'llama3.2' });
const result = await scrape(url, { llm, enhance: ['summarize'] });import { createLMStudio } from 'scrapex/llm';
const llm = createLMStudio({ model: 'local-model' });
const result = await scrape(url, { llm, enhance: ['summarize'] });Extract specific data using a schema:
const result = await scrape('https://example.com/product', {
llm,
extract: {
productName: 'string',
price: 'number',
features: 'string[]',
inStock: 'boolean',
sku: 'string?', // optional
},
});
console.log(result.extracted);
// { productName: "Widget", price: 29.99, features: [...], inStock: true }Create custom extractors to add domain-specific extraction logic:
import { scrape, type Extractor, type ExtractionContext } from 'scrapex';
const recipeExtractor: Extractor = {
name: 'recipe',
priority: 60, // Higher = runs earlier
async extract(context: ExtractionContext) {
const { $ } = context;
return {
custom: {
ingredients: $('.ingredients li').map((_, el) => $(el).text()).get(),
cookTime: $('[itemprop="cookTime"]').attr('content'),
servings: $('[itemprop="recipeYield"]').text(),
},
};
},
};
const result = await scrape('https://example.com/recipe', {
extractors: [recipeExtractor],
});
console.log(result.custom?.ingredients);const result = await scrape(url, {
replaceDefaultExtractors: true,
extractors: [myCustomExtractor],
});Parse markdown content:
import { MarkdownParser, extractListLinks, groupByCategory } from 'scrapex/parsers';
// Parse any markdown
const parser = new MarkdownParser();
const result = parser.parse(markdownContent);
console.log(result.data.title);
console.log(result.data.sections);
console.log(result.data.links);
console.log(result.data.codeBlocks);
// Extract links from markdown lists and group by category
const links = extractListLinks(markdownContent);
const grouped = groupByCategory(links);
grouped.forEach((categoryLinks, category) => {
console.log(`${category}: ${categoryLinks.length} links`);
});import {
isGitHubRepo,
parseGitHubUrl,
toRawUrl,
} from 'scrapex/parsers';
isGitHubRepo('https://github.com/owner/repo');
// true
parseGitHubUrl('https://github.com/facebook/react');
// { owner: 'facebook', repo: 'react' }
toRawUrl('https://github.com/owner/repo');
// 'https://raw.githubusercontent.com/owner/repo/main/README.md'Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:
import { RSSParser } from 'scrapex';
const parser = new RSSParser();
const result = parser.parse(feedXml, 'https://example.com/feed.xml');
console.log(result.data.format); // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title); // Feed title
console.log(result.data.items); // Array of feed itemsSupported formats:
rss2- RSS 2.0 (most common format)rss1- RSS 1.0 (RDF-based, older format)atom- Atom 1.0 (modern format with better semantics)
interface FeedItem {
id: string;
title: string;
link: string;
description?: string;
content?: string;
author?: string;
publishedAt?: string; // ISO 8601
rawPublishedAt?: string; // Original date string
updatedAt?: string; // Atom only
categories: string[];
enclosure?: FeedEnclosure; // Podcast/media attachments
customFields?: Record<string, string>;
}import { fetchFeed, paginateFeed } from 'scrapex';
// Fetch and parse in one call
const result = await fetchFeed('https://example.com/feed.xml');
console.log(result.data.items);
// Paginate through feeds with rel="next" links (Atom)
for await (const page of paginateFeed('https://example.com/atom')) {
console.log(`Page with ${page.items.length} items`);
}import { discoverFeeds } from 'scrapex';
const html = await fetch('https://example.com').then(r => r.text());
const feedUrls = discoverFeeds(html, 'https://example.com');
// ['https://example.com/feed.xml', 'https://example.com/atom.xml']import { RSSParser, filterByDate } from 'scrapex';
const parser = new RSSParser();
const result = parser.parse(feedXml);
const recentItems = filterByDate(result.data.items, {
after: new Date('2024-01-01'),
before: new Date('2024-12-31'),
includeUndated: false,
});import { RSSParser, feedToMarkdown, feedToText } from 'scrapex';
const parser = new RSSParser();
const result = parser.parse(feedXml);
// Convert to markdown (great for LLM consumption)
const markdown = feedToMarkdown(result.data, { maxItems: 10 });
// Convert to plain text
const text = feedToText(result.data);Extract custom namespace fields like iTunes podcast tags:
const parser = new RSSParser({
customFields: {
duration: 'itunes\\:duration',
explicit: 'itunes\\:explicit',
rating: 'media\\:rating',
},
});
const result = parser.parse(podcastXml);
const item = result.data.items[0];
console.log(item.customFields?.duration); // '10:00'
console.log(item.customFields?.explicit); // 'no'The RSS parser enforces strict URL security:
- HTTPS-only URLs (RSS parser only): The RSS/Atom parser (
RSSParser) resolves all links to HTTPS only. Non-HTTPS URLs (http, javascript, data, file) are rejected and returned as empty strings. This is specific to feed parsing to prevent malicious links in untrusted feeds. - XML Mode: Feeds are parsed with Cheerio's
{ xml: true }mode, which disables HTML entity processing and prevents XSS vectors.
Note: The public URL utilities (
resolveUrl,isValidUrl, etc.) accept bothhttp:andhttps:URLs. Protocol-relative URLs (e.g.,//example.com/path) are resolved against the base URL's protocol by the standardURLconstructor.
import {
isValidUrl,
normalizeUrl,
extractDomain,
resolveUrl,
isExternalUrl,
} from 'scrapex';
isValidUrl('https://example.com');
// true
normalizeUrl('https://example.com/page?utm_source=twitter');
// 'https://example.com/page' (tracking params removed)
extractDomain('https://www.example.com/path');
// 'example.com'
resolveUrl('/path', 'https://example.com/page');
// 'https://example.com/path'
isExternalUrl('https://other.com', 'example.com');
// trueimport { scrape, ScrapeError } from 'scrapex';
try {
const result = await scrape('https://example.com');
} catch (error) {
if (error instanceof ScrapeError) {
console.log(error.code); // 'FETCH_FAILED' | 'TIMEOUT' | 'INVALID_URL' | ...
console.log(error.statusCode); // HTTP status if available
console.log(error.isRetryable()); // true for network errors
}
}Error codes:
FETCH_FAILED- Network request failedTIMEOUT- Request timed outINVALID_URL- URL is malformedBLOCKED- Access denied (403)NOT_FOUND- Page not found (404)ROBOTS_BLOCKED- Blocked by robots.txtPARSE_ERROR- HTML parsing failedLLM_ERROR- LLM provider errorVALIDATION_ERROR- Schema validation failed
import { scrape, checkRobotsTxt } from 'scrapex';
// Check before scraping
const check = await checkRobotsTxt('https://example.com/path');
if (check.allowed) {
const result = await scrape('https://example.com/path');
}
// Or let scrape() handle it
const result = await scrape('https://example.com/path', {
respectRobots: true, // Throws if blocked
});| Extractor | Priority | Description |
|---|---|---|
MetaExtractor |
100 | OG, Twitter, meta tags |
JsonLdExtractor |
80 | JSON-LD structured data |
ContentExtractor |
50 | Readability + Turndown |
FaviconExtractor |
70 | Favicon discovery |
LinksExtractor |
30 | Content link extraction |
interface ScrapeOptions {
timeout?: number; // Default: 10000ms
userAgent?: string; // Custom user agent
extractContent?: boolean; // Default: true
maxContentLength?: number; // Default: 50000 chars
fetcher?: Fetcher; // Custom fetcher
extractors?: Extractor[]; // Additional extractors
replaceDefaultExtractors?: boolean;
respectRobots?: boolean; // Check robots.txt
llm?: LLMProvider; // LLM provider
enhance?: EnhancementType[]; // LLM enhancements
extract?: ExtractionSchema; // Structured extraction
}type EnhancementType =
| 'summarize' // Generate summary
| 'tags' // Extract keywords/tags
| 'entities' // Extract named entities
| 'classify'; // Classify content type- Node.js 20+
- TypeScript 5.0+ (for type imports)
MIT
Rakesh Paul - binaryroute