scrapex

Modern web scraper with LLM-enhanced extraction, extensible pipeline, and pluggable parsers.

Beta Release: v1.0.0 is currently in beta. The API is stable but minor changes may occur before the stable release.

Features

LLM-Ready Output - Content extracted as Markdown, optimized for AI/LLM consumption
Provider-Agnostic LLM - Works with OpenAI, Anthropic, Ollama, LM Studio, or any OpenAI-compatible API
Vector Embeddings - Generate embeddings with OpenAI, Azure, Cohere, HuggingFace, Ollama, or local Transformers.js
Extensible Pipeline - Pluggable extractors with priority-based execution
Smart Extraction - Uses Mozilla Readability for content, Cheerio for metadata
Markdown Parsing - Parse markdown content, awesome lists, and GitHub repos
RSS/Atom Feeds - Parse RSS 2.0, RSS 1.0 (RDF), and Atom feeds with pagination support
TypeScript First - Full type safety with comprehensive type exports
Dual Format - ESM and CommonJS builds

Installation

npm install scrapex@beta

Optional Peer Dependencies

# For LLM features
npm install openai           # OpenAI/Ollama/LM Studio
npm install @anthropic-ai/sdk  # Anthropic Claude

# For JavaScript-rendered pages
npm install puppeteer

Quick Start

import { scrape } from 'scrapex';

const result = await scrape('https://example.com/article');

console.log(result.title);       // "Article Title"
console.log(result.content);     // Markdown content
console.log(result.textContent); // Plain text (lower tokens)
console.log(result.excerpt);     // First ~300 chars

API Reference

`scrape(url, options?)`

Fetch and extract metadata and content from a URL.

import { scrape } from 'scrapex';

const result = await scrape('https://example.com', {
  timeout: 10000,
  userAgent: 'MyBot/1.0',
  extractContent: true,
  maxContentLength: 50000,
  respectRobots: false,
});

`scrapeHtml(html, url, options?)`

Extract from raw HTML without fetching.

import { scrapeHtml } from 'scrapex';

const html = await fetchSomehow('https://example.com');
const result = await scrapeHtml(html, 'https://example.com');

Result Object (`ScrapedData`)

interface ScrapedData {
  // Identity
  url: string;
  canonicalUrl: string;
  domain: string;

  // Basic metadata
  title: string;
  description: string;
  image?: string;
  favicon?: string;

  // Content (LLM-optimized)
  content: string;      // Markdown format
  textContent: string;  // Plain text
  excerpt: string;      // ~300 char preview
  wordCount: number;

  // Context
  author?: string;
  publishedAt?: string;
  modifiedAt?: string;
  siteName?: string;
  language?: string;

  // Classification
  contentType: 'article' | 'repo' | 'docs' | 'package' | 'video' | 'tool' | 'product' | 'unknown';
  keywords: string[];

  // Structured data
  jsonLd?: Record<string, unknown>[];
  links?: ExtractedLink[];

  // LLM Enhancements (when enabled)
  summary?: string;
  suggestedTags?: string[];
  entities?: ExtractedEntities;
  extracted?: Record<string, unknown>;

  // Meta
  scrapedAt: string;
  scrapeTimeMs: number;
  error?: string;
}

LLM Integration

Using OpenAI

import { scrape } from 'scrapex';
import { createOpenAI } from 'scrapex/llm';

const llm = createOpenAI({ apiKey: 'sk-...' });

const result = await scrape('https://example.com/article', {
  llm,
  enhance: ['summarize', 'tags', 'entities', 'classify'],
});

console.log(result.summary);       // AI-generated summary
console.log(result.suggestedTags); // ['javascript', 'web', ...]
console.log(result.entities);      // { people: [], organizations: [], ... }

Embeddings

Generate vector embeddings from scraped content for semantic search, RAG, and similarity matching:

import { scrape } from 'scrapex';
import { createOpenAIEmbedding } from 'scrapex/embeddings';

const result = await scrape('https://example.com/article', {
  embeddings: {
    provider: { type: 'custom', provider: createOpenAIEmbedding() },
    model: 'text-embedding-3-small',
  },
});

if (result.embeddings?.status === 'success') {
  console.log(result.embeddings.vector); // [0.023, -0.041, ...]
}

Features include:

Multiple providers - OpenAI, Azure, Cohere, HuggingFace, Ollama, Transformers.js
PII redaction - Automatically redact emails, phones, SSNs before sending to APIs
Smart chunking - Split long content with configurable overlap
Caching - Content-addressable cache to avoid redundant API calls
Resilience - Retry, circuit breaker, rate limiting

See the Embeddings Guide for full documentation.

Breaking Changes (Beta)

LLM provider classes (e.g., AnthropicProvider) were removed. Use preset factories like createOpenAI, createAnthropic, createOllama, and createLMStudio instead.

Using Anthropic Claude

import { createAnthropic } from 'scrapex/llm';

const llm = createAnthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
  model: 'claude-3-5-haiku-20241022', // or 'claude-3-5-sonnet-20241022'
});

const result = await scrape(url, { llm, enhance: ['summarize'] });

Using Ollama (Local)

import { createOllama } from 'scrapex/llm';

const llm = createOllama({ model: 'llama3.2' });

const result = await scrape(url, { llm, enhance: ['summarize'] });

Using LM Studio (Local)

import { createLMStudio } from 'scrapex/llm';

const llm = createLMStudio({ model: 'local-model' });

const result = await scrape(url, { llm, enhance: ['summarize'] });

Structured Extraction

Extract specific data using a schema:

const result = await scrape('https://example.com/product', {
  llm,
  extract: {
    productName: 'string',
    price: 'number',
    features: 'string[]',
    inStock: 'boolean',
    sku: 'string?', // optional
  },
});

console.log(result.extracted);
// { productName: "Widget", price: 29.99, features: [...], inStock: true }

Custom Extractors

Create custom extractors to add domain-specific extraction logic:

import { scrape, type Extractor, type ExtractionContext } from 'scrapex';

const recipeExtractor: Extractor = {
  name: 'recipe',
  priority: 60, // Higher = runs earlier

  async extract(context: ExtractionContext) {
    const { $ } = context;

    return {
      custom: {
        ingredients: $('.ingredients li').map((_, el) => $(el).text()).get(),
        cookTime: $('[itemprop="cookTime"]').attr('content'),
        servings: $('[itemprop="recipeYield"]').text(),
      },
    };
  },
};

const result = await scrape('https://example.com/recipe', {
  extractors: [recipeExtractor],
});

console.log(result.custom?.ingredients);

Replacing Default Extractors

const result = await scrape(url, {
  replaceDefaultExtractors: true,
  extractors: [myCustomExtractor],
});

Markdown Parsing

Parse markdown content:

import { MarkdownParser, extractListLinks, groupByCategory } from 'scrapex/parsers';

// Parse any markdown
const parser = new MarkdownParser();
const result = parser.parse(markdownContent);

console.log(result.data.title);
console.log(result.data.sections);
console.log(result.data.links);
console.log(result.data.codeBlocks);

// Extract links from markdown lists and group by category
const links = extractListLinks(markdownContent);
const grouped = groupByCategory(links);

grouped.forEach((categoryLinks, category) => {
  console.log(`${category}: ${categoryLinks.length} links`);
});

GitHub Utilities

import {
  isGitHubRepo,
  parseGitHubUrl,
  toRawUrl,
} from 'scrapex/parsers';

isGitHubRepo('https://github.com/owner/repo');
// true

parseGitHubUrl('https://github.com/facebook/react');
// { owner: 'facebook', repo: 'react' }

toRawUrl('https://github.com/owner/repo');
// 'https://raw.githubusercontent.com/owner/repo/main/README.md'

RSS/Atom Feed Parsing

Parse RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 feeds:

import { RSSParser } from 'scrapex';

const parser = new RSSParser();
const result = parser.parse(feedXml, 'https://example.com/feed.xml');

console.log(result.data.format);  // 'rss2' | 'rss1' | 'atom'
console.log(result.data.title);   // Feed title
console.log(result.data.items);   // Array of feed items

Supported formats:

rss2 - RSS 2.0 (most common format)
rss1 - RSS 1.0 (RDF-based, older format)
atom - Atom 1.0 (modern format with better semantics)

Feed Item Structure

interface FeedItem {
  id: string;
  title: string;
  link: string;
  description?: string;
  content?: string;
  author?: string;
  publishedAt?: string;      // ISO 8601
  rawPublishedAt?: string;   // Original date string
  updatedAt?: string;        // Atom only
  categories: string[];
  enclosure?: FeedEnclosure; // Podcast/media attachments
  customFields?: Record<string, string>;
}

Fetching and Parsing Feeds

import { fetchFeed, paginateFeed } from 'scrapex';

// Fetch and parse in one call
const result = await fetchFeed('https://example.com/feed.xml');
console.log(result.data.items);

// Paginate through feeds with rel="next" links (Atom)
for await (const page of paginateFeed('https://example.com/atom')) {
  console.log(`Page with ${page.items.length} items`);
}

Discovering Feeds in HTML

import { discoverFeeds } from 'scrapex';

const html = await fetch('https://example.com').then(r => r.text());
const feedUrls = discoverFeeds(html, 'https://example.com');
// ['https://example.com/feed.xml', 'https://example.com/atom.xml']

Filtering by Date

import { RSSParser, filterByDate } from 'scrapex';

const parser = new RSSParser();
const result = parser.parse(feedXml);

const recentItems = filterByDate(result.data.items, {
  after: new Date('2024-01-01'),
  before: new Date('2024-12-31'),
  includeUndated: false,
});

Converting to Markdown/Text

import { RSSParser, feedToMarkdown, feedToText } from 'scrapex';

const parser = new RSSParser();
const result = parser.parse(feedXml);

// Convert to markdown (great for LLM consumption)
const markdown = feedToMarkdown(result.data, { maxItems: 10 });

// Convert to plain text
const text = feedToText(result.data);

Custom Fields (Podcast/Media)

Extract custom namespace fields like iTunes podcast tags:

const parser = new RSSParser({
  customFields: {
    duration: 'itunes\\:duration',
    explicit: 'itunes\\:explicit',
    rating: 'media\\:rating',
  },
});

const result = parser.parse(podcastXml);
const item = result.data.items[0];

console.log(item.customFields?.duration);  // '10:00'
console.log(item.customFields?.explicit);  // 'no'

Security

The RSS parser enforces strict URL security:

HTTPS-only URLs (RSS parser only): The RSS/Atom parser (RSSParser) resolves all links to HTTPS only. Non-HTTPS URLs (http, javascript, data, file) are rejected and returned as empty strings. This is specific to feed parsing to prevent malicious links in untrusted feeds.
XML Mode: Feeds are parsed with Cheerio's { xml: true } mode, which disables HTML entity processing and prevents XSS vectors.

Note: The public URL utilities (resolveUrl, isValidUrl, etc.) accept both http: and https: URLs. Protocol-relative URLs (e.g., //example.com/path) are resolved against the base URL's protocol by the standard URL constructor.

URL Utilities

import {
  isValidUrl,
  normalizeUrl,
  extractDomain,
  resolveUrl,
  isExternalUrl,
} from 'scrapex';

isValidUrl('https://example.com');
// true

normalizeUrl('https://example.com/page?utm_source=twitter');
// 'https://example.com/page' (tracking params removed)

extractDomain('https://www.example.com/path');
// 'example.com'

resolveUrl('/path', 'https://example.com/page');
// 'https://example.com/path'

isExternalUrl('https://other.com', 'example.com');
// true

Error Handling

import { scrape, ScrapeError } from 'scrapex';

try {
  const result = await scrape('https://example.com');
} catch (error) {
  if (error instanceof ScrapeError) {
    console.log(error.code);       // 'FETCH_FAILED' | 'TIMEOUT' | 'INVALID_URL' | ...
    console.log(error.statusCode); // HTTP status if available
    console.log(error.isRetryable()); // true for network errors
  }
}

Error codes:

FETCH_FAILED - Network request failed
TIMEOUT - Request timed out
INVALID_URL - URL is malformed
BLOCKED - Access denied (403)
NOT_FOUND - Page not found (404)
ROBOTS_BLOCKED - Blocked by robots.txt
PARSE_ERROR - HTML parsing failed
LLM_ERROR - LLM provider error
VALIDATION_ERROR - Schema validation failed

Robots.txt

import { scrape, checkRobotsTxt } from 'scrapex';

// Check before scraping
const check = await checkRobotsTxt('https://example.com/path');
if (check.allowed) {
  const result = await scrape('https://example.com/path');
}

// Or let scrape() handle it
const result = await scrape('https://example.com/path', {
  respectRobots: true, // Throws if blocked
});

Built-in Extractors

Extractor	Priority	Description
`MetaExtractor`	100	OG, Twitter, meta tags
`JsonLdExtractor`	80	JSON-LD structured data
`ContentExtractor`	50	Readability + Turndown
`FaviconExtractor`	70	Favicon discovery
`LinksExtractor`	30	Content link extraction

Configuration

Options

interface ScrapeOptions {
  timeout?: number;              // Default: 10000ms
  userAgent?: string;            // Custom user agent
  extractContent?: boolean;      // Default: true
  maxContentLength?: number;     // Default: 50000 chars
  fetcher?: Fetcher;             // Custom fetcher
  extractors?: Extractor[];      // Additional extractors
  replaceDefaultExtractors?: boolean;
  respectRobots?: boolean;       // Check robots.txt
  llm?: LLMProvider;             // LLM provider
  enhance?: EnhancementType[];   // LLM enhancements
  extract?: ExtractionSchema;    // Structured extraction
}

Enhancement Types

type EnhancementType =
  | 'summarize'  // Generate summary
  | 'tags'       // Extract keywords/tags
  | 'entities'   // Extract named entities
  | 'classify';  // Classify content type

Requirements

Node.js 20+
TypeScript 5.0+ (for type imports)

License

MIT

Author

Rakesh Paul - binaryroute

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
design		design
docs		docs
examples		examples
src		src
test		test
.gitignore		.gitignore
.npmignore		.npmignore
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
biome.json		biome.json
lefthook.yml		lefthook.yml
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsdown.config.ts		tsdown.config.ts
vitest.config.ts		vitest.config.ts

License

developer-rakeshpaul/scrapex

Folders and files

Latest commit

History

Repository files navigation

scrapex

Features

Installation

Optional Peer Dependencies

Quick Start

API Reference

scrape(url, options?)

scrapeHtml(html, url, options?)

Result Object (ScrapedData)

LLM Integration

Using OpenAI

Embeddings

Breaking Changes (Beta)

Using Anthropic Claude

Using Ollama (Local)

Using LM Studio (Local)

Structured Extraction

Custom Extractors

Replacing Default Extractors

Markdown Parsing

GitHub Utilities

RSS/Atom Feed Parsing

Feed Item Structure

Fetching and Parsing Feeds

Discovering Feeds in HTML

Filtering by Date

Converting to Markdown/Text

Custom Fields (Podcast/Media)

Security

URL Utilities

Error Handling

Robots.txt

Built-in Extractors

Configuration

Options

Enhancement Types

Requirements

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 23

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`scrape(url, options?)`

`scrapeHtml(html, url, options?)`

Result Object (`ScrapedData`)

Packages