Skip to content

northys/unstoppable

Repository files navigation

Unstoppable - Product Data Scraper

A high-performance, extensible web scraper built with TypeScript and Crawlee for extracting product data from e-commerce websites.

Features

  • Extensible Architecture: Easy to add new website scrapers by extending the base scraper class
  • Parallel Processing: Leverages Crawlee's concurrent crawling capabilities
  • TypeScript: Full type safety and better development experience
  • Test-Driven: Comprehensive test suite with HTML snapshots
  • Data Export: Export scraped data in JSON or CSV formats
  • Proxy Support: Built-in proxy rotation support

Project Structure

unstoppable/
├── src/
│   ├── crawlers/        # Crawler implementations
│   ├── scrapers/        # Website-specific scrapers
│   ├── models/          # Data models
│   ├── utils/           # Utility functions
│   └── tests/           # Test files and fixtures
├── dist/                # Compiled JavaScript output
└── storage/             # Crawlee data storage

Installation

npm install

Usage

Product Scraping

Development

npm start
# or
npm run start:dev

Production

npm run build
npm run start:prod

Category Extraction

Extract categories from e-commerce websites:

# Extract all categories from Thomann
npm run extract-categories

# Extract from specific URL
npm run extract-categories -- --url https://www.thomann.de/de/index.html

# Export as CSV
npm run extract-categories -- --format csv

# Build category tree structure
npm run extract-categories -- --tree

# Verbose logging
npm run extract-categories -- --verbose

Testing

npm test

Linting and Type Checking

npm run lint
npm run typecheck

Adding a New Website Scraper

  1. Create a new scraper class extending BaseScraper:
import { BaseScraper } from './BaseScraper.js';

export class NewWebsiteScraper extends BaseScraper {
  constructor() {
    super({
      name: 'NewWebsite',
      baseUrl: 'https://example.com',
    });
  }

  // Implement required methods
  async scrapeProductList(url: string): Promise<ScraperResult> { /* ... */ }
  async scrapeProductDetail(url: string): Promise<Product> { /* ... */ }
  async getCategoryUrls(): Promise<string[]> { /* ... */ }
  isProductUrl(url: string): boolean { /* ... */ }
  isCategoryUrl(url: string): boolean { /* ... */ }
}
  1. Use the scraper with ProductCrawler:
import { ProductCrawler } from './crawlers/ProductCrawler.js';
import { NewWebsiteScraper } from './scrapers/NewWebsiteScraper.js';

const scraper = new NewWebsiteScraper();
const crawler = new ProductCrawler({ scraper });
await crawler.run();

Data Models

Product Model

Products are scraped with the following structure:

interface Product {
  id: string;
  sku?: string;
  name: string;
  brand?: string;
  category: string[];
  price: {
    currency: string;
    amount: number;
    originalAmount?: number;
    discount?: number;
  };
  availability: {
    inStock: boolean;
    stockLevel?: number;
    deliveryTime?: string;
  };
  images: string[];
  description?: string;
  specifications?: Record<string, string>;
  ratings?: {
    average: number;
    count: number;
  };
  url: string;
  scrapedAt: Date;
  source: string;
}

Category Model

Categories are extracted with the following structure:

interface Category {
  name: string;
  url: string;
  code: string;
  parentCategory?: string;
  level: number;
  productCount?: number;
  subcategories?: Category[];
  scrapedAt: Date;
  source: string;
}

Configuration

Configure the crawler behavior through ProductCrawler options:

  • maxRequestsPerCrawl: Limit the number of requests
  • proxyUrls: Array of proxy URLs for rotation
  • startUrls: Override default category URLs

License

ISC

About

High-performance, extensible web scraper for e-commerce product data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published