Unstoppable - Product Data Scraper

A high-performance, extensible web scraper built with TypeScript and Crawlee for extracting product data from e-commerce websites.

Features

Extensible Architecture: Easy to add new website scrapers by extending the base scraper class
Parallel Processing: Leverages Crawlee's concurrent crawling capabilities
TypeScript: Full type safety and better development experience
Test-Driven: Comprehensive test suite with HTML snapshots
Data Export: Export scraped data in JSON or CSV formats
Proxy Support: Built-in proxy rotation support

Project Structure

unstoppable/
├── src/
│   ├── crawlers/        # Crawler implementations
│   ├── scrapers/        # Website-specific scrapers
│   ├── models/          # Data models
│   ├── utils/           # Utility functions
│   └── tests/           # Test files and fixtures
├── dist/                # Compiled JavaScript output
└── storage/             # Crawlee data storage

Installation

npm install

Usage

Product Scraping

Development

npm start
# or
npm run start:dev

Production

npm run build
npm run start:prod

Category Extraction

Extract categories from e-commerce websites:

# Extract all categories from Thomann
npm run extract-categories

# Extract from specific URL
npm run extract-categories -- --url https://www.thomann.de/de/index.html

# Export as CSV
npm run extract-categories -- --format csv

# Build category tree structure
npm run extract-categories -- --tree

# Verbose logging
npm run extract-categories -- --verbose

Testing

npm test

Linting and Type Checking

npm run lint
npm run typecheck

Adding a New Website Scraper

Create a new scraper class extending BaseScraper:

import { BaseScraper } from './BaseScraper.js';

export class NewWebsiteScraper extends BaseScraper {
  constructor() {
    super({
      name: 'NewWebsite',
      baseUrl: 'https://example.com',
    });
  }

  // Implement required methods
  async scrapeProductList(url: string): Promise<ScraperResult> { /* ... */ }
  async scrapeProductDetail(url: string): Promise<Product> { /* ... */ }
  async getCategoryUrls(): Promise<string[]> { /* ... */ }
  isProductUrl(url: string): boolean { /* ... */ }
  isCategoryUrl(url: string): boolean { /* ... */ }
}

Use the scraper with ProductCrawler:

import { ProductCrawler } from './crawlers/ProductCrawler.js';
import { NewWebsiteScraper } from './scrapers/NewWebsiteScraper.js';

const scraper = new NewWebsiteScraper();
const crawler = new ProductCrawler({ scraper });
await crawler.run();

Data Models

Product Model

Products are scraped with the following structure:

interface Product {
  id: string;
  sku?: string;
  name: string;
  brand?: string;
  category: string[];
  price: {
    currency: string;
    amount: number;
    originalAmount?: number;
    discount?: number;
  };
  availability: {
    inStock: boolean;
    stockLevel?: number;
    deliveryTime?: string;
  };
  images: string[];
  description?: string;
  specifications?: Record<string, string>;
  ratings?: {
    average: number;
    count: number;
  };
  url: string;
  scrapedAt: Date;
  source: string;
}

Category Model

Categories are extracted with the following structure:

interface Category {
  name: string;
  url: string;
  code: string;
  parentCategory?: string;
  level: number;
  productCount?: number;
  subcategories?: Category[];
  scrapedAt: Date;
  source: string;
}

Configuration

Configure the crawler behavior through ProductCrawler options:

maxRequestsPerCrawl: Limit the number of requests
proxyUrls: Array of proxy URLs for rotation
startUrls: Override default category URLs

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
README.md		README.md
eslint.config.js		eslint.config.js
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unstoppable - Product Data Scraper

Features

Project Structure

Installation

Usage

Product Scraping

Development

Production

Category Extraction

Testing

Linting and Type Checking

Adding a New Website Scraper

Data Models

Product Model

Category Model

Configuration

License

About

Uh oh!

Releases

Packages

Languages

northys/unstoppable

Folders and files

Latest commit

History

Repository files navigation

Unstoppable - Product Data Scraper

Features

Project Structure

Installation

Usage

Product Scraping

Development

Production

Category Extraction

Testing

Linting and Type Checking

Adding a New Website Scraper

Data Models

Product Model

Category Model

Configuration

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages