A high-performance, extensible web scraper built with TypeScript and Crawlee for extracting product data from e-commerce websites.
- Extensible Architecture: Easy to add new website scrapers by extending the base scraper class
- Parallel Processing: Leverages Crawlee's concurrent crawling capabilities
- TypeScript: Full type safety and better development experience
- Test-Driven: Comprehensive test suite with HTML snapshots
- Data Export: Export scraped data in JSON or CSV formats
- Proxy Support: Built-in proxy rotation support
unstoppable/
├── src/
│ ├── crawlers/ # Crawler implementations
│ ├── scrapers/ # Website-specific scrapers
│ ├── models/ # Data models
│ ├── utils/ # Utility functions
│ └── tests/ # Test files and fixtures
├── dist/ # Compiled JavaScript output
└── storage/ # Crawlee data storage
npm installnpm start
# or
npm run start:devnpm run build
npm run start:prodExtract categories from e-commerce websites:
# Extract all categories from Thomann
npm run extract-categories
# Extract from specific URL
npm run extract-categories -- --url https://www.thomann.de/de/index.html
# Export as CSV
npm run extract-categories -- --format csv
# Build category tree structure
npm run extract-categories -- --tree
# Verbose logging
npm run extract-categories -- --verbosenpm testnpm run lint
npm run typecheck- Create a new scraper class extending
BaseScraper:
import { BaseScraper } from './BaseScraper.js';
export class NewWebsiteScraper extends BaseScraper {
constructor() {
super({
name: 'NewWebsite',
baseUrl: 'https://example.com',
});
}
// Implement required methods
async scrapeProductList(url: string): Promise<ScraperResult> { /* ... */ }
async scrapeProductDetail(url: string): Promise<Product> { /* ... */ }
async getCategoryUrls(): Promise<string[]> { /* ... */ }
isProductUrl(url: string): boolean { /* ... */ }
isCategoryUrl(url: string): boolean { /* ... */ }
}- Use the scraper with ProductCrawler:
import { ProductCrawler } from './crawlers/ProductCrawler.js';
import { NewWebsiteScraper } from './scrapers/NewWebsiteScraper.js';
const scraper = new NewWebsiteScraper();
const crawler = new ProductCrawler({ scraper });
await crawler.run();Products are scraped with the following structure:
interface Product {
id: string;
sku?: string;
name: string;
brand?: string;
category: string[];
price: {
currency: string;
amount: number;
originalAmount?: number;
discount?: number;
};
availability: {
inStock: boolean;
stockLevel?: number;
deliveryTime?: string;
};
images: string[];
description?: string;
specifications?: Record<string, string>;
ratings?: {
average: number;
count: number;
};
url: string;
scrapedAt: Date;
source: string;
}Categories are extracted with the following structure:
interface Category {
name: string;
url: string;
code: string;
parentCategory?: string;
level: number;
productCount?: number;
subcategories?: Category[];
scrapedAt: Date;
source: string;
}Configure the crawler behavior through ProductCrawler options:
maxRequestsPerCrawl: Limit the number of requestsproxyUrls: Array of proxy URLs for rotationstartUrls: Override default category URLs
ISC