This project is a Single-Domain Web Crawler integrated with Supabase. It crawls a website starting from a base URL, fetching linked URLs recursively.
The crawler performs two main tasks:
- Store HTML snapshots: Saves the raw HTML into Supabase Storage for further parsing by a Parser Worker in a serverless application
- Discover new URLs: Extracts and enqueues new links to crawl recursively with
crawlPage(). Each discovered URL is saved into the Supabase DB, along with a direct reference to the raw HTML stored in blob storage.
- JavaScript (Node.js) - Runs the Web Crawler (planned refactor to TypeScript)
- PostgreSQL(via Supabase DB) - Persists URLs metadata in the
urls_metadatable - AWS S3 (via Supabase Storage) - Stores raw HTML snapshots and then parses it to plain text
- Serverless (via Supabase Edge Functions) - runs the Parser Worker (responsible for text extraction)
- Queues (RabbitMQ) - TO BE IMPLEMENTED
- Install all dependencies:
npm install
-
Put your Supabase variables into an
.envin the./(See the required variables atconfig.js) -
Run the Crawler:
npm run start
