Web Crawler in JavaScript

About

This project is a Single-Domain Web Crawler integrated with Supabase. It crawls a website starting from a base URL, fetching linked URLs recursively.

The crawler performs two main tasks:

Store HTML snapshots: Saves the raw HTML into Supabase Storage for further parsing by a Parser Worker in a serverless application
Discover new URLs: Extracts and enqueues new links to crawl recursively with crawlPage(). Each discovered URL is saved into the Supabase DB, along with a direct reference to the raw HTML stored in blob storage.

JavaScript (Node.js) - Runs the Web Crawler (planned refactor to TypeScript)
PostgreSQL(via Supabase DB) - Persists URLs metadata in the urls_metada table
AWS S3 (via Supabase Storage) - Stores raw HTML snapshots and then parses it to plain text
Serverless (via Supabase Edge Functions) - runs the Parser Worker (responsible for text extraction)
Queues (RabbitMQ) - TO BE IMPLEMENTED

npm install

Put your Supabase variables into an .env in the ./ (See the required variables at config.js)
Run the Crawler:

npm run start

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
documentation		documentation
src		src
tests		tests
.gitignore		.gitignore
.nvmrc		.nvmrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json