Distributed Web Crawler System

This MVP project implements a scalable and distributed web crawler system designed for performance, reliability, and ease of management. The architecture leverages modern technologies to handle distributed processing, task queuing, and administration.

Overview

The system is built using:

Celery for distributed task processing
Redis for queue management
Requests for efficient HTTP requests during crawling
Django as the administration interface
MySQL for database management
Docker Compose for environment orchestration
Flower for monitoring Celery workers and tasks
Comprehensive logging for debugging and system tracking

Features

Scalable architecture with distributed worker nodes
Task queue management for efficient crawling workflows
Admin interface for managing sites, monitoring queues, and launching crawlers
Centralized logging for observability and troubleshooting
Dockerized environment for simplified deployment and execution
Real-time monitoring of worker queues and task execution via Flower

System Architecture

The following diagram illustrates the complete system architecture:

The system operates with a distributed model to ensure performance and scalability. Tasks are distributed across worker nodes managed by Celery and monitored through Flower.

Getting Started

Follow the instructions below to set up and run the crawler system on your local machine.

1. Prerequisites

Ensure the following dependencies are installed:

Docker and Docker Compose
Python 3.8+

2. Setup Docker Volumes

Create a directory to store files downloaded by the crawler:

mkdir /home/${USER}/crawler

3. Start the System with Docker Compose

Build and run the Docker containers:

docker-compose up --build -d

4. Create a Django Admin User

To access the administration interface, create a Django superuser:

Identify the Django container ID:

docker ps

Access the container shell and create the superuser:

docker exec -i -t <django_web_container_id> /bin/bash
python manage.py createsuperuser

Usage

1. Access the Admin Interface

Open your browser and navigate to:

http://localhost:8000

2. Register a New Site

After logging into the admin interface:

Navigate to the "Sites" tab in the sidebar.
Create a new site entry (e.g., "Arezzo").
- This corresponds to the directory containing the extraction logic for the specific site.

3. Run the Crawlers

Once the site is registered and the extraction logic is available, you can launch the crawler:

4. Monitor Worker Queues

Monitor the distributed crawler system in real time:

In the admin interface, click on "Go to Flower".
Flower provides insights into:
- Active workers
- Task queues
- Execution history

Logging and Debugging

The system provides comprehensive logs for tracking crawler activities, errors, and performance. Logs are accessible within the containers or through configured log directories.

Development Setup

To set up a local development environment:

Clone the repository.
Install dependencies:

pip install -r requirements.txt

Start the development server:

python manage.py runserver

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
.scannerwork		.scannerwork
adminCrawler		adminCrawler
crawler		crawler
images		images
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
makefile		makefile
sistem.drawio		sistem.drawio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Web Crawler System

Overview

Features

System Architecture

Getting Started

1. Prerequisites

2. Setup Docker Volumes

3. Start the System with Docker Compose

4. Create a Django Admin User

Usage

1. Access the Admin Interface

2. Register a New Site

3. Run the Crawlers

4. Monitor Worker Queues

Logging and Debugging

Development Setup

About

Uh oh!

Releases

Packages

Uh oh!

Languages

edno2819/system-multi-crawler

Folders and files

Latest commit

History

Repository files navigation

Distributed Web Crawler System

Overview

Features

System Architecture

Getting Started

1. Prerequisites

2. Setup Docker Volumes

3. Start the System with Docker Compose

4. Create a Django Admin User

Usage

1. Access the Admin Interface

2. Register a New Site

3. Run the Crawlers

4. Monitor Worker Queues

Logging and Debugging

Development Setup

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages