This MVP project implements a scalable and distributed web crawler system designed for performance, reliability, and ease of management. The architecture leverages modern technologies to handle distributed processing, task queuing, and administration.
The system is built using:
- Celery for distributed task processing
- Redis for queue management
- Requests for efficient HTTP requests during crawling
- Django as the administration interface
- MySQL for database management
- Docker Compose for environment orchestration
- Flower for monitoring Celery workers and tasks
- Comprehensive logging for debugging and system tracking
- Scalable architecture with distributed worker nodes
- Task queue management for efficient crawling workflows
- Admin interface for managing sites, monitoring queues, and launching crawlers
- Centralized logging for observability and troubleshooting
- Dockerized environment for simplified deployment and execution
- Real-time monitoring of worker queues and task execution via Flower
The following diagram illustrates the complete system architecture:
The system operates with a distributed model to ensure performance and scalability. Tasks are distributed across worker nodes managed by Celery and monitored through Flower.
Follow the instructions below to set up and run the crawler system on your local machine.
Ensure the following dependencies are installed:
- Docker and Docker Compose
- Python 3.8+
Create a directory to store files downloaded by the crawler:
mkdir /home/${USER}/crawler
Build and run the Docker containers:
docker-compose up --build -d
To access the administration interface, create a Django superuser:
- Identify the Django container ID:
docker ps
- Access the container shell and create the superuser:
docker exec -i -t <django_web_container_id> /bin/bash
python manage.py createsuperuser
Open your browser and navigate to:
http://localhost:8000
After logging into the admin interface:
- Navigate to the "Sites" tab in the sidebar.
- Create a new site entry (e.g., "Arezzo").
- This corresponds to the directory containing the extraction logic for the specific site.
Once the site is registered and the extraction logic is available, you can launch the crawler:
Monitor the distributed crawler system in real time:
- In the admin interface, click on "Go to Flower".
- Flower provides insights into:
- Active workers
- Task queues
- Execution history
The system provides comprehensive logs for tracking crawler activities, errors, and performance. Logs are accessible within the containers or through configured log directories.
To set up a local development environment:
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt
- Start the development server:
python manage.py runserver