This project has moved to https://github.com/watercrawl/watercrawl to be part of a mono repo. The project will continue there.
This project is now archived.
A self-hosted version of WaterCrawl, a powerful web crawling and data extraction platform.
- Docker Engine 24.0.0+
- Docker Compose v2.0.0+
- At least 2GB of RAM
- 10GB of free disk space
- Clone the repository:
git clone https://github.com/watercrawl/self-hosted.git
cd self-hosted- Run the installation script:
chmod +x install.sh
./install.sh- Start the services:
docker compose up -d- Create an admin user:
chmod +x create_admin.sh
./create_admin.sh- Access the application at
http://localhost:8000
The application uses several environment files for configuration:
Contains the main application configuration including:
- Django settings
- Database configuration
- JWT settings
- Celery configuration
- MinIO settings
- CORS settings
- Plugin configuration
PostgreSQL database configuration:
- Database credentials
- Database name
MinIO object storage configuration:
- Access credentials
- Server URLs
- Browser configuration
The application consists of several services:
- App: Main Django application
- Celery: Background task processing
- PostgreSQL: Primary database
- Redis: Message broker and caching
- MinIO: Object storage for files and media
watercrawl_self_hosted/
├── app.env # Application environment variables
├── app.env.example # Example application environment file
├── create_admin.sh # Script to create admin user
├── db.env # Database environment variables
├── db.env.example # Example database environment file
├── docker-compose.yml # Docker Compose configuration
├── install.sh # Installation script
├── minio.env # MinIO environment variables
├── minio.env.example # Example MinIO environment file
└── watercrawl/ # Application source code
├── Dockerfile # Application Dockerfile
└── plugin_requirements.txt # Plugin dependencies
The installation script (install.sh) supports the following options:
--reinstall: Backs up existing environment files and performs a fresh installation- Interactive prompts for:
- Website Domain
- Storage Domain
- Website Protocol (http/https)
- Storage Protocol (http/https)
Environment files are automatically backed up when using the --reinstall option. Backups are stored in:
.config_backups/DELETE_DATA/
- Services not starting: Check logs with
docker compose logs [service_name]- Database connection issues: Verify db.env configuration and ensure PostgreSQL is running
docker compose ps postgres- Storage issues: Check MinIO credentials in both minio.env and app.env match
-
Change default credentials in production:
- Database password
- MinIO access keys
- Django secret key
- Admin user password
-
Enable HTTPS in production by setting:
MINIO_USE_HTTPS=TrueMINIO_EXTERNAL_ENDPOINT_USE_HTTPS=True- Configure proper SSL certificates
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under a modified MIT License. Key points:
- You can freely use, modify, and distribute this software
- You must include the original copyright notice and license
- Important Restrictions:
- The software may not be used to run a service similar to watercrawl.dev or any commercial service without explicit permission
- Contributions are welcome through pull requests for community usage and development
See the LICENSE file for the complete terms.