This Python-based program allows you to recursively crawl a given website and download files with specific extensions, preserving the original folder structure. It's designed for users who want to easily retrieve media, documents, or any specific file types from a domain without dealing with manual scraping or downloading.
- Just provide the starting URL, and the program handles the rest.
- Supports both GUI (via
.exe) and CLI usage (via Python).
-
Automatically discovers all linked pages under the provided URL.
-
Extracts downloadable files from each visited page.
-
Prevents "backward crawling" (optional): You can stop the crawler from visiting upper-level directories.
- Example: Given
https://example.com/folder/subfolder/, it won’t crawlhttps://example.com/folder/if this feature is disabled.
- Example: Given
-
Choose which types of files to download:
- Images (
.jpg,.jpeg,.png,.gif,.webp,.svg, etc.) - Videos (
.mp4,.webm,.avi,.mov, etc.) - Documents (
.pdf,.docx,.pptx,.xlsx, etc.) - Audio Files (
.mp3,.wav,.ogg, etc.) - Or define your own custom extensions.
- Images (
-
Files are saved in the same relative path as on the server.
- Example:
https://example.com/construction-updates/admin/projects/2020/04/xxx-scaled.jpg→/example.com/construction-updates/admin/projects/2020/04/xxx-scaled.jpg
- Example:
-
A detailed summary is displayed after the crawl is completed or stopped manually.
-
Includes:
- Total pages discovered
- Total files downloaded
- Number of each file type
- Files that failed to download
- Files larger than 10MB
-
Reports are stored in a SQLite database for filtering and future reference.
- A full graphical interface is now available.
- Select URL, choose file types, and start/stop crawling with buttons.
- See real-time status and progress in the window.
- You can clone the repository and modify the script as needed.
- Clean and well-organized codebase for easy customization.
- Download the
.exefile from the Releases section. - Run it (no installation required).
- Provide the URL, select file types, and click Start Download.
- Results will be saved locally with full directory structure.
- View reports in the built-in GUI or from the saved database.
⚠️ Make sure to allow the program through your antivirus/firewall if prompted.
git clone https://github.com/aiproje/WebHuntDownloader.git
cd WebHuntDownloaderpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtpython main.py --guiAdjust the filename if your entry point differs.
- The program obeys robots.txt by default (if implemented).
- Supports both depth-first and breadth-first crawling (configurable).
- Handles both relative and absolute URLs.
- Skips already downloaded files using cache and logs.
Please open an issue in the Issues section.
Include:
- The URL you used
- Any error messages
- What you expected vs what happened
This project is open source and available under the MIT License.
We welcome all contributions! Feel free to fork the repo and submit pull requests for:
- Bug fixes
- New features
- UI/UX improvements
- Performance enhancements
Made with ❤️ by AIPROJE
