A high-performance tool to process image URLs from SQL dump files: it validates URLs, logs broken links, and downloads valid images efficiently.
- URL Extraction: Reads the SQL dump file and uses regex patterns to extract image URLs.
- Validation: Checks each URL by sending HTTP requests to see if the image exists and is valid.
- Download: Downloads valid images to a specified output folder using either async or threaded mode.
- Logging: Logs failed URLs to
failed_downloads.logandfailed_downloads.csv. - Summary: Generates a summary of failed downloads in
failed_downloads_summary.txt.
To get the image URLs from your database, you can run a query like:
SELECT media FROM chat_messages
WHERE media IS NOT NULL
AND media <> ''
AND media <> '[]';Then export the results to a .sql file, which can be used as input for this tool.
- Extracts image URLs from large SQL dump files efficiently using threading
- Validates URLs and downloads valid images to a specified folder
- Logs failed URLs to
failed_downloads.logandfailed_downloads.csv - Generates a summary of failed URLs by error type (
failed_downloads_summary.txt) - Async mode for maximum speed with many concurrent workers
- Threaded mode for better compatibility with large files or certain environments
- Resume capability for already downloaded images
- Progress tracking with
processed/total [elapsed<remaining, speed] - Configurable retry mechanisms and rate limiting
- Python 3.7+
- pip package manager
pip install -r requirements.txt
# or individually
pip install requests tqdm urllib3 aiohttp aiofiles# Async mode (recommended - fastest)
python db_url_checker.py database.sql
# Threaded mode
python db_url_checker.py database.sql --threaded# Custom output folder and workers
python db_url_checker.py database.sql -o my_images -a 100 -t 20
# Threaded mode with custom workers
python db_url_checker.py database.sql --threaded -t 15 -o downloads| Option | Description | Default |
|---|---|---|
| sql_file | Path to SQL dump file (required) | - |
| -o, --output | Output folder for downloaded images | downloaded_images |
| -a, --async-workers | Number of concurrent async workers | 50 |
| -t, --thread-workers | Number of concurrent thread workers | 10 |
| --threaded | Use threaded mode instead of async | False |
my_images/or custom output folder: Successfully downloaded imagesimage_downloader_production.log: Main execution logfailed_downloads.log: Text log of failed URLsfailed_downloads.csv: Structured CSV of failuresfailed_downloads_summary.txt: Error breakdown summary
failed_downloads.log:
[2024-01-15T10:30:45.123456] 404 Not Found (HTTP 404): https://example.com/missing.jpg
failed_downloads.csv:
timestamp,url,error_type,status_code,details
2024-01-15T10:30:45.123456,https://example.com/missing.jpg,404 Not Found,404,URL not found
failed_downloads_summary.txt:
FAILED DOWNLOADS SUMMARY
==============================
404 Not Found: 75 (50.0%)
Non-image content: 50 (33.3%)
Network Error: 20 (13.3%)
403 Forbidden: 5 (3.3%)
| Scenario | Recommendation |
|---|---|
| Many small images | High async worker count (-a 100-200) |
| Large images | Threaded mode with moderate workers (--threaded -t 10-20) |
| Mixed content | Default async mode with 50 workers |
| Limited bandwidth | Reduce workers to avoid rate limiting |
- Progress displayed as
processed/total [elapsed<remaining, speed], e.g.,37032/40112 [22:25<01:20, 38.07image/s] - Completion summary: total URLs, successful/failed, success rate, time, and speed
- 404 Not Found
- 403 Forbidden
- Network errors (timeouts, DNS failures)
- Non-image content
- Other HTTP errors (4xx/5xx)
- Suspiciously small files
- Check log files for detailed errors
- Adjust worker counts for memory or bandwidth limitations