Skip to content

RahulPalXDA/DB-URL-Checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

DB URL Checker

A high-performance tool to process image URLs from SQL dump files: it validates URLs, logs broken links, and downloads valid images efficiently.

How it Works

  1. URL Extraction: Reads the SQL dump file and uses regex patterns to extract image URLs.
  2. Validation: Checks each URL by sending HTTP requests to see if the image exists and is valid.
  3. Download: Downloads valid images to a specified output folder using either async or threaded mode.
  4. Logging: Logs failed URLs to failed_downloads.log and failed_downloads.csv.
  5. Summary: Generates a summary of failed downloads in failed_downloads_summary.txt.

Getting SQL Dump

To get the image URLs from your database, you can run a query like:

SELECT media FROM chat_messages 
WHERE media IS NOT NULL 
AND media <> '' 
AND media <> '[]';

Then export the results to a .sql file, which can be used as input for this tool.

Features

  • Extracts image URLs from large SQL dump files efficiently using threading
  • Validates URLs and downloads valid images to a specified folder
  • Logs failed URLs to failed_downloads.log and failed_downloads.csv
  • Generates a summary of failed URLs by error type (failed_downloads_summary.txt)
  • Async mode for maximum speed with many concurrent workers
  • Threaded mode for better compatibility with large files or certain environments
  • Resume capability for already downloaded images
  • Progress tracking with processed/total [elapsed<remaining, speed]
  • Configurable retry mechanisms and rate limiting

Installation

Prerequisites

  • Python 3.7+
  • pip package manager

Install Dependencies

pip install -r requirements.txt

# or individually

pip install requests tqdm urllib3 aiohttp aiofiles

Usage

Basic Usage

# Async mode (recommended - fastest)
python db_url_checker.py database.sql

# Threaded mode
python db_url_checker.py database.sql --threaded

Advanced Usage

# Custom output folder and workers
python db_url_checker.py database.sql -o my_images -a 100 -t 20

# Threaded mode with custom workers
python db_url_checker.py database.sql --threaded -t 15 -o downloads

Command Line Options

Option Description Default
sql_file Path to SQL dump file (required) -
-o, --output Output folder for downloaded images downloaded_images
-a, --async-workers Number of concurrent async workers 50
-t, --thread-workers Number of concurrent thread workers 10
--threaded Use threaded mode instead of async False

Output Files

  • my_images/ or custom output folder: Successfully downloaded images
  • image_downloader_production.log: Main execution log
  • failed_downloads.log: Text log of failed URLs
  • failed_downloads.csv: Structured CSV of failures
  • failed_downloads_summary.txt: Error breakdown summary

Log Structure

failed_downloads.log:
[2024-01-15T10:30:45.123456] 404 Not Found (HTTP 404): https://example.com/missing.jpg

failed_downloads.csv:
timestamp,url,error_type,status_code,details
2024-01-15T10:30:45.123456,https://example.com/missing.jpg,404 Not Found,404,URL not found

failed_downloads_summary.txt:
FAILED DOWNLOADS SUMMARY
==============================
404 Not Found: 75 (50.0%)
Non-image content: 50 (33.3%)
Network Error: 20 (13.3%)
403 Forbidden: 5 (3.3%)

Performance Tips

Scenario Recommendation
Many small images High async worker count (-a 100-200)
Large images Threaded mode with moderate workers (--threaded -t 10-20)
Mixed content Default async mode with 50 workers
Limited bandwidth Reduce workers to avoid rate limiting

Monitoring

  • Progress displayed as processed/total [elapsed<remaining, speed], e.g., 37032/40112 [22:25<01:20, 38.07image/s]
  • Completion summary: total URLs, successful/failed, success rate, time, and speed

Error Types Tracked

  • 404 Not Found
  • 403 Forbidden
  • Network errors (timeouts, DNS failures)
  • Non-image content
  • Other HTTP errors (4xx/5xx)
  • Suspiciously small files

Support

  • Check log files for detailed errors
  • Adjust worker counts for memory or bandwidth limitations

About

DB URL Checker – Quickly validate image URLs from SQL dumps and download the working ones.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages