A powerful Python tool for scanning websites to identify broken links, dead URLs, and accessibility issues. Perfect for website maintenance, SEO audits, and quality assurance.
- Comprehensive Link Scanning: Crawls websites recursively to find all links
- Multiple Export Formats: Export results to JSON and CSV formats
- Configurable Depth: Control how deep the crawler goes into your site
- Domain Filtering: Option to scan only same-domain links or include external links
- Rate Limiting: Built-in delays to be respectful to target servers
- Detailed Reporting: Categorizes links as working, broken, or error states
- Real-time Progress: Live updates during scanning with emoji indicators
- Flexible Configuration: Customizable via command-line arguments
- Python 3.7+
- Required packages:
requests- HTTP library for making web requestsbeautifulsoup4- HTML parsing librarylxml- XML and HTML parser (recommended for BeautifulSoup)
-
Clone or download the script:
wget https://raw.githubusercontent.com/rakeshf/broken_link_checker.py # or curl -O https://raw.githubusercontent.com/your-repo/broken_link_checker.py -
Install dependencies:
pip install requests beautifulsoup4 lxml
-
Make it executable (optional):
chmod +x broken_link_checker.py
python broken_link_checker.py <website_url>| Option | Description | Default |
|---|---|---|
--max-urls <number> |
Maximum URLs to scan | 100 |
--max-depth <number> |
Maximum crawl depth | 2 |
--delay <seconds> |
Delay between requests (be respectful!) | 1.0 |
--external |
Include external links (default: same domain only) | False |
--json <filename> |
Save results to JSON file | None |
--csv <filename> |
Save results to CSV file | None |
Basic website scan:
python broken_link_checker.py https://example.comComprehensive scan with custom settings:
python broken_link_checker.py https://example.com \
--max-urls 500 \
--max-depth 3 \
--delay 2 \
--externalExport results to files:
# JSON export
python broken_link_checker.py https://example.com --json results.json
# CSV export
python broken_link_checker.py https://example.com --csv results.csv
# Both formats
python broken_link_checker.py https://example.com \
--json results.json \
--csv results.csvLarge website audit:
python broken_link_checker.py https://mybigsite.com \
--max-urls 1000 \
--max-depth 4 \
--delay 1.5 \
--json comprehensive_audit.json \
--csv comprehensive_audit.csvThe tool provides real-time feedback with emoji indicators:
- π·οΈ Crawling pages
- β Working links
- β Broken links
β οΈ Error links- π CSV report saved
- πΎ JSON report saved
{
"scan_info": {
"start_time": "2024-01-15T10:30:00",
"end_time": "2024-01-15T10:35:30",
"duration_seconds": 330.5,
"start_domain": "example.com",
"max_urls": 100,
"max_depth": 2,
"delay": 1.0,
"same_domain_only": true
},
"statistics": {
"total_urls_processed": 87,
"working_links_count": 82,
"broken_links_count": 3,
"error_links_count": 2,
"visited_pages_count": 15
},
"results": {
"working_links": [...],
"broken_links": [...],
"error_links": [...]
}
}| url | status | status_code | final_url | error | type | timestamp |
|---|---|---|---|---|---|---|
| https://example.com/page1 | working | 200 | https://example.com/page1 | 2024-01-15T10:30:15 | ||
| https://example.com/broken | broken | 404 | https://example.com/broken | 2024-01-15T10:30:20 | ||
| https://timeout.com/page | error | Connection timeout | check | 2024-01-15T10:30:25 |
- Regular Health Checks: Schedule weekly/monthly scans to catch broken links early
- Post-Migration Audits: Verify all links work after site migrations or redesigns
- Content Updates: Check links after major content updates
- SEO Audits: Broken links hurt SEO rankings - find and fix them
- User Experience: Ensure visitors don't hit dead ends
- Link Building: Verify outgoing links to maintain site credibility
- Pre-Launch Testing: Scan staging sites before going live
- Continuous Integration: Integrate into CI/CD pipelines
- Quality Assurance: Regular checks as part of QA processes
For small sites (< 100 pages):
--max-urls 200 --max-depth 3 --delay 0.5For medium sites (100-1000 pages):
--max-urls 500 --max-depth 2 --delay 1.0For large sites (1000+ pages):
--max-urls 1000 --max-depth 2 --delay 2.0- Always use delays (
--delay) to avoid overwhelming servers - Start with smaller scans to test site behavior
- Monitor server response and increase delays if needed
- Consider time zones - scan during off-peak hours for target sites
Solution: Increase delay (--delay 2) or reduce concurrent requests
Solution: Increase delay significantly (--delay 5) and reduce max URLs
Solution: Reduce --max-urls and run multiple smaller scans
Solution: Use --external flag carefully, consider separate scans for internal/external
The tool generates detailed reports that can be used for:
- Spreadsheet Analysis: Open CSV in Excel/Google Sheets
- Database Import: Import CSV into databases for further analysis
- API Integration: Use JSON output in other tools and services
- Reporting: Generate management reports from the data
Contributions are welcome! Areas for improvement:
- Additional export formats (XML, HTML reports)
- Web interface
- Database storage options
- Advanced filtering options
- Performance optimizations
This project is open source. Feel free to use, modify, and distribute.
# Install all dependencies
pip install -r requirements.txt
# Or install individually
pip install requests beautifulsoup4 lxml# Make script executable
chmod +x broken_link_checker.py
# Run with Python explicitly
python3 broken_link_checker.py https://example.com- Use smaller
--max-urlsvalues - Reduce
--max-depth - Run multiple focused scans instead of one large scan
A REST API is provided for programmatic scanning, status checking, and result retrieval.
uvicorn broken_link_checker_api:app --host 0.0.0.0 --port 8000 --reloadOr:
python broken_link_checker_api.py apiPOST /scan
Request JSON:
{
"url": "https://example.com",
"max_urls": 100,
"max_depth": 2,
"delay": 1.0,
"same_domain_only": true
}- Only
urlis required; other fields are optional.
Response:
{
"message": "Scan completed",
"scan_id": "e1b2c3d4-...",
"result_file": "download/example_com_20250630.json",
"statistics": { ... },
"max_urls": 100
}GET /status/{scan_id}
Response:
{
"status": "completed",
"total_urls_processed": 100,
"working_links": 90,
"broken_links": 8,
"error_links": 2,
"broken_links_list": [...],
"error_links_list": [...],
"start_domain": "example.com",
"max_urls": 100,
"max_depth": 2,
"delay": 1.0,
"same_domain_only": true
}statuscan bein_progress,completed, ornot_started.
GET /results/{scan_id}
Returns the full scan results in JSON format.
Result files are stored in the download/ directory with a name based on the URL and date, e.g.:
download/example_com_20250630.json
You can download this file directly from the server if you expose a static file endpoint or via SFTP.
Visit http://localhost:8000/docs for interactive Swagger UI.
Happy Link Checking! πβ¨
For questions, issues, or feature requests, please open an issue in the repository.