A web archive and historical data retrieval tool for investigators and researchers.
Archive Snooper searches multiple web archive services to retrieve historical snapshots and data about websites. It's designed for OSINT investigators who need to track changes in web content over time, recover deleted information, or research the history of online entities.
- Wayback Machine integration (Internet Archive)
- Archive.today/Archive.is support
- Common Crawl index searching
- Timeline generation across multiple archives
- Year-based filtering for targeted searches
- JSON output for further analysis
- No API keys required for basic functionality
This tool integrates with the following web archive services:
-
Wayback Machine (Internet Archive) - Largest web archive
- CDX API Documentation: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
- Website: https://web.archive.org
- No API key required
-
Archive.today / Archive.is - Snapshot service
- Website: https://archive.today
- No API key required
- Note: May have rate limiting
-
Common Crawl - Open repository of web crawl data
- Index API Documentation: https://index.commoncrawl.org/
- Website: https://commoncrawl.org
- No API key required
- Python 3.7 or higher
- pip package manager
- Internet connection
- Clone the repository:
git clone https://github.com/yourusername/archivesnooper.git
cd archivesnooper- Install dependencies:
pip install -r requirements.txt- Run the tool:
python archivesnooper.py --helpSearch all archives for a URL:
python archivesnooper.py https://example.comSearch only Wayback Machine:
python archivesnooper.py https://example.com -a waybackSearch multiple specific archives:
python archivesnooper.py https://example.com -a wayback archive_todayGet Wayback snapshots from a specific year:
python archivesnooper.py https://example.com -a wayback -y 2020Generate a timeline across all archives:
python archivesnooper.py https://example.com -a timelineOutput results to JSON file:
python archivesnooper.py https://example.com -o results.jsonusage: archivesnooper.py [-h] [-a {wayback,archive_today,commoncrawl,timeline,all} [...]] [-y YEAR] [-o OUTPUT] url
positional arguments:
url Target URL to search in archives
optional arguments:
-h, --help show this help message and exit
-a, --archives Archives to search (default: all)
-y, --year YEAR Filter Wayback results by year
-o, --output OUTPUT Output file for JSON results
python archivesnooper.py https://targetcompany.compython archivesnooper.py https://suspicioussite.com -a wayback -o history.jsonpython archivesnooper.py https://company.com -y 2015python archivesnooper.py https://example.com -a timeline -o timeline.jsonpython archivesnooper.py https://target.com -a wayback archive_today commoncrawlResults are provided in JSON format:
{
"url": "https://example.com",
"timestamp": "2025-01-11T12:00:00",
"archives": {
"wayback_machine": {
"snapshots_found": 245,
"first_capture": "20100115000000",
"last_capture": "20250110120000",
"snapshots": [
{
"timestamp": "20250110120000",
"url": "https://example.com",
"archive_url": "https://web.archive.org/web/20250110120000/https://example.com"
}
]
}
}
}Wayback Machine uses a 14-digit timestamp format:
- Format: YYYYMMDDhhmmss
- Example: 20250111153045
- Year: 2025
- Month: 01 (January)
- Day: 11
- Hour: 15 (3 PM)
- Minute: 30
- Second: 45
- Corporate Investigation: Track changes to company websites, deleted pages, or modified content
- Fraud Detection: Identify changes to scam websites or phishing pages
- Background Checks: Research historical online presence of individuals or organizations
- Domain History: Track ownership and content changes over time
- Evidence Preservation: Document web content for legal proceedings
- Reputation Analysis: Monitor evolution of online profiles and social media
- Deleted Content Recovery: Retrieve information removed from live websites
- Generally tolerant of automated requests
- Be respectful with request frequency
- Consider delays between requests for large investigations
- May implement rate limiting
- Consider using delays between requests
- Some requests may be blocked if too frequent
- Public dataset with no rate limits
- Indexes updated monthly
- Latest index may not include very recent crawls
IMPORTANT: This tool is intended for legitimate research and investigations only. Users must:
- Have proper authorization before investigating targets
- Comply with all applicable laws and regulations
- Respect website robots.txt and terms of service
- Use archived data responsibly
- Not use for harassment or stalking
- Only investigate targets you have legal authority to investigate
The author is not responsible for misuse of this tool.
- URL may never have been archived
- Try different URL variations (www vs non-www, http vs https)
- Check if the domain existed during the searched time period
- Service may be experiencing downtime
- Rate limiting may be in effect
- Try again later or use Wayback Machine
- Only includes URLs crawled by Common Crawl
- Monthly updates mean recent content may not be indexed
- Not all websites are included in crawls
MIT License - See LICENSE file for details
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
For issues and questions:
- Open an issue on GitHub
- Check existing issues for solutions
- Wayback Machine
- Archive.today
- Common Crawl
- ArchiveBox - Self-hosted archiving
- Memento Protocol - Web archive aggregator
This tool is provided "as-is" without warranty. Use at your own risk. Always ensure you have proper authorization before conducting investigations. Archived content may be subject to copyright and other legal protections.
Created for professional OSINT investigators and researchers.
- Initial release
- Wayback Machine integration
- Archive.today support
- Common Crawl indexing
- Timeline generation
- JSON output support
- Year-based filtering