Archive Snooper

A web archive and historical data retrieval tool for investigators and researchers.

Description

Archive Snooper searches multiple web archive services to retrieve historical snapshots and data about websites. It's designed for OSINT investigators who need to track changes in web content over time, recover deleted information, or research the history of online entities.

Features

Wayback Machine integration (Internet Archive)
Archive.today/Archive.is support
Common Crawl index searching
Timeline generation across multiple archives
Year-based filtering for targeted searches
JSON output for further analysis
No API keys required for basic functionality

Sources/APIs

This tool integrates with the following web archive services:

Wayback Machine (Internet Archive) - Largest web archive
- CDX API Documentation: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
- Website: https://web.archive.org
- No API key required
Archive.today / Archive.is - Snapshot service
- Website: https://archive.today
- No API key required
- Note: May have rate limiting
Common Crawl - Open repository of web crawl data
- Index API Documentation: https://index.commoncrawl.org/
- Website: https://commoncrawl.org
- No API key required

Installation

Prerequisites

Python 3.7 or higher
pip package manager
Internet connection

Setup

Clone the repository:

git clone https://github.com/yourusername/archivesnooper.git
cd archivesnooper

Install dependencies:

pip install -r requirements.txt

Run the tool:

python archivesnooper.py --help

Usage

Basic Usage

Search all archives for a URL:

python archivesnooper.py https://example.com

Specific Archives

Search only Wayback Machine:

python archivesnooper.py https://example.com -a wayback

Search multiple specific archives:

python archivesnooper.py https://example.com -a wayback archive_today

Year Filtering

Get Wayback snapshots from a specific year:

python archivesnooper.py https://example.com -a wayback -y 2020

Timeline Generation

Generate a timeline across all archives:

python archivesnooper.py https://example.com -a timeline

Save Results

Output results to JSON file:

python archivesnooper.py https://example.com -o results.json

Command-Line Options

usage: archivesnooper.py [-h] [-a {wayback,archive_today,commoncrawl,timeline,all} [...]] [-y YEAR] [-o OUTPUT] url

positional arguments:
  url                   Target URL to search in archives

optional arguments:
  -h, --help            show this help message and exit
  -a, --archives        Archives to search (default: all)
  -y, --year YEAR       Filter Wayback results by year
  -o, --output OUTPUT   Output file for JSON results

Examples

Basic website investigation

python archivesnooper.py https://targetcompany.com

Track domain history

python archivesnooper.py https://suspicioussite.com -a wayback -o history.json

Find snapshots from specific year

python archivesnooper.py https://company.com -y 2015

Generate comprehensive timeline

python archivesnooper.py https://example.com -a timeline -o timeline.json

Check multiple archives

python archivesnooper.py https://target.com -a wayback archive_today commoncrawl

Output Format

Results are provided in JSON format:

{
  "url": "https://example.com",
  "timestamp": "2025-01-11T12:00:00",
  "archives": {
    "wayback_machine": {
      "snapshots_found": 245,
      "first_capture": "20100115000000",
      "last_capture": "20250110120000",
      "snapshots": [
        {
          "timestamp": "20250110120000",
          "url": "https://example.com",
          "archive_url": "https://web.archive.org/web/20250110120000/https://example.com"
        }
      ]
    }
  }
}

Understanding Timestamps

Wayback Machine uses a 14-digit timestamp format:

Format: YYYYMMDDhhmmss
Example: 20250111153045
- Year: 2025
- Month: 01 (January)
- Day: 11
- Hour: 15 (3 PM)
- Minute: 30
- Second: 45

Use Cases

Investigation Scenarios

Corporate Investigation: Track changes to company websites, deleted pages, or modified content
Fraud Detection: Identify changes to scam websites or phishing pages
Background Checks: Research historical online presence of individuals or organizations
Domain History: Track ownership and content changes over time
Evidence Preservation: Document web content for legal proceedings
Reputation Analysis: Monitor evolution of online profiles and social media
Deleted Content Recovery: Retrieve information removed from live websites

Rate Limiting and Best Practices

Wayback Machine

Generally tolerant of automated requests
Be respectful with request frequency
Consider delays between requests for large investigations

Archive.today

May implement rate limiting
Consider using delays between requests
Some requests may be blocked if too frequent

Common Crawl

Public dataset with no rate limits
Indexes updated monthly
Latest index may not include very recent crawls

Legal and Ethical Use

IMPORTANT: This tool is intended for legitimate research and investigations only. Users must:

Have proper authorization before investigating targets
Comply with all applicable laws and regulations
Respect website robots.txt and terms of service
Use archived data responsibly
Not use for harassment or stalking
Only investigate targets you have legal authority to investigate

The author is not responsible for misuse of this tool.

Troubleshooting

No Results Found

URL may never have been archived
Try different URL variations (www vs non-www, http vs https)
Check if the domain existed during the searched time period

Archive.today Errors

Service may be experiencing downtime
Rate limiting may be in effect
Try again later or use Wayback Machine

Common Crawl Limitations

Only includes URLs crawled by Common Crawl
Monthly updates mean recent content may not be indexed
Not all websites are included in crawls

License

MIT License - See LICENSE file for details

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

Support

For issues and questions:

Open an issue on GitHub
Check existing issues for solutions

Related Resources

Wayback Machine
Archive.today
Common Crawl
ArchiveBox - Self-hosted archiving
Memento Protocol - Web archive aggregator

Disclaimer

This tool is provided "as-is" without warranty. Use at your own risk. Always ensure you have proper authorization before conducting investigations. Archived content may be subject to copyright and other legal protections.

Author

Created for professional OSINT investigators and researchers.

Changelog

Version 1.0.0

Initial release
Wayback Machine integration
Archive.today support
Common Crawl indexing
Timeline generation
JSON output support
Year-based filtering

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
archivesnooper.py		archivesnooper.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Archive Snooper

Description

Features

Sources/APIs

Installation

Prerequisites

Setup

Usage

Basic Usage

Specific Archives

Year Filtering

Timeline Generation

Save Results

Command-Line Options

Examples

Basic website investigation

Track domain history

Find snapshots from specific year

Generate comprehensive timeline

Check multiple archives

Output Format

Understanding Timestamps

Use Cases

Investigation Scenarios

Rate Limiting and Best Practices

Wayback Machine

Archive.today

Common Crawl

Legal and Ethical Use

Troubleshooting

No Results Found

Archive.today Errors

Common Crawl Limitations

License

Contributing

Support

Related Resources

Disclaimer

Author

Changelog

Version 1.0.0

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages