Skip to content

OSINTCabal/ArchiveSnooper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Archive Snooper

A web archive and historical data retrieval tool for investigators and researchers.

Description

Archive Snooper searches multiple web archive services to retrieve historical snapshots and data about websites. It's designed for OSINT investigators who need to track changes in web content over time, recover deleted information, or research the history of online entities.

Features

  • Wayback Machine integration (Internet Archive)
  • Archive.today/Archive.is support
  • Common Crawl index searching
  • Timeline generation across multiple archives
  • Year-based filtering for targeted searches
  • JSON output for further analysis
  • No API keys required for basic functionality

Sources/APIs

This tool integrates with the following web archive services:

  1. Wayback Machine (Internet Archive) - Largest web archive

  2. Archive.today / Archive.is - Snapshot service

  3. Common Crawl - Open repository of web crawl data

Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager
  • Internet connection

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/archivesnooper.git
cd archivesnooper
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the tool:
python archivesnooper.py --help

Usage

Basic Usage

Search all archives for a URL:

python archivesnooper.py https://example.com

Specific Archives

Search only Wayback Machine:

python archivesnooper.py https://example.com -a wayback

Search multiple specific archives:

python archivesnooper.py https://example.com -a wayback archive_today

Year Filtering

Get Wayback snapshots from a specific year:

python archivesnooper.py https://example.com -a wayback -y 2020

Timeline Generation

Generate a timeline across all archives:

python archivesnooper.py https://example.com -a timeline

Save Results

Output results to JSON file:

python archivesnooper.py https://example.com -o results.json

Command-Line Options

usage: archivesnooper.py [-h] [-a {wayback,archive_today,commoncrawl,timeline,all} [...]] [-y YEAR] [-o OUTPUT] url

positional arguments:
  url                   Target URL to search in archives

optional arguments:
  -h, --help            show this help message and exit
  -a, --archives        Archives to search (default: all)
  -y, --year YEAR       Filter Wayback results by year
  -o, --output OUTPUT   Output file for JSON results

Examples

Basic website investigation

python archivesnooper.py https://targetcompany.com

Track domain history

python archivesnooper.py https://suspicioussite.com -a wayback -o history.json

Find snapshots from specific year

python archivesnooper.py https://company.com -y 2015

Generate comprehensive timeline

python archivesnooper.py https://example.com -a timeline -o timeline.json

Check multiple archives

python archivesnooper.py https://target.com -a wayback archive_today commoncrawl

Output Format

Results are provided in JSON format:

{
  "url": "https://example.com",
  "timestamp": "2025-01-11T12:00:00",
  "archives": {
    "wayback_machine": {
      "snapshots_found": 245,
      "first_capture": "20100115000000",
      "last_capture": "20250110120000",
      "snapshots": [
        {
          "timestamp": "20250110120000",
          "url": "https://example.com",
          "archive_url": "https://web.archive.org/web/20250110120000/https://example.com"
        }
      ]
    }
  }
}

Understanding Timestamps

Wayback Machine uses a 14-digit timestamp format:

  • Format: YYYYMMDDhhmmss
  • Example: 20250111153045
    • Year: 2025
    • Month: 01 (January)
    • Day: 11
    • Hour: 15 (3 PM)
    • Minute: 30
    • Second: 45

Use Cases

Investigation Scenarios

  1. Corporate Investigation: Track changes to company websites, deleted pages, or modified content
  2. Fraud Detection: Identify changes to scam websites or phishing pages
  3. Background Checks: Research historical online presence of individuals or organizations
  4. Domain History: Track ownership and content changes over time
  5. Evidence Preservation: Document web content for legal proceedings
  6. Reputation Analysis: Monitor evolution of online profiles and social media
  7. Deleted Content Recovery: Retrieve information removed from live websites

Rate Limiting and Best Practices

Wayback Machine

  • Generally tolerant of automated requests
  • Be respectful with request frequency
  • Consider delays between requests for large investigations

Archive.today

  • May implement rate limiting
  • Consider using delays between requests
  • Some requests may be blocked if too frequent

Common Crawl

  • Public dataset with no rate limits
  • Indexes updated monthly
  • Latest index may not include very recent crawls

Legal and Ethical Use

IMPORTANT: This tool is intended for legitimate research and investigations only. Users must:

  • Have proper authorization before investigating targets
  • Comply with all applicable laws and regulations
  • Respect website robots.txt and terms of service
  • Use archived data responsibly
  • Not use for harassment or stalking
  • Only investigate targets you have legal authority to investigate

The author is not responsible for misuse of this tool.

Troubleshooting

No Results Found

  • URL may never have been archived
  • Try different URL variations (www vs non-www, http vs https)
  • Check if the domain existed during the searched time period

Archive.today Errors

  • Service may be experiencing downtime
  • Rate limiting may be in effect
  • Try again later or use Wayback Machine

Common Crawl Limitations

  • Only includes URLs crawled by Common Crawl
  • Monthly updates mean recent content may not be indexed
  • Not all websites are included in crawls

License

MIT License - See LICENSE file for details

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

Support

For issues and questions:

  • Open an issue on GitHub
  • Check existing issues for solutions

Related Resources

Disclaimer

This tool is provided "as-is" without warranty. Use at your own risk. Always ensure you have proper authorization before conducting investigations. Archived content may be subject to copyright and other legal protections.

Author

Created for professional OSINT investigators and researchers.

Changelog

Version 1.0.0

  • Initial release
  • Wayback Machine integration
  • Archive.today support
  • Common Crawl indexing
  • Timeline generation
  • JSON output support
  • Year-based filtering

About

A script that searches a few different web archive sources like the wayback machine, archive.is, etc.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages