Skip to content

Conversation

AliHz1337
Copy link

Problem

The current Wayback Machine URL fetching implementation suffers from critical memory management issues that cause application crashes when processing domains with large archives:

Issues Identified:

  • Memory exhaustion: Collecting all URLs in memory before returning causes OOM kills
  • Process crashes: Large enterprise domains consistently crash the application
  • Poor scalability: Memory usage grows linearly with archive size
  • Blocking behavior: Application hangs for minutes before eventual crash

Reproduction Case:

Testing with indeed.com (a domain with extensive web archives):

./waybackurls indeed.com
# Result: Application hangs → Memory usage grows → OOM kill after ~15 minutes
# No URLs output during the hang period

AliHz1337 added 2 commits June 1, 2025 19:10
- Switch to plain output format and streaming processing
- Eliminate OOM crashes for large domains
- Implement true streaming with buffered channels
- Maintain constant memory usage
- Switch to plain output format and streaming processing
- Eliminate OOM crashes for large domains
- Implement true streaming with buffered channels
- Maintain constant memory usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant