CLI
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
The successor of GNU Wget. Contributions preferred at https://gitlab.com/gnuwget/wget2. But accepted here as well 😍
Google Drive Public File Downloader when Curl/Wget Fails
Got: Simple golang package and CLI tool to download large files faster 🏃 than cURL and Wget!
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON do…
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
simple script to convert web resources to a single warc file
A file management automation tool with SQL-like syntax.