Favorites
A Python module to bypass Cloudflare's anti-bot page.
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
The successor of GNU Wget. Contributions preferred at https://gitlab.com/gnuwget/wget2. But accepted here as well 😍
Google Drive Public File Downloader when Curl/Wget Fails
Media Downloader is a Qt/C++ front end to yt-dlp, youtube-dl, gallery-dl, lux, you-get, svtplay-dl, aria2c, wget and safari books..
Got: Simple golang package and CLI tool to download large files faster 🏃 than cURL and Wget!
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON do…
A curated list of awesome tools for website diffing and change monitoring.
An Awesome List for getting started with web archiving
📚 A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity
List of libraries, tools and APIs for web scraping and data processing.
A curated list of awesome puppeteer resources.
List of libraries, tools and APIs for web scraping and data processing.
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
simple script to convert web resources to a single warc file
wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
A server to collect & archive websites that also supports video downloads
Multi functional app to find duplicates, empty folders, similar images etc.
Generate regular expressions from sample texts.
ConsoleApp to export OneNote notebooks to Markdown formats