Crawlers
Google Drive Public File Downloader when Curl/Wget Fails
Media Downloader is a Qt/C++ front end to yt-dlp, youtube-dl, gallery-dl, lux, you-get, svtplay-dl, aria2c, wget and safari books..
Got: Simple golang package and CLI tool to download large files faster π than cURL and Wget!
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON doβ¦
The successor of GNU Wget. Contributions preferred at https://gitlab.com/gnuwget/wget2. But accepted here as well π
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
π Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
π An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.
An Awesome List for getting started with web archiving
A curated list of awesome tools for website diffing and change monitoring.
π A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity
List of libraries, tools and APIs for web scraping and data processing.
A curated list of awesome puppeteer resources.
List of libraries, tools and APIs for web scraping and data processing.
π Web Archiving Integration Layer: One-Click User Instigated Preservation
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
simple script to convert web resources to a single warc file
wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
A server to collect & archive websites that also supports video downloads