Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON do…

Pascal 681 42 Updated Apr 20, 2024

jeffjose / tget

tget is wget for torrents

JavaScript 622 51 Updated Dec 11, 2020

mirror / wget

Wget Git mirror

C 393 132 Updated Sep 25, 2024

rockdaboot / wget2

The successor of GNU Wget. Contributions preferred at https://gitlab.com/gnuwget/wget2. But accepted here as well 😍

C 562 76 Updated Nov 1, 2024

ArchiveTeam / wpull

Wget-compatible web downloader and crawler.

HTML 555 77 Updated Apr 29, 2024

ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Python 1,395 135 Updated Jul 7, 2024

ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Python 22,243 1,178 Updated Nov 4, 2024

pirate / internet-archiving-talk

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

JavaScript 50 5 Updated Aug 15, 2024

iipc / awesome-web-archiving

An Awesome List for getting started with web archiving

2,040 156 Updated Nov 6, 2024

edgi-govdata-archiving / awesome-website-change-monitoring

A curated list of awesome tools for website diffing and change monitoring.

494 31 Updated Aug 9, 2022

datatogether / research

📚 A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity

91 11 Updated Sep 27, 2018

lorien / awesome-web-scraping

List of libraries, tools and APIs for web scraping and data processing.

Makefile 6,671 787 Updated Oct 27, 2024

transitive-bullshit / awesome-puppeteer

A curated list of awesome puppeteer resources.

2,403 161 Updated Jul 19, 2024

Germey / AwesomeWebScraping

List of libraries, tools and APIs for web scraping and data processing.

Makefile 240 33 Updated Apr 5, 2024

simon987 / awesome-datahoarding

List of data-hoarding related tools

1,084 83 Updated Sep 14, 2023

machawk1 / wail

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

Roff 350 35 Updated Oct 4, 2024

internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Java 2,829 763 Updated Nov 7, 2024

webrecorder / webrecorder-player

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

JavaScript 437 38 Updated Sep 17, 2020

steffenfritz / html2warc

simple script to convert web resources to a single warc file

Python 18 2 Updated May 11, 2023

wallabag / wallabag

wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.

PHP 10,448 768 Updated Nov 8, 2024

xarantolus / Collect

A server to collect & archive websites that also supports video downloads

TypeScript 78 10 Updated Feb 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Butters3388214

Block or report Butters3388214

Crawlers

wkentaro / gdown

mhogomchungu / media-downloader

circulosmeos / gdown.pl

melbahja / got

benibela / xidel