Is there a means by which one can crawl a self-hosted Warc file? #103

deltabravozulu · 2020-09-01T00:05:01Z

So, my general usecase for this is that I have a personal website I recorded at one point but no longer have the original files for. I'd like to rehost my site, but at this point, without the old source code, I cannot. I'm trying to figure out a way to get all my links and everything put back in order the warc file from my backups, but thus far this has been in vain.

I've found that webrecorder (not the player) puts things together in such a way that other programs that have been built over the years cannot take them apart (e.g. warc to zip , warcat, or warc-extractor ) -- each runs into errors when trying to figure out the indexing of the warc.

As such, I ran sudo netstat -tulpn | grep -i webrecord which gave me a host:port of http://127.0.0.1:35535. I found that instead of going through webrecorder-player, I could actually open the whole site in Chrome by going to http://127.0.0.1:35535/local/collection/http://deltabravozu.lu. Because I can access it in the browser with all links working as they would in webrecorder-player, I figured I should be able to crawl the site and pull down the intact site structure using, say, wget or httrack, but thus far I've been able to crawl nothing more than the first page and random offsite links encoded in the webrecorder-player server (e.g. https://www.w3.org).

For wget, I used wget --force-directories --timestamping --level=inf --no-remove-listing --debug --page-requisites --adjust-extension --convert-links --retry-connrefused --span-hosts --follow-ftp --retry-on-host-error --execute robots=off http://127.0.0.1:35535/local/collection/http://deltabravozu.lu

Does anyone have any idea as to how I might more effectively go about my little task?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a means by which one can crawl a self-hosted Warc file? #103

Is there a means by which one can crawl a self-hosted Warc file? #103

deltabravozulu commented Sep 1, 2020

Is there a means by which one can crawl a self-hosted Warc file? #103

Is there a means by which one can crawl a self-hosted Warc file? #103

Comments

deltabravozulu commented Sep 1, 2020