Skip to content
This repository has been archived by the owner on Sep 17, 2020. It is now read-only.

Is there a means by which one can crawl a self-hosted Warc file? #103

Open
deltabravozulu opened this issue Sep 1, 2020 · 0 comments
Open

Comments

@deltabravozulu
Copy link

So, my general usecase for this is that I have a personal website I recorded at one point but no longer have the original files for. I'd like to rehost my site, but at this point, without the old source code, I cannot. I'm trying to figure out a way to get all my links and everything put back in order the warc file from my backups, but thus far this has been in vain.

I've found that webrecorder (not the player) puts things together in such a way that other programs that have been built over the years cannot take them apart (e.g. warc to zip , warcat, or warc-extractor ) -- each runs into errors when trying to figure out the indexing of the warc.

As such, I ran sudo netstat -tulpn | grep -i webrecord which gave me a host:port of http://127.0.0.1:35535. I found that instead of going through webrecorder-player, I could actually open the whole site in Chrome by going to http://127.0.0.1:35535/local/collection/http://deltabravozu.lu. Because I can access it in the browser with all links working as they would in webrecorder-player, I figured I should be able to crawl the site and pull down the intact site structure using, say, wget or httrack, but thus far I've been able to crawl nothing more than the first page and random offsite links encoded in the webrecorder-player server (e.g. https://www.w3.org).

For wget, I used wget --force-directories --timestamping --level=inf --no-remove-listing --debug --page-requisites --adjust-extension --convert-links --retry-connrefused --span-hosts --follow-ftp --retry-on-host-error --execute robots=off http://127.0.0.1:35535/local/collection/http://deltabravozu.lu

Does anyone have any idea as to how I might more effectively go about my little task?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant