You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 17, 2020. It is now read-only.
So, my general usecase for this is that I have a personal website I recorded at one point but no longer have the original files for. I'd like to rehost my site, but at this point, without the old source code, I cannot. I'm trying to figure out a way to get all my links and everything put back in order the warc file from my backups, but thus far this has been in vain.
I've found that webrecorder (not the player) puts things together in such a way that other programs that have been built over the years cannot take them apart (e.g. warc to zip , warcat, or warc-extractor ) -- each runs into errors when trying to figure out the indexing of the warc.
As such, I ran sudo netstat -tulpn | grep -i webrecord which gave me a host:port of http://127.0.0.1:35535. I found that instead of going through webrecorder-player, I could actually open the whole site in Chrome by going to http://127.0.0.1:35535/local/collection/http://deltabravozu.lu. Because I can access it in the browser with all links working as they would in webrecorder-player, I figured I should be able to crawl the site and pull down the intact site structure using, say, wget or httrack, but thus far I've been able to crawl nothing more than the first page and random offsite links encoded in the webrecorder-player server (e.g. https://www.w3.org).
For wget, I used wget --force-directories --timestamping --level=inf --no-remove-listing --debug --page-requisites --adjust-extension --convert-links --retry-connrefused --span-hosts --follow-ftp --retry-on-host-error --execute robots=off http://127.0.0.1:35535/local/collection/http://deltabravozu.lu
Does anyone have any idea as to how I might more effectively go about my little task?
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
So, my general usecase for this is that I have a personal website I recorded at one point but no longer have the original files for. I'd like to rehost my site, but at this point, without the old source code, I cannot. I'm trying to figure out a way to get all my links and everything put back in order the warc file from my backups, but thus far this has been in vain.
I've found that webrecorder (not the player) puts things together in such a way that other programs that have been built over the years cannot take them apart (e.g. warc to zip , warcat, or warc-extractor ) -- each runs into errors when trying to figure out the indexing of the warc.
As such, I ran
sudo netstat -tulpn | grep -i webrecord
which gave me a host:port ofhttp://127.0.0.1:35535
. I found that instead of going through webrecorder-player, I could actually open the whole site in Chrome by going tohttp://127.0.0.1:35535/local/collection/http://deltabravozu.lu
. Because I can access it in the browser with all links working as they would in webrecorder-player, I figured I should be able to crawl the site and pull down the intact site structure using, say, wget or httrack, but thus far I've been able to crawl nothing more than the first page and random offsite links encoded in the webrecorder-player server (e.g. https://www.w3.org).For wget, I used
wget --force-directories --timestamping --level=inf --no-remove-listing --debug --page-requisites --adjust-extension --convert-links --retry-connrefused --span-hosts --follow-ftp --retry-on-host-error --execute robots=off http://127.0.0.1:35535/local/collection/http://deltabravozu.lu
Does anyone have any idea as to how I might more effectively go about my little task?
The text was updated successfully, but these errors were encountered: