Releases: netarchivesuite/solrwayback
SolrWayback bundle 4.2.1
The SolrWayback distribution is an out of the box solution for exploring archived webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).
Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.2.1/solrwayback_package_4.2.1.zip
log4shell security alert
SolrWayback itself does not use log4j 2+ and is not directly affected by CVE-2021-44228.
The SolrWayback bundle uses Solr 7.7.3, which is affected by log4shell. Please follow the Solr log4shell mitigation guide if the bundled Solr is used. The quickest fix, taken from the guide, is
- (Linux/MacOS) Edit your solr.in.sh file to include: SOLR_OPTS="$SOLR_OPTS -Dlog4j2.formatMsgNoLookups=true"
- (Windows) Edit your solr.in.cmd file to include: set SOLR_OPTS=%SOLR_OPTS% -Dlog4j2.formatMsgNoLookups=true
If another version of Solr is used, note that Solr >= 7.4 and < 8.11 are vulnerable. See the mitigation guide above for details.
No more live leaks.
From version 4.2.1 SolrWayback comes with a build in Serviceworker(javascript worker) that will redirect or block all live leaks. This works in modern browsers.
Playback will still work in legacy browsers using url rewrites, but can leak to the live web unless using http-proxy or sandbox.
How to upgrade from previous version 4.1.1 (or higher):
To upgrade from a previous version you to need to replace the solrwayback.war in the 'apache-tomcat-8.5.60/webbapps' folder.
And add the following properties to 'solrwayback.properties' in your home folder if they are not present:
#Solr caching. Will be default false if not defined
solr.server.caching=true
solr.server.caching.max.entries=10000
solr.server.caching.age.seconds=86400
Add the following properties to 'solrwaybackweb.properties' in your home folder if they are not present:
#English
wordcloud.stopwords=i,me,my,myself,we, ...
(Take the full list from the property file in release. Also comes with a danish stopwords list)
To upgrade from an older version just compare solrwayback.properties and solrwaybackweb.properties and add the missing properties to your files.
Changes since release 4.1.0:
4.2.1
Further improvements in serviceworker:
a) The SolrWaybackRoot-servlet application is no longer required if te Serviceworker is loaded. For legacy browsers where servicerworker does not work, the root servlet will required for improved playback.
b) In rare cases referer is missing so crawltime for the origin resource is unknown. As a default it uses current year as crawltime. This situation is often not relevant for playback since the requests often are to trackers and adds.
Cleaned up in logging to the solrwayback.log file. It should not be as spammy now.
Upgraded frontend dependencies (security updates).
Fixed bug in load more facets for domain facet when there also was a filter query involved.
4.2.0
All Playback live leaks are now blocked or redirected back to SolrWayback with a javascript Serviceworker added to playback. No more leaking to the live web! This will also improve playback when the live leak can be resolved in SolrWayback. (Thanks to Ilya Kreymer for pointing me in this direction).
The Serviceworker implementation require the SolrWayback server to run under HTTPS. This can be archived by setting an Apache or Nginx in front of the Tomcat.
The Serviceworker feature is supported by most recent browser versions. See: https://caniuse.com/serviceworkers
Playback will still work in legacy browsers using url rewrite, but can leak to the live web in if not blocked by proxy server or sandboxed.
Encoding fix in javascript rewrite: Modify < > handling to preserve the original representation (including faulty ones). This closes SOLRWBFB-58
Upgraded frontend depencencies (security updates).
4.1.2
Wordcloud stop words works can be configured in solrwaybackweb.properties.
Added new property(wordcloud.stopwords) in solrwaybackweb.properties with default stopwords (english). Will use empty stopword list if not defined
Word cloud html pages extraction reduced from 10.000 to 5.000 as difference was minimal, but doubles performance
API method to extract word+count for a query+filterquery(optional) : /services/frontend/wordcloud/wordfrequency?q=xxx&fg=yyy
API method to extract wordcloud image for query+filterquery(optional): /services/frontend/wordcloud/query?q=xxx&fg=yyy
Solr query caching for performance boost.
Added new optional properties in solrwayback.properties
#Solr caching. Will be default false if not defined
solr.server.caching=true
solr.server.caching.max.entries=10000
solr.server.caching.age.seconds=86400
When clicking a link and opening playback in a new tab. The browser URL will match the crawl-time of the html page.
The file location of the two property-files solrwayback.properties and solrwaybackweb.properties can be configured so they do not have
to be in the HOME directory.
To change to location copy this file: https://github.com/netarchivesuite/solrwayback/blob/master/src/main/webapp/META-INF/context.xml
to the folder '/apache-tomcat-8.5.60/conf/Catalina/localhost' and rename it to solrwayback.war
Remnove the uncomment of the environment variables and edit the location of the files. During start up of the tomcat server, the
values will be logged in solrwayback.log.
Updated the README.md with more information about scaling and using SolrWayback in production.
See full changelog: https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md
SolrWayback bundle 4.1.1
The SolrWayback distribution is an out of the box solution for exploring archieved webpages in ARC/WARC format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).
Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.1.1/solrwayback_package_4.1.1.zip
Changes since 4.1.0:
Added a better parallel indexing script for Linux/macOS with more options. (warc-indexer.sh)
With warc-indexer.sh you can define number of threads. It keeps track of already index WARC-file so you can start it again after adding new WARC-files to the folder.
Example: THREADS=20 ./warc-indexer.sh warcs1
The file location of the two property-files solrwayback.properties and solrwaybackweb.properties can be configured so they do not have
to be in the HOME directory.
To change to location copy this file: https://github.com/netarchivesuite/solrwayback/blob/master/src/main/webapp/META-INF/context.xml
to the folder '/apache-tomcat-8.5.60/conf/Catalina/localhost' and rename it to solrwayback.war
Remnove the uncomment of the environment variables and edit the location of the files. During start up of the tomcat server, the
values will be logged in solrwayback.log.
Updated the README.md with more information about scaling and using SolrWayback in production.
See full changelog: https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md
SolrWayback bundle 4.1.0
The SolrWayback distribution is an out of the box solution for exploring archieved webpages in arc/warc format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).
Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.1.0/solrwayback_package_4.1.0.zip
Unzip the folder and read the README.md file and follow the instructions.
Changes since 4.0.6:
Indexing scripts updated
Introduced JavascriptPlayback class. Does nothing but handle brotli, but can later be improved to do url-replacement in javascript files.
Brotli encoding fix for javascript.
Fixed chunked transfer encoding error when HTTP header declared it was chunked, but was not.
New optional properties can be added to solrwaybackweb.properties to limit maximum number of export results for CSV/WARC.
SolrWayback bundle 4.0.6
The SolrWayback distribution is an out of the box solution for exploring archieved webpages in arc/warc format.
Runs under Windows/Linux/MacOs.
All components now runs under java 11 (and still java 8 as well).
Windows only GUI tool also included. With this tool you can index files by selecting warc files with a file chooser and clear an index.
Open the AddOn/SolrSetup.exe to open the GUI Tool.
Download: https://github.com/netarchivesuite/solrwayback/releases/download/4.0.6/solrwayback_package_4.0.6.zip
Unzip the folder and read the README.md file and follow the instructions.
SolrWayback bundle 3.2.1
The SolrWayback distribution is an out of the box solution for exploring archieved webpages in arc/warc format.
Runs under Windows/Linux/MacOs.
Java 8/9/10 are compatible. (Not Java 11)
Download: https://github.com/netarchivesuite/solrwayback/releases/download/3.2.1/solrwayback_package.zip
Unzip the folder and read the README.md file and follow the instructions.
Upgrade from 3.1
To update from 3.1 add the new additional properties in solrwaybackweb.properties and solrwayback.properties. Download the release and to see the new properties.
Replace the war-file in tomcat with this release and the add new root servlet (ROOT.WAR) and restart tomcat. Both war-files are attachments to this release to be downloaded (apache-tomcat-8.5.29/webapps/ folder)