This project aims to replace the legacy search system. The legacy search system, Nutchwax based, will be decrepated by Solr as the full-text search backend. In order to accomplish this, it provides a new API implementation to decouple Arquivo.pt API from the old project, making it backend agnostic and working with both Nutchwax and Solr systems.
To be able to compile the project we need in the machine's maven repository the Arquivo.pt Nutchwax Project libraries. In order for Page Search API to use NutchWaxSearchService as the full-text backend (In the future we can remove these libraries).
NOTE: Do not try to compile with other Java than Java 8, because of the nutchwax dependecies
a) Satisfy the following page-search-api/pom.xml requirements for NutchWaxSearchService (legacy backend):
<dependency>
<groupId>pt.arquivo</groupId>
<artifactId>pwalucene</artifactId>
<version>1.0.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.archive.nutchwax</groupId>
<artifactId>nutchwax-plugins</artifactId>
<version>0.11.0-SNAPSHOT</version>
</dependency>
You will need the following to satisfy these dependencies:
hadoop-common (0.14)
$ git clone -b branch-0.14 https://github.com/arquivo/hadoop-common.git
$ mvn clean install -f hadoop-common/pom.xml
PwaLucene and PwaArchive-access
$ git clone https://github.com/arquivo/pwa-technologies.git
$ mvn clean install -f pwa-technologies/PwaLucene/pom.xml
$ mvn clean install -f pwa-technologies/PwaArchive-access/pom.xml
b) Clone and compile Page Search
$ git clone https://github.com/arquivo/pagesearch.git
$ mvn clean install -f pagesearch/pom.xml
To do this just run mvn install with the following profile deactivated:
$ mvn clean install -f pagesearch/pom.xml -P !docker-integration-tests
Note: you need log with hub.docker.com to be able to publish an image to Arquivo's docker hub repository.
$ cd pagesearch/scripts
$ ./build-solr-test-image.sh
https://preprod.arquivo.pt/pagesearch/swagger-ui.html#/
Available jobs:
- HdfsPageSearchDataDriver
- PageSearchDataDriver
- InvertLinksDriver
- SolrPageDocDriver
Example of expected workflow:
$ yarn jar pagesearch-index-job-0.0.1-jar-with-dependencies.jar HdfsPageSearchDataDriver -D collection="TESTE" input.txt output
$ yarn jar pagesearch-index-job-0.0.1-jar-with-dependencies.jar InvertLinksDriver -D mapred.reduce.tasks=<nr_reduces> output
$ yarn jar pagesearch-index-job-0.0.1-jar-with-dependencies.jar SolrPageDocDriver -D mapred.reduce.tasks=<nr_reduces> output
https://github.com/arquivo/page-search/blob/master/scripts/index-pagesearch.sh
$ ./index-pagesearch.sh <hdfs_warcfiles_folder> <hdfs_output_folder> <collection_name>
Example:
$ ./index-pagesearch.sh /user/dbicho/AWP2 /user/dbicho/output_AWP2 AWP2
Write a reference.conf with the parsing configurations. The default configurations of the parser are:
{
"warc" : {
"solr":{
"server": "http://localhost:8983/solr/searchpages"
},
"index":{
"extract":{
# Restrict record types:
"record_type_include" : [
response, revisit
],
"record_response_include" : [
"2"
],
"record_primary_mimetype_include" : [
text, application
],
"record_mimetype_exclude" : [
xml, css, javascript, x-javascript, json
]
}
}
}
}