- java v1.8.0_222
- spark v2.4.4
- hadoop v3.2.1
- rocksdb - Compiled from the source
spark.home= Bin directory inside spark home where spark binaries are located, eg./home/user/spark/binhadoop.doc_data.dir= HDFS URI for directory with documents, eg.hdfs://localhost:9000/user/<username>/doc_contentshadoop.url_mapping.dir= HDFS URI for directory with file containing URL document mapping LINK, eg.hdfs://localhost:9000/user/<username>/url_datarocksdb.forwardindex.dir= RocksDB directory for storing forward index, eg./tmp/forward-index. Please note,/tmpmust exist.rocksdb.invertedindex.dir= RocksDB directory for storing inverted index, eg./tmp/inverted-index. Please note,/tmpmust exist.rocksdb.url_mapping.dir= RocksDB directory for storing URL document mappings, eg./tmp/url-doc-map. Please note,/tmpmust exist.
Steps to setup the project locally:
-
Clone the repository from source
-
Create a hdfs path for storing the data in HDFS using the following commands:
hdfs dfs -mkdir /userhdfs dfs -mkdir /user/<username>
Note: Please make sure hadoop is running before executing the above commands.
-
Create a hdfs sub-directory and load the all the documents in that directory by running the following commands:
hdfs dfs -mkdir /user/<username>/doc_contentshdfs dfs -put </path/to/local/*.*> /user/<username>/doc_contents
-
Similarly create a hdfs sub-directory and load the
id_url_pairs.txtfile in that directory by running the following commands:hdfs dfs -mkdir /user/<username>/url_datahdfs dfs -put </path/to/local/id_url_pairs.txt> /user/<username>/url_data
-
Run the following command from
minigoogledirectory with appropriate path values to generate the executable. Example command will look like below../mvnw clean install -Dspark.home=/home/kautilya/spark-2.4.4-bin-hadoop2.7/bin -Dhadoop.doc_data.dir=hdfs://localhost:9000/user/kautilya/data -Dhadoop.url_mapping.dir=hdfs://localhost:9000/user/kautilya/url_data -Drocksdb.forwardindex.dir=/tmp/app-kv -Drocksdb.invertedindex.dir=/tmp/app-iv-kv -Drocksdb.url_mapping.dir=/tmp/url-kv
-
Run the following command from
minigoogledirectory with appropriate path values to run the web server. Example command should look something like below../mvnw spring-boot:run -Dspring-boot.run.arguments="--spark.home=/home/kautilya/spark-2.4.4-bin-hadoop2.7/bin,--hadoop.doc_data.dir=hdfs://localhost:9000/user/kautilya/data,--hadoop.url_mapping.dir=hdfs://localhost:9000/user/kautilya/url_data,--rocksdb.forwardindex.dir=/tmp/app-kv,--rocksdb.invertedindex.dir=/tmp/app-iv-kv,--rocksdb.url_mapping.dir=/tmp/url-kv"Note: By default the application runs on port 8080.
/minigoogle/api/v1/save/index- Computes forward and inverted index and persists both in rocksDB./minigoogle/api/v1/search?query=<your query>- Returns list of URLs containing any of the words in the query./minigoogle/api/v1/get/forwardIndex- Returns the entire forward index persisted in rocksDB. NOTE: Results can be huge./minigoogle/api/v1/get/invertedIndex- Returns the entire inverted index persisted in rocksDB. NOTE: Results can be huge.
- Hit the API
http://<HOST:PORT>/minigoogle/api/v1/save/indexto generate the indices.
- After generating the indices search using the API
http://<HOST:PORT>/minigoogle/api/v1/search?query=<your query>to retrieve the results.