Distributed Search Engine

Dependencies

java v1.8.0_222
spark v2.4.4
hadoop v3.2.1
rocksdb - Compiled from the source

Configuration Variables

spark.home = Bin directory inside spark home where spark binaries are located, eg. /home/user/spark/bin
hadoop.doc_data.dir = HDFS URI for directory with documents, eg. hdfs://localhost:9000/user/<username>/doc_contents
hadoop.url_mapping.dir = HDFS URI for directory with file containing URL document mapping LINK, eg. hdfs://localhost:9000/user/<username>/url_data
rocksdb.forwardindex.dir = RocksDB directory for storing forward index, eg. /tmp/forward-index. Please note, /tmp must exist.
rocksdb.invertedindex.dir = RocksDB directory for storing inverted index, eg. /tmp/inverted-index. Please note, /tmp must exist.
rocksdb.url_mapping.dir = RocksDB directory for storing URL document mappings, eg. /tmp/url-doc-map. Please note, /tmp must exist.

Project Setup

Steps to setup the project locally:

Clone the repository from source
Create a hdfs path for storing the data in HDFS using the following commands:
- hdfs dfs -mkdir /user
- hdfs dfs -mkdir /user/<username>
Note: Please make sure hadoop is running before executing the above commands.
Create a hdfs sub-directory and load the all the documents in that directory by running the following commands:
- hdfs dfs -mkdir /user/<username>/doc_contents
- hdfs dfs -put </path/to/local/*.*> /user/<username>/doc_contents
Similarly create a hdfs sub-directory and load the id_url_pairs.txt file in that directory by running the following commands:
- hdfs dfs -mkdir /user/<username>/url_data
- hdfs dfs -put </path/to/local/id_url_pairs.txt> /user/<username>/url_data

Run the following command from minigoogle directory with appropriate path values to generate the executable. Example command will look like below.

./mvnw clean install -Dspark.home=/home/kautilya/spark-2.4.4-bin-hadoop2.7/bin -Dhadoop.doc_data.dir=hdfs://localhost:9000/user/kautilya/data -Dhadoop.url_mapping.dir=hdfs://localhost:9000/user/kautilya/url_data -Drocksdb.forwardindex.dir=/tmp/app-kv -Drocksdb.invertedindex.dir=/tmp/app-iv-kv -Drocksdb.url_mapping.dir=/tmp/url-kv

Run the following command from minigoogle directory with appropriate path values to run the web server. Example command should look something like below.

./mvnw spring-boot:run -Dspring-boot.run.arguments="--spark.home=/home/kautilya/spark-2.4.4-bin-hadoop2.7/bin,--hadoop.doc_data.dir=hdfs://localhost:9000/user/kautilya/data,--hadoop.url_mapping.dir=hdfs://localhost:9000/user/kautilya/url_data,--rocksdb.forwardindex.dir=/tmp/app-kv,--rocksdb.invertedindex.dir=/tmp/app-iv-kv,--rocksdb.url_mapping.dir=/tmp/url-kv"

Note: By default the application runs on port 8080.

APIs

/minigoogle/api/v1/save/index - Computes forward and inverted index and persists both in rocksDB.
/minigoogle/api/v1/search?query=<your query> - Returns list of URLs containing any of the words in the query.
/minigoogle/api/v1/get/forwardIndex - Returns the entire forward index persisted in rocksDB. NOTE: Results can be huge.
/minigoogle/api/v1/get/invertedIndex - Returns the entire inverted index persisted in rocksDB. NOTE: Results can be huge.

Generate Index

Hit the API http://<HOST:PORT>/minigoogle/api/v1/save/index to generate the indices.

Query

After generating the indices search using the API http://<HOST:PORT>/minigoogle/api/v1/search?query=<your query> to retrieve the results.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
minigoogle		minigoogle
.gitignore		.gitignore
LICENSE		LICENSE
PROBLEM.md		PROBLEM.md
Project1_data.zip		Project1_data.zip
README.md		README.md
id_URL_pairs.txt		id_URL_pairs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Distributed Search Engine

Dependencies

Configuration Variables

Project Setup

APIs

Generate Index

Query

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

License

Uh oh!

j809/distributed-search-engine

Folders and files

Latest commit

History

Repository files navigation

Distributed Search Engine

Dependencies

Configuration Variables

Project Setup

APIs

Generate Index

Query

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages