This project is a simplified recreation of Google's algorithm that manages all the websites it serves. This project is merely a proof of concept, so many shortcuts were taken to save money and resources. No hardware such as physical hard disks and servers were used, so a digital recreation of hard disks was used instead stored as a HardDisk class. Additionally, only the necessary features of a browser and searchEngine were implemented utilizing the respective interfaces.
To understand this algorithm, I must explain the vital classes variables in its implementation:
This is a java class that simulates a physical hard disk. It implements a treeMap and not a hashTable to be able to store the hypothetical billions of entries in webpages and words if google were actually using this. The map keys are Longs and the values are InfoFiles. The keys represent the index of the first block of storage on the hard disk that a web page or word is stored at. The smallest block of memory on a hard disk is 512 bytes, so each index position represents and increment of at most 512 bytes. For example, website a takes up 1111 bytes and website b takes up 500 bytes. website a's InfoFile would be stored at index 0. This is because it is the first website on the hard disk, and website b would be stored at index 2 because website a uses up 2 full blocks and partially uses the 3rd block of 512 bytes.
This class is used to store all the data associated with each page and each word referenced on all the pages. The InfoFiles are the objects that are stored on the HardDisk objects. This class can be broken down into its attributes:
data - Contains a string which is equal to either the url or the word that this InfoFile corresponds to.
influence - A Long equal to the final influence after the page-rank algorithm is done running
influenceTemp - A long equal to the temporary influence of the file during one iteration of the page-rank algorithm
indices - An array of index positions of where that InfoFile's data is referenced on other webpages. For example, webpages a and b are stored at index position 0 and 7 on the simulated hard disk. If b references a, then a's InfoFile.indices would contain a 0.
final influence - Final influence after designated number of iterations of the ranking algorithm.
These variables are HardDisk objects that represent a hard disk for storing every word seen on every website and one for every url seen.
These variables are both maps whose keys are either the word or url as a String and whose values are the index positions of that url or word in its respective hard disk. There is one key difference between these two variables. urlToIndex implements a treeMap to enable it to handle the large volume of urls on the internet, and wordToIndex implements a hashMap because there are far fewer words, and storing the hashMap in memory is not an issue.
The complete algorithm can be broken down into three distinct sub-algorithms
The algorithm is provided a list of urls to begin with, and it initializes a Queue with these urls. Then, it preforms a breadth first search, indexing each page and every word on the page as it goes. The algorithm also has checks in place to ensure each url is only indexed once, and each word is only indexed once per page. Indexing words and urls is simply the process of adding the word or url into the hard disk and adding the data with its corresponding index into urlToIndex or wordToIndex.
This is a simplified version of Google's page-rank algorithm. The algorithm works by assigning each page an initial influence of 1, and making each page give its entire influence proportionally to each webpage it references. This process is repeated for a number of iterations until the influences of each page converge to their final numbers.
This portion of the algorithm will actually return the results to your search inquiry with the 10 pages that match your inquiry and have the highest influences. The 10 best search results are found by storing all the search results into a priority queue that stores the lowest influence match at the root. By limiting the size of the queue to 10, the algorithm can compare any new matches to the root of the priorityQueue and do nothing if the match is less than the root. Otherwise, the algorithm will poll the root and offer the new match in its place, and the PriorityQueue comparator will place the new match in its correct ranking.
To work properly, the jsoup-1.8.3.jar file must be added as a dependency in the configurations.
To see a demo of the algorithm, run the search file. For the purposes of the demo, a word disk and page disk file have already been provided. These files only contain the numbers 1 through 90 as words, and searching multiple numbers will return only multiples of those numbers. For example, searching 2 and 3 should return multiples of 2 and 3. However, it will only return the 10 results with the highest influence. If no matches exist, a blank array will be returned.
This project was created for CSC220 at the University of Miami taught by Victor Milenkovic. Dr. Milenkovic created the framework and also provided guidance throughout the project.
This project is licensed under the MIT License - see the LICENSE file for details.