Final project for CS 410 Text Information Systems at the University of Illinois Urbana-Champaign
Historian is a Google Chrome extension that builds a knowledge graph of your visited webpages based on their similarity with respect to each other.
- Builds graph of webpages visited
- Visualizes the similarity of webpages in history
- Clusters similar webpages together under shared topics
- Presents an intuitive way to research
Create a Google Chrome extension that acts as a knowledge graph builder for webpages that the user visits while researching information online.
The extension should represent the user's visited webpages as nodes in a graph where the edges reflect the relative similarity between them such that similar webpages will be clustered together.
This will offer users an intuitive way to visualize their search history while performing online research and eliminate the reliance on other third party applications to track this information.
Task | Assigned To |
---|---|
Learn how to make a Chrome extension | Everyone |
Visualizing Graphs | Blake |
Web Scraping | Megha |
Similarity Algorithm | Michael |
Build Frontend | Rohan |
Build Backend | Kaushal |
Note
The section(s) that follow provide comprehensive instructions for getting Historian setup on your local device. After completing Step 1 and Step 4, you can run the demo script to install dependencies, start the local server, and open some sample webpages to see an easy demonstration of how Historian works.
The table below gives an overview of the dependencies for this project as well as the versions used. For the packages, you can download these directly or run the setup.py
script as discussed in the next section.
Show dependencies
Item | Version |
---|---|
Python | 3.12.0 |
NumPy | 1.26.1 |
SciPy | 1.11.3 |
SciKit-Learn | 1.3.2 |
NLTK | 3.8.1 |
BeautifulSoup | 4.12.2 |
Plotly | 5.18.0 |
Dash | 2.14.2 |
Flask | 3.0.0 |
Flask-Caching | 2.1.0 |
Flask-Cors | 4.0.0 |
Regex | 2023.10.3 |
Alive-Progress | 3.1.5 |
Important
Historian is not currently being hosted on a domain, which means that the only way to currently use this extension is by running the server locally on your machine. The instructions below will guide you through the setup process step-by-step.
First you need to clone this repo to your local machine to access the server as well as the extension. The instructions below are adapted from GitHub's documentation on cloning repositories; for more information, please refer to the docs.
Show instructions
-
Navigate to the main page of the repository.
-
Above the list of files, click <> Code.
-
Copy the URL for the repository.
-
Open Git Bash.
-
Change the current working directory to the location where you want the cloned repository. e.g.
cd path/to/folder
-
Type
git clone
, and then paste the URL you copied earlier, e.g.git clone https://github.com/blakepm2/CS410_Final_Project
-
Press Enter to create your local clone.
After cloning the repo, you can install the dependencies using the setup.py
or manually with pip
.
Note
Python 3.12 is required in order to run setup.py
. If you do not have Python 3.12 installed, please download it here before proceeding.
Show instructions
-
Navigate to the directory where you saved the repository.
cd path/to/repository
-
Run
setup.py
to install all dependencies.py -3.12 setup.py
-
Navigate to the directory where you saved the repository.
cd path/to/repository
-
In the terminal, run the following command:
py -3.12 -m pip install -r config/requirements.txt
Once you've successfully cloned the repo and installed the necessary dependencies, you can host the local server on your machine to enable the backend functionality of the extension.
Show instructions
- Navigate to the directory where you saved the repository.
- Run
server.py
.- This will begin hosting a server on your local network.
- Verify that the server is running.
- You can verify that the server is running by visiting http://127.0.0.1:8050/ in a web browser.
Important
Historian works by sending a list of the URLs from your history to the server, which will then perform the computations needed to create the graph. Once completed, the server will asynchrononously update the graph on the frontend for you to see. Thus, it is imperative that you run the server in order to see your results visualized.
In order to use the extension in Chrome, you need to load an unpacked version into your extensions. The instructions below are adapted from Google Chrome's documentation, which you can consult for more information.
Show instructions
- On your computer, open Chrome.
- At the top right, click More (three dots) > Extensions > Manage Extensions.
- At the top right, enable Developer mode.
- At the top left, click Load unpacked.
- Navigate to the directory where you stored the repository, and select the
extension
folder. - Verify that the extension has been loaded.
- Once the extension has been successfully loaded into Chrome, you should see Historian listed in My extensions.
After the extension has been loaded into Chrome and the local server has been started, you should now be able to use Historian to see the hierarchical graph of your recent search history.
- On your computer, open Chrome.
- Visit some webpages.
- Try to visit different kinds of webpages so that the app can highlight the divisions between them (e.g. "best snack foods", "top 10 careers for computer science majors", "best nba players of all time").
- If you're having trouble coming up with ideas or would rather use some pre-selected samples, please use the demo.
- In the top right, click and select Historian from the dropdown menu.
- In Historian, click Visualize History.
You should see a graph populate with lines connecting nodes that represents the hierarchical clusters of your browsing history.
If there are no errors on the server-side and the extension takes too long to load your dendrogram, the issue is the Cross-Origin Resource Sharing (CORS) policy. This issue occurs because when you load Historian into Chrome, your Extension ID may be different from the one included in the code.
To fix this, simply go to Chrome > Manage Extensions and copy the Extension ID you see under Historian. Then navigate to server.py and replace line 37 with your Extension ID.
If the server throws an error saying it failed to create the dendrogram due to a mismatch in dimensions, this is likely because either the webpages could not be parsed or the webpages had the same titles. Currently, Historian can only analyze distinct webpages (i.e. webpages with unique titles).
To fix this, you can either try to visit some different webpages or you can simply run the demo script to get some presampled webpages to use.
Historian defines several modules to facilitate its functionality. The table below provides a high-level overview of these modules with links to their respective code and documentation.
Module | Purpose | Documentation |
---|---|---|
Document | Represent webpages as documents | Link |
WebScraper | Extract webpage text data | Link |
HierarchicalClustering | Perform agglomerative hierarchical clustering | Link |
Dendrogram | Visualize hierarchical clusters | Link |
Frontend | Enable user functionality | Link |
A class to represent scraped webpages as documents
src.webScraping.document.Document(self, title, text, url)
- self
Document
: TheDocument
object - title
str
: The title of the webpage - text
str
: The text data of the webpage - url
str
: The url of the webpage
None
A class for extracting text data from webpages
src.webScraping.webScraper.WebScraper(self)
- self
WebScraper
: TheWebScraper
object
getWebpageText( self, response )
Extracts and preprocesses the data from a webpage from a given requests.Response
object
Parameters
- self
WebScraper
: TheWebScraper
object - response
requests.Response
: Arequests.Response
object for a given URL
Returns str
scrapeWebpages ( self, urls )
Extracts text data from webpage(s) at a given url and saves their text data as a string into the Webscraper.corpus
hashmap
Parameters
- self
WebScraper
: TheWebScraper
object - urls
list
: A list of URLS for the webpages you want to scrape
Returns dict
A class to perform agglomerative hierarchical clustering with average link for a collection of webpages
src.graphing.hierarchicalClustering.HierarchicalCluster(self)
- self
HierarchicalClustering
: TheHierarchicalClustering
object
preprocess( self, text )
Preprocesses text data from a document by performing normalization, tokenization, and lemmatization
Parameters
- self
HierarchicalClustering
: TheHierarchicalClustering
object - text
str
: The text data from a webpage document
Returns str
preprocess_docs( self, docs )
Preprocesses text for a collection of documents by performing normalization, tokenization, and lemmatization
Parameters
- self
HierarchicalClustering
: TheHierarchicalClustering
object - docs
list
: A list of the processed documents to
Returns list
extract_features( self, docs )
Implements Term Frequency (TF) - Inverse Document Frequency (IDF) weighting to a set of (processed) documents
Parameters
- self
HierarchicalClustering
: TheHierarchicalClustering
object - docs
list
: A list of the processed documents you want to analyze
Returns numpy.ndarray
create_hierarchical_cluster( self, tfidf_matrix )
Performs hierarchical/agglomerative clustering for a TF-IDF weighted matrix of text data from a collection of documents using Average-Link
Parameters
- self
HierarchicalClustering
: TheHierarchicalClustering
object - tfidf_matrix
numpy.ndarray
: A TF-IDF weighted mattrix of text data
Returns numpy.ndarray
create_dendrogram( self, cluster, docs)
Creates a dendrogram to visualize a hierarchical/agglomerative cluster
Parameters
- self
HierarchicalClustering
: TheHierarchicalClustering
object - cluster
numpy.ndarray
: The hierarchical cluster of the data - docs
list
: A list of the original documents
Returns Dendrogram
A class to visualize a hierarchical clustering of webpages
src.graphing.dendrogram.Dendrogram(self, cluster, docs)
- self
Dendrogram
: TheDendrogram
object - cluster
np.ndarray
: The hierarchical cluster of the data - docs
list
: A list of the original documents
create( self )
Creates a dendrogram figure for a hierarchical/agglomerative cluster
Parameters
- self
Dendrogram
: TheDendrogram
object
Returns plotly.graph_objs.Figure
create_lines( self )
Creates the lines representing the relationships between nodes in a dendrogram
Parameters
- self
Dendrogram
: TheDendrogram
object
Returns None
create_nodes( self )
Creates the nodes representing the documents in a dendrogram
Parameters
- self
Dendrogram
: TheDendrogram
object
Returns None
create_layout( self )
Creates the layout of the dendrogram
Parameters
- self
Dendrogram
: TheDendrogram
object
Returns None
Enables user functionality by sending data to the server
Uses the Google Chrome history API to fetch the user's recent browsing history
Parameters
None
Returns Response
Checks the server to see if the preprocessing has been done so it can fetch the graph
Parameters
None
Returns boolean
Loads the graph created by the server into the frontend for the user to see
Parameters
None
Returns boolean
Leverages API call to send the user's browsing history over to the server for processing
Parameters
None
Returns data.message
Shows a spinner while the page loads
Parameters
None
Returns None
Hides the spinner after a page has finished loading
Parameters
None
Returns None
"βοΈ" denotes Team Leader
Name | NetID/Email | Contributions |
---|---|---|
Blake McBride βοΈ | blakepm2@illinois.edu | Created Document class; created WebScraper class; created HierarchicalClustering class; created Dendrogram class; configured agglomerative hierarchical clustering algorithm; designed webscraping logic; wrote visualization logic; configured local server; created all functions for and designed frontend; implemented API calls from frontend to server; created setup and demo scripts; designed logo(s); wrote setup instructions; wrote documentation; designed and wrote README; wrote, editied, and produced video presentation. |
Kaushal Dadi | kdadi2@illinois.edu | Created manifest.json; put iframe in HTML to show graph on webpage; built preliminary frontend. |
Rohan Parekh | rohanjp2@illinois.edu | Helped Kaushal with creation of manifest.json and the chrome extension that displayed the graph on the webpage. |
Megha Chada | megharc2@illinois.edu | Changed colors for graph; added comments to code; added title, timestamp, and description to graph; created architectural diagram. |
Michael Ma | chiuyin2@illinois.edu | Added unfinished topic labels to the graph. |