- Introduction
- Uses
- Compiling & Running
- Quality Assurance
- Future Features
- Suggestions
- Contributing
- License
A while ago, I had the question of what was the most linked to Wikipedia page by other Wikipededia pages. I'd already found the page with the most links to other pages: List of Android smartphones, but I couldn't find anywhere that answered my question. I guess I was kinda stupid, because as of 4/3/23 I found out that Wikipedia already had the answers I guess: Most Linked to Pages & Pages Linking to a Page. Since I wasn't able to find the answers on Wikipedia (Even though it was always there 🤦), I decided to make my own program to find the answers. At first, I had to download all of Wikipedia, and then extract all the titles and links from the pages. To do this, I copied most of my Python section from Will Koehrsen's Downloading and Parsing Wikipedia Articles. There was still a fair amount that I had to figure out and change to actually get it working (e.g. Inside In [4]: it tells you to use soup_dump.find_all('li', {'class': 'file'}, limit = 10)[:4]
, when it actually needs soup_dump.find_all('a')
) so it still took me quite some time to figure out the intention of the code & all the libraries to update & fix it. Once I figured out how to download & parse all of Wikipedia into json
files using Python, I wrote the analysis section of the program in C++. The program loads all of the data parsed by the Python section into a graph (using the Adjacency List implementation), which can then be analyzed by the methods below!
In this current version, after loading in the data into the graph, the user can find all the information about a given page, information about the graph/Wikipedia as a whole, find the most linked to pages up to a user specified number, or find all the pages linking to a page. Any pages inputted are case sensitive!!! All data the user decides to save is saved in the data/user_data/
folder. The program also saves the data it needs to load in from the data/user_data/
folder, so it doesn't have to download & parse all of Wikipedia every time it's run.
If you want to learn about how it does this, or how to call the functions yourself, check out the Program Structure page in the wiki.
Double click dependencies
, or run bash
dependencies
or ./
dependencies
in the root directory or to install the python dependencies. All the c++ dependencies are included in source/c++/
already. You must have pip installed to download the new dependencies. Also, you'll need to install python yourself if you haven't already.
Double click compile
, or run bash
compile
or ./
compile
in the command line in the root directory. You must have a version of gcc or clang that supports c++20 installed.
YOU HAVE TO COMPILE & INSTALL THE DEPENDENCIES BEFORE TRYING TO RUN THE PROGRAM!!!
Double click run
, or run bash
run
or ./
run
in the command line in the root directory.
Every new release is run with with leaks (the apple version of valgrind) to ensure there are no memory leaks, and the program is compiled with -Wall
& -Wextra
for getting as much standardization as possible, and -Werror
to make sure all errors are dealt with before the files can be compiled. All variable, function, class, module, & file names are written in snake_case to make sure everything is consistent, and all const
variables are written in ALL-CAPS. The code is also quite commented, so it should be easy enough to understand what's going on.
Also, I know the python section could probably be more ✨𝒫𝓎𝓉𝒽ℴ𝓃𝒾𝒸✨, but I just started seriously learning python, so I'm sure there are many things I could improve on!
I've also tested out every function and section of the code multiple times with all the data to make sure everything runs properly!
If there are any other/better ways to check for quality assurance, please let me know in the suggestions!
For any news on future features, or if you want to suggest some of your own, check out FUTURE_FEATURES.md.
If you have any suggestions about anything, please create a new discussion in suggestions. I'm only a second year computer science student, so I'm sure there are many things I could improve on!
Contributions are always welcomed! Look at CONTRIBUTING.md for more information.
The project is available under the MIT license.