An internet search engine written mostly in python. Currently TF-IDF based.
It should be noted that the file naughty-words.txt
is sourced from Coffee and Fun's google-profanity-words.
Preforms initial tasks for setting up an instance. Use the --user
option to specify a privileged MariaDB/MySQL user.
Purges old data from database after it reaches a specific age. Should be run as a cron job periodically.
Webcrawler. Crawls the internet for new URLs and outputs its findings to a file called 'raw-urls.txt'. The --sites
option can be used to pass a list of URLs separated by commas.
Meta data extractor / web scraper. Extracts meta data from html and stores it in a MariaDB database. Uses the 'raw-urls.txt' file as input. The --key
option must be used to provide the path to the key that decrypts db_creds.gpg
Preforms a TF-IDF based search against the database. Run search.py --help
for more information.
Directory containing source files for website user interface / front end.
Outputs a results page.