mister-meow.mp4
Important
- 🕷 Crawler: 1000 pages in 1m12s with 64 threads
- 📓 Indexer: 1000 pages in 47s with 50 threads
- 🔎 Search: search is not stable enough but in general it could be improved in the ranker.
- MeowCrawler: crawl the web and insert the data into the database.
- multi-threading
- multi level host priority queue
- handles robots.txt
- url hashing and content hashing to prevent duplicate content
- url filtering
- url normalization
- seeding with a list of urls
- Incremental crawling - could be paused and resumed
- creates a sitemap graph for the ranking algorithm
- MeowIndexer: tokenize and index the crawled data.
- multi-threading
- store in a inverted index collection
- get the TF and position of the tokens.
- handles stemming (Porter Stemmer) PS: we are required to give higher priority to exact tokens_
- handles stop words
- incremental indexing - could be paused and resumed
- MeowRanker: search the indexed data.
- search for the query in the inverted index
- use Google Page Rank algorithm to give popularity to the pages
- rank the results based on the TF-IDF algorithm
- phrase matching
- higher rank bonus for the exact match then stems
- higher rank bonus for words in important tags like title, h1, h2, etc.
- MeowEngine: query engine and server.
- RESTful API
- snippet generation
- search suggestions and history
- query parsing
- phrase matching queries
- AND, OR, NOT operators in queries
- stop words and stemming
- pagination
- cache
- MeowApp: web application.
- Fancy Custom theming 4 themes are available (light, dark, rose, and black)
- Powerful Search bar and suggestions components
- fancy pagination element
- navigation and data loading with react-router 6
- Java
- Gradle
- MongoDB
- Spring Boot - for the server only
- ==FRONTEND==
- React
- TypeScript
- Tailwind CSS
- React Router 6
- Java 11
- Gradle
- MongoDB
- Node
Note
to install java and gradle see the Java setup document to install mongo see the mongo setup document
- Clone the repository
git clone <repo-url>
- Install the dependencies
cd Mister-Meow
cd mistermeow
gradle build
- To run the crawler
sudo systemctl start mongod # have to be done once
gradle crawl
- To run the indexer
gradle index
- To run the server
gradle engine
- To install and run the web application
cd app/src/meowapp
npm install
npm run dev
Please check the following documents before contributing: