A web search engine built with Python that crawls, indexes, and ranks web pages using the PageRank algorithm. The front end is built with Reflex and isn't the focus of this project, is used only to demonstrate the results of the engine.
The project consists of two main components:
- Web Crawler: Multi-threaded and asynchronous crawling
- Indexer: Scrapes the webpage with LXML, removes stop words and lemmatizes the words before indexing
- PageRank: Ranks pages with the PageRank algorithm, using NetworkX
- Queue Management: Queues for pages to crawl and pages to index made with Redis
- Web Interface: Search UI built with Reflex framework
- Respectful Crawling: Honors robots.txt, implements crawl delays and allowed pages
- Text Processing: Tokenization, lemmatization and stop word removal
- Ranking: Implements the PageRank algorithm for ranking pages
- High Performance: Asynchronous processing with multi-threading
- Storage: MongoDB for document storage and Redis for queue management
- MongoDB (running on localhost:27017)
- Redis (running on localhost:6379)
-
Clone the repository
git clone https://github.com/thiagobapt/SearchEngine.git cd SearchEngine -
Install dependencies
pip install -r requirements.txt
Navigate to the Bot directory and run:
cd Bot
python Main.pyThe main menu offers four options:
-
Crawl and index: Start the crawling and indexing process
-
Page Rank: Calculate PageRank scores for indexed pages
-
Load NLTK: Download required NLTK datasets THIS MUST BE DONE FIRST
-
Exit
Navigate to the FrontEnd directory and run:
cd FrontEnd
reflex runThe web interface will be available at http://localhost:3000
Modify the parameters in Bot/Main.py when selecting option 1:
low_priority_crawlers: Number of threads for discovered domains (default: 100)high_priority_crawlers: Number of threads for new domains (default: 10)max_indexers: Number of indexing worker threads (default: 4)max_concurrent_indexer: Concurrent documents per indexer (default: 100)max_concurrent_crawler: Concurrent requests per crawler (default: 100)
In my experience, the indexing is done very fast and the resources are better allocated with more crawlers. Keep a balance with more low priority crawlers than high priority ones, as the high priority queue tends to empty out fast.
- MongoDB:
mongodb://localhost:27017/- Database:
searchengine - Collections:
indexes,pages,outgoing_links
- Database:
- Redis:
localhost:6379- Queues:
high_priority_queue,low_priority_queue,indexing_queue
- Queues:
All the database configuration is done automatically!
Adjust iterations in Bot/src/Ranker.py:
ranker = Ranker(db=AsyncMongoClient("mongodb://localhost:27017/"), iterations=100)- Normalization: Remove extra whitespace and punctuation
- Tokenization: Split text into individual words using NLTK
- Stop word removal: Remove stop words and overly long tokens
- Lemmatization: Reduce words to root forms
The search uses MongoDB aggregation pipeline:
- Match documents containing all query terms
- Count term frequency per document
- Join with page metadata (title, description, rank)
- Sort by PageRank score first and then term frequency
- Return formatted results
- Priority Queue System: New domains get high priority and known domains get low priority to ensure the crawler doesn't get stuck in a single website and visits a lot of new pages
- Robots.txt Compliant: Respects the rules under robots.txt for crawlable pages and cooldowns
- Duplicate Detection: Keeps track of the urls it has seen before to avoid crawling the same page twice
aiohttp: Async HTTP client for web crawlinglxml: Fast XML/HTML parsingnltk: Natural language processing toolkitpymongo: MongoDB drivernetworkx: Graph algorithms for PageRankredis: Queue management and cachingreflex: Web framework for the frontendsentence-transformers: Semantic search capabilities