Web crawlers, also known as spiders, are most often used by search engines such as Google, Bing DuckDuckGo, to go through web pages and read the content on them. Web crawlers work by giving them an initial address to start from and from there it will find new addresses to browse through. Web crawlers benefit a lot from concurrency, since usually the highest amount of time is spent loading the page. By using some concurrent method such as using multiple threads to load and read through the pages, performance of the crawler can be substantially increased..
-
You need VC++ 2019 Runtime 32-bit and 64-bit versions
-
You will need .NET 8.
-
You need to install the version of VC++ Runtime that Baby Browser needs.
I implemented the solution by creating a class named Crawler
.
The class packs inside itself
HttpClient
to make requestsConcurrentQueue
to hold urls to visit nextHashSet
for keeping track of all visited urls and urls found.Semaphore
for protecting variables from data racesList
of tasks that do all the workSiteMap
(my own implementation) that holds recursive data structure to map out the web pages
Crawler is started by calling Run method, which creates desired amount of Tasks.
CancellationToken
is also passed in case the user wants to stop the execution. Each task will start by dequeing an
url from the queue. If the queue is empty, task will try dequeing item until it will timeout (5 sec)
or it will dequeue url succesfully. Queue uses ConcurrentQueue class so it is thread safe.
Url is then used to create a http request. First status code is checked to be ok and then the header content is checked to be text/html.
If this is ok, page is loaded and passed to helper function which finds all urls by looking for strings in the html
which start with href="
and end with "
.
These links are then added to a HashSet which is basically an unordered list. I will call it a list for now the keep things simple. Elements in this list are then added to a list which contains all urls, if it is not already in it, and it will also get added to queue if the url has not been yet visited and is not invalid file type such as .exe or .jpeg. After parsing the urls from the body of html, the current url is added to a list which holds all visited urls. After that, new url is dequeued and same process is repeated.
- User Guide - How to get Koolaid
- Compilation Guide - How to compile koolaid.
- Configuration Guide - Configuration settings used in koolaid.
- Distribution Guide - Create a new installer package