GitHub - is-leeroy-jenkins/Koolaid: Koolaid is a WPF-based web crawler written in C-Sharp.

Koolaid is a WPF-based web crawler written in C-Sharp.

Features

Web crawlers, also known as spiders, are most often used by search engines such as Google, Bing DuckDuckGo, to go through web pages and read the content on them. Web crawlers work by giving them an initial address to start from and from there it will find new addresses to browse through. Web crawlers benefit a lot from concurrency, since usually the highest amount of time is spent loading the page. By using some concurrent method such as using multiple threads to load and read through the pages, performance of the crawler can be substantially increased..

System requirements

You need VC++ 2019 Runtime 32-bit and 64-bit versions
You will need .NET 8.
You need to install the version of VC++ Runtime that Baby Browser needs.

Implementation

I implemented the solution by creating a class named Crawler. The class packs inside itself

HttpClient to make requests
ConcurrentQueue to hold urls to visit next
HashSet for keeping track of all visited urls and urls found.
Semaphore for protecting variables from data races
List of tasks that do all the work
SiteMap (my own implementation) that holds recursive data structure to map out the web pages

Crawler is started by calling Run method, which creates desired amount of Tasks. CancellationToken is also passed in case the user wants to stop the execution. Each task will start by dequeing an url from the queue. If the queue is empty, task will try dequeing item until it will timeout (5 sec) or it will dequeue url succesfully. Queue uses ConcurrentQueue class so it is thread safe. Url is then used to create a http request. First status code is checked to be ok and then the header content is checked to be text/html. If this is ok, page is loaded and passed to helper function which finds all urls by looking for strings in the html which start with href=" and end with ".

These links are then added to a HashSet which is basically an unordered list. I will call it a list for now the keep things simple. Elements in this list are then added to a list which contains all urls, if it is not already in it, and it will also get added to queue if the url has not been yet visited and is not invalid file type such as .exe or .jpeg. After parsing the urls from the body of html, the current url is added to a list which holds all visited urls. After that, new url is dequeued and same process is repeated.

Documentation

User Guide - How to get Koolaid
Compilation Guide - How to compile koolaid.
Configuration Guide - Configuration settings used in koolaid.
Distribution Guide - Create a new installer package

Code

Built on NET 8
Supports AnyCPU as well as x86/x64 specific builds
Crawler - the web crawler class.
UI - main UI layer and associated controls and related functionality.
Assets - resources used in koolaid.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Data		Data
Properties		Properties
Resources		Resources
UI/Windows		UI/Windows
img		img
.gitattributes		.gitattributes
.gitignore		.gitignore
App.config		App.config
App.xaml		App.xaml
App.xaml.cs		App.xaml.cs
Koolaid.csproj		Koolaid.csproj
Koolaid.csproj.user		Koolaid.csproj.user
Koolaid.sln		Koolaid.sln
Map.txt		Map.txt
README.md		README.md
REPORT.md		REPORT.md
packages.config		packages.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Koolaid is a WPF-based web crawler written in C-Sharp.

Features

System requirements

Implementation

Documentation

Code

Report

Example output

About

Uh oh!

Releases

Packages

Uh oh!

Languages

is-leeroy-jenkins/Koolaid

Folders and files

Latest commit

History

Repository files navigation

Koolaid is a WPF-based web crawler written in C-Sharp.

Features

System requirements

Implementation

Documentation

Code

Report

Example output

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages