Skip to content

pbaderia01/webCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GO-CRAWLER

This project is an attempt to build a web crawler using golang and implement concurrency

The project provides multiple options which are configurable using environment variables as follows:

Option Environment Variable Values Accepted Default Value Description Required
Concurrency THREAD_COUNT Integer 5 This option lets you control the concurrency at which the crawler runs defaulting to 5 False
URL to crawl CRAWL_URL String - This lets you configure the URL which you want to crawl and should be provided True
Root Path ROOT_PATH String - This lets you configure the root path in which responses should be saved if you want to save responses to the disk. Needs to be set to a valid directory path if STORE_ON_DISK is set to True False
Output Control DISPLAY_URI Boolean false This lets you configure if you want to view the URIs that are being visited by the crawler False
Store On Disk STORE_ON_DISK Boolean false This lets you configure if you want to save the responses fetched on the local disk False

Usage

Prerequisites:

  • Before running the go script locally please install go version 1.13.1
  • Install all dependencies using go mod download
The source code can be run with defaults as:
You can run the source code as: go run crawler.go <URL>
You can also use the env variable to specify URL: CRAWL_URL=<URL> go run crawler.go

All other options can be configured as env variables by either setting them as a env variable or supplying the env variable with go command as:

ENV_VAR_1=value go run crawler.go

To run just the crawler it can be run as a docker container as well

docker run -e CRAWL_URL=<URL> baderiapiyush/web-crawler-go:latest

##Examples To run with a concurrency of 3:

THREAD_COUNT=3 go run crawler.go <URL>
docker run -e CRAWL_URL=<URL> -e THREAD_COUNT=3 baderiapiyush/web-crawler-go:latest

To store responses on disk:

STORE_ON_DISK=true ROOT_PATH=/Users/piyushbaderia/response/ go run crawler.go <URL>
docker run -e CRAWL_URL=<URL> -e STORE_ON_DISK=true ROOT_PATH=/Users/piyushbaderia/response/ baderiapiyush/web-crawler-go:latest

To Display URIs that are being crawled:

DISPLAY_URI=true go run crawler.go <URL>
docker run -e CRAWL_URL=<URL> -e DISPAY_URI=true baderiapiyush/web-crawler-go:latest

Features

The crawler performs the following tasks:

  • Crawls a single subdomain i.e. the base domain of the URI passed to crawl on
  • Option to view URIs that are being crawled
  • Option to store the responses on local
  • Provides control over concurrency
  • The requests timeout after 30 sec

Enhancements

The crawler can be enhanced on the following points:

  • More tests : The crawler currently does not have tests for the functions that need to fetch data over the internet
  • BenchMark Tests: Benchmark tests need to be added to the crawler to benchmark performance for every change
  • Support for Robots.txt: Crawler currently does not support robots.txt restrictions
  • Transport Configuration: More http client options should be added to support for TLS

About

My attempt at web crawling using Go

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published