Skip to content

aadithpm/crawler-swarm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concurrent web crawler in Python 3. Powered by BeautifulSoup and requests. Crawls a website for all non-relative links and queues those links for further crawling until a user-defined 'level' of crawling is reached (or you get bored of waiting for the crawler to finish).

Setup:

  • Clone repo locally: git clone REPO_URL

  • Optional: Create a virtual environment: python -m venv env

  • Install requirements: pip install -r requirements.txt

  • Running unittests: python -m unittest

  • Running the crawler:

python crawler.py [--help] --url=URL --levels=LEVELS --indent

help - displays help message
url - URL to crawl
levels - maximum levels to crawl from starting URL
indent - print with/without indentation, defaults to unindented

Preview:

crawler preview

Improvements:

  • Visit only unique links

    • For 'parent' and 'child' links, do not visit a link that has already been visited. For example, social links are available on every page of a company's website. Resources are wasted by queueing these for crawling when they will be a part of the crawl queue on the first instance anyway
  • Present links on a different interface

    • A console application is the fastest to develop but a web application might present this information in a more organized and navigable format
  • Tests

    • Can never have enough tests. crawler_swarm's crawler processing logic does not have tests at all, for example.

About

Simple Python web crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages