webSpider

Project background

Scanning through web page is tedious and repeatitive. The webSpider.py will automatically visit all underlying pages with a given starter URL.

Design

(Links) = URLs (both image tag or a tag for the interest of this project)

The main program loop through each elmement (url) from a visited page and re-compute links for the next crawling acitivities. (Level) stores and tells the deepth of the webspider drilling to.

Once page is visited it will store into visited list to prevent webspider from getting into a endless loop.

HTML # are skipped and will be treated if its corresponding anchor link is already in (visited) list.

Implementation & Dependencies

Python has been chosen over other OOP programming tools. Beautiful soup and requests library has been used to meet this specific requirement.

Execution

python webspider.py start_url

(i.e.) python webspider.py "www.yahoo.com"

Testing

At console, it displays from parent page to all child pages and at the end it shows totally number of pages that the webspider has been visited.

Enhancement if time allow

Visualization of the produced result perhelps in graphical tree structure similiar to site map Data analytics on the result of webspider.py

Automating testing

test_script.sh

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
test_script.sh.txt		test_script.sh.txt
webSpider.py		webSpider.py
webspiderFlow.PNG		webspiderFlow.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

webSpider

Project background

Design

Implementation & Dependencies

Execution

Testing

Enhancement if time allow

Automating testing

About

Uh oh!

Releases

Packages

Languages

SWKCheung/webCrawler_builtIT

Folders and files

Latest commit

History

Repository files navigation

webSpider

Project background

Design

Implementation & Dependencies

Execution

Testing

Enhancement if time allow

Automating testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages