Skip to content

RMLaroche/CoursInteg

Repository files navigation

Purdoogle

A Java Based Web Crawler and Search Engine

Requirements

  • jSoup [1.7.2]
  • Java mySQL Connector (JDBC) [5.1.23]
  • Tomcat Apache Server

The Database

URL Table

URLID
URL
Rank
Title
Description
0
http://www.cs.purdue.edu
4
Computer Science Department
Computer Science Department ....
1
http://www.cs.purdue.edu/homes/cs390lang/java
1
Advanced Java
CS390java: Advanced Java...

This table will store the list of URLs found during the crawling.
  • URLID -  It is the number of this URL in the table and it will uniquely identify this URL.
  • URL - It is the URL found
  • Rank - Page calculated rank
  • Title - It is the page title
  • Description - It is a fragment of the text content of this URL. You can save for example the first 100 characters of the URL.

Images Table

URLID
URL
Rank
0
http://www.cs.purdue.edu/news/images/bxd.jpg
17
1
http://www.cs.purdue.edu/news/lawsons-sculpture.jpg
2

This table will store the list of image URLs found during the crawling.
  • URLID -  It is the number of this URL in the table and it will uniquely identify this URL.
  • URL - It is the URL found
  • Rank - Page calculated rank

Word Table

Word
URLList
Computer
1,5,7,3
Science
3,6,7

Image Word Table

Word
URLList
Computer
1,5,7,3
Science
3,6,7

These tables will store the list of words and the URLs / images that contain them.
  • Word - It is a keyword found in one or more URLs
  • URLList - It is a string with the list of all the URLs / images that contain that word. Since the length of this string is variable, you can use a VARCHAR type.

A completed database with all the mySQL tables can be found in the database directory. It countains 7027 urls and 5334 urls that have been fully parsed. Feel free to use these tables for your testing or development.

The Crawler

The webcrawler program uses breadth-first search and will have the syntax:

webcrawl [-u <maxurls>] [-d domain] [-r] url-list

Where maxurls is the maximum number of URLs that will be traversed. By default it is 1000. domain is the domain used to restrict the links added to the table of URLs. Only the URLs in this domain will be added. url-list is the list of starting URL's that will be traversed. The -r flag tells the crawler whether to reset the database or to start where it last left off.

Alternatively, these settings are stored in database.properties. This is a multi-threaded crawler and the number of threads can be changed by changing the global variable nThread in the crawler class.

The Properties Manager

This GUI based application that allows the user to quickly and easily change the database.properties file. Just run the PropertiesManager application to start.

The Search Engine

This servlet is a Google-like search engine that includes both web and image search. I recreated the CSS and HTML myself and was able to get the look and feel I wanted. Using javascript and jQuery plugins, I was able to include live url previews and a popup like image gallery (fancyBox). Screenshots can be found in the screenshots folder.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published