Purdoogle

A Java Based Web Crawler and Search Engine

Requirements

jSoup [1.7.2]
Java mySQL Connector (JDBC) [5.1.23]
Tomcat Apache Server

The Database

URL Table

URLID	URL	Rank	Title	Description
0	http://www.cs.purdue.edu	4	Computer Science Department	Computer Science Department ....
1	http://www.cs.purdue.edu/homes/cs390lang/java	1	Advanced Java	CS390java: Advanced Java...

This table will store the list of URLs found during the crawling.

URLID - It is the number of this URL in the table and it will uniquely identify this URL.
URL - It is the URL found
Rank - Page calculated rank
Title - It is the page title
Description - It is a fragment of the text content of this URL. You can save for example the first 100 characters of the URL.

Images Table

URLID	URL	Rank
0	http://www.cs.purdue.edu/news/images/bxd.jpg	17
1	http://www.cs.purdue.edu/news/lawsons-sculpture.jpg	2

This table will store the list of image URLs found during the crawling.

URLID - It is the number of this URL in the table and it will uniquely identify this URL.
URL - It is the URL found
Rank - Page calculated rank

Word Table

Word	URLList
Computer	1,5,7,3
Science	3,6,7

Image Word Table

Word	URLList
Computer	1,5,7,3
Science	3,6,7

These tables will store the list of words and the URLs / images that contain them.

Word - It is a keyword found in one or more URLs
URLList - It is a string with the list of all the URLs / images that contain that word. Since the length of this string is variable, you can use a VARCHAR type.

A completed database with all the mySQL tables can be found in the database directory. It countains 7027 urls and 5334 urls that have been fully parsed. Feel free to use these tables for your testing or development.

The Crawler

The webcrawler program uses breadth-first search and will have the syntax:

webcrawl [-u <maxurls>] [-d domain] [-r] url-list

Where maxurls is the maximum number of URLs that will be traversed. By default it is 1000. domain is the domain used to restrict the links added to the table of URLs. Only the URLs in this domain will be added. url-list is the list of starting URL's that will be traversed. The -r flag tells the crawler whether to reset the database or to start where it last left off.

Alternatively, these settings are stored in database.properties. This is a multi-threaded crawler and the number of threads can be changed by changing the global variable nThread in the crawler class.

The Properties Manager

This GUI based application that allows the user to quickly and easily change the database.properties file. Just run the PropertiesManager application to start.

The Search Engine

This servlet is a Google-like search engine that includes both web and image search. I recreated the CSS and HTML myself and was able to get the look and feel I wanted. Using javascript and jQuery plugins, I was able to include live url previews and a popup like image gallery (fancyBox). Screenshots can be found in the screenshots folder.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.settings		.settings
Screenshots		Screenshots
WebContent		WebContent
build/classes		build/classes
database		database
src		src
.classpath		.classpath
.project		.project
README.md		README.md
forms-1.3.0-src.zip		forms-1.3.0-src.zip
forms-1.3.0.jar		forms-1.3.0.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Purdoogle

Requirements

The Database

URL Table

Images Table

Word Table

Image Word Table

The Crawler

The Properties Manager

The Search Engine

About

Uh oh!

Releases

Packages

Languages

RMLaroche/CoursInteg

Folders and files

Latest commit

History

Repository files navigation

Purdoogle

Requirements

The Database

URL Table

Images Table

Word Table

Image Word Table

The Crawler

The Properties Manager

The Search Engine

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages