A Java Based Web Crawler and Search Engine
- jSoup [1.7.2]
- Java mySQL Connector (JDBC) [5.1.23]
- Tomcat Apache Server
| URLID |
URL |
Rank |
Title |
Description |
| 0 |
http://www.cs.purdue.edu |
4 |
Computer Science Department |
Computer Science Department .... |
| 1 |
http://www.cs.purdue.edu/homes/cs390lang/java |
1 |
Advanced Java |
CS390java: Advanced Java... |
This table will store the list of URLs found during the crawling.
- URLID - It is the number of this URL in the table and it will uniquely identify this URL.
- URL - It is the URL found
- Rank - Page calculated rank
- Title - It is the page title
- Description - It is a fragment of the text content of this URL. You can save for example the first 100 characters of the URL.
| URLID |
URL |
Rank |
| 0 |
http://www.cs.purdue.edu/news/images/bxd.jpg |
17 |
| 1 |
http://www.cs.purdue.edu/news/lawsons-sculpture.jpg |
2 |
This table will store the list of image URLs found during the crawling.
- URLID - It is the number of this URL in the table and it will uniquely identify this URL.
- URL - It is the URL found
- Rank - Page calculated rank
| Word |
URLList |
| Computer |
1,5,7,3 |
| Science |
3,6,7 |
| Word |
URLList |
| Computer |
1,5,7,3 |
| Science |
3,6,7 |
These tables will store the list of words and the URLs / images that contain them.
- Word - It is a keyword found in one or more URLs
- URLList - It is a string with the list of all the URLs / images that contain that word. Since the length of this string is variable, you can use a VARCHAR type.
A completed database with all the mySQL tables can be found in the database directory. It countains 7027 urls and 5334 urls that have been fully parsed. Feel free to use these tables for your testing or development.
The webcrawler program uses breadth-first search and will have the syntax:
webcrawl [-u <maxurls>] [-d domain] [-r] url-list
Where maxurls is the maximum number of URLs that will be traversed. By default it is 1000. domain is the domain used to restrict the links added to the table of URLs. Only the URLs in this domain will be added. url-list is the list of starting URL's that will be traversed. The -r flag tells the crawler whether to reset the database or to start where it last left off.
Alternatively, these settings are stored in database.properties. This is a multi-threaded crawler and the number of threads can be changed by changing the global variable nThread in the crawler class.
This GUI based application that allows the user to quickly and easily change the database.properties file. Just run the PropertiesManager application to start.
This servlet is a Google-like search engine that includes both web and image search. I recreated the CSS and HTML myself and was able to get the look and feel I wanted. Using javascript and jQuery plugins, I was able to include live url previews and a popup like image gallery (fancyBox). Screenshots can be found in the screenshots folder.