Tutorials : Unweaving a Tangled Web With HTMLParser and Lucene :

Completing the crawler

The crawler algorithm goes something like this (pseudo-code):

  1. Set current address to the chosen start address
  2. Use StringExtractor to get text and links for the current address
  3. Set the current address as "crawled" (use a HashMap for this)
  4. For each address in the set of links do:
  5.     Make this the current address
  6.     Go to 2
  7. End For

This will obviously go on forever if we don't include some stop condition. It's normal practice to define a "depth" parameter, indicating how far down in the tree-structure of links we want to crawl. We'll also add a boolean parameter that tells whether we want to crawl external addresses. We define an external address as an address whose prefix doesn't match the prefix of the start address.

Valid links

The syntax of the HTML link-tag (the A-tag) allows for some variants which we are not interested in. The link address can, for example, be a JavaScript expression, so we'll add some code that checks if an address starts with "http://" or https://". Only these addresses will be crawled. Bookmarks must also be handled to avoid crawling a page more than once. There are probably even other things you'd have to consider if your web crawler is going to compete with Google!

You'll find a link to the source of a complete crawler in the resources section at the end of the article.

If you run the crawler you'll notice that the number of pages crawled per minute is rather modest. This is not due to bad performance of the HTMLParser, but simply because an HTTP request on the Internet takes some time to complete. To speed up crawling you'd have to manage a set of threads, each handling a web page. This is not part of my demo program, but is safe to do, since most of the HTMLParser classes are thread-safe. You may even update an index while it's being queried by other programs. 

Until now we've only been web crawling, which is fun, but not very useful, unless you store the information you find on your way. Enter Lucene.

How to Add Java Applets to Your Site

New on the Java Boutique:

New Review:

Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling API boasts simplicity, ease-of-integration, a well-rounded feature set, and it's free!

New Applet:

Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA sequences into three useful formats.

Elsewhere on internet.com:

WebDeveloper Java
Lots of Java information on webdeveloper.com

WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.

ScriptSearch Java
Hundreds of free Java code files to download.

jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.