|
Completing the crawler
The crawler algorithm goes something like this (pseudo-code):
- Set current address to the chosen start address
- Use
StringExtractor to get text and
links for the current address
- Set the current address as "crawled" (use a HashMap for this)
- For each address in the set of links do:
- Make this the current address
- Go to 2
- End For
This will obviously go on forever if we don't include some stop
condition. It's normal practice to define a "depth"
parameter, indicating how far down in the tree-structure of
links we want to crawl. We'll also add a boolean parameter that
tells whether we want to crawl external addresses. We define an
external address as an address whose prefix doesn't match the
prefix of the start address.
Valid links
The syntax of the HTML link-tag (the A-tag) allows for some
variants which we are not interested in. The link address can,
for example, be a JavaScript expression, so we'll add some code
that checks if an address starts with "http://" or
https://". Only these addresses will be crawled. Bookmarks
must also be handled to avoid crawling a page more than once.
There are probably even other things you'd have to consider if
your web crawler is going to compete with Google!
You'll find a link to the source of a complete crawler in the
resources section at the end of the article.
If you run the crawler you'll notice that the number of pages
crawled per minute is rather modest. This is not due to bad
performance of the HTMLParser, but simply because an HTTP
request on the Internet takes some time to complete. To speed up
crawling you'd have to manage a set of threads, each handling a
web page. This is not part of my demo program, but is safe to
do, since most of the HTMLParser classes are thread-safe. You
may even update an index while it's being queried by other
programs.
Until now we've only been web crawling, which is fun, but not
very useful, unless you store the information you find on your
way. Enter Lucene.
New on the Java Boutique:
New Review:
Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling
API boasts simplicity, ease-of-integration, a well-rounded feature
set, and it's free!
New Applet:
Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA
sequences into three useful formats.
Elsewhere on internet.com:
WebDeveloper Java
Lots of Java information on webdeveloper.com
WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.
ScriptSearch Java
Hundreds of free Java code files to download.
jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.
|