Unweaving a Tangled Web With HTMLParser and Lucene
by Keld H. Hansen
Introduction
Ever wanted to write a Java program that crawls the web? You
know a program that reads HTML-pages, retrieves the links, gets
the new pages--with more links and so on. Maybe you also have
thought about storing the text from the HTML pages for later
use, to be able to search for specific information in the pages
for example. These are the characteristics of a search engine
like Google or Yahoo. If you have a web site of your own you
might be interested in having your own search engine. One
possibility is to buy one, or use an Open Source search engine,
but you might also find it rewarding to write your own!
In this article I'll show you the basic technique in building a search engine
using two powerful Open Source products:
HTMLParser and
Lucene.
Crawling the Web
The first step is to find out how to "crawl the web".
That is: request a page using the HTTP protocol, receive the
page, extract the text in the page, and harvest the links in the
page. Then repeat this process for every link found. There are
several ways to handle this task, some of them are:
- Use the
java.net.URLConnection class. This is a rather low-level approach
that appeals to those who want absolute control over what's going on.
- Use the HttpClient
from the Jakarta project. This open source product will handle several
situations for you which otherwise would need non-trivial coding. There's
a feature
list available if you want the details.
- Use the HTMLParser found
on SourceForge.net. This product not only allows you to send a request and
receive a response, but it'll also parse the HTML for you.
So in our situation the HTMLParser is a natural choice. It's not
the only open source HTML parser available, but it's the best
that I've found.
New on the Java Boutique:
New Review:
Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling
API boasts simplicity, ease-of-integration, a well-rounded feature
set, and it's free!
New Applet:
Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA
sequences into three useful formats.
Elsewhere on internet.com:
WebDeveloper Java
Lots of Java information on webdeveloper.com
WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.
ScriptSearch Java
Hundreds of free Java code files to download.
jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.
|