Tutorials : Unweaving a Tangled Web With HTMLParser and Lucene :

The HTMLParser

On the HTMLParser home page there's a link for downloading version 1.4. It's a zip file about 3 MB in size. It not only contains the jar-file you need, but also the utilities and documentation. I suggest that you unzip all of it to have the JavaDoc at hand. Then put the htmlparser.jar in your classpath.

A really nice built-in feature is that HTMLParser has 6 sample programs which really help you understand how to use the package. The source is found in the src.zip file in the download. For our program, we'll use one of the sample programs- -the StringExtractor program as the starting point. As its name depicts it will extract the text from a web page, but as an option it will also extract the links in the page.

Getting the HTML body text and the links from a page is really simple. Let's get the HTML from the HTMLParser's own home page:

String URL = "http://htmlparser.sourceforge.net";
StringExtractor se = new StringExtractor (URL);
String contents = se.extractStrings(true);
System.out.println(contents);

If we run this code we get a somewhat disappointing result:

HTMLParser Home Page
SORRY! YOU NEED FRAMES ENABLED BROWSER FOR THIS WEBSITE"

The StringExtractor obviously doesn't handle frames, so to handle them we need something more. Be patient and I'll soon give the hints you need to cover frames as well.

For now we'll use the StringExctractor with the url of the left frame (the menu), which top part looks like this:

If we run the program with the menu's URL, which is http:// htmlparser.sourceforge.net/panel.html, we see the following output:

NAVIGATION PAGE
About HTMLParser
<http://htmlparser.sourceforge.net/main.html> Welcome
<http://sourceforge.net/projects/htmlparser> Project Page
<http://htmlparser.sourceforge.net/contributors.html> 
Contributors
<http://htmlparser.sourceforge.net/joinus.html> 
Join this Project
Downloads
. . . 

The first line is the TITLE from the HTML page and then follows text and links. A nice thing is that we always get the link addresses back as full URLs. This relieves us from the task of constructing URLs from relative addresses.

The next thing we need to do is to get the text without the links--and the links without the text. I've used regular expressions for this, because it makes coding very simple. The pattern to use is "look for '<' and '>' with something in between--except for a '>'":

. . .
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(contents);

// Replace links with a space
String text = m.replaceAll(" ");

// Get all links
List links = new ArrayList();
while(m.find()) {
  links.add(m.group());
}
. . . 

Frames

To locate frames in the HTML we insert this code before the link extracting code:

. . .
Parser parser = new Parser (URL);
Node[] list = parser.extractAllNodesThatAre (FrameTag.class);

if (list.length > 0) {
  for (int i = 0; i < list.length; i++) {
    String link = ( (FrameTag)list[i] ).getFrameLocation(); 
    links.add("<" + link + ">");
  }
  return;
}
. . .
(extract links from normal HTML Page)  

Using the Parser class like this shows you how to extract any tag(s) from an HTML page.

So now we're ready to crawl the web!

How to Add Java Applets to Your Site

New on the Java Boutique:

New Review:

Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling API boasts simplicity, ease-of-integration, a well-rounded feature set, and it's free!

New Applet:

Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA sequences into three useful formats.

Elsewhere on internet.com:

WebDeveloper Java
Lots of Java information on webdeveloper.com

WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.

ScriptSearch Java
Hundreds of free Java code files to download.

jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.