|
The HTMLParser
On the HTMLParser
home page there's a link for downloading version 1.4. It's a
zip file about 3 MB in size. It not only contains the jar-file
you need, but also the utilities and documentation. I suggest that
you unzip all of it to have the JavaDoc at hand. Then put the
htmlparser.jar in your classpath.
A really nice built-in feature is that HTMLParser has 6 sample
programs which really help you understand how to use the
package. The source is found in the src.zip file in the
download. For our program, we'll use one of the sample programs-
-the StringExtractor program as the starting
point. As its name depicts it will extract the text from a web
page, but as an option it will also extract the links in the page.
Getting the HTML body text and the links from a page is really
simple. Let's get the HTML from the HTMLParser's own home page:
String URL = "http://htmlparser.sourceforge.net";
StringExtractor se = new StringExtractor (URL);
String contents = se.extractStrings(true);
System.out.println(contents);
If we run this code we get a somewhat disappointing result:
HTMLParser Home Page
SORRY! YOU NEED FRAMES ENABLED BROWSER FOR THIS WEBSITE"
The StringExtractor obviously doesn't handle
frames, so to handle them we need something more. Be patient and
I'll soon give the hints you need to cover frames as well.
For now we'll use the StringExctractor with the url of
the left frame (the
menu), which top part looks like this:
If we run the program with the menu's URL, which is http://
htmlparser.sourceforge.net/panel.html, we see the following
output:
NAVIGATION PAGE
About HTMLParser
<http://htmlparser.sourceforge.net/main.html> Welcome
<http://sourceforge.net/projects/htmlparser> Project Page
<http://htmlparser.sourceforge.net/contributors.html>
Contributors
<http://htmlparser.sourceforge.net/joinus.html>
Join this Project
Downloads
. . .
The first line is the TITLE from the HTML page and then follows
text and links. A nice thing is that we always get the link
addresses back as full URLs. This relieves us from the task of
constructing URLs from relative addresses.
The next thing we need to do is to get the text without the
links--and the links without the text. I've used regular
expressions for this, because it makes coding very simple. The
pattern to use is "look for '<' and '>' with
something in between--except for a '>'":
. . .
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(contents);
// Replace links with a space
String text = m.replaceAll(" ");
// Get all links
List links = new ArrayList();
while(m.find()) {
links.add(m.group());
}
. . .
Frames
To locate frames in the HTML we insert this code before the link
extracting code:
. . .
Parser parser = new Parser (URL);
Node[] list = parser.extractAllNodesThatAre (FrameTag.class);
if (list.length > 0) {
for (int i = 0; i < list.length; i++) {
String link = ( (FrameTag)list[i] ).getFrameLocation();
links.add("<" + link + ">");
}
return;
}
. . .
(extract links from normal HTML Page)
Using the Parser class like this shows you how to
extract any tag(s) from an HTML page.
So now we're ready to crawl the web!
New on the Java Boutique:
New Review:
Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling
API boasts simplicity, ease-of-integration, a well-rounded feature
set, and it's free!
New Applet:
Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA
sequences into three useful formats.
Elsewhere on internet.com:
WebDeveloper Java
Lots of Java information on webdeveloper.com
WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.
ScriptSearch Java
Hundreds of free Java code files to download.
jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.
|