advertisement
javaboutique
Search Tips
Articles  |   Tutorials  |   Reviews  |   Tools  |   by Category  |   by Date  |   by Name  |   Submit  |   Source  |   Forums  |  
javaboutique
Browse DevX


Partners & Affiliates











advertisement

Tutorials : Unweaving a Tangled Web With HTMLParser and Lucene :

The HTMLParser

On the HTMLParser home page there's a link for downloading version 1.4. It's a zip file about 3 MB in size. It not only contains the jar-file you need, but also the utilities and documentation. I suggest that you unzip all of it to have the JavaDoc at hand. Then put the htmlparser.jar in your classpath.

A really nice built-in feature is that HTMLParser has 6 sample programs which really help you understand how to use the package. The source is found in the src.zip file in the download. For our program, we'll use one of the sample programs- -the StringExtractor program as the starting point. As its name depicts it will extract the text from a web page, but as an option it will also extract the links in the page.

Getting the HTML body text and the links from a page is really simple. Let's get the HTML from the HTMLParser's own home page:

String URL = "http://htmlparser.sourceforge.net";
StringExtractor se = new StringExtractor (URL);
String contents = se.extractStrings(true);
System.out.println(contents);

If we run this code we get a somewhat disappointing result:

HTMLParser Home Page
SORRY! YOU NEED FRAMES ENABLED BROWSER FOR THIS WEBSITE"

The StringExtractor obviously doesn't handle frames, so to handle them we need something more. Be patient and I'll soon give the hints you need to cover frames as well.

For now we'll use the StringExctractor with the url of the left frame (the menu), which top part looks like this:

If we run the program with the menu's URL, which is http:// htmlparser.sourceforge.net/panel.html, we see the following output:

NAVIGATION PAGE
About HTMLParser
<http://htmlparser.sourceforge.net/main.html> Welcome
<http://sourceforge.net/projects/htmlparser> Project Page
<http://htmlparser.sourceforge.net/contributors.html> 
Contributors
<http://htmlparser.sourceforge.net/joinus.html> 
Join this Project
Downloads
. . . 

The first line is the TITLE from the HTML page and then follows text and links. A nice thing is that we always get the link addresses back as full URLs. This relieves us from the task of constructing URLs from relative addresses.

The next thing we need to do is to get the text without the links--and the links without the text. I've used regular expressions for this, because it makes coding very simple. The pattern to use is "look for '<' and '>' with something in between--except for a '>'":

. . .
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(contents);

// Replace links with a space
String text = m.replaceAll(" ");

// Get all links
List links = new ArrayList();
while(m.find()) {
  links.add(m.group());
}
. . . 

Frames

To locate frames in the HTML we insert this code before the link extracting code:

. . .
Parser parser = new Parser (URL);
Node[] list = parser.extractAllNodesThatAre (FrameTag.class);

if (list.length > 0) {
  for (int i = 0; i < list.length; i++) {
    String link = ( (FrameTag)list[i] ).getFrameLocation(); 
    links.add("<" + link + ">");
  }
  return;
}
. . .
(extract links from normal HTML Page)  

Using the Parser class like this shows you how to extract any tag(s) from an HTML page.

So now we're ready to crawl the web!

How to Add Java Applets to Your Site

New on the Java Boutique:

New Review:

Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling API boasts simplicity, ease-of-integration, a well-rounded feature set, and it's free!

New Applet:

Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA sequences into three useful formats.

Elsewhere on internet.com:

WebDeveloper Java
Lots of Java information on webdeveloper.com

WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.

ScriptSearch Java
Hundreds of free Java code files to download.

jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.

 Microsoft Visual Studio 2010 Showcase
 Avaya Developer Showcase
 MSDN Spotlight
 PHP for Windows Showcase
XML error: undefined entity at line 39
advertisement
Receive Articles via our XML/RSS feed
Receive Articles via our XML/RSS feed

JavaBytes
Internet Cyclone
This powerful, easy-to-use, internet optimizer is for Windows 95, 98, ME, NT, 2000 and XP. It's designed to automatically optimize your Windows settings, boosting your Internet connection up to 200%.

Windows 7: From Beta to Final Code in One Year
Google Shows Off Chrome OS, Releases Source
Microsoft Shows Off Silverlight 4, IE9 Plans
Metasploit Expands Vulnerability Test Framework
HyperCard Reborn?
Fedora 12 Takes Aim at Linux Networking
Top Supercomputer Nearly Doubles in Speed
Fedora 12 Linux Tackles Virtualization
Apple Gives iPhone Developers App Status Tracker
Novell Sets OpenSUSE 11.2 Free

Creating Custom Export Filters for StarOffice with XSLT
WPF Wonders: Using DataTemplates
Crystal Reports Family Offers Options for Developers
Avaya Aura Session Manager video
Avaya Aura Overview video
Exploring HTML 5's Audio/Video Multimedia Support
Overriding Virtual Functions? Use C++0x Attributes to Avoid Bugs.
Understanding the Cloud Computing Security Vulnerabilities
Cisco and IBM Target a Greener World
Upgrade to Visual Studio 2010 with the Ultimate Offer

Advertising Info  |   Member Services  |   Contact Us  |   Help  |   Feedback  |   Site Map  |   Network Map  |   About

internet.commediabistro.comJusttechjobs.comGraphics.com

Search:

WebMediaBrands Corporate Info

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | Shopping | E-mail Offers | Freelance Jobs