advertisement
javaboutique
Search Tips
Articles  |   Tutorials  |   Reviews  |   Tools  |   by Category  |   by Date  |   by Name  |   Submit  |   Source  |   Forums  |  
javaboutique
Browse DevX


Partners & Affiliates











advertisement

Tutorials : Unweaving a Tangled Web With HTMLParser and Lucene :

Unweaving a Tangled Web With HTMLParser and Lucene

by Keld H. Hansen

Introduction

Ever wanted to write a Java program that crawls the web? You know a program that reads HTML-pages, retrieves the links, gets the new pages--with more links and so on. Maybe you also have thought about storing the text from the HTML pages for later use, to be able to search for specific information in the pages for example. These are the characteristics of a search engine like Google or Yahoo. If you have a web site of your own you might be interested in having your own search engine. One possibility is to buy one, or use an Open Source search engine, but you might also find it rewarding to write your own!

In this article I'll show you the basic technique in building a search engine using two powerful Open Source products: HTMLParser and Lucene.

Crawling the Web

The first step is to find out how to "crawl the web". That is: request a page using the HTTP protocol, receive the page, extract the text in the page, and harvest the links in the page. Then repeat this process for every link found. There are several ways to handle this task, some of them are:

  1. Use the java.net.URLConnection class. This is a rather low-level approach that appeals to those who want absolute control over what's going on.
  2. Use the HttpClient from the Jakarta project. This open source product will handle several situations for you which otherwise would need non-trivial coding. There's a feature list available if you want the details.
  3. Use the HTMLParser found on SourceForge.net. This product not only allows you to send a request and receive a response, but it'll also parse the HTML for you.

So in our situation the HTMLParser is a natural choice. It's not the only open source HTML parser available, but it's the best that I've found.

How to Add Java Applets to Your Site

New on the Java Boutique:

New Review:

Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling API boasts simplicity, ease-of-integration, a well-rounded feature set, and it's free!

New Applet:

Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA sequences into three useful formats.

Elsewhere on internet.com:

WebDeveloper Java
Lots of Java information on webdeveloper.com

WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.

ScriptSearch Java
Hundreds of free Java code files to download.

jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.

 IBM Software Construction Toolbox
 Microsoft RIA Development Center
 Destination .NET
XML error: not well-formed (invalid token) at line 33
advertisement
Receive Articles via our XML/RSS feed
Receive Articles via our XML/RSS feed

JavaBytes
Internet Cyclone
This powerful, easy-to-use, internet optimizer is for Windows 95, 98, ME, NT, 2000 and XP. It's designed to automatically optimize your Windows settings, boosting your Internet connection up to 200%.

IBM Gives a Mobile Voice to Developers
Inadequate Tools Send Software Down the Drain
USB 3.0 One Step Closer to Reality
Would-Be Linux Contributors May Get a Leg Up
SAP, Oracle Holding Out on Ubuntu?
GIPS Technology to Voice-Enable iPhone Apps
Citrix CTO Eyes the Future of Virtualization
The Pitfalls of Open Source Litigation
LiMo Open to Working With Google on Mobile
Google Gadgets Under Attack at Black Hat

The Guide to E-Mail Archiving and Management
Making XQuery Control Structures Work for You
Simplifying Composite Applications with Service Component Architecture
Overview: C++ Gets an Overhaul
Sharpening Your Axis with Visual Basic 9
Easier C++: An Introduction to Concepts
Simpler Multithreading in C++0x
Software as a Service Perceptions Survey - March 2007
Rackspace® Security
On Demand Media Technology Company Relies on The IT Hosting Leader to Provide Managed Microsoft Exchange

Advertising Info  |   Member Services  |   Contact Us  |   Help  |   Feedback  |   Site Map  |   Network Map  |   About



JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers