advertisement
javaboutique
Search Tips
Articles  |   Tutorials  |   Reviews  |   Tools  |   by Category  |   by Date  |   by Name  |   Submit  |   Source  |   Forums  |  
javaboutique
Browse DevX


Partners & Affiliates











advertisement

Tutorials : Unweaving a Tangled Web With HTMLParser and Lucene :

Completing the crawler

The crawler algorithm goes something like this (pseudo-code):

  1. Set current address to the chosen start address
  2. Use StringExtractor to get text and links for the current address
  3. Set the current address as "crawled" (use a HashMap for this)
  4. For each address in the set of links do:
  5.     Make this the current address
  6.     Go to 2
  7. End For

This will obviously go on forever if we don't include some stop condition. It's normal practice to define a "depth" parameter, indicating how far down in the tree-structure of links we want to crawl. We'll also add a boolean parameter that tells whether we want to crawl external addresses. We define an external address as an address whose prefix doesn't match the prefix of the start address.

Valid links

The syntax of the HTML link-tag (the A-tag) allows for some variants which we are not interested in. The link address can, for example, be a JavaScript expression, so we'll add some code that checks if an address starts with "http://" or https://". Only these addresses will be crawled. Bookmarks must also be handled to avoid crawling a page more than once. There are probably even other things you'd have to consider if your web crawler is going to compete with Google!

You'll find a link to the source of a complete crawler in the resources section at the end of the article.

If you run the crawler you'll notice that the number of pages crawled per minute is rather modest. This is not due to bad performance of the HTMLParser, but simply because an HTTP request on the Internet takes some time to complete. To speed up crawling you'd have to manage a set of threads, each handling a web page. This is not part of my demo program, but is safe to do, since most of the HTMLParser classes are thread-safe. You may even update an index while it's being queried by other programs. 

Until now we've only been web crawling, which is fun, but not very useful, unless you store the information you find on your way. Enter Lucene.

How to Add Java Applets to Your Site

New on the Java Boutique:

New Review:

Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling API boasts simplicity, ease-of-integration, a well-rounded feature set, and it's free!

New Applet:

Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA sequences into three useful formats.

Elsewhere on internet.com:

WebDeveloper Java
Lots of Java information on webdeveloper.com

WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.

ScriptSearch Java
Hundreds of free Java code files to download.

jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.

 Microsoft Visual Studio 2010 Showcase
 Avaya Developer Showcase
 MSDN Spotlight
 PHP for Windows Showcase
XML error: undefined entity at line 39
advertisement
Receive Articles via our XML/RSS feed
Receive Articles via our XML/RSS feed

JavaBytes
Internet Cyclone
This powerful, easy-to-use, internet optimizer is for Windows 95, 98, ME, NT, 2000 and XP. It's designed to automatically optimize your Windows settings, boosting your Internet connection up to 200%.

Windows 7: From Beta to Final Code in One Year
Google Shows Off Chrome OS, Releases Source
Microsoft Shows Off Silverlight 4, IE9 Plans
Metasploit Expands Vulnerability Test Framework
HyperCard Reborn?
Fedora 12 Takes Aim at Linux Networking
Top Supercomputer Nearly Doubles in Speed
Fedora 12 Linux Tackles Virtualization
Apple Gives iPhone Developers App Status Tracker
Novell Sets OpenSUSE 11.2 Free

Creating Custom Export Filters for StarOffice with XSLT
WPF Wonders: Using DataTemplates
Crystal Reports Family Offers Options for Developers
Avaya Aura Session Manager video
Avaya Aura Overview video
Exploring HTML 5's Audio/Video Multimedia Support
Overriding Virtual Functions? Use C++0x Attributes to Avoid Bugs.
Understanding the Cloud Computing Security Vulnerabilities
Cisco and IBM Target a Greener World
Upgrade to Visual Studio 2010 with the Ultimate Offer

Advertising Info  |   Member Services  |   Contact Us  |   Help  |   Feedback  |   Site Map  |   Network Map  |   About

internet.commediabistro.comJusttechjobs.comGraphics.com

Search:

WebMediaBrands Corporate Info

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | Shopping | E-mail Offers | Freelance Jobs