Tutorials : Unweaving a Tangled Web With HTMLParser and Lucene :

Jakarta Lucene 

Lucene from the Apache Jakarta project is a high-performance indexing and search engine. "Indexing" means to split sentences into the individual words for storing them in some kind of directory. This directory can then be used for fast lookup of words or combination of words. Indexing and searching is by no means simple technology, yet Lucene is very easy to use. If you stick to the default settings you need few lines of code to have a simple indexer and searcher completed. If you want to refine your program there are plenty of possibilities thanks to the very open architecture of Lucene.

My goal will be to produce some simple code for the web crawler we've just completed. I'll therefore go rather quickly through the Lucene classes to be used. The resources section contains links to several articles that dig deeper into the Lucene API.

The most important classes are:

IndexWriter for creating and building the index

IndexSearcher for querying the index

Documents are what you give to the IndexWriter and what you get from the IndexSearcher.  

Query contains a query (in parsed form)

QueryParser constructs a Query object (from a String)

Analyzers contain the policies for extracting the words or tokens from the source documents. The IndexWriter and QueryParser uses the Analyzers.

NOTE: The IndexWriter and QueryParser must use the same type of Analyzer, or you may get the wrong results from your query.

The first item we must create is a Document object to hold the text we've extracted from a web page. We'll also store the URL for the web page, to make it possible to locate the page on the web:

. . .
Document document = new Document();
document.add(Field.Text("text", text));
document.add(Field.Keyword("URL", URL));
. . .

This code shows that a Document object consists of a set of Field objects. Despite the beginning capital letter, "Text" and "Keyword" are actually (static) methods which return a Field object. There are six such Field returning methods, for various purposes:

Name of method Purpose
Keyword(String name, Date value) Constructs a Date-valued Field that is not tokenized and is indexed, and stored in the index, for return with hits.
Keyword(String name, String value) Constructs a String-valued Field that is not tokenized, but is indexed and stored. Useful for non-text fields, e.g. a url.
Text(String name, Reader value)
 
Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in the index verbatim. Useful for longer text fields, like "body".
Text(String name, String value) Constructs a String-valued Field that is tokenized and indexed, and is stored in the index, for return with hits. Useful for short text fields, like "title" or "subject".
UnIndexed(String name, String value) Constructs a String-valued Field that is not tokenized nor indexed, but is stored in the index, for return with hits
UnStored(String name, String value) Constructs a String-valued Field that is tokenized and indexed, but that is not stored in the index.

How to Add Java Applets to Your Site

New on the Java Boutique:

New Review:

Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling API boasts simplicity, ease-of-integration, a well-rounded feature set, and it's free!

New Applet:

Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA sequences into three useful formats.

Elsewhere on internet.com:

WebDeveloper Java
Lots of Java information on webdeveloper.com

WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.

ScriptSearch Java
Hundreds of free Java code files to download.

jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.