Jakarta Lucene
Lucene
from the Apache Jakarta project is a high-performance
indexing and search engine. "Indexing" means to split
sentences into the individual words for storing them in some
kind of directory. This directory can then be used for fast
lookup of words or combination of words. Indexing and searching
is by no means simple technology, yet Lucene is very easy to
use. If you stick to the default settings you need few lines of
code to have a simple indexer and searcher completed. If you
want to refine your program there are plenty of possibilities
thanks to the very open architecture of Lucene.
My goal will be to produce some simple code for the web crawler
we've just completed. I'll therefore go rather quickly through
the Lucene classes to be used. The resources section contains
links to several articles that dig deeper into the Lucene API.
The most important classes are:
IndexWriter for creating and building the index
IndexSearcher for querying the index
Documents are what you give to the IndexWriter
and what you get from the IndexSearcher.
Query contains a query (in parsed form)
QueryParser constructs a Query object (from
a String)
Analyzers contain the policies for extracting the words
or tokens from the source documents. The IndexWriter and
QueryParser uses the Analyzers.
NOTE: The IndexWriter and QueryParser
must use the same type of Analyzer, or you
may get the wrong results from your query.
The first item we must create is a Document object
to hold the text we've extracted from a web page. We'll also
store the URL for the web page, to make it possible to locate
the page on the web:
. . .
Document document = new Document();
document.add(Field.Text("text", text));
document.add(Field.Keyword("URL", URL));
. . .
This code shows that a Document object consists of
a set of Field objects. Despite the beginning
capital letter, "Text" and
"Keyword" are actually (static) methods
which return a Field object. There are six such
Field returning methods, for various purposes:
| Name of method |
Purpose |
Keyword(String name, Date value) |
Constructs a Date-valued Field that is not tokenized and is
indexed, and stored in the index, for return with hits. |
Keyword(String name, String value) |
Constructs a String-valued Field that is not tokenized, but
is indexed and stored. Useful for non-text fields, e.g. a url. |
Text(String name, Reader value)
|
Constructs a Reader-valued Field that is tokenized and
indexed, but is not stored in the index verbatim. Useful for longer text
fields, like "body". |
Text(String name, String value) |
Constructs a String-valued Field that is tokenized and
indexed, and is stored in the index, for return with hits. Useful for short
text fields, like "title" or "subject". |
UnIndexed(String name, String value) |
Constructs a String-valued Field that is not tokenized nor
indexed, but is stored in the index, for return with hits |
UnStored(String name, String value) |
Constructs a String-valued Field that is tokenized and
indexed, but that is not stored in the index. |
New on the Java Boutique:
New Review:
Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling
API boasts simplicity, ease-of-integration, a well-rounded feature
set, and it's free!
New Applet:
Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA
sequences into three useful formats.
Elsewhere on internet.com:
WebDeveloper Java
Lots of Java information on webdeveloper.com
WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.
ScriptSearch Java
Hundreds of free Java code files to download.
jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.