Ever wanted to write a Java program that crawls the web? You know a program that reads HTML-pages, retrieves the links, gets the new pages--with more links and so on. Maybe you also have thought about storing the text from the HTML pages for later use, to be able to search for specific information in the pages for example. These are the characteristics of a search engine like Google or Yahoo. If you have a web site of your own you might be interested in having your own search engine. One possibility is to buy one, or use an Open Source search engine, but you might also find it rewarding to write your own!
In this article I'll show you the basic technique in building a search engine using two powerful Open Source products: HTMLParser and Lucene.
The first step is to find out how to "crawl the web". That is: request a page using the HTTP protocol, receive the page, extract the text in the page, and harvest the links in the page. Then repeat this process for every link found. There are several ways to handle this task, some of them are:
So in our situation the HTMLParser is a natural choice. It's not the only open source HTML parser available, but it's the best that I've found.
On the HTMLParser home page there's a link for downloading version 1.4. It's a zip file about 3 MB in size. It not only contains the jar-file you need, but also the utilities and documentation. I suggest that you unzip all of it to have the JavaDoc at hand. Then put the htmlparser.jar in your classpath.
A really nice built-in feature is that HTMLParser has 6 sample
programs which really help you understand how to use the
package. The source is found in the src.zip file in the
download. For our program, we'll use one of the sample programs-
-the StringExtractor program as the starting
point. As its name depicts it will extract the text from a web
page, but as an option it will also extract the links in the page.
Getting the HTML body text and the links from a page is really simple. Let's get the HTML from the HTMLParser's own home page:
String URL = "http://htmlparser.sourceforge.net"; StringExtractor se = new StringExtractor (URL); String contents = se.extractStrings(true); System.out.println(contents);
If we run this code we get a somewhat disappointing result:
HTMLParser Home Page SORRY! YOU NEED FRAMES ENABLED BROWSER FOR THIS WEBSITE"
The StringExtractor obviously doesn't handle
frames, so to handle them we need something more. Be patient and
I'll soon give the hints you need to cover frames as well.
For now we'll use the StringExctractor with the url of
the left frame (the
menu), which top part looks like this:
If we run the program with the menu's URL, which is http:// htmlparser.sourceforge.net/panel.html, we see the following output:
NAVIGATION PAGE About HTMLParser <http://htmlparser.sourceforge.net/main.html> Welcome <http://sourceforge.net/projects/htmlparser> Project Page <http://htmlparser.sourceforge.net/contributors.html> Contributors <http://htmlparser.sourceforge.net/joinus.html> Join this Project Downloads . . .
The first line is the TITLE from the HTML page and then follows text and links. A nice thing is that we always get the link addresses back as full URLs. This relieves us from the task of constructing URLs from relative addresses.
The next thing we need to do is to get the text without the links--and the links without the text. I've used regular expressions for this, because it makes coding very simple. The pattern to use is "look for '<' and '>' with something in between--except for a '>'":
. . .
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(contents);
// Replace links with a space
String text = m.replaceAll(" ");
// Get all links
List links = new ArrayList();
while(m.find()) {
links.add(m.group());
}
. . .
To locate frames in the HTML we insert this code before the link extracting code:
. . .
Parser parser = new Parser (URL);
Node[] list = parser.extractAllNodesThatAre (FrameTag.class);
if (list.length > 0) {
for (int i = 0; i < list.length; i++) {
String link = ( (FrameTag)list[i] ).getFrameLocation();
links.add("<" + link + ">");
}
return;
}
. . .
(extract links from normal HTML Page)
Using the Parser class like this shows you how to
extract any tag(s) from an HTML page.
So now we're ready to crawl the web!
The crawler algorithm goes something like this (pseudo-code):
StringExtractor to get text and
links for the current addressThis will obviously go on forever if we don't include some stop condition. It's normal practice to define a "depth" parameter, indicating how far down in the tree-structure of links we want to crawl. We'll also add a boolean parameter that tells whether we want to crawl external addresses. We define an external address as an address whose prefix doesn't match the prefix of the start address.
The syntax of the HTML link-tag (the A-tag) allows for some variants which we are not interested in. The link address can, for example, be a JavaScript expression, so we'll add some code that checks if an address starts with "http://" or https://". Only these addresses will be crawled. Bookmarks must also be handled to avoid crawling a page more than once. There are probably even other things you'd have to consider if your web crawler is going to compete with Google!
You'll find a link to the source of a complete crawler in the resources section at the end of the article.
If you run the crawler you'll notice that the number of pages crawled per minute is rather modest. This is not due to bad performance of the HTMLParser, but simply because an HTTP request on the Internet takes some time to complete. To speed up crawling you'd have to manage a set of threads, each handling a web page. This is not part of my demo program, but is safe to do, since most of the HTMLParser classes are thread-safe. You may even update an index while it's being queried by other programs.
Until now we've only been web crawling, which is fun, but not very useful, unless you store the information you find on your way. Enter Lucene.
Lucene from the Apache Jakarta project is a high-performance indexing and search engine. "Indexing" means to split sentences into the individual words for storing them in some kind of directory. This directory can then be used for fast lookup of words or combination of words. Indexing and searching is by no means simple technology, yet Lucene is very easy to use. If you stick to the default settings you need few lines of code to have a simple indexer and searcher completed. If you want to refine your program there are plenty of possibilities thanks to the very open architecture of Lucene.
My goal will be to produce some simple code for the web crawler we've just completed. I'll therefore go rather quickly through the Lucene classes to be used. The resources section contains links to several articles that dig deeper into the Lucene API.
The most important classes are:
IndexWriter for creating and building the index
IndexSearcher for querying the index
Documents are what you give to the IndexWriter
and what you get from the IndexSearcher.
Query contains a query (in parsed form)
QueryParser constructs a Query object (from
a String)
Analyzers contain the policies for extracting the words
or tokens from the source documents. The IndexWriter and
QueryParser uses the Analyzers.
NOTE: The IndexWriter and QueryParser
must use the same type of Analyzer, or you
may get the wrong results from your query.
The first item we must create is a Document object
to hold the text we've extracted from a web page. We'll also
store the URL for the web page, to make it possible to locate
the page on the web:
. . .
Document document = new Document();
document.add(Field.Text("text", text));
document.add(Field.Keyword("URL", URL));
. . .
This code shows that a Document object consists of
a set of Field objects. Despite the beginning
capital letter, "Text" and
"Keyword" are actually (static) methods
which return a Field object. There are six such
Field returning methods, for various purposes:
| Name of method | Purpose |
Keyword(String name, Date value) |
Constructs a Date-valued Field that is not tokenized and is
indexed, and stored in the index, for return with hits. |
Keyword(String name, String value) |
Constructs a String-valued Field that is not tokenized, but
is indexed and stored. Useful for non-text fields, e.g. a url. |
Text(String name, Reader value) |
Constructs a Reader-valued Field that is tokenized and
indexed, but is not stored in the index verbatim. Useful for longer text
fields, like "body". |
Text(String name, String value) |
Constructs a String-valued Field that is tokenized and
indexed, and is stored in the index, for return with hits. Useful for short
text fields, like "title" or "subject". |
UnIndexed(String name, String value) |
Constructs a String-valued Field that is not tokenized nor
indexed, but is stored in the index, for return with hits |
UnStored(String name, String value) |
Constructs a String-valued Field that is tokenized and
indexed, but that is not stored in the index. |
You should index what you'd like to search for (and the index values will then be stored in the directory), but to save disk space you should only store data that you can not otherwise easily recover again. In the code above, I've chosen to store the text from the web page because I later want to be able to search it easily.
Next step is to index the Document objects. You do
this by giving the Document object(s) to the
IndexWriter. When you're finished you close the
writer:
. . . Analyzer analyzer = new StandardAnalyzer(); String indexDir = "D:\\LuceneIndex"; IndexWriter writer = new IndexWriter(indexDir, analyzer, true); . . . writer.addDocument(document); . . . (add more documents) writer.optimize(); writer.close(); . . .
The first parameter to IndexWriter is where you
want the index stored in your file system. The second parameter
is the analyzer we want to use. I've chosen the
StandardAnalyzer which is a rather sophisticated
analyzer, which of course includes lowercasing of words. The
third parameter must be true if the index must be
created as new. By setting this parameter to false
you may add to an existing index.
That's it - the index is created.
To make a query against the index you must pass a Query object to an
IndexSearcher. Let's search for
the word "XML"
. . .
String indexDir = "D:\\LuceneIndex";
IndexSearcher is = new IndexSearcher(indexDir);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("text", analyzer);
String searchCriteria = "xml";
Query query = parser.parse(searchCriteria);
Hits hits = is.search(query);
. . .
Note that we must use the same Analyzer and
of course also the same directory. The QueryParser
is instantiated with the name of a (default) field:
"text". You may recall that this was the name we used
when we stored the web page text. The hits variable
is a set of Documents, so it's a simple thing to
loop through this set and write out the URL's for the web pages
that contains "xml":
. . .
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
System.out.println("URL: " + doc.getField("URL").stringValue());
}
. . .
Indexing http://javaboutique.internet.com to a depth of one (which covers the main page and all pages linked to from the main page) gives 20 hits:
URL 1: http://javaboutique.internet.com/reviews/xml_javabeans URL 2: http://javaboutique.internet.com/tutorials/FOP URL 3: http://javaboutique.internet.com/demoIDEs URL 4: http://javaboutique.internet.com/tutorials . . . URL 19: http://javaboutique.internet.com/resources/custom.html URL 20: http://javaboutique.internet.com/newsletter
By the way: the Documents returned in
Hits are "ranked", i.e. those with most
hits are found first in the list.
I'm sure you'll admit that this was easy enough. But we did leave a few questions behind. So let's dig a bit deeper.
The process of parsing the text and finding all words is not as simple as one might think. Some of the things that you'll have to consider are:
Lucene offers a set of Analyzers all using
tokenizers and filters that serve various purposes:
SimpleAnalyzer: is an Analyzer that filters
LetterTokenizer (a tokenizer that divides text at non-letters) with
LowerCaseFilterStopAnalyzer: Filters LetterTokenizer with
LowerCaseFilter and StopFilter (removes "stop words" like
"the"and "a" from a token stream).StandardAnalyzer: Filters StandardTokenizer (a
grammar-based tokenizer constructed with
JavaCC) with StandardFilter
, LowerCaseFilter and StopFilter. WhitespaceAnalyzer: An Analyzer that uses
WhitespaceTokenizer (a tokenizer that divides text at whitespace.
Adjacent sequences of non-whitespace characters form tokens).
In the "Lucene sandbox" there are a couple of more
"experimental" Analyzers and Filters, for
example the Snowball
package. You'll also find Analyzers for several
spoken languages.
If you're interested in seeing the differences between the
various Analyzers then it's fairly simple to write
a small program that shows the tokens the Analyzers
produce. There's a program called AnalysisDemo in
this article that may be used for this. In most
situations, however, the StandardAnalyzer is what
you need.
Queries given to the QueryParser may be quite
complex. On the Lucene web site you can read more about the
syntax, but to give you an idea here are a few examples:
To learn more about the possibilities and how Lucene has implemented queries I'd recommend reading the article "QueryParser Rules".
Lucene finds the documents matching your query, but you'll often
also want to see the part of the text that contains the words
you're looking for. This information is not part of the results
returned from the IndexSearcher, and the regular
Lucene API does not really help you. In the Lucene mailing list I saw that the sandbox had a
contribution called Highlighter that could be used for this.
The Highlighter locates the queried words in the text and
highlights them where they occur in the text. If the results
are shown in a browser, highlighting can be done with a
<B> tag.
Here are some code fragments that show how to use the Highlighter:
. . .
String FIELD_NAME = "text";
Highlighter highlighter = new Highlighter(new MyBolder(),
new QueryScorer(query));
highlighter.setTextFragmenter(new SimpleFragmenter(20));
for (int i = 0; i < hits.length(); i++) {
System.out.println("URL " + (i + 1) + ": " +
hits.doc(i).getField("URL").stringValue());
String text = hits.doc(i).get(FIELD_NAME);
int maxNumFragmentsRequired = 2;
String fragmentSeparator = "...";
TokenStream tokenStream =
analyzer.tokenStream(FIELD_NAME, new StringReader(text));
String result =
highlighter.getBestFragments(
tokenStream,
text,
maxNumFragmentsRequired,
fragmentSeparator);
System.out.println("\t" + result);
}
. . .
The Highlighter uses a helper class called MyBolder, with a method that
highlights a phrase:
private static class MyBolder implements Formatter {
public String highlightTerm(
String originalTermText,
String stemmedTerm,
float score,
int startOffset) {
if (score <= 0)
return originalTermText;
return "<b>" + originalTermText + "</b>";
}
}
When this code is wrapped as a complete program and we now search http://javaboutique.internet.com for "xml AND castor" we'll get this printed out:
URL 1: http://javaboutique.internet.com/tutorials found in: with <b>Castor</b> <b>XML</b> If you're... <b>XML</b> to JavaBeans URL 2: http://javaboutique.internet.com/reviews/xml_javabeans found in: it to <b>Castor</b>-<b>XML</b>... <b>XML</b> to JavaBeans URL 3: http://javaboutique.internet.com/articles found in: with <b>Castor</b> <b>XML</b>.... Using <b>CASTOR</b>
As you can see the words searched for are "HTML-
highlighted", but you may of course do the highlighting the
way you want by modifying the highlightTerm method
above. With these building blocks it's simple to build a web
page with the URLs as links on the text like this:
with
Castor XML If you're... XML to
JavaBeans
it to Castor-XML... XML to
JavaBeans
with
Castor XML.... Using CASTOR
It begins to taste like a Google page, yet there is some way to go before we have a professional level solution.
Final note: when I had finished writing this article I read in the newsgroup that a new version of the Highlighter was available. Here's the announcement in the mailing list. This version is not compatible with the version I've just shown you, but it seems to be a more elegant solution. I'll therefore recommend that you get this version if you're into some serious work. Here's a download address and a JavaDoc address.
In this article I've shown how to build a web search engine by
first coding a web crawler using HTMLParser, and then doing the
indexing and queries with Lucene. Before you code your own
search engine you should carefully consider what you or your
customers would like to search for, and how the results should
be presented. With this knowledge you should pick the
appropriate Analyzer (or code your own!), and only
store the data you really need in the index.
The techniques shown can of course be used to index many other
kind of "documents", for example text files on your
hard disk. I'm not sure if Analyzers for MS Office
files like Word are available, but the Jakarta POI project
may have some contributions. Happy coding!
Keld H. Hansen is currently working as a web architect for one of the largest IT companies in Denmark. He battled with the mainframes during the 70's when they were the size of a gymnasium and had the power of your PalmPilot. He also struggled with CASE-tools in the 90s and now explores the cutting edge technology of the Web. While not busy at his computer he likes to vacation on the Greek islands.