Unweaving a Tangled Web With HTMLParser and Lucene

by Keld H. Hansen

Introduction

Ever wanted to write a Java program that crawls the web? You know a program that reads HTML-pages, retrieves the links, gets the new pages--with more links and so on. Maybe you also have thought about storing the text from the HTML pages for later use, to be able to search for specific information in the pages for example. These are the characteristics of a search engine like Google or Yahoo. If you have a web site of your own you might be interested in having your own search engine. One possibility is to buy one, or use an Open Source search engine, but you might also find it rewarding to write your own!

In this article I'll show you the basic technique in building a search engine using two powerful Open Source products: HTMLParser and Lucene.

Crawling the Web

The first step is to find out how to "crawl the web". That is: request a page using the HTTP protocol, receive the page, extract the text in the page, and harvest the links in the page. Then repeat this process for every link found. There are several ways to handle this task, some of them are:

  1. Use the java.net.URLConnection class. This is a rather low-level approach that appeals to those who want absolute control over what's going on.
  2. Use the HttpClient from the Jakarta project. This open source product will handle several situations for you which otherwise would need non-trivial coding. There's a feature list available if you want the details.
  3. Use the HTMLParser found on SourceForge.net. This product not only allows you to send a request and receive a response, but it'll also parse the HTML for you.

So in our situation the HTMLParser is a natural choice. It's not the only open source HTML parser available, but it's the best that I've found.

The HTMLParser

On the HTMLParser home page there's a link for downloading version 1.4. It's a zip file about 3 MB in size. It not only contains the jar-file you need, but also the utilities and documentation. I suggest that you unzip all of it to have the JavaDoc at hand. Then put the htmlparser.jar in your classpath.

A really nice built-in feature is that HTMLParser has 6 sample programs which really help you understand how to use the package. The source is found in the src.zip file in the download. For our program, we'll use one of the sample programs- -the StringExtractor program as the starting point. As its name depicts it will extract the text from a web page, but as an option it will also extract the links in the page.

Getting the HTML body text and the links from a page is really simple. Let's get the HTML from the HTMLParser's own home page:

String URL = "http://htmlparser.sourceforge.net";
StringExtractor se = new StringExtractor (URL);
String contents = se.extractStrings(true);
System.out.println(contents);

If we run this code we get a somewhat disappointing result:

HTMLParser Home Page
SORRY! YOU NEED FRAMES ENABLED BROWSER FOR THIS WEBSITE"

The StringExtractor obviously doesn't handle frames, so to handle them we need something more. Be patient and I'll soon give the hints you need to cover frames as well.

For now we'll use the StringExctractor with the url of the left frame (the menu), which top part looks like this:

If we run the program with the menu's URL, which is http:// htmlparser.sourceforge.net/panel.html, we see the following output:

NAVIGATION PAGE
About HTMLParser
<http://htmlparser.sourceforge.net/main.html> Welcome
<http://sourceforge.net/projects/htmlparser> Project Page
<http://htmlparser.sourceforge.net/contributors.html> 
Contributors
<http://htmlparser.sourceforge.net/joinus.html> 
Join this Project
Downloads
. . . 

The first line is the TITLE from the HTML page and then follows text and links. A nice thing is that we always get the link addresses back as full URLs. This relieves us from the task of constructing URLs from relative addresses.

The next thing we need to do is to get the text without the links--and the links without the text. I've used regular expressions for this, because it makes coding very simple. The pattern to use is "look for '<' and '>' with something in between--except for a '>'":

. . .
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(contents);

// Replace links with a space
String text = m.replaceAll(" ");

// Get all links
List links = new ArrayList();
while(m.find()) {
  links.add(m.group());
}
. . . 

Frames

To locate frames in the HTML we insert this code before the link extracting code:

. . .
Parser parser = new Parser (URL);
Node[] list = parser.extractAllNodesThatAre (FrameTag.class);

if (list.length > 0) {
  for (int i = 0; i < list.length; i++) {
    String link = ( (FrameTag)list[i] ).getFrameLocation(); 
    links.add("<" + link + ">");
  }
  return;
}
. . .
(extract links from normal HTML Page)  

Using the Parser class like this shows you how to extract any tag(s) from an HTML page.

So now we're ready to crawl the web!

Completing the crawler

The crawler algorithm goes something like this (pseudo-code):

  1. Set current address to the chosen start address
  2. Use StringExtractor to get text and links for the current address
  3. Set the current address as "crawled" (use a HashMap for this)
  4. For each address in the set of links do:
  5.     Make this the current address
  6.     Go to 2
  7. End For

This will obviously go on forever if we don't include some stop condition. It's normal practice to define a "depth" parameter, indicating how far down in the tree-structure of links we want to crawl. We'll also add a boolean parameter that tells whether we want to crawl external addresses. We define an external address as an address whose prefix doesn't match the prefix of the start address.

Valid links

The syntax of the HTML link-tag (the A-tag) allows for some variants which we are not interested in. The link address can, for example, be a JavaScript expression, so we'll add some code that checks if an address starts with "http://" or https://". Only these addresses will be crawled. Bookmarks must also be handled to avoid crawling a page more than once. There are probably even other things you'd have to consider if your web crawler is going to compete with Google!

You'll find a link to the source of a complete crawler in the resources section at the end of the article.

If you run the crawler you'll notice that the number of pages crawled per minute is rather modest. This is not due to bad performance of the HTMLParser, but simply because an HTTP request on the Internet takes some time to complete. To speed up crawling you'd have to manage a set of threads, each handling a web page. This is not part of my demo program, but is safe to do, since most of the HTMLParser classes are thread-safe. You may even update an index while it's being queried by other programs. 

Until now we've only been web crawling, which is fun, but not very useful, unless you store the information you find on your way. Enter Lucene.

Jakarta Lucene 

Lucene from the Apache Jakarta project is a high-performance indexing and search engine. "Indexing" means to split sentences into the individual words for storing them in some kind of directory. This directory can then be used for fast lookup of words or combination of words. Indexing and searching is by no means simple technology, yet Lucene is very easy to use. If you stick to the default settings you need few lines of code to have a simple indexer and searcher completed. If you want to refine your program there are plenty of possibilities thanks to the very open architecture of Lucene.

My goal will be to produce some simple code for the web crawler we've just completed. I'll therefore go rather quickly through the Lucene classes to be used. The resources section contains links to several articles that dig deeper into the Lucene API.

The most important classes are:

IndexWriter for creating and building the index

IndexSearcher for querying the index

Documents are what you give to the IndexWriter and what you get from the IndexSearcher.  

Query contains a query (in parsed form)

QueryParser constructs a Query object (from a String)

Analyzers contain the policies for extracting the words or tokens from the source documents. The IndexWriter and QueryParser uses the Analyzers.

NOTE: The IndexWriter and QueryParser must use the same type of Analyzer, or you may get the wrong results from your query.

The first item we must create is a Document object to hold the text we've extracted from a web page. We'll also store the URL for the web page, to make it possible to locate the page on the web:

. . .
Document document = new Document();
document.add(Field.Text("text", text));
document.add(Field.Keyword("URL", URL));
. . .

This code shows that a Document object consists of a set of Field objects. Despite the beginning capital letter, "Text" and "Keyword" are actually (static) methods which return a Field object. There are six such Field returning methods, for various purposes:

Name of method Purpose
Keyword(String name, Date value) Constructs a Date-valued Field that is not tokenized and is indexed, and stored in the index, for return with hits.
Keyword(String name, String value) Constructs a String-valued Field that is not tokenized, but is indexed and stored. Useful for non-text fields, e.g. a url.
Text(String name, Reader value)
 
Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in the index verbatim. Useful for longer text fields, like "body".
Text(String name, String value) Constructs a String-valued Field that is tokenized and indexed, and is stored in the index, for return with hits. Useful for short text fields, like "title" or "subject".
UnIndexed(String name, String value) Constructs a String-valued Field that is not tokenized nor indexed, but is stored in the index, for return with hits
UnStored(String name, String value) Constructs a String-valued Field that is tokenized and indexed, but that is not stored in the index.

You should index what you'd like to search for (and the index values will then be stored in the directory), but to save disk space you should only store data that you can not otherwise easily recover again. In the code above, I've chosen to store the text from the web page because I later want to be able to search it easily.  

Next step is to index the Document objects. You do this by giving the Document object(s) to the IndexWriter. When you're finished you close the writer:

. . .
Analyzer analyzer  = new StandardAnalyzer();
String indexDir    = "D:\\LuceneIndex";
IndexWriter writer = new IndexWriter(indexDir, analyzer, true);
. . .
writer.addDocument(document);
. . . (add more documents)
writer.optimize();
writer.close();
. . .

The first parameter to IndexWriter is where you want the index stored in your file system. The second parameter is the analyzer we want to use. I've chosen the StandardAnalyzer which is a rather sophisticated analyzer, which of course includes lowercasing of words. The third parameter must be true if the index must be created as new. By setting this parameter to false you may add to an existing index.

That's it - the index is created.

Querying the index

To make a query against the index you must pass a Query object to an IndexSearcher. Let's search for the word "XML"

. . .
String indexDir    = "D:\\LuceneIndex";
IndexSearcher is   = new IndexSearcher(indexDir);

Analyzer analyzer  = new StandardAnalyzer();
QueryParser parser = new QueryParser("text", analyzer);

String searchCriteria = "xml";
Query query = parser.parse(searchCriteria);
Hits hits   = is.search(query);
. . .

Note that we must use the same Analyzer and of course also the same directory. The QueryParser is instantiated with the name of a (default) field: "text". You may recall that this was the name we used when we stored the web page text. The hits variable is a set of Documents, so it's a simple thing to loop through this set and write out the URL's for the web pages that contains "xml":

. . .
for (int i = 0; i < hits.length(); i++) {
  Document doc = hits.doc(i);
  System.out.println("URL: " + doc.getField("URL").stringValue());
}
. . .

Indexing http://javaboutique.internet.com to a depth of one (which covers the main page and all pages linked to from the main page) gives 20 hits:

URL 1: http://javaboutique.internet.com/reviews/xml_javabeans
URL 2: http://javaboutique.internet.com/tutorials/FOP
URL 3: http://javaboutique.internet.com/demoIDEs
URL 4: http://javaboutique.internet.com/tutorials
. . .
URL 19: http://javaboutique.internet.com/resources/custom.html
URL 20: http://javaboutique.internet.com/newsletter

By the way: the Documents returned in Hits are "ranked", i.e. those with most hits are found first in the list.

I'm sure you'll admit that this was easy enough. But we did leave a few questions behind. So let's dig a bit deeper.

Analyzers

The process of parsing the text and finding all words is not as simple as one might think. Some of the things that you'll have to consider are:

Lucene offers a set of Analyzers all using tokenizers and filters that serve various purposes:

In the "Lucene sandbox" there are a couple of more "experimental" Analyzers and Filters, for example the Snowball package. You'll also find Analyzers for several spoken languages.

If you're interested in seeing the differences between the various Analyzers then it's fairly simple to write a small program that shows the tokens the Analyzers produce. There's a program called AnalysisDemo in this article that may be used for this. In most situations, however, the StandardAnalyzer is what you need.

Queries

Queries given to the QueryParser may be quite complex. On the Lucene web site you can read more about the syntax, but to give you an idea here are a few examples:

To learn more about the possibilities and how Lucene has implemented queries I'd recommend reading the article "QueryParser Rules".

Displaying the Matching Phrases

Lucene finds the documents matching your query, but you'll often also want to see the part of the text that contains the words you're looking for. This information is not part of the results returned from the IndexSearcher, and the regular Lucene API does not really help you. In the Lucene mailing list I saw that the sandbox had a contribution called Highlighter that could be used for this. The Highlighter locates the queried words in the text and highlights them where they occur in the text. If the results are shown in a browser, highlighting can be done with a <B> tag.

Here are some code fragments that show how to use the Highlighter:

. . .
String FIELD_NAME = "text";
Highlighter highlighter = new Highlighter(new MyBolder(), 
                          new QueryScorer(query));
highlighter.setTextFragmenter(new SimpleFragmenter(20));
for (int i = 0; i < hits.length(); i++) {
  System.out.println("URL " + (i + 1) + ": " + 
    hits.doc(i).getField("URL").stringValue());
  String text = hits.doc(i).get(FIELD_NAME);
  int maxNumFragmentsRequired = 2;
  String fragmentSeparator = "...";
  TokenStream tokenStream = 
    analyzer.tokenStream(FIELD_NAME, new StringReader(text));

  String result =
    highlighter.getBestFragments(
    tokenStream,
    text,
    maxNumFragmentsRequired,
    fragmentSeparator);
  System.out.println("\t" + result);
}
. . .

The Highlighter uses a helper class called MyBolder, with a method that highlights a phrase:

private static class MyBolder implements Formatter {
  public String highlightTerm(
    String originalTermText,
    String stemmedTerm,
    float score,
    int startOffset) {
  
    if (score <= 0) 
      return originalTermText;
  
    return "<b>" + originalTermText + "</b>";
  }
}

When this code is wrapped as a complete program and we now search http://javaboutique.internet.com for "xml AND castor" we'll get this printed out:

URL 1: http://javaboutique.internet.com/tutorials
   found in: with <b>Castor</b> <b>XML</b>
If you're... <b>XML</b> to JavaBeans
URL 2: http://javaboutique.internet.com/reviews/xml_javabeans
   found in: it to <b>Castor</b>-<b>XML</b>... <b>XML</b> to JavaBeans
URL 3: http://javaboutique.internet.com/articles
   found in: with <b>Castor</b> <b>XML</b>....
Using <b>CASTOR</b>

As you can see the words searched for are "HTML- highlighted", but you may of course do the highlighting the way you want by modifying the highlightTerm method above. With these building blocks it's simple to build a web page with the URLs as links on the text like this:

  with Castor XML If you're... XML to JavaBeans 
  it to Castor-XML... XML to JavaBeans 
  with Castor XML.... Using CASTOR 

It begins to taste like a Google page, yet there is some way to go before we have a professional level solution.

Final note: when I had finished writing this article I read in the newsgroup that a new version of the Highlighter was available. Here's the announcement in the mailing list. This version is not compatible with the version I've just shown you, but it seems to be a more elegant solution. I'll therefore recommend that you get this version if you're into some serious work. Here's a download address and a JavaDoc address.  

Conclusion

In this article I've shown how to build a web search engine by first coding a web crawler using HTMLParser, and then doing the indexing and queries with Lucene. Before you code your own search engine you should carefully consider what you or your customers would like to search for, and how the results should be presented. With this knowledge you should pick the appropriate Analyzer (or code your own!), and only store the data you really need in the index.

The techniques shown can of course be used to index many other kind of "documents", for example text files on your hard disk. I'm not sure if Analyzers for MS Office files like Word are available, but the Jakarta POI project may have some contributions. Happy coding!

Keld H. Hansen is currently working as a web architect for one of the largest IT companies in Denmark. He battled with the mainframes during the 70's when they were the size of a gymnasium and had the power of your PalmPilot. He also struggled with CASE-tools in the 90s and now explores the cutting edge technology of the Web. While not busy at his computer he likes to vacation on the Greek islands.

Print Article

Resources