Tutorials : Java and XML: putting SAX to work :

Contents
Why use XML?
Reading an XML file
Putting SAX to work
A complete event handler program for SAX
Sorting the data

Reading an XML file

In this article we'll look at how you, from a Java program, can read and process an XML file. Actually there are several ways of doing this, but a powerful API is SAX, the Simple API for XML. SAX is typically used if you are not interested in holding all data from the XML file in memory. You might be searching for specific elements, or you want to count the number of a certain kind of element. If it's more convenient for you to read and keep all the data from the file as a tree-structure in memory, then DOM, the Document Object Model API, could be a better choice.

SAX is an event-based parser, meaning that when it reads the XML file it'll call various methods in your program when certain events occur. These methods are called call-back methods, and there are 4 SAX interfaces that support them. In order to facilitate coding of your program--which acts as an event handler--you may extend a SAX convenience class called DefaultHandler. This class has default implementations of all the call-back methods, so you only need to overwrite the ones you're interested in. The most important events are when a SAX parser detects the

  • beginning of an element. The call-back method is "startElement"
  • end of an element. The call-back method is "endElement"
  • content of an element. The call-back method is "characters"

The following figure illustrates the "call-back" mechanism when a simple XML structure is read. The 8 events that occur and the methods called are shown:

So it works like this: first you call the SAX parser from your program, and then you wait… this is how event-based programming is: you leave the control to another party. Suddenly the "startElement" method is called--and SAX tells you that it found the "dvd" start-tag. Then you wait again… and now "startElement" is called once more--this time because the "title" start-tag has been found. Patient as we are we wait again, until "characters" is called. This time we are told that the string "The Matrix" has been read from the XML-file. We will of course have to save this text, because we're only informed once. Let's wait one more time… Now "endElement" is called, and SAX informs us that this was the end of the "title" tag. We can therefore safely save "The Matrix" as the element-value for "title".

What I'd like to emphasize is that all the data from the XML-file will be given to our program, but only once. It's your responsibility as the programmer to keep track of where we are in the XML-hierarchy, and to save the element values that you receive.

The syntax of the 3 call-back methods are:

void startElement(String uri, String localName, String qName, 
	                  Attributes attributes) 
  • uri and localName are used with namespaces, which we'll not cover in this article
  • qName is the name of the element found--i.e. "dvd", "title" or "length" in the example above
  • attributes are only used when an element has an attribute value. For example: <dvd category="horror">
void endElement(String uri, String localName, String qName)

The parameters are as above.

void characters(char[] ch, int start, int length)
  • ch holds the characters read
  • start is the starting position in the character array
  • length is the number of characters to use from the array

Note: The "characters" method may be called more than one time for a specific element. This means that you'll have to buffer the characters until you meet the end tag for the element. This is easily done using a StringBuffer "b":

b.append(ch, start, length);

As an example "The Matrix" could be delivered in three parts: "The", " ", and "Matrix".

The way you use the 3 call-back methods will usually be like this:

  • in startElement: empty the StringBuffer so you're ready to receive the characters inside a tag
  • in characters: collect the characters in your StringBuffer
  • in endElement: having the name of the tag and the characters inside the tag it's up to you to do whatever is needed

One thing that might come as a surprise is that newline characters and spaces in front of the tags are received in the "characters" method. As a consequence you should empty your StringBuffer every time "startElement" is called.

Having said this I'll have to admit that there are more than the 8 events in the example above. The newline characters and succeeding leading spaces actually triggers 3 more events belonging to the "dvd"-tag. You should simply ignore them.

How to Add Java Applets to Your Site

New on the Java Boutique:

New Review:

Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling API boasts simplicity, ease-of-integration, a well-rounded feature set, and it's free!

New Applet:

Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA sequences into three useful formats.

Elsewhere on internet.com:

WebDeveloper Java
Lots of Java information on webdeveloper.com

WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.

ScriptSearch Java
Hundreds of free Java code files to download.

jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.