Reading an XML file
In this article we'll look at how you, from a Java program, can read and
process an XML file. Actually there are several ways of doing this, but a
powerful API is SAX, the Simple API for XML. SAX is
typically used if you are not interested in holding all data from the XML file
in memory. You might be searching for specific elements, or you want to count
the number of a certain kind of element. If it's more convenient for you to read
and keep all the data from the file as a tree-structure in memory, then DOM, the
Document Object Model API, could be a better choice.
SAX is an event-based parser, meaning that when it reads the XML file it'll
call various methods in your program when certain events occur. These
methods are called call-back methods, and there are 4 SAX interfaces that
support them. In order to facilitate coding of your program--which acts as an
event handler--you may extend a SAX convenience class called DefaultHandler.
This class has default implementations of all the call-back methods, so you only
need to overwrite the ones you're interested in. The most important events are
when a SAX parser detects the
- beginning of an element. The call-back method is "startElement"
- end of an element. The call-back method is "endElement"
- content of an element. The call-back method is "characters"
The following figure illustrates the "call-back" mechanism when a simple XML
structure is read. The 8 events that occur and the methods called are shown:
So it works like this: first you call the SAX parser from your program, and
then you wait… this is how event-based programming is: you leave the control to
another party. Suddenly the "startElement" method is called--and SAX tells you
that it found the "dvd" start-tag. Then you wait again… and now "startElement"
is called once more--this time because the "title" start-tag has been found.
Patient as we are we wait again, until "characters" is called. This time we are
told that the string "The Matrix" has been read from the XML-file. We will of
course have to save this text, because we're only informed once. Let's wait one
more time… Now "endElement" is called, and SAX informs us that this was the end
of the "title" tag. We can therefore safely save "The Matrix" as the
element-value for "title".
What I'd like to emphasize is that all the data from the XML-file will be
given to our program, but only once. It's your responsibility as the
programmer to keep track of where we are in the XML-hierarchy, and to save the
element values that you receive.
The syntax of the 3 call-back methods are:
void startElement(String uri, String localName, String qName,
Attributes attributes)
- uri and localName are used with namespaces, which we'll not cover in this
article
- qName is the name of the element found--i.e. "dvd", "title" or "length" in
the example above
- attributes are only used when an element has an attribute value. For
example: <dvd category="horror">
void endElement(String uri, String localName, String qName)
The parameters are as above.
void characters(char[] ch, int start, int length)
- ch holds the characters read
- start is the starting position in the character array
- length is the number of characters to use from the array
Note: The "characters" method may be called more than one time for a specific
element. This means that you'll have to buffer the characters until you meet the
end tag for the element. This is easily done using a StringBuffer "b":
b.append(ch, start, length);
As an example "The Matrix" could be delivered in three parts: "The", " ", and
"Matrix".
The way you use the 3 call-back methods will usually be like this:
- in startElement: empty the StringBuffer so you're ready to receive the
characters inside a tag
- in characters: collect the characters in your StringBuffer
- in endElement: having the name of the tag and the characters inside the
tag it's up to you to do whatever is needed
One thing that might come as a surprise is that newline characters and spaces
in front of the tags are received in the "characters" method. As a consequence
you should empty your StringBuffer every time "startElement" is called.
Having said this I'll have to admit that there are more than the 8 events in
the example above. The newline characters and succeeding leading spaces actually
triggers 3 more events belonging to the "dvd"-tag. You should simply ignore
them.
New on the Java Boutique:
New Review:
Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling
API boasts simplicity, ease-of-integration, a well-rounded feature
set, and it's free!
New Applet:
Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA
sequences into three useful formats.
Elsewhere on internet.com:
WebDeveloper Java
Lots of Java information on webdeveloper.com
WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.
ScriptSearch Java
Hundreds of free Java code files to download.
jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.