AgaveBlue Musings - XML Parsing

Background

As XML has become more and more pervasive, I have found myself writing more and more custom parsers. Each time I write one, it seems to be a re-write of the one I wrote previously with the exception that the tags have changed and the xml structure itself has changed. However, the common thread each time I do this is that I'm the one defining the XML structure, and I'm the one defining the Java Object structure (usually bean objects). And most of the time, the XML is used purely internally to the application I'm developing. Sometimes I'm using the XML as a more structured config file, and sometimes I'm using the XML as a datafile because either the data I wish to store is relatively small, or else I'm trying to build an application that is as lightweight as possible therefore eliminating the need for a third party database.

Add a little tequila...

...to your Java

Problem

There are many 3rd party API's out there that attempt to make this process simpler. J2EE includes JAXB, Apache-Jakarta has Commons Digester and XMLBeans, J2SE has java.beans.XMLEncoder and java.beans.XMLDecoder. They all have their individual advantages and disadvantages, but a common theme among them is that they are all somewhat heavyweight, and they try to be all things to everyone.

JAXB

JAXB has the advantage that it is part of the J2EE standard, and it is more universal. It typically enforces schemas, as the primary way to use it is by providing an XML Schema for it to use to generate its beans. This is typically good when XML is being shared with multiple applications, possibly across the internet. However, for lightweight applications that want simple XML parsing for internal use as opposed to portable data, this is way too much work, and overkill.

Commons Digester

Apache Jakarta's Commons Digester seems to be a step in the right direction. My guess is that this API started as an attempt to solve the problem I'm describing here, only it seems to have ballooned into a more heavyweight API, that again, attempts to be all things to everyone. The biggest issue I have with Commons Digester is that it requires too much help. I'll explain in more detail below, as I'll be comparing my proposed solution to Commons Digester.

Apache XMLBeans

XMLBeans is not very well documented. Oh there's plenty of documentation, but it isn't very concise, and the examples are hard to follow. From what I can tell, XMLBeans uses its own schema-compiler that generates interfaces and classes to handle the parsing. If I wanted to go through all of this work, I'd just use JAXB.

java.beans.XMLEncoder and java.beans.XMLDecoder

This is interesting, and seems relatively simple. Your beans must be serializable, and then Java will automatically convert the bean to xml and vice-versa. However, the XML that is generated is very verbose, and includes attributes that then help the API determine how to later parse the XML back into the beans. It makes for very clunky XML. But I applaud the attempt.

Solution

So why can't we have a simple, lightweight API that can parse the XML into Beans with very little configuration, and/or programming? Commons Digester tries to do this, but it comes up short. By default, Commons Digester has to be told every little detail about how the XML is to be parsed. It doesn't attempt to discover the structure on its own. That's where SLASH comes in. SLASH is the Simple Lightweight Automatic Sax Handler. SLASH differs from Commons Digester in two ways. First it is an implementation of the DefaultHandler class that is provided by the SAX 2.0 API, which allows it to work with the SAX 2.0 API and is therefore extensible. Second, SLASH's default behavior is to attempt to discover the XML structure and build up an object map automatically with little or no help from things like the "rules" that are used by Commons Digester.

SLASH optionally uses "Hints", but they aren't required. SLASH Hints are provided to the Handler in the form of a java.util.Map where each key is a local name of an xml tag, and the corresponding value is a java.lang.Class object that represents the bean that the handler should use when it encounters the xml tag tied specified by the key. If the Class object provided represents any java.util.Collection concrete class or interface, then SLASH will treat all xml tags directly descended from the xml tag associated as an object to be added to that collection. For instance if I have the XML:



<Computer>

   <Processor>1500mhz</Processor>

   <Memory>512MB</Memory>

   <Drive>100GB HD</Drive>

   <Drive>DVD RW</Drive>

</Computer>

And I provide a hint of the form ("Computer", java.util.ArrayList.class), then SLASH will construct an ArrayList every time it encounters a "Computer" tag, and add all subtag objects to that ArrayList. So in this example, it will produce an ArrayList of Strings containing the values of the Processor, Memory, Drive, and Drive tags specified in the xml. Or had I provided hints for the Processor, Memory, and Drive tags, then it would have constructed the appropriate objects and added them to the ArrayList.

Now if the hint provided is a java bean object such that it has a default constructor (no parameters) and setter methods corresponding to the subtag names, SLASH becomes much more powerful. SLASH will construct a new bean object each time it encounters that tag, and then will assume that a setter method exists for all subtags. For instance if I have the XML:



<Computer>

   <Processor>1500mhz</Processor>

   <Memory>512MB</Memory>

   <Drives>

      <Drive>100GB HD</Drive>

      <Drive>DVD RW</Drive>

   </Drives>

</Computer>

And I provide a hint of the form ("Computer", your.package.Computer.class) such that Computer.class is some bean object with a default constructor and methods setProcessor(String), setMemory(String), setDrives(Collection), SLASH will automatically parse this XML into a Computer object. When SLASH encounters the Processor and Memory tags, it calls the corresponding setter methods. Also, when SLASH encounters the Drives tag, it inspects the Computer bean via reflection and determines that the Computer bean expects a Collection, so it constructs a LinkedList (the default Collection implementation that SLASH prefers), and continues parsing the XML. As each Drive subtag is encountered, it parses the value and adds it to the LinkedList as a string, and then when it is done, it calls the setDrives method on the Computer.

In the examples above, all of the parsing is accomplished by providing only one hint to SLASH. Contrastly, Commons Digester would have required at least one "Rule" per tag.

Summary

There has to be an easier way to parse XML into bean objects. And now there is... it's called SLASH.

Reader Comments:

Parsing XML into Java Beans