Thursday, June 25, 2009

JAXB, @XmlMixed, and white space anomalies

Whether or not you think "mixed" content in XML is ever a good idea, you may need to handle it using JAXB one day. Recall that for JAXB to parse a mixed content XML element to a class C, you use an @XmlMixed annotation on a field of C of type List< Serializable >, combined with either @XmlAnyElement or @XmlElements. In each case, the resulting list will contain Strings representing the text nodes and objects representing the element nodes, in the same order as they appear in the XML text. Thus

maps to an instance of
class Thing {
    @XmlMixed @XmlAnyElement
    List< Serializable > lserComponents;
which looks like
{ lserComponents : [ "stuff", { localName : "nested" }, "entities", { localName: "alsoNested" } ] }
Unfortunately, if the only content other than nested elements happens to be white space, as in
<thing><nested/>   <alsoNested/></thing>
you get the odd bound object
{ lserComponents : [ { localName : "nested" }, { localName: "alsoNested" }, "" ] }
If you care about white space, and who doesn't these days in the throes of late-stage Reaganomics, you need a trick when you actually go to parse the XML.

First, we create a SAX 2.0 ContentHandler implementation that delegates all events to a JAXB UnmarshallerHandler, but modifies all the whitespace slightly:

class WhitespaceAwareUnmarshallerHandler implements ContentHandler {
  private final UnmarshallerHandler uh;
  public WhitespaceAwareUnmarshallerHandler( UnmarshallerHandler uh ) {
    this.uh = uh;
   * Replace all-whitespace character blocks with the character '\u000B',
   * which satisfies the following properties:
   * 1. "\u000B".matches( "\\s" ) == true
   * 2. when parsing XmlMixed content, JAXB does not suppress the whitespace
  public void characters(
    char[] ch, int start, int length
  ) throws SAXException {
    for ( int i = start + length - 1; i >= start; --i )
      if ( !Character.isWhitespace( ch[ i ] ) ) {
        uh.characters( ch, start, length );
    Arrays.fill( ch, start, start + length, '\u000B' );
    uh.characters( ch, start, length );
  /* what follows is just blind delegation monkey code */
  public void ignorableWhitespace( char[] ch, int start, int length ) throws SAXException { uh.characters( ch, start, length ); }
  public void endDocument() throws SAXException { uh.endDocument(); }
  public void endElement( String uri, String localName, String name ) throws SAXException { uh.endElement( uri,  localName, name ); }
  public void endPrefixMapping( String prefix ) throws SAXException { uh.endPrefixMapping( prefix ); }
  public void processingInstruction( String target, String data ) throws SAXException { uh.processingInstruction(  target, data ); }
  public void setDocumentLocator( Locator locator ) { uh.setDocumentLocator( locator ); }
  public void skippedEntity( String name ) throws SAXException { uh.skippedEntity( name ); }
  public void startDocument() throws SAXException { uh.startDocument(); }
  public void startElement( String uri, String localName, String name, Attributes atts ) throws SAXException { uh.startElement( uri, localName, name, atts ); }
  public void startPrefixMapping( String prefix, String uri ) throws SAXException { uh.startPrefixMapping( prefix, uri ); }
Then at parse time, instead of the usual ctx.createUnmarhaller().unmarshal( strData ), we substitute our special handler to do the parsing:
public class JAXBUtil {
  @SuppressWarnings( "unchecked" )
  public static < T > T unmarshal(
    JAXBContext ctx, String strData, boolean flgWhitespaceAware
  ) throws Exception {
    UnmarshallerHandler uh = ctx.createUnmarshaller().getUnmarshallerHandler();
    XMLReader xr = new WstxSAXParser(); // use your favorite SAX 2.0 parser
    xr.setContentHandler( flgWhitespaceAware ? new WhitespaceAwareUnmarshallerHandler( uh ) : uh );
    xr.parse( new InputSource( new StringReader( strData ) ) );
    return ( T )uh.getResult();