Thursday, June 25, 2009

JAXB, @XmlMixed, and white space anomalies

Whether or not you think "mixed" content in XML is ever a good idea, you may need to handle it using JAXB one day. Recall that for JAXB to parse a mixed content XML element to a class C, you use an @XmlMixed annotation on a field of C of type List< Serializable >, combined with either @XmlAnyElement or @XmlElements. In each case, the resulting list will contain Strings representing the text nodes and objects representing the element nodes, in the same order as they appear in the XML text. Thus

<thing>stuff<nested/>entities<alsoNested/></thing>
maps to an instance of
@XmlRootElement
class Thing {
    @XmlMixed @XmlAnyElement
    List< Serializable > lserComponents;
}
which looks like
{ lserComponents : [ "stuff", { localName : "nested" }, "entities", { localName: "alsoNested" } ] }
Unfortunately, if the only content other than nested elements happens to be white space, as in
<thing><nested/>   <alsoNested/></thing>
you get the odd bound object
{ lserComponents : [ { localName : "nested" }, { localName: "alsoNested" }, "" ] }
If you care about white space, and who doesn't these days in the throes of late-stage Reaganomics, you need a trick when you actually go to parse the XML.

First, we create a SAX 2.0 ContentHandler implementation that delegates all events to a JAXB UnmarshallerHandler, but modifies all the whitespace slightly:

class WhitespaceAwareUnmarshallerHandler implements ContentHandler {
  private final UnmarshallerHandler uh;
  public WhitespaceAwareUnmarshallerHandler( UnmarshallerHandler uh ) {
    this.uh = uh;
  }
  /**
   * Replace all-whitespace character blocks with the character '\u000B',
   * which satisfies the following properties:
   * 
   * 1. "\u000B".matches( "\\s" ) == true
   * 2. when parsing XmlMixed content, JAXB does not suppress the whitespace
   **/
  public void characters(
    char[] ch, int start, int length
  ) throws SAXException {
    for ( int i = start + length - 1; i >= start; --i )
      if ( !Character.isWhitespace( ch[ i ] ) ) {
        uh.characters( ch, start, length );
        return;
      }
    Arrays.fill( ch, start, start + length, '\u000B' );
    uh.characters( ch, start, length );
  }
  /* what follows is just blind delegation monkey code */
  public void ignorableWhitespace( char[] ch, int start, int length ) throws SAXException { uh.characters( ch, start, length ); }
  public void endDocument() throws SAXException { uh.endDocument(); }
  public void endElement( String uri, String localName, String name ) throws SAXException { uh.endElement( uri,  localName, name ); }
  public void endPrefixMapping( String prefix ) throws SAXException { uh.endPrefixMapping( prefix ); }
  public void processingInstruction( String target, String data ) throws SAXException { uh.processingInstruction(  target, data ); }
  public void setDocumentLocator( Locator locator ) { uh.setDocumentLocator( locator ); }
  public void skippedEntity( String name ) throws SAXException { uh.skippedEntity( name ); }
  public void startDocument() throws SAXException { uh.startDocument(); }
  public void startElement( String uri, String localName, String name, Attributes atts ) throws SAXException { uh.startElement( uri, localName, name, atts ); }
  public void startPrefixMapping( String prefix, String uri ) throws SAXException { uh.startPrefixMapping( prefix, uri ); }
}
Then at parse time, instead of the usual ctx.createUnmarhaller().unmarshal( strData ), we substitute our special handler to do the parsing:
public class JAXBUtil {
  @SuppressWarnings( "unchecked" )
  public static < T > T unmarshal(
    JAXBContext ctx, String strData, boolean flgWhitespaceAware
  ) throws Exception {
    UnmarshallerHandler uh = ctx.createUnmarshaller().getUnmarshallerHandler();
    XMLReader xr = new WstxSAXParser(); // use your favorite SAX 2.0 parser
    xr.setContentHandler( flgWhitespaceAware ? new WhitespaceAwareUnmarshallerHandler( uh ) : uh );
    xr.parse( new InputSource( new StringReader( strData ) ) );
    return ( T )uh.getResult();
  }
}