Thursday, May 15, 2008

Identity transformation, my butt

Some lovely trivia I have recently discovered about the default implementations of XSLT transformations in the JDK 1.5:

  1. The so-called "identity transformation" available at TransformerFactory.newTransformer() is anything but the identity when applied to XHTML, until certain non-default configuration is applied. Specifically, you have to do all this:

    xfmEng.setOutputProperty( OutputKeys.DOCTYPE_SYSTEM, "" );
    xfmEng.setOutputProperty( OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD XHTML 1.0 Transitional//EN" );
    xfmEng.setOutputProperty( OutputKeys.METHOD, "html" );
    xfmEng.setOutputProperty( OutputKeys.OMIT_XML_DECLARATION, "yes" );

    or you get tons of <!-- ... --> garbage before the real document. The garbage seems to live in the dtd files for xhtml.

  2. Even with all that, you still end up with the very non-identity transformation of input like <script src=...></script> becoming <script src=.../>. The latter is actually malformed according to many browsers. Forget newTransformer() and use an xslt-based transformation like

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="">
    <xsl:output method="html"/>
    <xsl:template match="/">
    <xsl:copy-of select="."/>

  3. Speaking of the dtd files, there's still some nasty stuff going on behind the scenes; when any transformer created via TransformerFactory.newTransformer() or Templates.newTransformer() starts processing XHTML, it actually goes and grabs those extremely well-known DTDs off the web from their URIs at Every document, even with the same transformer, engenders a new set of GETs to w3c. Pretty ridiculous. Here's how to get around that:

    package MyPackage;

    import org.xml.sax.SAXNotRecognizedException;
    import org.xml.sax.SAXNotSupportedException;

    public class MyTransform {

    // ...

    public static class MySAXParser extends SAXParser {
    public MySAXParser() {
    try {
    setFeature( Constants.SAX_FEATURE_PREFIX + Constants.VALIDATION_FEATURE, false );
    setFeature( Constants.XERCES_FEATURE_PREFIX + Constants.LOAD_EXTERNAL_DTD_FEATURE, false );
    } catch ( SAXNotRecognizedException sne ) {
    } catch ( SAXNotSupportedException sse ) {

    // in the code that uses the transformer:
    System.setProperty( "org.xml.sax.driver", "MyPackage.MyTransform$MySAXParser" );
    TransformerFactory.newInstance().newTransformer().transform( stmIn, stmOut );

    // ...

1 comment:

cowtowncoder said...

Yes: it is much better to use Saxon instead. :-)