Thursday, May 15, 2008

Identity transformation, my butt

Some lovely trivia I have recently discovered about the default implementations of XSLT transformations in the JDK 1.5:


  1. The so-called "identity transformation" available at TransformerFactory.newTransformer() is anything but the identity when applied to XHTML, until certain non-default configuration is applied. Specifically, you have to do all this:


    xfmEng.setOutputProperty( OutputKeys.DOCTYPE_SYSTEM, "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" );
    xfmEng.setOutputProperty( OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD XHTML 1.0 Transitional//EN" );
    xfmEng.setOutputProperty( OutputKeys.METHOD, "html" );
    xfmEng.setOutputProperty( OutputKeys.OMIT_XML_DECLARATION, "yes" );


    or you get tons of <!-- ... --> garbage before the real document. The garbage seems to live in the w3c.org dtd files for xhtml.

  2. Even with all that, you still end up with the very non-identity transformation of input like <script src=...></script> becoming <script src=.../>. The latter is actually malformed according to many browsers. Forget newTransformer() and use an xslt-based transformation like

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="html"/>
    <xsl:template match="/">
    <xsl:copy-of select="."/>
    </xsl:template>
    </xsl:stylesheet>

  3. Speaking of the w3c.org dtd files, there's still some nasty stuff going on behind the scenes; when any transformer created via TransformerFactory.newTransformer() or Templates.newTransformer() starts processing XHTML, it actually goes and grabs those extremely well-known DTDs off the web from their URIs at w3c.org. Every document, even with the same transformer, engenders a new set of GETs to w3c. Pretty ridiculous. Here's how to get around that:


    package MyPackage;

    import org.xml.sax.SAXNotRecognizedException;
    import org.xml.sax.SAXNotSupportedException;
    import com.sun.org.apache.xerces.internal.impl.Constants;
    import com.sun.org.apache.xerces.internal.parsers.SAXParser;

    public class MyTransform {

    // ...

    public static class MySAXParser extends SAXParser {
    public MySAXParser() {
    super();
    try {
    setFeature( Constants.SAX_FEATURE_PREFIX + Constants.VALIDATION_FEATURE, false );
    setFeature( Constants.XERCES_FEATURE_PREFIX + Constants.LOAD_EXTERNAL_DTD_FEATURE, false );
    } catch ( SAXNotRecognizedException sne ) {
    } catch ( SAXNotSupportedException sse ) {
    }
    }
    }

    // in the code that uses the transformer:
    System.setProperty( "org.xml.sax.driver", "MyPackage.MyTransform$MySAXParser" );
    TransformerFactory.newInstance().newTransformer().transform( stmIn, stmOut );

    // ...
    }


1 comment:

cowtowncoder said...

Yes: it is much better to use Saxon instead. :-)