Category Archives: xml
XML Support in APR and Apache
Recently the subject of bundling or non-bundling of expat within APR and Apache HTTPD (the web server) re-emerged on the dev list. I’ve always been against bundling: it’s a third-party library and should be a dependency. We’ve moved gradually towards that, but current practice includes bundling it in an optional dependencies package.
APR’s use of expat is in practice pretty limited and straightforward: the core does nothing very demanding with XML. And in practice, when applications such as Apache Modules need to work with XML, expat is often too limiting. So modules need to introduce an alternative XML library. The most usual choice is libxml2 as in, for example, mod_proxy_html, mod_transform, and mod_security.
Libxml2 is not just a much bigger and more powerful library than expat, it’s also very nearly a drop-in replacement. In particular, it provides a compatible SAX API. So if we could use it in place of libxml2 in APR we have a win-win for web servers (and other applications) involving libxml2: replace expat in APR, and load just the one XML library instead of two. At the same time, we don’t want to impose libxml2 as a dependency on APR applications that have no need for it.
So this week I’ve finally got around to rewriting APR’s XML module to decouple the parser and use either expat or libxml2. The choice of XML parser is now available at compile time. While libxml2 support should be considered experimental for the time being, it should become the preferred option for users of applications requiring it, potentially simplifying your configuration and reducing your footprint.
For the time being, anyone interested will need to download APR from trunk.
Transcoding module
One of the new features in mod_proxy_html 3.0 is improved i18n support, adding character sets supported by apr_xlate (normally iconv) to those supported by libxml2.
In generalising this for other filter modules, I’ve decided to split it out into a new transcoding module. It will be tied to libxml2 applications, and will be usable both before and after any libxml2-based content filter. For maximum efficiency, it will only handle charsets that are not supported by libxml2.
It will also support additional preprocessing fixups that experience has shown necessary. That includes adjusting charset declarations that are invalidated by transcoding, and fixing tag-soup problems that screw up libxml2’s htmlParser.
It won’t do anything useful yet, but I’ve committed mod_xml2enc as a work-in-progress to svn at apache.webthing.com. When ready, it’ll borrow from several existing modules, and replace transcoding and preprocessing functions in them.