Category Archives: html
Since HTML5 websockets seem to be attracting some interest, I expect some of my readers may be interested to hear there’s now an implementation for the Apache webserver.
It’s a third-party module, written in Python by developers not associated with the apache team itself. It’s hosted at Google Code (here) and labelled experimental.
This is merely to draw attention to it. I’m in no position to vouch for it myself. Caveat Reader.
 But what else could it be, given the experimental state of the candidate standard being implemented?
At ApacheCon, I once again encountered the argument sanitising markup is difficult, with an explanation of how easy it is to evade pattern-matching filters with tricks like reordering, whitespace, and embedded comments. I protested that this kind of difficulty comes from using the wrong tools, and the problem largely goes away if you use markup-aware tools.
On April 10th I promised a note on this (though that promise came from a separate conversation at apachecon, and in a different context to the security issue). Today I’ve just delivered on that promise, with a brief technical note. I expect to use it in future when the subject arises.
In generalising this for other filter modules, I’ve decided to split it out into a new transcoding module. It will be tied to libxml2 applications, and will be usable both before and after any libxml2-based content filter. For maximum efficiency, it will only handle charsets that are not supported by libxml2.
It will also support additional preprocessing fixups that experience has shown necessary. That includes adjusting charset declarations that are invalidated by transcoding, and fixing tag-soup problems that screw up libxml2’s htmlParser.
It won’t do anything useful yet, but I’ve committed mod_xml2enc as a work-in-progress to svn at apache.webthing.com. When ready, it’ll borrow from several existing modules, and replace transcoding and preprocessing functions in them.
I’ve just announced a public dev version of mod_proxy_html, incorporating a range of updates. That means it works nicely for me, and I’d like the outside world to start test-driving it.
First, there’s much better internationalisation support.
- A charset not supported by libxml2 can be aliased to a supported one.
- A charset that is neither supported directly nor aliased will be converted to unicode using apr_xlate (an iconv wrapper).
- A default input encoding (for totally unlabelled contents) can be configured.
- Output can be filtered through apr_xlate to a server admin’s desired encoding.
Second, support for rewriting proprietary HTML variants is now configurable. Indeed, the definitions of all link and event attributes is now delegated to httpd.conf, and an example configuration is supplied, defining the links and events in W3C HTML 4.01 and XHTML 1.0.
When I announced it here I got two requests, one of which was easy to satisfy. You can now override its refusal to run when not in a proxy context, or when the input isn’t HTML. This of course is at your own risk, to help dealing with broken backends.
This is one of a number of new fixes available for broken backends. Others include an option to ignore leading junk, and the capability to strip out bogus or deprecated markup and output cleaned up HTML or XHTML.
Finally, Version 3 introduces more flexible configuration. It now supports variable interpolation in ProxyHTMLURLMap rules, and allows an additional clause making application of individual rules conditional on an environment variable. So configuration can now be dynamic – e.g. driven by mod_rewrite – when <Location> / <LocationMatch> sections aren’t sufficiently flexible.
I’ve just had a good hacking session on mod_proxy_html (version 3.0-dev of course; 2.x isn’t getting major new features).
I had contemplated adding DTD support using the code from mod_publisher. But that’s OTT for a specifically-HTML module. Instead, I’ve added the capability to check HTML conformance to HTML4/XHTML1, using the HTML knowledge built into libxml2. And in doing so, I recollect hacking up that little bit of libxml2 myself back when I was developing AccessValet:-)
So now a server admin can enable checking either to current or legacy (X)HTML standards (the difference being that the legacy – aka transitional – DTD allows deprecated markup). If checking is enabled, then any bogus crap will be dumped. This will be logged at loglevel DEBUG. It’ll also complain if an HTML element is missing a REQUIRED attribute (e.g. ALT on an image), though of course it can’t fix that.
I’m contemplating also supporting context checking, so it’ll fix up elements that are valid, but appear in a context where they’re not valid. That’s something libxml2 can fix (up to a point) as well as log. But that’s rather more overhead to implement, because it means saving state over the SAX callbacks.