Daily Archives: October 12, 2006

mod_proxy_html revisited

I had a phone call today. From a registered user of mod_proxy_html, looking for help running it with a microsoft sharepoint backend.

His problem appeared at first to be Question 3 in the mod_proxy_html FAQ. But when I finally got a copy of an unfiltered page from him, it turned out to be a variant on it, and one that I could fix for very little effort and negligible runtime overhead.

The page he sent me looked something like

<meta http-equiv=”content-type” content=”text/html; charset=utf-8″>

<html xmlns:…..>
[followed by parseable tag soup]

I ran that through xmllint, and as expected, the output looked similar to that of mod_proxy_html. The parser (correctly) inserts the implied <html> and <head> when it encounters the first bogus <meta>. When it encounters <html> it sees a tag that cannot appear within an HTML page, and its attributes become loose text, which in turn implies <body>. Miraculously, despite the parser now being thoroughly confused, the page displays just fine, apart from that loose text.

OK, now mod_proxy_html already does some body sniffing before starting the HTML parser. Specifically it checks for charset in an XML BOM, XML declaration, or HTML META element. So why not add an optional extra check, to strip out any leading junk and start the parser at the first legitimate element (HTML, HEAD, TITLE or BODY)? That offers an extra little bit of error correction for users.

I hacked it up this afternoon, and will probably update the published mod_proxy_html with it sometime soon.

OK, how does this thing work?

testing, testing. Is this a platform I want to live with?

Well, signup was easy enough, except that it wouldn’t let me have “bah-humbug”. Yes I know it says letters and numbers only, but that might have been intended as idiot-proof shorthand for “must be a valid hostname”. So I tried bah-humbug, and of course it was rejected.

Now, will this be sensibly formatted?

Bah, Humbug.