I had a phone call today. From a registered user of mod_proxy_html, looking for help running it with a microsoft sharepoint backend.
His problem appeared at first to be Question 3 in the mod_proxy_html FAQ. But when I finally got a copy of an unfiltered page from him, it turned out to be a variant on it, and one that I could fix for very little effort and negligible runtime overhead.
The page he sent me looked something like
ï»¿<meta http-equiv=”content-type” content=”text/html; charset=utf-8″>
[followed by parseable tag soup]
I ran that through xmllint, and as expected, the output looked similar to that of mod_proxy_html. The parser (correctly) inserts the implied <html> and <head> when it encounters the first bogus <meta>. When it encounters <html> it sees a tag that cannot appear within an HTML page, and its attributes become loose text, which in turn implies <body>. Miraculously, despite the parser now being thoroughly confused, the page displays just fine, apart from that loose text.
OK, now mod_proxy_html already does some body sniffing before starting the HTML parser. Specifically it checks for charset in an XML BOM, XML declaration, or HTML META element. So why not add an optional extra check, to strip out any leading junk and start the parser at the first legitimate element (HTML, HEAD, TITLE or BODY)? That offers an extra little bit of error correction for users.
I hacked it up this afternoon, and will probably update the published mod_proxy_html with it sometime soon.