Category Archives: apache
Folks who know me will know that I’ve been taking an interest for some time in the problems of online identity and trust:
- Passwords (as we know them today) are a sick joke.
- Monolithic certificate authorities (and browser trust lists) are a serious weakness in web trust.
- PGP and the Web of Trust remain the preserve of geekdom.
- People distrust and even fear centralised databases. At issue are both the motivations of those who run them, and security against intruders.
- Complexity and poor practice opens doors for phishing and identity theft.
- Establishing identity and trust can be a nightmare, to the extent that a competent fraudster might find it easier than the real person to establish an identity.
I’m not a cryptographer. But as mathematician, software developer, and old cynic, I have the essential ingredients. I can see that things are wrong and could so easily be a whole lot better at many levels. It’s not even a hard problem: merely a more rational deployment of existing technology! Some time back I thought about setting myself up in the business of making it happen, but was put off by the ghost of what happened last time I tried (and failed) to launch an innovative startup.
Recently – starting this summer – I’ve embarked on another mission towards improving the status quo. Instead of trying to run my own business, I’ve sought out an existing business doing good work in the field, to which I can hope to make a significant contribution. So the project’s fortunes tap into my strengths as techie rather than my weaknesses as a Suit.
I should add that the project does rather more than just improve the deployment of existing technology, as it significantly advances the underlying cryptographic framework. Most importantly it introduces a Distributed Trust Authority model, as an alternative to the flawed monolithic Certificate Authority and its single point of failure. The distributed model also makes it particularly well-suited to “cloud” applications and to securing the “Internet of Things”.
And it turns out, I arrived at an opportune moment. The project has been single-company open source for some time and generated some interest at github. Now it’s expanding beyond that: a second corporate team is joining development and I understand there are further prospects. So it could really use a higher-level development model than github: one that will actively foster the community and offer mutual assurance and protection to all participants. So we’ve put it forward as a candidate for incubation at Apache. The proposal is here.
If all goes well, this could be the core of my work for some time to come. Here’s hoping for a big success and a better, safer online world.
I haven’t blogged much on software of late. Well, I don’t seem to have blogged so much at all, but my techie contents have been woefully sparse even within a meagre whole.
Well, I’ve just added a new stream editor in to Apache Trafficserver. It’s been on my to-do list for a long time to produce a similar functionality to sed and sed-like modules in Apache HTTPD. Now I’ve hacked it up, and dropped in in to the main repo at /plugins/experimental/stream-editor/. I expect it’ll stay in /experimental/ until and unless it gets sufficient real-world usage to prove itself and sufficient demand to be promoted.
The startingpoint for this was to duplicate the functionality of mod_line_edit or mod_substitute, but with the capability (offered by mod_sed but not by the others) to rewrite incoming as well as outgoing data. Trafficserver gives me that for free, as the same code will filter both input and output. Some of the more advanced features, such as HTTPD’s environment variables, are not supported.
There were two main problems to deal with. Firstly, the configuration needs to be designed and implemented from scratch: that’s currently documented in the source code. It’s a bit idiosyncratic (I’ll append it below): suggestions welcome. Secondly, the trafficserver API lacks a set of utility classes as provided by APR for Apache HTTPD. To deal with the latter, I hacked it in C++ and used STL containers, in a manner that should hopefully annoy purists in either C (if they exist) or C++ (where they certainly do).
In figuring it out I was able to make some further improvements: in particular, it deals much better than mod_line_edit or mod_substitute with the case where different rules produce conflicting edits, allowing different rules to be assigned different precedences in configuration to resolve conflicts. And it applies all rules in a single pass, avoiding the overhead of reconstituting the data or parsing ever-more-fragmented buffers – though it does have to splice buffers to avoid the risk of losing matches that span input chunks. It parses each chunk of data into an ordered (stl) set before actually applying the edits and dispatching the edited data.
/* stream-editor: apply string and/or regexp search-and-replace to * HTTP request and response bodies. * * Load from plugin.config, with one or more filenames as args. * These are config files, and all config files are equal. * * Each line in a config file and conforming to config syntax specifies a * rule for rewriting input or output. * * A line starting with [out] is an output rule. * One starting with [in] is an input rule. * Any other line is ignored, so blank lines and comments are fine. * * Each line must have a from: field and a to: field specifying what it * rewrites from and to. Other fields are optional. The full list: * from:flags:value * to:value * scope:flags:value * prio:value * len:value * * Fields are separated by whitespace. from: and to: fields may contain * whitespace if they are quoted. Quoting may use any non-alphanumeric * matched-pair delimiter, though the delimiter may not then appear * (even escaped) within the value string. * * Flags are: * i - case-independent matching * r - regexp match * u (applies only to scope) - apply scope match to full URI * starting with "http://" (the default is to match the path * only, as in for example a <Location> in HTTPD). * * * A from: value is a string or a regexp, according to flags. * A to: string is a replacement, and may reference regexp memory $1 - $9. * * A scope: value is likewise a string or (memory-less) regexp and * determines the scope of URLs over which the rule applies. * * A prio: value is a single digit, and determines the priority of the * rule. That is to say, two or more rules generate overlapping matches, * the priority value will determine which rule prevails. A lower * priority value prevails over a higher one. * * A len: value is an integer, and applies only to a regexp from: * It should be an estimate of the largest match size expected from * the from: pattern. It is used internally to determine the size of * a continuity buffer, that avoids missing a match that spans more * than one incoming data chunk arriving at the stream-editor filter. * The default is 20. * * Performance tips: * - A high len: value on any rule can severely impact on performance, * especially if mixed with short matches that match frequently. * - Specify high-precedence rules (low prio: values) first in your * configuration to avoid reshuffling edits while processing data. * * Example: a trivial ruleset to escape text in HTML: * [out] scope::/html-escape/ from::"&" to:"&" * [out] scope::/html-escape/ from::< to:< * [out] scope::/html-escape/ from::> to:> * [out] scope::/html-escape/ from::/"/ to:/"/ * Note, the first & has to be quoted, as the two ampersands in the line * would otherwise be mis-parsed as a matching pair of delimiters. * Quoting the &, and the " line with //, are optional (and quoting * is not applicable to the scope: field). * The double-colons delimit flags, of which none are used in this example. */
I’ve already posted from ApacheCon about my favourable first impression. I’m happy to say my comments about the fantastic city and hotel have survived the week intact: I was as impressed at the end of the week as at the start. Even the weather improved through the week, so in the second half – when the conference schedule was less intense – I could go out without getting wet.
The main conference sessions were Monday to Wednesday, with all-day schedules and social events in the evening. Thursday was all-day BarCamp, though I skipped the morning in favour of a bit of touristing in the best weather of the week. Thursday and Friday were also the related Cloudstack event. I’m not going to give a detailed account of my week. I attended a mix of talks: a couple on familiar subjects to support and heckle speakers, new and unfamiliar material to educate myself on topics of interest, and – not least – inspirational talks from Apache’s gurus such as Bertrand.
Socially it had a very good feel: as ever I’ve renewed acquaintance with old friends, met new friends, and put faces to names hitherto seen only online. The social scene was no doubt helped not just by the three social evenings laid on, but also by the fact that all meals were provided encouraging us to stay around the hotel, and that the weather discouraged going elsewhere for the first half of the week. The one thing missing was a keysigning party. Note to self: organise it myself for future conferences if noone else gets there first!
I’ve returned home much refreshed and with some ideas relevant to my work, and an intention to revitalise my Apache work – where I need to cut my involvement down to my three core projects and then give those the time&effort they deserve but which have been sadly lacking of late. Also grossly overfed and bloated. Now I just have to sustain that high, against the adversity of the darkest time of year and temperatures that encourage staying in bed. 😮
Huge thanks to DrBacchus and the team for making it all happen!
It’s lunchtime on the first day of Apachecon. Too soon to assess the event as a whole, but I’ve formed a view on the venue.
Of all the ApacheCon venues I’ve been to, I think this week’s seems the best. The Corinthia Hotel is about as good as any I’ve encountered, and we’re in a nice area of the great historic city of Budapest. Amsterdam is the only past-Apachecon city that can really rival Budapest, but that was let down by a bad conference hotel. And conversely, where I’ve encountered decent hotels, they’ve been in some altogether less pleasant or interesting locations. At worst we’ve had poor hotels in poor locations.
Come to think of it, that’s not just Apachecon, it’s conferences of any kind, even stretching back to my days in academia.
Of course, my perception may be coloured by individual circumstances too. I’m not doing anything stressful like giving a talk or tutorial this time. And I may have been fortunate to have been allocated an ideal hotel room, overlooking a quiet quadrangle where I can open the window wide for fresh air without being disturbed either by outside traffic or hotel noise.
Just a couple of flies in the ointment. The weather in bleak November isn’t entirely conducive to getting the most from Budapest. And there are not sufficient power outlets to wield the laptop everywhere around the conference. Even if that’s (arguably) a good thing when in a presentation, the shortage of power points applies even to the designated hacker area, which is itself not a strong point of the event.
OK, time to get back to conferring!
I spent two days last week at the trafficserver summit.
Or rather, two evenings. The summit was held in Silicon Valley (hosted by linkedin), while I remained at home in Blighty with a conferencing link, making me one of several remote attendees. With an 8 hour time difference, each day started at 5pm and went on into the wee hours. On the first day (Tuesday) this followed a day of regular work. On the Wednesday I took a more sensible approach and the only work I did before the summit was a bit of gardening. Despite that I felt more tired on the Wednesday.
The conferencing link was a decent enough instance of its kind, with regular video alongside screen sharing and text (though IRC does a better job with text). The video was pointed at the speakers as they presented, and the screen sharing was used to share their presentations. That was good enough to follow the presentations pretty well: indeed, sometimes better than being there, as I could read all the intricate slides and screens that would’ve been just a blur if I’d been present in the room.
Unfortunately most of the presentations involved discussion around the room, and that was much harder, sometimes impossible, to follow. Also, speaking was not a good experience: I heard my voice some time after I’d spoken, and it sounded ghastly and indistinct, so I muted my microphone. That was using just the builtin mike in the macbook. I tried later with a proper headset when I had something to contribute, but alas it seems by then I (and I think all remote attendees, after the initial difficulties) was muted by the system. So I had something approximating to read-only access. And of course missed out on the social aspects of the event away from the presentations.
In terms of the mechanics of running an event like this, I think in retrospect we could make some modest improvements. We had good two-way communication over IRC, and that might be better-harnessed. Maybe rather than ad-hoc intervention, someone present (a session chair?) could act as designated proxy for remote attendees, and keep an eye on IRC for anyone looking to contribute to discussion. Having such a person would probably have prompted me into action on a few occasions when I had a comment, question or suggestion. Or perhaps better, IRC could be projected onto a second screen in the room, alongside the presenter’s materials.
The speakers and contents were well worth the limitations and antisocial hours of attending. I found a high proportion of the material interesting, informative, and well-presented. Alan, who probably knows more than anyone about Trafficserver internals, spoke at length on a range of topics. The duo of Brian and Bryan (no, not a comedy act) talked about debugging and led discussion on test frameworks.
Other speakers addressed applications and APIs, and deployments, ops and tools. A session I found unexpectedly interesting was Susan on the subject of how, in integrating sophisticated SSL capabilities in a module, she’s been working with Alan to extend the API to meet her needs. It’s an approach from which I might just benefit, and I also need to take a look at whether Ironbee adequately captures all potentially-useful information available from SSL.
At the end I also made (via IRC) one suggestion for a session for the next summit: API review. There’s a lot that’s implemented in Trafficserver core and utils that could usefully be made available to plugins via the API, even just by installing existing header files to a public includes directory. Obviously that requires some control over what is intended to be public, and a stability deal over exported APIs. I have some thoughts over how to deal with those, but I think that’s a subject for the wiki rather than a blog post. One little plea for now: let’s not get hung up on what’s in C vs C++. Accept that exported headers might be either, and let application developers deal with it. If anyone then feels compelled to write a ‘clean’ wrapper, welcome their contribution!
I started writing a longer post about the so-called shell shock, with analysis of what makes a web server vulnerable or secure. Or, strictly speaking, not a webserver, but a platform an attacker might access through a web server. But I’m not sure when I’ll find time to do justice to that, so here’s the short announcement:
I’ve updated mod_taint to offer an ultra-simple defence against the risk of shell shock attacks coming through Apache HTTPD, versions 2.2 or later. A new simplified configuration option is provided specifically for this problem:
LoadModule taint_module modules/mod_taint.so Untaint shellshock
Here’s some detail from what I posted earlier to the Apache mailinglists:
Untaint works in a directory context, so can be selectively enabled for potentially-vulnerable apps such as those involving CGI, SSI, ExtFilter, or (other) scripts.
This goes through all Request headers, any PATH_INFO and QUERY_STRING, and (just to be paranoid) any other subprocess environment variables. It untaints them against a regexp that checks for “()” at the beginning of a variable, and returns an HTTP 400 error (Bad Request) if found.
Feedback welcome, indeed solicited. I believe this is a simple but sensible approach to protecting potentially-vulnerable systems, but I’m open to contrary views. The exact details, including the shellshock regexp itself, could probably use some refinement. And of course, bug reports!
The fallout from heartbleed seems to be manifesting itself in a range of ways. I’ve been required to set new passwords for a small number of online services, and expect I may encounter others as and when I next access them.
The main contrast seems to be between admins who tell you what’s happening, vs services that just stop working. Contrast Apache and Google:
Apache: email arrives from the infrastructure folks: all system passwords will have to be reset. Then a second email: if you haven’t already, you’ll have to set a new password via the “forgot my password” mechanism (which sends you PGP-encrypted email instructions). All very smooth and maximally secure – unless some glitch has yet to manifest itself.
Google: @employer email address, which is hosted on gmail, just stopped working without explanation. But this is the weekend, and similar things have happened before at weekends, so I ignore it. But when it’s still not back on Monday, I try logging in with my web browser. It allows me that, and insists I set a new password, whereupon normal imap access is also restored. Hmmm … In the first place, no explanation or warning. In the second place, if the password had been compromised then anyone who had it could trivially have reset it. Bottom of the class both for insecurity and for the user experience.
There is also secondary fallout: worried users of products that link OpenSSL asking or wondering what they have to upgrade: for example, here. For most, the answer is that you just upgrade your OpenSSL installation and then restart any services that link it (or reboot the whole system if you favour the sledgehammer approach). Exceptions to that will be cases where you have custom builds with statically linked OpenSSL, or multiple OpenSSL installations (as might reasonably be the case on a developer’s machine). If in doubt, restart your services and check for the OpenSSL version appearing in its startup messages: for example, with Apache HTTPD you’ll see it in the error log at startup.
mod_form is one of my old Apache modules. It serves to parse a standard form, and make its contents available to application modules in Apache. One fewer wheel for application modules to reinvent.
Like many of my older Apache modules, I wrote it for my own applications, but released it as open source in case it might be of use to anyone. I hadn’t heard of anyone using it, but then I wouldn’t necessarily: I’ve seen my forgotten works pop up in a few different contexts, sometimes as-is, sometimes developed a lot further than I ever took them.
A day or two ago I got email from Peter Pöml, telling me that it is used by MirrorBrain to parse arguments. But this usage requires a patch: mod_form as-was consumes the data so they no longer exist for anything else that needs the unparsed data. A very simple patch: just copy the data before parsing and leave the original untouched.
The patched version has been in use since 2007. But now it seems Fedora packaged it un-patched for MirrorBrain, leaving potential breakage in unexpected places. Whoops!
Peter’s patch is simple and beneficial, and carries no risk of breaking anything. So I’ve just applied it: download it now and you’ll get Peter’s improvement. mod_form is not versioned (I never considered it important enough – maybe I’ll rethink if it’s being packaged in the mainstream) so it won’t be immediately obvious. Blogging here for the benefit of anyone googling the story.
POSTSCRIPT (Jan 10): Peter mailed me again. It seems my information was incomplete, and the Fedora package was patched after all. There’s also another patch (from SUSE) for Apache 2.4 per-module logging, which I’ll look at when I have time.
I meant to blog this upwards of a week ago, but I guess better late than never – at least when the subject isn’t so topical to the moment as to go instantly stale.
Apache Trafficserver 4.01 was released on August 30th. It’s basically a production release of what has hitherto been the developer (unstable) series 3.3.x. It’s actually also an incremental upgrade from earlier 3.x releases, in that existing users should be able to upgrade to 4.0 as a drop-in replacement or with very minimal reconfiguration, though of course test before deploying in production! And if you use third-party add-ons, check with their developers or support.
Ironbee, the leading WAF and the add-on with which I’m substantially involved, has always tracked Trafficserver development versions, and is thus ready for Trafficserver 4. Users are encouraged to upgrade as soon as you are ready, and subject of course to the general testing you would always apply to a change of platform. If you find any issues arising, you are encouraged to raise them in the relevant fora for Ironbee and/or Trafficserver.
Please note, although I work on both the Trafficserver and Ironbee projects, I don’t speak on behalf of either of them when I blog. None of the above is in any sense official.
Some people engage in Holy Wars over what source control system to use. For my part I really can’t get too worked up over a choice of tools, but I am concerned about another question. What files do you keep in a source control repository?
I’d like to say source files. Program source files, inputs for your choice of build system, legal stuff like licenses and acknowledgements, matters of record, documentation. The key point is, files that are rightfully under the direct control of project members. Not files that are generated by software, or managed by third-parties.
In practice, this principle is all-too-often lost. One example is Apache HTTPD, whose source repos contain extensive HTML documentation that is not written by developers but generated from XML source. There’s a clue in the headers of each of these files:
<!-- XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX This file is generated from xml source: DO NOT EDIT XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX -->
So these files are not source, and should really be generated in the build (or made a configuration option) rather than kept under source control. But apart from raising the overhead of using the repos, they’re harmless.
I’ve recently come upon an altogether more problematic case. It manifested itself after I’d installed all the prerequisites for a configure to succeed, but found my build fell down in compiling something. Scrolling up through reams of error messages, I find at the top:
#error This file was generated by a newer version of protoc which is #error incompatible with your Protocol Buffer headers. Please update #error your headers.
OK, that’s simple enough: the version of google protobuf I installed with aptitude is too old. Go to google and download the latest (cursing google for failing to sign it). And hack protobuf.m4 to detect this error from configure rather than fall over in the build.
But hang on! It’s not as simple as that. This isn’t the usual dependency on a minimum version: it’s a requirement for an exact version of protobuf. If I install a version that’s too new I get another error:
#error This file was generated by an older version of protoc which is #error incompatible with your Protocol Buffer headers. Please #error regenerate this file with a newer version of protoc.
Altogether more problematic. Nightmare if I have more than one app each requiring different protobuf versions. And this is a library I’m building: it could be linked with somesuch. Ouch!
The clue is at the top of the file that generates the errors:
// Generated by the protocol buffer compiler. DO NOT EDIT! // source: [filename].proto
This C++ is not source, it’s an intermediate file generated by protoc, which is part of the protobuf package. Its source is the .proto file, which is also there in the repo but not used for the build. It follows that hacking protobuf.m4 to test the version was the wrong solution: instead the build should be updated to generate the intermediate files from the .proto source.