Stream Editor for Trafficserver
I haven’t blogged much on software of late. Well, I don’t seem to have blogged so much at all, but my techie contents have been woefully sparse even within a meagre whole.
Well, I’ve just added a new stream editor in to Apache Trafficserver. It’s been on my to-do list for a long time to produce a similar functionality to sed and sed-like modules in Apache HTTPD. Now I’ve hacked it up, and dropped in in to the main repo at /plugins/experimental/stream-editor/. I expect it’ll stay in /experimental/ until and unless it gets sufficient real-world usage to prove itself and sufficient demand to be promoted.
The startingpoint for this was to duplicate the functionality of mod_line_edit or mod_substitute, but with the capability (offered by mod_sed but not by the others) to rewrite incoming as well as outgoing data. Trafficserver gives me that for free, as the same code will filter both input and output. Some of the more advanced features, such as HTTPD’s environment variables, are not supported.
There were two main problems to deal with. Firstly, the configuration needs to be designed and implemented from scratch: that’s currently documented in the source code. It’s a bit idiosyncratic (I’ll append it below): suggestions welcome. Secondly, the trafficserver API lacks a set of utility classes as provided by APR for Apache HTTPD. To deal with the latter, I hacked it in C++ and used STL containers, in a manner that should hopefully annoy purists in either C (if they exist) or C++ (where they certainly do).
In figuring it out I was able to make some further improvements: in particular, it deals much better than mod_line_edit or mod_substitute with the case where different rules produce conflicting edits, allowing different rules to be assigned different precedences in configuration to resolve conflicts. And it applies all rules in a single pass, avoiding the overhead of reconstituting the data or parsing ever-more-fragmented buffers – though it does have to splice buffers to avoid the risk of losing matches that span input chunks. It parses each chunk of data into an ordered (stl) set before actually applying the edits and dispatching the edited data.
/* stream-editor: apply string and/or regexp search-and-replace to * HTTP request and response bodies. * * Load from plugin.config, with one or more filenames as args. * These are config files, and all config files are equal. * * Each line in a config file and conforming to config syntax specifies a * rule for rewriting input or output. * * A line starting with [out] is an output rule. * One starting with [in] is an input rule. * Any other line is ignored, so blank lines and comments are fine. * * Each line must have a from: field and a to: field specifying what it * rewrites from and to. Other fields are optional. The full list: * from:flags:value * to:value * scope:flags:value * prio:value * len:value * * Fields are separated by whitespace. from: and to: fields may contain * whitespace if they are quoted. Quoting may use any non-alphanumeric * matched-pair delimiter, though the delimiter may not then appear * (even escaped) within the value string. * * Flags are: * i - case-independent matching * r - regexp match * u (applies only to scope) - apply scope match to full URI * starting with "http://" (the default is to match the path * only, as in for example a <Location> in HTTPD). * * * A from: value is a string or a regexp, according to flags. * A to: string is a replacement, and may reference regexp memory $1 - $9. * * A scope: value is likewise a string or (memory-less) regexp and * determines the scope of URLs over which the rule applies. * * A prio: value is a single digit, and determines the priority of the * rule. That is to say, two or more rules generate overlapping matches, * the priority value will determine which rule prevails. A lower * priority value prevails over a higher one. * * A len: value is an integer, and applies only to a regexp from: * It should be an estimate of the largest match size expected from * the from: pattern. It is used internally to determine the size of * a continuity buffer, that avoids missing a match that spans more * than one incoming data chunk arriving at the stream-editor filter. * The default is 20. * * Performance tips: * - A high len: value on any rule can severely impact on performance, * especially if mixed with short matches that match frequently. * - Specify high-precedence rules (low prio: values) first in your * configuration to avoid reshuffling edits while processing data. * * Example: a trivial ruleset to escape text in HTML: * [out] scope::/html-escape/ from::"&" to:"&" * [out] scope::/html-escape/ from::< to:< * [out] scope::/html-escape/ from::> to:> * [out] scope::/html-escape/ from::/"/ to:/"/ * Note, the first & has to be quoted, as the two ampersands in the line * would otherwise be mis-parsed as a matching pair of delimiters. * Quoting the &, and the " line with //, are optional (and quoting * is not applicable to the scope: field). * The double-colons delimit flags, of which none are used in this example. */