Regex Guru

Friday, 4 April 2008

Escape Characters Only When Necessary

Filed under: Regex Philosophy — Jan Goyvaerts @ 12:35

A lot of people seem to have a habit of escaping all non-alphanumeric characters that they want to treat as literals in their regular expressions. E.g. to match #1+1=2 they’ll write \#1\+1\=2 instead of #1\+1=2. Though these regexes are equivalent in all modern regex flavors, the extraneous backslashes don’t exactly make the pattern more readable. And when formatted as a C++ string, “\\#1\\+1\\=2″ is definitely a step back from “#1\\+1=2″.

Beyond redability, needlessly escaping characters can also lead to subtle problems. In most flavors, < and \< both match a literal <. But in some flavors, like the GNU flavors, < is a literal and \< is a word boundary.

Similarly, _ and \_ usually simply match _. But the .NET framework treats \_ as an error, just like most modern flavors treat escaped letters that don’t form a regex token, like \j, as an error. This is done to reserve these letters for future expansion. I recommend that you treat non-alphanumerics the same, and escape only metacharacters.

Modern regex flavors have 11 metacharacters outside character classes: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket ).

The closing square bracket and the curly braces are indeed not in this list. The closing square bracket is an ordinary character outside character classes. Sometimes I do escape it for readability, e.g. when using a regex like \[[0-9a-f]\] to match [a]. The opening curly brace only needs to be escaped if it would otherwise form a quantifier like {3}. An exception to this rule is Java, which always requires { to be escaped.

Inside character classes, different metacharacters apply. Namely, the caret ^, the hyphen -, the closing bracket ] are the backslash itself are metacharacters. You can actually avoid escaping these, except for the backslash, by positioning them so that their special meaning cannot apply. You can place ] right after the opening bracket, - right before the closing bracket and ^ anywhere except right after the opening bracket. So []^\\-] matches any of the 3 metacharacters inside character classes. Again, one flavor has to deviate from normal practice. The JavaScript standard treats [] as an empty character class. This is not very useful, as it can never match anything. No surprise that the Internet Explorer developers got this wrong, and follow the usual practice of treating ] after [ as a literal. I recommend that you escape the 4 metacharacters inside character classes for maximum compatibility with various flavors, and to make your regex easier to understand by other developers who may be confused by something like []^\\-]. But don’t needlessly add backslashes to a regex like [*/._] which is perfectly fine without.

Thursday, 3 April 2008

wxRegEx class in wxWidgets

Filed under: Regex Code — Jan Goyvaerts @ 16:24

wxWidgets is a popular open source cross-platform windowing toolkit for C++ and other programming languages. Included with this toolkit is the wxRegEx class. This class encapsulates the “Advanced Regular Expressions” engine that was originally developed for Tcl. This means that anything you read about Tcl’s regular expression flavor also applies to wxRegEx. Since wxRegEx is compiled from the actual ARE source code, there are no compatibility issues. In RegexBuddy, simply select the Tcl ARE flavor to create patterns for wxRegEx. The only caveat is that you need to specify the wxRE_ADVANCED flag to wxRegEx.Compile(), or you’ll be stuck with plain old POSIX EREs.

I’ve been putting this class through its paces for a few days. I’ve written some documentation for wxRegEx that’s a bit more detailed than the official docs. The class is fairly bare-bones. You can compile a regex, find the first match in a string, and search-and-replace any number of matches in the string. That’s it. RegexBuddy 3.1.1, released today, includes a new source code template for wxRegEx. It generates source code snippets for the basic wxRegEx tasks I just mentioned. I also put in some more elaborate code snippets to iterate over all matches in a string, and to split a string into a wxArrayString.

You can do anything with wxRegEx that you could do in a programming language with built-in regex support. But it generally takes a bit more C++ code to get the job done. If you’ve already written your own support routines based on wxRegEx, you can easily edit RegexBuddy’s source code templates for wxWidgets to use your own routines. Just click the Edit button on the toolbar under the Use tab.

Wednesday, 26 March 2008

No One-on-One Advice

Filed under: About Regex Guru — Jan Goyvaerts @ 9:32

It happened a little sooner than I expected. I just deleted the first comment that a request for help with creating a particular regular expression. I really don’t have the time for one-on-one tech support. Not even if you offer to pay me for it, or buy a copy of RegexBuddy. RegexBuddy does come with free technical support by email. But it only covers RegexBuddy itself. It doesn’t cover regular expressions in general. Just like Microsoft tech support won’t give you free counseling on your next C# project just because you bought a copy of Visual Studio.

I much prefer to spend my time writing this blog and web site. This way I can reach far more people than I could with one-on-one support. If there are any particular topics you’d like me to write about, please let me know. Feel free to leave a comment or use the feedback form.

The only place where I assist with individual regex problems is in the RegexBuddy user forum. At least when the other regex experts hanging out there don’t beat me to it. The forum is built into RegexBuddy itself. Just click on the Forum tab, and then on the Login button.

Friday, 21 March 2008

preg_replace_callback

Filed under: Regex Code — Jan Goyvaerts @ 10:48

I just added a paragraph about preg_replace_callback to the PHP reference on regular-expressions.info. This function is just like preg_replace, with one important difference: instead of passing the replacement as a literal string (or array of strings), you pass it the name of a function. This function will be called for each match. In the function, you can do whatever calculations you want to produce the replacement text.

Guess what the following code does:

$result = preg_replace_callback('/(\d+)\+(\d+)/', compute_replacement, $subject);

function compute_replacement($groups) {
  // You can vary the replacement text for each match on-the-fly
  // $groups[0] holds the regex match
  // $groups[n] holds the match for capturing group n
  return $groups[1] + $groups[2];
}

A few other programming languages have similar functionality. E.g. in .NET, you’d pass a MatchEvaluator instance to the Regex.Replace() method. RegexBuddy can already generate such code snippets for .NET and Java. The PHP version will be added in the next free minor update.

« Previous PageNext Page »