Regex Guru

Thursday, 3 April 2008

wxRegEx class in wxWidgets

Filed under: Regex Code — Jan Goyvaerts @ 16:24

wxWidgets is a popular open source cross-platform windowing toolkit for C++ and other programming languages. Included with this toolkit is the wxRegEx class. This class encapsulates the “Advanced Regular Expressions” engine that was originally developed for Tcl. This means that anything you read about Tcl’s regular expression flavor also applies to wxRegEx. Since wxRegEx is compiled from the actual ARE source code, there are no compatibility issues. In RegexBuddy, simply select the Tcl ARE flavor to create patterns for wxRegEx. The only caveat is that you need to specify the wxRE_ADVANCED flag to wxRegEx.Compile(), or you’ll be stuck with plain old POSIX EREs.

I’ve been putting this class through its paces for a few days. I’ve written some documentation for wxRegEx that’s a bit more detailed than the official docs. The class is fairly bare-bones. You can compile a regex, find the first match in a string, and search-and-replace any number of matches in the string. That’s it. RegexBuddy 3.1.1, released today, includes a new source code template for wxRegEx. It generates source code snippets for the basic wxRegEx tasks I just mentioned. I also put in some more elaborate code snippets to iterate over all matches in a string, and to split a string into a wxArrayString.

You can do anything with wxRegEx that you could do in a programming language with built-in regex support. But it generally takes a bit more C++ code to get the job done. If you’ve already written your own support routines based on wxRegEx, you can easily edit RegexBuddy’s source code templates for wxWidgets to use your own routines. Just click the Edit button on the toolbar under the Use tab.

Wednesday, 26 March 2008

No One-on-One Advice

Filed under: About Regex Guru — Jan Goyvaerts @ 9:32

It happened a little sooner than I expected. I just deleted the first comment that a request for help with creating a particular regular expression. I really don’t have the time for one-on-one tech support. Not even if you offer to pay me for it, or buy a copy of RegexBuddy. RegexBuddy does come with free technical support by email. But it only covers RegexBuddy itself. It doesn’t cover regular expressions in general. Just like Microsoft tech support won’t give you free counseling on your next C# project just because you bought a copy of Visual Studio.

I much prefer to spend my time writing this blog and web site. This way I can reach far more people than I could with one-on-one support. If there are any particular topics you’d like me to write about, please let me know. Feel free to leave a comment or use the feedback form.

The only place where I assist with individual regex problems is in the RegexBuddy user forum. At least when the other regex experts hanging out there don’t beat me to it. The forum is built into RegexBuddy itself. Just click on the Forum tab, and then on the Login button.

Friday, 21 March 2008

preg_replace_callback

Filed under: Regex Code — Jan Goyvaerts @ 10:48

I just added a paragraph about preg_replace_callback to the PHP reference on regular-expressions.info. This function is just like preg_replace, with one important difference: instead of passing the replacement as a literal string (or array of strings), you pass it the name of a function. This function will be called for each match. In the function, you can do whatever calculations you want to produce the replacement text.

Guess what the following code does:

$result = preg_replace_callback('/(\d+)\+(\d+)/', compute_replacement, $subject);

function compute_replacement($groups) {
  // You can vary the replacement text for each match on-the-fly
  // $groups[0] holds the regex match
  // $groups[n] holds the match for capturing group n
  return $groups[1] + $groups[2];
}

A few other programming languages have similar functionality. E.g. in .NET, you’d pass a MatchEvaluator instance to the Regex.Replace() method. RegexBuddy can already generate such code snippets for .NET and Java. The PHP version will be added in the next free minor update.

Tuesday, 18 March 2008

If You Do It Differently, Document It Clearly

Filed under: The Guru's Kitchen — Jan Goyvaerts @ 18:19

Earlier today during development, I was writing some code that deals with mode modifiers. Most modern regex flavors use the (?i), (?s), (?m) and (?x) modifiers first used in Perl. Though the s and m modes are misnamed, at least they’re easy enough to remember once you get the hang of it.

Tcl’s ARE engine, however, tried to improve the situation. Instead of a “single line” and a “multi line” option that can both be on or off, yielding 4 states, Tcl uses the terms “non-newline-sensitive”, “partial newline-senstive”, “inverse partial newline-sensitive” (a.k.a. “weird”) and “newline-sensitive” for each of the 4 combinations, and four letters to go with the 4 names. The defaults are also different.

I can never remember Tcl’s matching modes. I don’t use Tcl other than for testing its regex engine. So I checked my own documentation on the subject. And I found I was contradicting myself. What I wrote in the bullet points contradicted other bullet points, and the comparison table with Perl further down the page. Turns out the (?w) and (?n) bullet points and table items were all wrong, in different ways.

To figure this out I consulted the official Tcl docs once more:

If newline-sensitive matching is specified, . and bracket expressions using ^ will never match the newline character (so that matches will never cross newlines unless the RE explicitly arranges it) and ^ and $ will match the empty string after and before a newline respectively, in addition to matching at beginning and end of string respectively. ARE \A and \Z continue to match beginning or end of string only.

If partial newline-sensitive matching is specified, this affects . and bracket expressions as with newline-sensitive matching, but not ^ and `$’.

If inverse partial newline-sensitive matching is specified, this affects ^ and $ as with newline-sensitive matching, but not . and bracket expressions. This isn’t very useful but is provided for symmetry.

I don’t know about you, but the above makes little sense to me. Testing Tcl’s engine again, it’s actually technically correct. Just hard to understand when explained like this. RegexBuddy does get its explanation right on the Create tab.

It doesn’t matter if Perl’s or Tcl’s way of specifying what RegexBuddy calls “dot matches newlines” and “^ and $ match at line breaks” is better. Perl’s the established way, and Tcl thinks it can do better. But Tcl then does a poor job of explaining its improvements, which only leads to confusion.

If you’re improving on established standards, make sure to explain yourself clearly. People are used the old ways, and will resist change, particularly if you make change difficult with poor documentation.

So what’s my opinion on “dot matches newlines” and “^ and $ match at line breaks”? The latter is obsolete. Perl, Tcl and most flavors that follow Perl, have \A and \Z to match the start and end of a subject. So redefining ^ and $ to match at embedded line breaks is fair game. In EditPad Pro and PowerGREP, “^ and $ match at line breaks” is permanently enabled, though you could put (?-m) at the start of your regex if you must. The “dot matches newlines” option is still useful, because doing the same with character classes is cumbersome. What Tcl’s docs call “weird” and “not very useful” is actually quite handy when dealing with data spread over multiple lines in a larger file (i.e. turning on both “dot matches newlines” and “^ and $ match at line breaks”).

« Previous PageNext Page »