Regex Guru

Wednesday, 23 April 2008

PCRE Library for MySQL

Filed under: Regex Libraries — Jan Goyvaerts @ 11:24

A RegexBuddy user pointed me to LIB_MYSQLUDF_PREG. This is an open source library of MySQL user functions that imports the PCRE library.

MySQL’s built-in regular expression support uses the POSIX ERE flavor. By todays standards, that flavor offers limited regex functionality. PCRE on the other hand offers all the goodies from Perl and other modern regex flavors.

If you want to work with LIB_MYSQLUDF_PREG, you’ll need to set the regex flavor to PCRE. Use the “PHP preg operator” string style when copying and pasting regular expressions. This will format regex as ‘/regex/’ as required by LIB_MYSQLUDF_PREG.

I haven’t tried to use LIB_MYSQLUDF_PREG myself. I don’t have access to a MySQL server where I can install such libraries.

If you want RegexBuddy to generate source code snippets for LIB_MYSQLUDF_PREG, you can edit the provided MySQL template. Change the regex flavor to PCRE and the string style to PHP/preg. Then edit the functions to use the PREG_* calls instead of MySQL’s built-in operators. Save your custom template, and share it on the RegexBuddy user forum. :-)

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl

Tuesday, 15 April 2008

Watch Out for Zero-Length Matches

Filed under: Regex Trouble — Jan Goyvaerts @ 14:51

A zero-width or zero-length match is a regular expression match that does not match any characters. It matches only a position in the string. E.g. the regex \b matches between the 1 and , in 1,2.

Zero-lenght matches are often an unintended result of mistakenly making everything optional in a regular expression. Such a regular expression will in fact find a zero-length match at every position in the string. My floating point example has long shown this.

Apparently, JavaScript developers have it particularly tough. Different browsers handle zero-length matches differently. Steven Levithan argues that IE has a bug because it increments lastIndex. Steven’s observation is correct. When iterating over /\b/g.exec(), regex.lastIndex = match.index + 1 in Internet Explorer, while in other browsers they’re equal. So who’s got it wrong?

The ECMA-262 v3 standard defines the lastIndex property in 15.10.7.5 as:

The value of the lastIndex property is an integer that specifies the string position at which to start the next match.

It’s easy enough to understand this in the context where the developer sets lastIndex prior to calling exec() to make the match attempt start at a certain position. But how should the exec() method set lastIndex after a successful match?

For String.match() the standard says in 15.5.4.10:

If there is a match with an empty string (in other words, if the value of regexp.lastIndex is left unchanged), increment regexp.lastIndex by 1.

For String.replace() the standard says in 15.5.4.11:

Do the search in the same manner as in String.match(), including the update of searchValue.lastIndex.

But for RegExp.exec() the standard says in 15.10.6.2:

Let e be r’s endIndex value [i.e. the end of the match]. If the global property is true, set lastIndex to e.

The standard contradicts itself. 15.10.6.2 is inconsistent with the three other definitions, in that it omits the +1 in case of a zero-width match.

My opinion though is that, Internet Explorer got it right, and that browsers who implement 15.10.6.2 as written while ignoring the definition in 15.10.7.5 got it wrong. The omission of the lastIndex++ for regex.exec() looks to me as an oversight by the standards writers rather than something they did intentionally. The reason is that every regex engine that I know of works the way Internet Explorer. It’s the only way to avoid an infinite loop, like Firefox does.

If a zero-width match is found, the next match attempt begins one character further ahead in the string. After \b matches between the 1 and , in 1,2, the next match attempt will begin at the position between the , and the 2 (and match there), rather than staying stuck forever.

I do understand where the confusion comes from. The property is called lastIndex, but the standard defines it as something that should be called nextAttempt. lastIndex is not the end of the previous match. The ECMA-262 standard does not provide a property for that. To get that you have to add up match.index and match[0].length yourself.

Here’s my solution to the browser compatibility problem:

while (match = regex.exec(subject)) {
  // Prevent browsers like Firefox from getting stuck in an infinite loop
  if (match.index == regex.lastIndex) regex.lastIndex++;
  // Do whatever you want with the match
  start_of_match = match.index;
  length_of_match = match[0].length;
  first_character_after_match = start_of_match + length_of_match;
}

This code is easy to understand, and only uses one extra line (plus a comment) to work around the browser problems.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl

Tuesday, 8 April 2008

Unintended Backtracking Can Bite You

Filed under: Regex Trouble — Jan Goyvaerts @ 17:01

Backtracking occurs when the regular expression engine encounters a regex token that does not match the next character in the string. The regex engine will then back up part of what it matched so far, to try different alternatives and/or repetitions. Understanding this process will make all the difference between guessing and understanding why a regular expression matches what it does and doesn’t.

But even when you know all about backtracking, it’s still easy to miss unintended consequences when you don’t test the regex on enough examples of things you don’t want to match. In the tutorial topic about backreferences, I showed the regular expression <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> as a way to match a pair of HTML tags, like <tag>content</tag>. It does this just fine. Can you figure out when this regex will go wrong?

The regex won’t go wrong on properly nested HTML tags. The problem occurs when the tags don’t match. E.g. the regex will match <tagxyz>content</tag>, even though these aren’t matching HTML tags.

At first, ([A-Z][A-Z0-9]*) will store tagxyz into the first capturing group. This eventually causes \1 to fail, as tagxyz doesn’t match tag. But the tireless regex engine doesn’t stop here. Along the way, it has noticed that [A-Z0-9]* has repeated 5 times. That’s five steps the regex engine can backtrack. So it goes back to capturing tagxy into \1. The following [^>]* then matches the z.

Now you might say that deleting [^>]* from the regex will solve the problem. It will, if you don’t want to match HTML tags with attributes like <tag attr=”value”>content</tag>. But we want to allow those, so we need the [^>]* to skip over any attributes the tag might have, or even just some extra whitespace between the tag name and the >.

\1 will fail to match until the engine has backtracked enough so that ([A-Z][A-Z0-9]*) matches tag, and the [^>]* skips the xyz. Then \1 matches tag and a valid match is found for the whole regex.

So how to fix this? There are two solutions.

One solution is to be more precise about what we want. The name of the tag must be a whole word. The [^>]* can’t be allowed to match part of the tag. We can easily specify this by adding a word boundary to the mix: <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>. The regex engine will still backtrack when \1 fails to match. But each time the engine makes [A-Z0-9]* give up one more character, \b will immediately fail to match.

This is the change that I made to the tutorial topic about backreferences. The word boundary is easy to understand, and supported by all modern regex flavors.

The other solution is to tell the regex engine to forget about backtracking alltogether. The two greedy quantifiers in this regular expression never have a reason to backtrack. ([A-Z][A-Z0-9]*) will match the tag’s name straight away. If it backtrack, it cuts of part of the tag name, which we never want. There’s no point in backtracking [^>]* either. It will eat up everything up to the tag’s closing >. There’s no point in backtracking it, since [^>]* and > are mutually exclusive. If [^>]* gives up a character it previously matched, > could never match it. Some regex engines, like the one in the Just Great Software products, even detect this and automatically disable backtracking for mutually exclusive tokens.

There are two ways to disable backtracking on a quantifier. The easiest one is to use possesive quantifiers. Only a few regex flavors support these. <([A-Z][A-Z0-9]*+)[^>]*+>.*?</\1> uses two possessive quantifiers to match the opening tag. Like before, ([A-Z][A-Z0-9]*+) will capture tagxyz and \1 will fail. The .*? will still expand to look for a matching closing tag. But if it can’t find one, that’s the end of the story.

A few more regular expression flavors support atomic grouping. When the regex engine exists an atomic group, it instantly forgets about all backtracking positions it remembered inside the group. <(?>([A-Z][A-Z0-9]*)[^>]*)>.*?</\1> puts the opening tag in an atomic group. This regex will work in exactly the same way, though the mechanics inside the regex engine is slightly different. (Possessive quantifiers never store backtracking positions. Using atomic grouping, the greedy quantifiers store backtracking positions, which are subsequently thrown away.)

There are still things that the improved regex can’t do, like dealing with nested identical tags and incorrectly nested tags. Obviously, you can’t write a full HTML parser with just one regex. Always test your regex on both good and bad data to see if it does what you want.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl

Friday, 4 April 2008

Escape Characters Only When Necessary

Filed under: Regex Philosophy — Jan Goyvaerts @ 12:35

A lot of people seem to have a habit of escaping all non-alphanumeric characters that they want to treat as literals in their regular expressions. E.g. to match #1+1=2 they’ll write \#1\+1\=2 instead of #1\+1=2. Though these regexes are equivalent in all modern regex flavors, the extraneous backslashes don’t exactly make the pattern more readable. And when formatted as a C++ string, “\\#1\\+1\\=2″ is definitely a step back from “#1\\+1=2″.

Beyond redability, needlessly escaping characters can also lead to subtle problems. In most flavors, < and \< both match a literal <. But in some flavors, like the GNU flavors, < is a literal and \< is a word boundary.

Similarly, _ and \_ usually simply match _. But the .NET framework treats \_ as an error, just like most modern flavors treat escaped letters that don’t form a regex token, like \j, as an error. This is done to reserve these letters for future expansion. I recommend that you treat non-alphanumerics the same, and escape only metacharacters.

Modern regex flavors have 11 metacharacters outside character classes: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket ).

The closing square bracket and the curly braces are indeed not in this list. The closing square bracket is an ordinary character outside character classes. Sometimes I do escape it for readability, e.g. when using a regex like \[[0-9a-f]\] to match [a]. The opening curly brace only needs to be escaped if it would otherwise form a quantifier like {3}. An exception to this rule is Java, which always requires { to be escaped.

Inside character classes, different metacharacters apply. Namely, the caret ^, the hyphen -, the closing bracket ] are the backslash itself are metacharacters. You can actually avoid escaping these, except for the backslash, by positioning them so that their special meaning cannot apply. You can place ] right after the opening bracket, - right before the closing bracket and ^ anywhere except right after the opening bracket. So []^\\-] matches any of the 3 metacharacters inside character classes. Again, one flavor has to deviate from normal practice. The JavaScript standard treats [] as an empty character class. This is not very useful, as it can never match anything. No surprise that the Internet Explorer developers got this wrong, and follow the usual practice of treating ] after [ as a literal. I recommend that you escape the 4 metacharacters inside character classes for maximum compatibility with various flavors, and to make your regex easier to understand by other developers who may be confused by something like []^\\-]. But don’t needlessly add backslashes to a regex like [*/._] which is perfectly fine without.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl
« Previous PageNext Page »