Regex Guru

Thursday, 8 May 2008

Follow Up with Adequate Testing

Filed under: Regex Trouble — Jan Goyvaerts @ 15:05

I always emphasize on the importance of testing your regular expressions on all possible input data. Particularly on any input that you might get that you don’t want the regex to match.

Writing a regular expression that matches something is often quite straightforward. Making it match all the variations of what you want can be tricky. Excluding all the things you don’t want is often the hard part.

The regex in the Do Follow plugin does not match all HTML anchor tag with a rel=”nofollow” attribute. When wordpress encounters something like <a href=”url”> in a comment, it changes it into <a href=”url” rel=”nofollow”>. The plugin regex matches this just fine. So this isn’t really a bug in the plugin.

The problem arises if you were to blindly copy-and-paste this regular expression into your own code. A lot of programmers do that, and it’s a mistake. Always thoroughly test any regex you plan to use on both valid and invalid data. Try either the original regex or my improved regex on this: <a rel=”nofollow” href=”url”>. It doesn’t work. But this anchor tag is valid. The order of attributes is irrelevant in HTML and XHTML.

The problem is that the author of the original regex used \s+ to force whitespace to occur after the a and before the rel parts of the match. This means the regex requires at least two spaces between the a and rel, possibly with other spaces and characters between those two spaces. But one space is actually sufficient.

An easy solution is to specify in the regex what is really meant. We don’t care if there are any spaces between the a and rel. What we require is that they are complete words. This is better done with word boundaries, like this:

'/
  (
    <a\b
    [^<>]*?
    \b
    rel=["\']
    [a-z0-9\s\-_|[\]]*?
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_|[\]]*
    ["\']
    [^<>]*
    >
  )
/isx’

Even with this improvement, this is still a regular expression dedicated to a particular job. It will still match <a all this ref=”nofollow” nonsense!!!> which is obviously not a valid URL. In the context of stripping ref=”nofollow” from comments, we don’t care. The point is, you should never copy-and-paste a regex without being sure of exactly what it does. The only way to be really sure is to test.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl

No Follow The Lazy Dot

Filed under: Regex Trouble — Jan Goyvaerts @ 8:31

.*? is what I call the “lazy dot”. It matches any sequence of characters. It matches as few of them as needed to make the whole regex match. The problem is that if there’s no way for the regex to match, the lazy dot will continue all the way until the end of the line or the end of the subject string (if the dot is allowed to match newlines). If you have two lazy dots in a regex, they’ll both try expand to match the whole regex, trying every possible permutation between them. That leads to catastrophic backtracking. The HTML file example at the bottom of that page shows how a bunch of lazy dots will get into a fight.

Yesterday I installed the Do Follow plugin on all my blogs. Looking at the plugin’s source, I saw that one regular expression was used to simply strip the ref=”nofollow” attributes that WordPress adds. Here it is, formatted as a PHP preg string as it appears in the code:

'/
  (
    <a\s+
    .*
    \s+
    rel=["\']
    [a-z0-9\s\-_\|\[\]]*
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_\|\[\]]*
    ["\']
    .*
    >
  )
/isUx’

Notice the /U modifier at the end. This is a PHP flag that reverses the meaning of the question mark after quantifier. Normally, .* is greedy and .*? is lazy. With /U, .* is lazy and .*? is greedy. You could call /U the Uber Lazy mode because it even saves you typing the extra ?.

The problem with this regex is that when it encounters an HTML anchor that does not have ref=”nofollow”, the first lazy dot will expand all the way to the end of the HTML code it’s trying to strip the “nofollow” from. That’s very inefficient. (Of course, the whole business of stripping off something that shouldn’t be added in the first place is very inefficient. Suffice to say I’m less and less impressed with WordPress each day.)

Here’s my version:

'/
  (
    <a\s+
    [^<>]*?
    \s+
    rel=["\']
    [a-z0-9\s\-_|[\]]*?
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_|[\]]*
    ["\']
    [^<>]*
    >
  )
/isx’

I removed the /U flag. I replaced the dot with the far more sensible negated character class [^<>]. Angle brackets can’t occur within an HTML anchor. Coding this small piece of information into the regex is all it takes to make it stop at the closing > of any anchor tag that doesn’t use “nofollow” already. I made the first set of quantifiers lazy, and the second set greedy, to minimize the amount of backtracking needed. But that’s a minor issue. The major savings is too make sure the regex doesn’t needlessly scan through everything that follows after an HTML anchor without “nofollow”. Are you following me? :-)

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl

Wednesday, 23 April 2008

PCRE Library for MySQL

Filed under: Regex Libraries — Jan Goyvaerts @ 11:24

A RegexBuddy user pointed me to LIB_MYSQLUDF_PREG. This is an open source library of MySQL user functions that imports the PCRE library.

MySQL’s built-in regular expression support uses the POSIX ERE flavor. By todays standards, that flavor offers limited regex functionality. PCRE on the other hand offers all the goodies from Perl and other modern regex flavors.

If you want to work with LIB_MYSQLUDF_PREG, you’ll need to set the regex flavor to PCRE. Use the “PHP preg operator” string style when copying and pasting regular expressions. This will format regex as ‘/regex/’ as required by LIB_MYSQLUDF_PREG.

I haven’t tried to use LIB_MYSQLUDF_PREG myself. I don’t have access to a MySQL server where I can install such libraries.

If you want RegexBuddy to generate source code snippets for LIB_MYSQLUDF_PREG, you can edit the provided MySQL template. Change the regex flavor to PCRE and the string style to PHP/preg. Then edit the functions to use the PREG_* calls instead of MySQL’s built-in operators. Save your custom template, and share it on the RegexBuddy user forum. :-)

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl

Tuesday, 15 April 2008

Watch Out for Zero-Length Matches

Filed under: Regex Trouble — Jan Goyvaerts @ 14:51

A zero-width or zero-length match is a regular expression match that does not match any characters. It matches only a position in the string. E.g. the regex \b matches between the 1 and , in 1,2.

Zero-lenght matches are often an unintended result of mistakenly making everything optional in a regular expression. Such a regular expression will in fact find a zero-length match at every position in the string. My floating point example has long shown this.

Apparently, JavaScript developers have it particularly tough. Different browsers handle zero-length matches differently. Steven Levithan argues that IE has a bug because it increments lastIndex. Steven’s observation is correct. When iterating over /\b/g.exec(), regex.lastIndex = match.index + 1 in Internet Explorer, while in other browsers they’re equal. So who’s got it wrong?

The ECMA-262 v3 standard defines the lastIndex property in 15.10.7.5 as:

The value of the lastIndex property is an integer that specifies the string position at which to start the next match.

It’s easy enough to understand this in the context where the developer sets lastIndex prior to calling exec() to make the match attempt start at a certain position. But how should the exec() method set lastIndex after a successful match?

For String.match() the standard says in 15.5.4.10:

If there is a match with an empty string (in other words, if the value of regexp.lastIndex is left unchanged), increment regexp.lastIndex by 1.

For String.replace() the standard says in 15.5.4.11:

Do the search in the same manner as in String.match(), including the update of searchValue.lastIndex.

But for RegExp.exec() the standard says in 15.10.6.2:

Let e be r’s endIndex value [i.e. the end of the match]. If the global property is true, set lastIndex to e.

The standard contradicts itself. 15.10.6.2 is inconsistent with the three other definitions, in that it omits the +1 in case of a zero-width match.

My opinion though is that, Internet Explorer got it right, and that browsers who implement 15.10.6.2 as written while ignoring the definition in 15.10.7.5 got it wrong. The omission of the lastIndex++ for regex.exec() looks to me as an oversight by the standards writers rather than something they did intentionally. The reason is that every regex engine that I know of works the way Internet Explorer. It’s the only way to avoid an infinite loop, like Firefox does.

If a zero-width match is found, the next match attempt begins one character further ahead in the string. After \b matches between the 1 and , in 1,2, the next match attempt will begin at the position between the , and the 2 (and match there), rather than staying stuck forever.

I do understand where the confusion comes from. The property is called lastIndex, but the standard defines it as something that should be called nextAttempt. lastIndex is not the end of the previous match. The ECMA-262 standard does not provide a property for that. To get that you have to add up match.index and match[0].length yourself.

Here’s my solution to the browser compatibility problem:

while (match = regex.exec(subject)) {
  // Prevent browsers like Firefox from getting stuck in an infinite loop
  if (match.index == regex.lastIndex) regex.lastIndex++;
  // Do whatever you want with the match
  start_of_match = match.index;
  length_of_match = match[0].length;
  first_character_after_match = start_of_match + length_of_match;
}

This code is easy to understand, and only uses one extra line (plus a comment) to work around the browser problems.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl
Next Page »