Regex Guru

Tuesday, 27 May 2008

Writing Offline

Filed under: About Regex Guru — Jan Goyvaerts @ 12:56

Late last year I resolved to start writing more. Setting up this blog and two others was a big part of that. Though my blogs have been a little quiet lately, I’ve actually been doing a lot of writing. But offline.

Whenever I set out to do something, opportunities that weren’t there before pop up everywhere. The opportunities aren’t new, but my attention to them is. I set out to do more writing, and suddenly I’m doing all the writing I can.

Two months ago I was asked if I wanted to co-author a book on regular expressions. I was hesitant at first. I’m very busy already, and writing books isn’t a very lucrative way to spend one’s time. But writing a real book sold in real bookstores has been something I’ve wanted to do since I was very young. I took the opportunity, before books go all digital. The book will be published by the same publisher as the best book on regular expressions to date. I’m confident the project is in good hands.

Unfortunately, writing a book, running a software business, and moving 800 km South, does not leave much time for blogging. This blog will likely remain quiet until the book is available for pre-order on Amazon.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl

Thursday, 8 May 2008

Follow Up with Adequate Testing

Filed under: Regex Trouble — Jan Goyvaerts @ 15:05

I always emphasize on the importance of testing your regular expressions on all possible input data. Particularly on any input that you might get that you don’t want the regex to match.

Writing a regular expression that matches something is often quite straightforward. Making it match all the variations of what you want can be tricky. Excluding all the things you don’t want is often the hard part.

The regex in the Do Follow plugin does not match all HTML anchor tag with a rel=”nofollow” attribute. When wordpress encounters something like <a href=”url”> in a comment, it changes it into <a href=”url” rel=”nofollow”>. The plugin regex matches this just fine. So this isn’t really a bug in the plugin.

The problem arises if you were to blindly copy-and-paste this regular expression into your own code. A lot of programmers do that, and it’s a mistake. Always thoroughly test any regex you plan to use on both valid and invalid data. Try either the original regex or my improved regex on this: <a rel=”nofollow” href=”url”>. It doesn’t work. But this anchor tag is valid. The order of attributes is irrelevant in HTML and XHTML.

The problem is that the author of the original regex used \s+ to force whitespace to occur after the a and before the rel parts of the match. This means the regex requires at least two spaces between the a and rel, possibly with other spaces and characters between those two spaces. But one space is actually sufficient.

An easy solution is to specify in the regex what is really meant. We don’t care if there are any spaces between the a and rel. What we require is that they are complete words. This is better done with word boundaries, like this:

'/
  (
    <a\b
    [^<>]*?
    \b
    rel=["\']
    [a-z0-9\s\-_|[\]]*?
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_|[\]]*
    ["\']
    [^<>]*
    >
  )
/isx’

Even with this improvement, this is still a regular expression dedicated to a particular job. It will still match <a all this ref=”nofollow” nonsense!!!> which is obviously not a valid URL. In the context of stripping ref=”nofollow” from comments, we don’t care. The point is, you should never copy-and-paste a regex without being sure of exactly what it does. The only way to be really sure is to test.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl

No Follow The Lazy Dot

Filed under: Regex Trouble — Jan Goyvaerts @ 8:31

.*? is what I call the “lazy dot”. It matches any sequence of characters. It matches as few of them as needed to make the whole regex match. The problem is that if there’s no way for the regex to match, the lazy dot will continue all the way until the end of the line or the end of the subject string (if the dot is allowed to match newlines). If you have two lazy dots in a regex, they’ll both try expand to match the whole regex, trying every possible permutation between them. That leads to catastrophic backtracking. The HTML file example at the bottom of that page shows how a bunch of lazy dots will get into a fight.

Yesterday I installed the Do Follow plugin on all my blogs. Looking at the plugin’s source, I saw that one regular expression was used to simply strip the ref=”nofollow” attributes that WordPress adds. Here it is, formatted as a PHP preg string as it appears in the code:

'/
  (
    <a\s+
    .*
    \s+
    rel=["\']
    [a-z0-9\s\-_\|\[\]]*
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_\|\[\]]*
    ["\']
    .*
    >
  )
/isUx’

Notice the /U modifier at the end. This is a PHP flag that reverses the meaning of the question mark after quantifier. Normally, .* is greedy and .*? is lazy. With /U, .* is lazy and .*? is greedy. You could call /U the Uber Lazy mode because it even saves you typing the extra ?.

The problem with this regex is that when it encounters an HTML anchor that does not have ref=”nofollow”, the first lazy dot will expand all the way to the end of the HTML code it’s trying to strip the “nofollow” from. That’s very inefficient. (Of course, the whole business of stripping off something that shouldn’t be added in the first place is very inefficient. Suffice to say I’m less and less impressed with WordPress each day.)

Here’s my version:

'/
  (
    <a\s+
    [^<>]*?
    \s+
    rel=["\']
    [a-z0-9\s\-_|[\]]*?
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_|[\]]*
    ["\']
    [^<>]*
    >
  )
/isx’

I removed the /U flag. I replaced the dot with the far more sensible negated character class [^<>]. Angle brackets can’t occur within an HTML anchor. Coding this small piece of information into the regex is all it takes to make it stop at the closing > of any anchor tag that doesn’t use “nofollow” already. I made the first set of quantifiers lazy, and the second set greedy, to minimize the amount of backtracking needed. But that’s a minor issue. The major savings is too make sure the regex doesn’t needlessly scan through everything that follows after an HTML anchor without “nofollow”. Are you following me? :-)

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Live
  • StumbleUpon
  • Spurl