<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Regex Guru</title>
	<atom:link href="http://www.regex-guru.info/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.regex-guru.info</link>
	<description>Irregular updates on the wonderful world of regular expressions by Jan Goyvaerts, developer of premier regular expression software and web sites</description>
	<pubDate>Thu, 08 May 2008 08:06:37 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
			<item>
		<title>Follow Up with Adequate Testing</title>
		<link>http://www.regex-guru.info/2008/05/follow-up-with-adequate-testing/</link>
		<comments>http://www.regex-guru.info/2008/05/follow-up-with-adequate-testing/#comments</comments>
		<pubDate>Thu, 08 May 2008 08:05:47 +0000</pubDate>
		<dc:creator>Jan Goyvaerts</dc:creator>
		
		<category><![CDATA[Regex Trouble]]></category>

		<guid isPermaLink="false">http://www.regex-guru.info/?p=18</guid>
		<description><![CDATA[The regular expression from the Do Follow plugin is dedicated to a single purpose.  Repurposing it for your own code will expose shortcomings that don't matter for the plugin, but may matter for what you're trying to do.  Never copy-and-paste a regex without testing it.]]></description>
			<content:encoded><![CDATA[<p>I always emphasize on the <b>importance of testing</b> your regular expressions <b>on all possible input data</b>.  Particularly on any input that you might get that you don&#8217;t want the regex to match.</p>
<p>Writing a regular expression that matches something is often quite straightforward.  Making it match all the variations of what you want can be tricky.  Excluding all the things you don&#8217;t want is often the hard part.</p>
<p>The <a href="http://www.regex-guru.info/2008/05/no-follow-the-lazy-dot/">regex in the Do Follow plugin</a> does not match all HTML anchor tag with a <tt>rel=&#8221;nofollow&#8221;</tt> attribute.  When wordpress encounters something like <tt class="string">&lt;a href=&#8221;url&#8221;></tt> in a comment, it changes it into <tt class="string">&lt;a href=&#8221;url&#8221; rel=&#8221;nofollow&#8221;></tt>.  The plugin regex matches this just fine.  So this isn&#8217;t really a bug in the plugin.</p>
<p>The problem arises if you were to <b>blindly copy-and-paste this regular expression</b> into your own code.  A lot of programmers do that, and it&#8217;s <b>a mistake</b>.  Always thoroughly test any regex you plan to use on both valid and invalid data.  Try either <a href="http://www.regex-guru.info/2008/05/no-follow-the-lazy-dot/">the original regex or my improved regex</a> on this: <tt class="string">&lt;a rel=&#8221;nofollow&#8221; href=&#8221;url&#8221;></tt>.  It doesn&#8217;t work.  But this anchor tag is valid.  The order of attributes is irrelevant in HTML and XHTML.</p>
<p>The problem is that the author of the original regex used <tt class="regex">\s+</tt> to force whitespace to occur after the <tt class="match">a</tt> and before the <tt class="match">rel</tt> parts of the match.  This means the regex requires at least two spaces between the <tt class="match">a</tt> and <tt class="match">rel</tt>, possibly with other spaces and characters between those two spaces.  But one space is actually sufficient.</p>
<p>An easy solution is to specify in the regex what is really meant.  We don&#8217;t care if there are any spaces between the <tt class="match">a</tt> and <tt class="match">rel</tt>.  What we require is that they are complete words.  This is better done with <a href="http://www.regular-expressions.info/wordboundaries.html">word boundaries</a>, like this:</p>
<pre>'/
  (
    &lt;a\b
    [^<>]*?
    \b
    rel=["\']
    [a-z0-9\s\-_|[\]]*?
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_|[\]]*
    ["\']
    [^<>]*
    >
  )
/isx&#8217;</pre>
<p>Even with this improvement, this is still <b>a regular expression dedicated to a particular job</b>.  It will still match <tt class="string">&lt;a all this ref=&#8221;nofollow&#8221; nonsense!!!></tt> which is obviously not a valid URL.  In the context of stripping ref=&#8221;nofollow&#8221; from comments, we don&#8217;t care.  The point is, you should never copy-and-paste a regex without being sure of exactly what it does.  <b>The only way to be really sure is to test.</b></p>
]]></content:encoded>
			<wfw:commentRss>http://www.regex-guru.info/2008/05/follow-up-with-adequate-testing/feed/</wfw:commentRss>
		</item>
		<item>
		<title>No Follow The Lazy Dot</title>
		<link>http://www.regex-guru.info/2008/05/no-follow-the-lazy-dot/</link>
		<comments>http://www.regex-guru.info/2008/05/no-follow-the-lazy-dot/#comments</comments>
		<pubDate>Thu, 08 May 2008 01:31:03 +0000</pubDate>
		<dc:creator>Jan Goyvaerts</dc:creator>
		
		<category><![CDATA[Regex Trouble]]></category>

		<guid isPermaLink="false">http://www.regex-guru.info/?p=17</guid>
		<description><![CDATA[The popular Do Follow WordPress plugin uses a rather inefficient regular expression for its job.  Here's how to improve it.]]></description>
			<content:encoded><![CDATA[<p><tt class="regex">.*?</tt> is what I call the &#8220;lazy dot&#8221;.  It matches any sequence of characters.  It matches as few of them as needed to make the whole regex match.  The problem is that if there&#8217;s no way for the regex to match, the lazy dot will continue all the way until the end of the line or the end of the subject string (if the dot is allowed to match newlines).  If you have two lazy dots in a regex, they&#8217;ll both try expand to match the whole regex, trying every possible permutation between them.  That leads to <a href="http://www.regular-expressions.info/catastrophic.html">catastrophic backtracking</a>.  The HTML file example at the bottom of that page shows how a bunch of lazy dots will get into a fight.</p>
<p>Yesterday <a href="http://www.micro-isv.asia/2008/05/u-comment-i-follow/">I installed the Do Follow plugin</a> on all my blogs.  Looking at the plugin&#8217;s source, I saw that one regular expression was used to simply strip the ref=&#8221;nofollow&#8221; attributes that WordPress adds.  Here it is, formatted as a PHP preg string as it appears in the code:</p>
<pre>'/
  (
    &lt;a\s+
    .*
    \s+
    rel=["\']
    [a-z0-9\s\-_\|\[\]]*
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_\|\[\]]*
    ["\']
    .*
    >
  )
/isUx&#8217;</pre>
<p>Notice the /U modifier at the end.  This is a PHP flag that reverses the meaning of the question mark after quantifier.  Normally, <tt class="regex">.*</tt> is greedy and <tt class="regex">.*?</tt> is lazy.  With /U, <tt class="regex">.*</tt> is lazy and <tt class="regex">.*?</tt> is greedy.  You could call /U the Uber Lazy mode because it even saves you typing the extra <tt>?</tt>.</p>
<p>The problem with this regex is that when it encounters an HTML anchor that does not have ref=&#8221;nofollow&#8221;, the first lazy dot will expand all the way to the end of the HTML code it&#8217;s trying to strip the &#8220;nofollow&#8221; from.  That&#8217;s very inefficient.  (Of course, the whole business of stripping off something that shouldn&#8217;t be added in the first place is very inefficient.  Suffice to say I&#8217;m less and less impressed with WordPress each day.)</p>
<p>Here&#8217;s my version:</p>
<pre>'/
  (
    &lt;a\s+
    [^<>]*?
    \s+
    rel=["\']
    [a-z0-9\s\-_|[\]]*?
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_|[\]]*
    ["\']
    [^<>]*
    >
  )
/isx&#8217;</pre>
<p>I removed the /U flag.  I replaced the dot with the far more sensible negated character class <tt class="regex">[^<>]</tt>.  Angle brackets can&#8217;t occur within an HTML anchor.  Coding this small piece of information into the regex is all it takes to make it stop at the closing > of any anchor tag that doesn&#8217;t use &#8220;nofollow&#8221; already.  I made the first set of quantifiers lazy, and the second set greedy, to minimize the amount of backtracking needed.  But that&#8217;s a minor issue.  The major savings is too make sure the regex doesn&#8217;t needlessly scan through everything that follows after an HTML anchor without &#8220;nofollow&#8221;.  Are you following me? <img src='http://www.regex-guru.info/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.regex-guru.info/2008/05/no-follow-the-lazy-dot/feed/</wfw:commentRss>
		</item>
		<item>
		<title>PCRE Library for MySQL</title>
		<link>http://www.regex-guru.info/2008/04/pcre-library-for-mysql/</link>
		<comments>http://www.regex-guru.info/2008/04/pcre-library-for-mysql/#comments</comments>
		<pubDate>Wed, 23 Apr 2008 04:24:44 +0000</pubDate>
		<dc:creator>Jan Goyvaerts</dc:creator>
		
		<category><![CDATA[Regex Libraries]]></category>

		<guid isPermaLink="false">http://www.regex-guru.info/2008/04/pcre-library-for-mysql/</guid>
		<description><![CDATA[A RegexBuddy user pointed me to LIB_MYSQLUDF_PREG.  This is an open source library of MySQL user functions that imports the PCRE library.
MySQL&#8217;s built-in regular expression support uses the POSIX ERE flavor.  By todays standards, that flavor offers limited regex functionality.  PCRE on the other hand offers all the goodies from Perl and [...]]]></description>
			<content:encoded><![CDATA[<p>A RegexBuddy user pointed me to <a href="http://mysqludf.com/lib_mysqludf_preg/">LIB_MYSQLUDF_PREG</a>.  This is an open source library of MySQL user functions that imports the <a href="http://www.regular-expressions.info/pcre.html">PCRE library</a>.</p>
<p>MySQL&#8217;s built-in regular expression support uses the <a href="http://www.regular-expressions.info/posix.html#ere">POSIX ERE flavor</a>.  By todays standards, that flavor offers limited regex functionality.  PCRE on the other hand offers all the goodies from Perl and other modern regex flavors.</p>
<p>If you want to work with LIB_MYSQLUDF_PREG, you&#8217;ll need to set the regex flavor to PCRE.  Use the &#8220;PHP preg operator&#8221; string style when copying and pasting regular expressions.  This will format <tt class=regex>regex</tt> as <tt>&#8216;/regex/&#8217;</tt> as required by LIB_MYSQLUDF_PREG.</p>
<p>I haven&#8217;t tried to use LIB_MYSQLUDF_PREG myself.  I don&#8217;t have access to a MySQL server where I can install such libraries.</p>
<p>If you want RegexBuddy to generate source code snippets for LIB_MYSQLUDF_PREG, you can edit the provided MySQL template.  Change the regex flavor to PCRE and the string style to PHP/preg.  Then edit the functions to use the PREG_* calls instead of MySQL&#8217;s built-in operators.  Save your custom template, and share it on the RegexBuddy user forum. <img src='http://www.regex-guru.info/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.regex-guru.info/2008/04/pcre-library-for-mysql/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Watch Out for Zero-Length Matches</title>
		<link>http://www.regex-guru.info/2008/04/watch-out-for-zero-length-matches/</link>
		<comments>http://www.regex-guru.info/2008/04/watch-out-for-zero-length-matches/#comments</comments>
		<pubDate>Tue, 15 Apr 2008 07:51:34 +0000</pubDate>
		<dc:creator>Jan Goyvaerts</dc:creator>
		
		<category><![CDATA[Regex Trouble]]></category>

		<guid isPermaLink="false">http://www.regex-guru.info/2008/04/watch-out-for-zero-length-matches/</guid>
		<description><![CDATA[Zero-length matches are often an unintended result of mistakenly making everything optional in a regular expression.  Sometimes they can be useful.  In browsers like Firefox, zero-length matches can cause your JavaScript code to loop forever on regex.exec().]]></description>
			<content:encoded><![CDATA[<p>A zero-width or zero-length match is a regular expression match that does not match any characters.  It matches only a position in the string.  E.g. the regex <tt class=regex>\b</tt> matches between the <tt class=string>1</tt> and <tt class=string>,</tt> in <tt class=string>1,2</tt>.</p>
<p>Zero-lenght matches are often an unintended result of mistakenly making everything optional in a regular expression.  Such a regular expression will in fact find a zero-length match at every position in the string.  My <a href="http://www.regular-expressions.info/floatingpoint.html">floating point example</a> has long shown this.</p>
<p>Apparently, <b>JavaScript</b> developers have it particularly tough.  Different browsers handle zero-length matches differently.  Steven Levithan argues that <a href="http://blog.stevenlevithan.com/archives/exec-bugs">IE has a bug because it increments lastIndex</a>.  Steven&#8217;s observation is correct.  When iterating over <tt>/\b/g.exec()</tt>, regex.lastIndex = match.index + 1 in Internet Explorer, while in other browsers they&#8217;re equal.  So who&#8217;s got it wrong?</p>
<p>The ECMA-262 v3 standard defines the <tt><b>lastIndex</b></tt> property in 15.10.7.5 as:</p>
<blockquote><p>The value of the lastIndex property is an integer that specifies the string position at which to start the next match.</p></blockquote>
<p>It&#8217;s easy enough to understand this in the context where the developer sets <tt>lastIndex</tt> prior to calling <tt>exec()</tt> to make the match attempt start at a certain position.  But how should the <tt>exec()</tt> method set <tt>lastIndex</tt> after a successful match?</p>
<p>For <tt>String.match()</tt> the standard says in 15.5.4.10:</p>
<blockquote><p>If there is a match with an empty string (in other words, if the value of regexp.lastIndex is left unchanged), increment regexp.lastIndex by 1.</p></blockquote>
<p>For <tt>String.replace()</tt> the standard says in 15.5.4.11:</p>
<blockquote><p>Do the search in the same manner as in String.match(), including the update of searchValue.lastIndex.</p></blockquote>
<p>But for <tt>RegExp.exec()</tt> the standard says in 15.10.6.2:</p>
<blockquote><p>Let e be r&#8217;s endIndex value [i.e. the end of the match].  If the global property is true, set lastIndex to e.</p></blockquote>
<p>The standard contradicts itself.  15.10.6.2 is inconsistent with the three other definitions, in that it omits the +1 in case of a zero-width match.</p>
<p>My opinion though is that, Internet Explorer got it right, and that browsers who implement 15.10.6.2 as written while ignoring the definition in 15.10.7.5 got it wrong.  The omission of the <tt>lastIndex++</tt> for <tt>regex.exec()</tt> looks to me as an oversight by the standards writers rather than something they did intentionally.  The reason is that every regex engine that I know of works the way Internet Explorer.  It&#8217;s the only way to avoid an <b>infinite loop</b>, like Firefox does.</p>
<p>If a zero-width match is found, the next match attempt begins one character further ahead in the string.  After <tt class=regex>\b</tt> matches between the <tt class=string>1</tt> and <tt class=string>,</tt> in <tt class=string>1,2</tt>, the next match attempt will begin at the position between the <tt class=string>,</tt> and the <tt class=string>2</tt> (and match there), rather than staying stuck forever.</p>
<p>I do understand where the confusion comes from.  The property is called <tt>lastIndex</tt>, but the standard defines it as something that should be called <tt>nextAttempt</tt>.  <b><tt>lastIndex</tt> is not the end of the previous match</b>.  The ECMA-262 standard does not provide a property for that.  To get that you have to add up <tt>match.index</tt> and <tt>match[0].length</tt> yourself.</p>
<p>Here&#8217;s my solution to the browser compatibility problem:</p>
<pre lang="javascript">while (match = regex.exec(subject)) {
  // Prevent browsers like Firefox from getting stuck in an infinite loop
  if (match.index == regex.lastIndex) regex.lastIndex++;
  // Do whatever you want with the match
  start_of_match = match.index;
  length_of_match = match[0].length;
  first_character_after_match = start_of_match + length_of_match;
}</pre>
<p>This code is easy to understand, and only uses one extra line (plus a comment) to work around the browser problems.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.regex-guru.info/2008/04/watch-out-for-zero-length-matches/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
