with Imagination: by Dustin Diaz

./with Imagination

A JavaScript, CSS, XHTML web log focusing on usability and accessibility by Dustin Diaz

Parenthetical back matching in regular expressions

Thursday, November 27th, 2008

Let me repeat, I still don’t like the taste of vegetables regular expressions. But despite their bitterness, my mind has grown stronger having eaten so many lately.

So anyway, I ran into this problem lately with back references within the body of regular expression matches. Trust me, I’ve scoured the penultimate of documentation websites for regular expressions, and still found no clear answer, so I thought I’d share my findings through trial and error.

First, what am I talking about?

Within the body of a regular expression, you can callback a previous match by using “\1″ or “\2″ (up to “\9″ (in JavaScript, of course)). These will reference the “appropriate grouping to the left of the back reference” (so says websites that document this feature). This is true, however, what does that mean?

A quick example before we complicate things

What this basically means, using the most common example on nearly every website that explains this feature is this. Let’s say we want to replace all duplicated words in a paragraph. We would do so as follows:

replacing duplicate words

var str = "welcome to the the jungle baby baby";
str = str.replace(/(\w+) \1/g, "$1");
// str === "welcome to the jungle baby"

Notice that \1 recalls what it found in the (parenthesis) match. Ok great, most already know this, good onya.

Now, let’s put it to a confusing test. We’ll start with the following string:

hello world

var str = 'hello hello hello world hello world world';

Take a good look at the string, then compare it against the following matches:

various back reference matches

/(((hello) (world)) \1)/
/(((hello) (world)) \2)/
/(((hello) (world)) \3)/
/(((hello) (world)) \4)/

Stop effing with me

In the above four scenarios, what in the world does “appropriate grouping to the left of the back reference” mean? Clearly we have parenthesis madness everywhere (four in total), but through trial and error, they simply match the following:

back reference matches

/(((hello) (world)) \x)/
when "x" matches
1 = "hello world " (note the space at the end)
2 = "hello world hello world"
3 = "hello world hello"
4 = "hello world world"

If that bit of text doesn’t help, here’s a graphic I made to help understand it a little better:

Therefore, what this tell us is that it works from left-to-right (as mentioned) starting from outside to the inside. Take special note that if we tried to reference a match before the set of parenthesis exists yet, it is a null value. So if we tried to reference “\4″ below:

null reference to non-existing match

/(((hello) \4 (world)))/

It doesn’t work. But “\3″ would :)

null reference to non-existing match

/(((hello) \3 (world)))/

The above “\3″ would look for “hello” - thus making the entire match “hello hello world”, which is in fact, a found match. And that, my friends, concludes todays oddities in back matching for regular expressions. Happy Thanksgiving! Cheers.

13 Responses to “Parenthetical back matching in regular expressions”

  1. Charles Christolini

    Good to see you posting again Dustin.

    Nice post.

  2. Luke Smith

    Another good resource for regex help is the #regex room on freenode IRC.

  3. Frode Danielsen

    I could not believe this was not mentioned at http://regular-expressions.info/ seeing as this is such basic knowledge regarding backreferences. So I checked, and:

    “To figure out the number of a particular backreference, scan the regular expression from left to right and count the opening round brackets. The first bracket starts backreference number one, the second number two, etc. Non-capturing parentheses are not counted.” - http://www.regular-expressions.info/brackets.html#usebackrefinregex (2nd paragraph).

    You don’t even need to think about “from outside to the inside”, just count sequentially. And I’d be surprised if this isn’t covered in Mastering Regular Expressions, which I also ordered after your last post on Regexes :)

  4. Caeser

    nice article…clears some confusions regarding back-references…good job…Cheers!

  5. Aristotle Pagaltzis

    Frode beat me to it. You can just count opening parentheses left-to-right, that’s it.

  6. links for 2008-11-28 | Amasijo

    [...] Parenthetical back matching in regular expressions (tags: regexp javascript) [...]

  7. Wesley Walser

    So in the ‘1 = “hello world ” (note the space at the end)’ is the \1 a null reference since the parenthesis hadn’t been closed when it was used?

  8. Dustin Diaz

    Wesley, good question. The answer is no (it’s not null). Since it’s encapsulated within the parenthesis, it counts. Thus, if it is within the opening parenthesis (or after), it can be referenced.

  9. Janie Parrish

    Thanks for the heads up Dustin, hope you had a Happy Thanksgiving!

  10. Val

    um, sorry to be a pedant, but the way you used “penultimate” implies you think it means something like “the most ultimate”; it does not. Rather, penultimate means *second*-to-last; the modifier doesn’t make it *more* ultimate, but *less*. If I misread you, apologies!

    Good analysis on the regexes, but I wonder how you got in such a jam to begin with. Can you show us something like the real-world problem you were solving that required such a nasty combo of nested parentheticals and backreferences?

    Cheers,

  11. Andy Tjin

    The matching actually makes a lot of sense, and works as I would expect. It’s just that the explanation “appropriate grouping to the left of the back reference” is a very very unclear one.

  12. thedropzone

    Want to see Dustiniaz Ajax Contact Form on Steriods with Captcha? Download thedropzone

  13. Mike Samuel

    Don’t blame back-reference weirdness on regular expressions. Blame them on Perl:)

    Back-references introduce a lot of problems by allowing “regular” expressions to match languages that are not regular, and actually make matching NP hard ( http://perl.plover.com/NPC/ ) so should be used sparingly.

    As an example, the following non-regular language
    Terminal = a a
    NonTerminal = b
    |
    is matchable by the perl5 style regex /(a+)b\1/.

Leave a Reply

Phone Number:

If you're about to post code in your comment, please wrap your code with the tag-combo <pre><code>. Also please escape your html entities - otherwise they will be stripped out. I recommend using postable.

Comments for this post will be closed on 26 January 2009.

Get "JavaScript Design Patterns"

"As a web developer, you'll already know that JavaScript™ is a powerful language, allowing you to add an impressive array of dynamic functionality to otherwise static web sites. But there is more power waiting to be unlocked--JavaScript is capable of full object-oriented capabilities, and by applying OOP principles, best practices, and design patterns to your code, you can make it more powerful, more efficient, and easier to work with alone or as part of a team."

Buy JS Design Patterns from Amazon.com Buy JS Design Patterns from Apress

Submit a Prototype

All content copyright © 2003 - 2007 under the Creative Commons License. Wanna know something? Just ask.

About | Archives | Blog Search

[x] close

Loading...

Submit a prototype

By checking this prototype I agree that I am not submitting false credentials, pornography, or a hate crime website. I also understand that by submitting my entry I may or may not be accepted, and if accepted, my entry may be taken down at any given time if I violate these terms.