Parenthetical back matching in regular expressions
Let me repeat, I still don’t like the taste of vegetables regular expressions. But despite their bitterness, my mind has grown stronger having eaten so many lately.
So anyway, I ran into this problem lately with back references within the body of regular expression matches. Trust me, I’ve scoured the penultimate of documentation websites for regular expressions, and still found no clear answer, so I thought I’d share my findings through trial and error.
First, what am I talking about?
Within the body of a regular expression, you can callback a previous match by using “\1″ or “\2″ (up to “\9″ (in JavaScript, of course)). These will reference the “appropriate grouping to the left of the back reference” (so says websites that document this feature). This is true, however, what does that mean?
A quick example before we complicate things
What this basically means, using the most common example on nearly every website that explains this feature is this. Let’s say we want to replace all duplicated words in a paragraph. We would do so as follows:
replacing duplicate words
var str = "welcome to the the jungle baby baby";
str = str.replace(/(\w+) \1/g, "$1");
// str === "welcome to the jungle baby"
Notice that \1 recalls what it found in the (parenthesis) match. Ok great, most already know this, good onya.
Now, let’s put it to a confusing test. We’ll start with the following string:
hello world
var str = 'hello hello hello world hello world world';
Take a good look at the string, then compare it against the following matches:
various back reference matches
/(((hello) (world)) \1)/
/(((hello) (world)) \2)/
/(((hello) (world)) \3)/
/(((hello) (world)) \4)/
Stop effing with me
In the above four scenarios, what in the world does “appropriate grouping to the left of the back reference” mean? Clearly we have parenthesis madness everywhere (four in total), but through trial and error, they simply match the following:
back reference matches
/(((hello) (world)) \x)/
when "x" matches
1 = "hello world " (note the space at the end)
2 = "hello world hello world"
3 = "hello world hello"
4 = "hello world world"
If that bit of text doesn’t help, here’s a graphic I made to help understand it a little better:

Therefore, what this tell us is that it works from left-to-right (as mentioned) starting from outside to the inside. Take special note that if we tried to reference a match before the set of parenthesis exists yet, it is a null value. So if we tried to reference “\4″ below:
null reference to non-existing match
/(((hello) \4 (world)))/
It doesn’t work. But “\3″ would :)
null reference to non-existing match
/(((hello) \3 (world)))/
The above “\3″ would look for “hello” - thus making the entire match “hello hello world”, which is in fact, a found match. And that, my friends, concludes todays oddities in back matching for regular expressions. Happy Thanksgiving! Cheers.




November 27th, 2008 at 7:17 pm
Good to see you posting again Dustin.
Nice post.
November 27th, 2008 at 7:59 pm
Another good resource for regex help is the #regex room on freenode IRC.
November 27th, 2008 at 8:18 pm
I could not believe this was not mentioned at http://regular-expressions.info/ seeing as this is such basic knowledge regarding backreferences. So I checked, and:
“To figure out the number of a particular backreference, scan the regular expression from left to right and count the opening round brackets. The first bracket starts backreference number one, the second number two, etc. Non-capturing parentheses are not counted.” - http://www.regular-expressions.info/brackets.html#usebackrefinregex (2nd paragraph).
You don’t even need to think about “from outside to the inside”, just count sequentially. And I’d be surprised if this isn’t covered in Mastering Regular Expressions, which I also ordered after your last post on Regexes :)
November 27th, 2008 at 9:34 pm
nice article…clears some confusions regarding back-references…good job…Cheers!
November 28th, 2008 at 6:07 am
Frode beat me to it. You can just count opening parentheses left-to-right, that’s it.
November 28th, 2008 at 8:00 am
[...] Parenthetical back matching in regular expressions (tags: regexp javascript) [...]
November 28th, 2008 at 12:42 pm
So in the ‘1 = “hello world ” (note the space at the end)’ is the \1 a null reference since the parenthesis hadn’t been closed when it was used?
November 28th, 2008 at 2:53 pm
Wesley, good question. The answer is no (it’s not null). Since it’s encapsulated within the parenthesis, it counts. Thus, if it is within the opening parenthesis (or after), it can be referenced.
November 29th, 2008 at 12:17 am
Thanks for the heads up Dustin, hope you had a Happy Thanksgiving!
November 30th, 2008 at 8:20 pm
um, sorry to be a pedant, but the way you used “penultimate” implies you think it means something like “the most ultimate”; it does not. Rather, penultimate means *second*-to-last; the modifier doesn’t make it *more* ultimate, but *less*. If I misread you, apologies!
Good analysis on the regexes, but I wonder how you got in such a jam to begin with. Can you show us something like the real-world problem you were solving that required such a nasty combo of nested parentheticals and backreferences?
Cheers,
December 1st, 2008 at 12:54 am
The matching actually makes a lot of sense, and works as I would expect. It’s just that the explanation “appropriate grouping to the left of the back reference” is a very very unclear one.
December 4th, 2008 at 1:51 am
Want to see Dustiniaz Ajax Contact Form on Steriods with Captcha? Download thedropzone
December 8th, 2008 at 2:36 pm
Don’t blame back-reference weirdness on regular expressions. Blame them on Perl:)
Back-references introduce a lot of problems by allowing “regular” expressions to match languages that are not regular, and actually make matching NP hard ( http://perl.plover.com/NPC/ ) so should be used sparingly.
As an example, the following non-regular language
Terminal = a a
NonTerminal = b
|
is matchable by the perl5 style regex /(a+)b\1/.