i am dustin diaz

a JavaScriptr...

boosh.

don't worry about it.

Parenthetical back matching in regular expressions

Let me repeat, I still don't like the taste of vegetables regular expressions. But despite their bitterness, my mind has grown stronger having eaten so many lately. So anyway, I ran into this problem lately with back references within the body of regular expression matches. Trust me, I've scoured the penultimate of documentation websites for regular expressions, and still found no clear answer, so I thought I'd share my findings through trial and error.

First, what am I talking about?

Within the body of a regular expression, you can callback a previous match by using "\1" or "\2" (up to "\9" (in JavaScript, of course)). These will reference the "appropriate grouping to the left of the back reference" (so says websites that document this feature). This is true, however, what does that mean?

A quick example before we complicate things

What this basically means, using the most common example on nearly every website that explains this feature is this. Let's say we want to replace all duplicated words in a paragraph. We would do so as follows:

replacing duplicate words

var str = "welcome to the the jungle baby baby";

str = str.replace(/(\w+) \1/g, "$1");

// str === "welcome to the jungle baby"
Notice that \1 recalls what it found in the (parenthesis) match. Ok great, most already know this, good onya. Now, let's put it to a confusing test. We'll start with the following string:

hello world

var str = 'hello hello hello world hello world world';
Take a good look at the string, then compare it against the following matches:

various back reference matches

/(((hello) (world)) \1)/

/(((hello) (world)) \2)/

/(((hello) (world)) \3)/

/(((hello) (world)) \4)/

Stop effing with me

In the above four scenarios, what in the world does "appropriate grouping to the left of the back reference" mean? Clearly we have parenthesis madness everywhere (four in total), but through trial and error, they simply match the following:

back reference matches

/(((hello) (world)) \x)/

when "x" matches

1 = "hello world " (note the space at the end)

2 = "hello world hello world"

3 = "hello world hello"

4 = "hello world world"

If that bit of text doesn't help, here's a graphic I made to help understand it a little better: Therefore, what this tell us is that it works from left-to-right (as mentioned) starting from outside to the inside. Take special note that if we tried to reference a match before the set of parenthesis exists yet, it is a null value. So if we tried to reference "\4" below:

null reference to non-existing match

/(((hello) \4 (world)))/
It doesn't work. But "\3" would :)

null reference to non-existing match

/(((hello) \3 (world)))/
The above "\3" would look for "hello" - thus making the entire match "hello hello world", which is in fact, a found match. And that, my friends, concludes todays oddities in back matching for regular expressions. Happy Thanksgiving! Cheers.

this is who i am

Hi, my name is Dustin Diaz and I'm an Engineer @ObviousCorp. Previously @Twitter, @Google, and @Yahoo, author of Strobist® Info co-author of JavaScript Design Patterns, co-creator of the Ender JavaScript Framework, a Photographer, and an amateur Mixologist. This is my website. Welcome!

On this site I write about JavaScript. You can also follow along with my open-source work on Github.

This site is optimized and works best in Microsoft Internet Explorer 6.