Let me
repeat, I still don't like the taste of
vegetables regular expressions. But despite their bitterness, my mind has grown stronger having eaten so many lately.
So anyway, I ran into this problem lately with back references within the body of regular expression matches. Trust me, I've scoured the
penultimate of documentation websites for regular expressions, and still found no clear answer, so I thought I'd share my findings through trial and error.
First, what am I talking about?
Within the body of a regular expression, you can callback a previous match by using "\1" or "\2" (up to "\9" (in JavaScript, of course)). These will reference the "
appropriate grouping to the left of the back reference" (so says websites that document this feature). This is true, however, what does that mean?
A quick example before we complicate things
What this basically means, using the most common example on nearly every website that explains this feature is this. Let's say we want to replace all duplicated words in a paragraph. We would do so as follows:
replacing duplicate words
var str = "welcome to the the jungle baby baby";
str = str.replace(/(\w+) \1/g, "$1");
// str === "welcome to the jungle baby"
Notice that
\1 recalls what it found in the (parenthesis) match. Ok great, most already know this, good onya.
Now, let's put it to a confusing test. We'll start with the following string:
hello world
var str = 'hello hello hello world hello world world';
Take a good look at the string, then compare it against the following matches:
various back reference matches
/(((hello) (world)) \1)/
/(((hello) (world)) \2)/
/(((hello) (world)) \3)/
/(((hello) (world)) \4)/
Stop effing with me
In the above four scenarios, what in the world does "
appropriate grouping to the left of the back reference" mean? Clearly we have parenthesis madness everywhere (
four in total), but through trial and error, they simply match the following:
back reference matches
/(((hello) (world)) \x)/
when "x" matches
1 = "hello world " (note the space at the end)
2 = "hello world hello world"
3 = "hello world hello"
4 = "hello world world"
If that bit of text doesn't help, here's a graphic I made to help understand it a little better:

Therefore, what this tell us is that it works from left-to-right (as mentioned) starting from outside to the inside. Take
special note that if we tried to reference a match before the set of parenthesis exists yet, it is a null value. So if we tried to reference "\4" below:
null reference to non-existing match
/(((hello) \4 (world)))/
It doesn't work. But "\3" would :)
null reference to non-existing match
/(((hello) \3 (world)))/
The above "\3" would look for "hello" - thus making the entire match "hello hello world", which is in fact, a found match. And that, my friends, concludes todays oddities in back matching for regular expressions. Happy Thanksgiving! Cheers.