Regular expressions: lookaround assertions by example • Deep JavaScript

16 Regular expressions: lookaround assertions by example

In this chapter we use examples to explore lookaround assertions in regular expressions. A lookaround assertion is non-capturing and must match (or not match) what comes before (or after) the current location in the input string.

16.1 Cheat sheet: lookaround assertions

Table 4: Overview of available lookaround assertions.
Pattern	Name
`(?=«pattern»)`	Positive lookahead	ES3
`(?!«pattern»)`	Negative lookahead	ES3
`(?<=«pattern»)`	Positive lookbehind	ES2018
`(?<!«pattern»)`	Negative lookbehind	ES2018

There are four lookaround assertions (tbl. 4)

Lookahead assertions (ECMAScript 3):
- Positive lookahead: (?=«pattern») matches if pattern matches what comes after the current location in the input string.
- Negative lookahead: (?!«pattern») matches if pattern does not match what comes after the current location in the input string.
Lookbehind assertions (ECMAScript 2018):
- Positive lookbehind: (?<=«pattern») matches if pattern matches what comes before the current location in the input string.
- Negative lookbehind: (?<!«pattern») matches if pattern does not match what comes before the current location in the input string.

16.2 Warnings for this chapter

The examples show what can be achieved via lookaround assertions. However, regular expression aren’t always the best solution. Another technique, such as proper parsing, may be a better choice.
Lookbehind assertions are a relatively new feature that may not be supported by all JavaScript engines you are targeting.
Lookaround assertions may affect performance negatively, especially if their patterns match long strings.

16.3 Example: Specifying what comes before or after a match (positive lookaround)

In the following interaction, we extract quoted words:

> 'how "are" "you" doing'.match(/(?<=")[a-z]+(?=")/g)
[ 'are', 'you' ]

Two lookaround assertions help us here:

(?<=") “must be preceded by a quote”
(?=") “must be followed by a quote”

Lookaround assertions are especially convenient for .match() in /g mode, which returns whole matches (capture group 0). Whatever the pattern of a lookaround assertion matches is not captured. Without lookaround assertions, the quotes show up in the result:

> 'how "are" "you" doing'.match(/"([a-z]+)"/g)
[ '"are"', '"you"' ]

16.4 Example: Specifying what does not come before or after a match (negative lookaround)

How can we achieve the opposite of what we did in the previous section and extract all unquoted words from a string?

Input: 'how "are" "you" doing'
Output: ['how', 'doing']

Our first attempt is to simply convert positive lookaround assertions to negative lookaround assertions. Alas, that fails:

> 'how "are" "you" doing'.match(/(?<!")[a-z]+(?!")/g)
[ 'how', 'r', 'o', 'doing' ]

The problem is that we extract sequences of characters that are not bracketed by quotes. That means that in the string '"are"', the “r” in the middle is considered unquoted, because it is preceded by an “a” and followed by an “e”.

We can fix this by stating that prefix and suffix must be neither quote nor letter:

> 'how "are" "you" doing'.match(/(?<!["a-z])[a-z]+(?!["a-z])/g)
[ 'how', 'doing' ]

Another solution is to demand via \b that the sequence of characters [a-z]+ start and end at word boundaries:

> 'how "are" "you" doing'.match(/(?<!")\b[a-z]+\b(?!")/g)
[ 'how', 'doing' ]

One thing that is nice about negative lookbehind and negative lookahead is that they also work at the beginning or end, respectively, of a string – as demonstrated in the example.

16.4.1 There are no simple alternatives to negative lookaround assertions

Negative lookaround assertions are a powerful tool and usually impossible to emulate via other regular expression means.

If we don’t want to use them, we normally have to take a completely different approach. For example, in this case, we could split the string into (quoted and unquoted) words and then filter those:

const str = 'how "are" "you" doing';

const allWords = str.match(/"?[a-z]+"?/g);
const unquotedWords = allWords.filter(
  w => !w.startsWith('"') || !w.endsWith('"'));
assert.deepEqual(unquotedWords, ['how', 'doing']);

Benefits of this approach:

It works on older engines.
It is easy to understand.

16.5 Interlude: pointing lookaround assertions inward

All of the examples we have seen so far have in common that the lookaround assertions dictate what must come before or after the match but without including those characters in the match.

The regular expressions shown in the remainder of this chapter are different: Their lookaround assertions point inward and restrict what’s inside the match.

16.6 Example: match strings not starting with `'abc'`

Let‘s assume we want to match all strings that do not start with 'abc'. Our first attempt could be the regular expression /^(?!abc)/.

That works well for .test():

> /^(?!abc)/.test('xyz')
true

However, .exec() gives us an empty string:

> /^(?!abc)/.exec('xyz')
{ 0: '', index: 0, input: 'xyz', groups: undefined }

The problem is that assertions such as lookaround assertions don’t expand the matched text. That is, they don’t capture input characters, they only make demands about the current location in the input.

Therefore, the solution is to add a pattern that does capture input characters:

> /^(?!abc).*$/.exec('xyz')
{ 0: 'xyz', index: 0, input: 'xyz', groups: undefined }

As desired, this new regular expression rejects strings that are prefixed with 'abc':

> /^(?!abc).*$/.exec('abc')
null
> /^(?!abc).*$/.exec('abcd')
null

And it accepts strings that don’t have the full prefix:

> /^(?!abc).*$/.exec('ab')
{ 0: 'ab', index: 0, input: 'ab', groups: undefined }

16.7 Example: match substrings that do not contain `'.mjs'`

In the following example, we want to find

import ··· from '«module-specifier»';

where module-specifier does not end with '.mjs'.

const code = `
import {transform} from './util';
import {Person} from './person.mjs';
import {zip} from 'lodash';
`.trim();
assert.deepEqual(
  code.match(/^import .*? from '[^']+(?<!\.mjs)';$/umg),
  [
    "import {transform} from './util';",
    "import {zip} from 'lodash';",
  ]);

Here, the lookbehind assertion (?<!\.mjs) acts as a guard and prevents that the regular expression matches strings that contain '.mjs’ at this location.

16.8 Example: skipping lines with comments

Scenario: We want to parse lines with settings, while skipping comments. For example:

const RE_SETTING = /^(?!#)([^:]*):(.*)$/

const lines = [
  'indent: 2', // setting
  '# Trim trailing whitespace:', // comment
  'whitespace: trim', // setting
];
for (const line of lines) {
  const match = RE_SETTING.exec(line);
  if (match) {
    const key = JSON.stringify(match[1]);
    const value = JSON.stringify(match[2]);
    console.log(`KEY: ${key} VALUE: ${value}`);
  }
}

// Output:
// 'KEY: "indent" VALUE: " 2"'
// 'KEY: "whitespace" VALUE: " trim"'

How did we arrive at the regular expression RE_SETTING?

We started with the following regular expression for settings:

/^([^:]*):(.*)$/

Intuitively, it is a sequence of the following parts:

Start of the line
Non-colons (zero or more)
A single colon
Any characters (zero or more)
The end of line

This regular expression does reject some comments:

> /^([^:]*):(.*)$/.test('# Comment')
false

But it accepts others (that have colons in them):

> /^([^:]*):(.*)$/.test('# Comment:')
true

We can fix that by prefixing (?!#) as a guard. Intuitively, it means: ”The current location in the input string must not be followed by the character #.”

The new regular expression works as desired:

> /^(?!#)([^:]*):(.*)$/.test('# Comment:')
false

16.9 Example: smart quotes

Let’s assume we want to convert pairs of straight double quotes to curly quotes:

Input: `"yes" and "no"`
Output: `“yes” and “no”`

This is our first attempt:

> `The words "must" and "should".`.replace(/"(.*)"/g, '“$1”')
'The words “must" and "should”.'

Only the first quote and the last quote is curly. The problem here is that the * quantifier matches greedily (as much as possible).

If we put a question mark after the *, it matches reluctantly:

> `The words "must" and "should".`.replace(/"(.*?)"/g, '“$1”')
'The words “must” and “should”.'

16.9.1 Supporting escaping via backslashes

What if we want to allow the escaping of quotes via backslashes? We can do that by using the guard (?<!\\) before the quotes:

> const regExp = /(?<!\\)"(.*?)(?<!\\)"/g;
> String.raw`\"straight\" and "curly"`.replace(regExp, '“$1”')
'\\"straight\\" and “curly”'

As a post-processing step, we would still need to do:

.replace(/\\"/g, `"`)

However, this regular expression can fail when there is a backslash-escaped backslash:

> String.raw`Backslash: "\\"`.replace(/(?<!\\)"(.*?)(?<!\\)"/g, '“$1”')
'Backslash: "\\\\"'

The second backslash prevented the quotes from becoming curly.

We can fix that if we make our guard more sophisticated (?: makes the group non-capturing):

(?<=[^\\](?:\\\\)*)

The new guard allows pairs of backslashes before quotes:

> const regExp = /(?<=[^\\](?:\\\\)*)"(.*?)(?<=[^\\](?:\\\\)*)"/g;
> String.raw`Backslash: "\\"`.replace(regExp, '“$1”')
'Backslash: “\\\\”'

One issue remains. This guard prevents the first quote from being matched if it appears at the beginning of a string:

> const regExp = /(?<=[^\\](?:\\\\)*)"(.*?)(?<=[^\\](?:\\\\)*)"/g;
> `"abc"`.replace(regExp, '“$1”')
'"abc"'

We can fix that by changing the first guard to: (?<=[^\\](?:\\\\)*|^)

> const regExp = /(?<=[^\\](?:\\\\)*|^)"(.*?)(?<=[^\\](?:\\\\)*)"/g;
> `"abc"`.replace(regExp, '“$1”')
'“abc”'

16.10 Acknowledgements

The first regular expression that handles escaped backslashes in front of quotes was proposed by @jonasraoni on Twitter.

16.11 Further reading

Chapter “Regular expressions (RegExp)” in “JavaScript for impatient programmers”