'abc'
'.mjs'
In this chapter we use examples to explore lookaround assertions in regular expressions. A lookaround assertion is non-capturing and must match (or not match) what comes before (or after) the current location in the input string.
Pattern | Name | |
---|---|---|
(?=«pattern») |
Positive lookahead | ES3 |
(?!«pattern») |
Negative lookahead | ES3 |
(?<=«pattern») |
Positive lookbehind | ES2018 |
(?<!«pattern») |
Negative lookbehind | ES2018 |
There are four lookaround assertions (tbl. 4)
(?=«pattern»)
matches if pattern
matches what comes after the current location in the input string.(?!«pattern»)
matches if pattern
does not match what comes after the current location in the input string.(?<=«pattern»)
matches if pattern
matches what comes before the current location in the input string.(?<!«pattern»)
matches if pattern
does not match what comes before the current location in the input string.The examples show what can be achieved via lookaround assertions. However, regular expression aren’t always the best solution. Another technique, such as proper parsing, may be a better choice.
Lookbehind assertions are a relatively new feature that may not be supported by all JavaScript engines you are targeting.
Lookaround assertions may affect performance negatively, especially if their patterns match long strings.
In the following interaction, we extract quoted words:
Two lookaround assertions help us here:
(?<=")
“must be preceded by a quote”(?=")
“must be followed by a quote”Lookaround assertions are especially convenient for .match()
in /g
mode, which returns whole matches (capture group 0). Whatever the pattern of a lookaround assertion matches is not captured. Without lookaround assertions, the quotes show up in the result:
How can we achieve the opposite of what we did in the previous section and extract all unquoted words from a string?
'how "are" "you" doing'
['how', 'doing']
Our first attempt is to simply convert positive lookaround assertions to negative lookaround assertions. Alas, that fails:
The problem is that we extract sequences of characters that are not bracketed by quotes. That means that in the string '"are"'
, the “r” in the middle is considered unquoted, because it is preceded by an “a” and followed by an “e”.
We can fix this by stating that prefix and suffix must be neither quote nor letter:
Another solution is to demand via \b
that the sequence of characters [a-z]+
start and end at word boundaries:
One thing that is nice about negative lookbehind and negative lookahead is that they also work at the beginning or end, respectively, of a string – as demonstrated in the example.
Negative lookaround assertions are a powerful tool and usually impossible to emulate via other regular expression means.
If we don’t want to use them, we normally have to take a completely different approach. For example, in this case, we could split the string into (quoted and unquoted) words and then filter those:
const str = 'how "are" "you" doing';
const allWords = str.match(/"?[a-z]+"?/g);
const unquotedWords = allWords.filter(
w => !w.startsWith('"') || !w.endsWith('"'));
assert.deepEqual(unquotedWords, ['how', 'doing']);
Benefits of this approach:
All of the examples we have seen so far have in common that the lookaround assertions dictate what must come before or after the match but without including those characters in the match.
The regular expressions shown in the remainder of this chapter are different: Their lookaround assertions point inward and restrict what’s inside the match.
'abc'
Let‘s assume we want to match all strings that do not start with 'abc'
. Our first attempt could be the regular expression /^(?!abc)/
.
That works well for .test()
:
However, .exec()
gives us an empty string:
The problem is that assertions such as lookaround assertions don’t expand the matched text. That is, they don’t capture input characters, they only make demands about the current location in the input.
Therefore, the solution is to add a pattern that does capture input characters:
As desired, this new regular expression rejects strings that are prefixed with 'abc'
:
And it accepts strings that don’t have the full prefix:
'.mjs'
In the following example, we want to find
import ··· from '«module-specifier»';
where module-specifier
does not end with '.mjs'
.
const code = `
import {transform} from './util';
import {Person} from './person.mjs';
import {zip} from 'lodash';
`.trim();
assert.deepEqual(
code.match(/^import .*? from '[^']+(?<!\.mjs)';$/umg),
[
"import {transform} from './util';",
"import {zip} from 'lodash';",
]);
Here, the lookbehind assertion (?<!\.mjs)
acts as a guard and prevents that the regular expression matches strings that contain '.mjs
’ at this location.
Scenario: We want to parse lines with settings, while skipping comments. For example:
const RE_SETTING = /^(?!#)([^:]*):(.*)$/
const lines = [
'indent: 2', // setting
'# Trim trailing whitespace:', // comment
'whitespace: trim', // setting
];
for (const line of lines) {
const match = RE_SETTING.exec(line);
if (match) {
const key = JSON.stringify(match[1]);
const value = JSON.stringify(match[2]);
console.log(`KEY: ${key} VALUE: ${value}`);
}
}
// Output:
// 'KEY: "indent" VALUE: " 2"'
// 'KEY: "whitespace" VALUE: " trim"'
How did we arrive at the regular expression RE_SETTING
?
We started with the following regular expression for settings:
Intuitively, it is a sequence of the following parts:
This regular expression does reject some comments:
But it accepts others (that have colons in them):
We can fix that by prefixing (?!#)
as a guard. Intuitively, it means: ”The current location in the input string must not be followed by the character #
.”
The new regular expression works as desired:
Let’s assume we want to convert pairs of straight double quotes to curly quotes:
`"yes" and "no"`
`“yes” and “no”`
This is our first attempt:
Only the first quote and the last quote is curly. The problem here is that the *
quantifier matches greedily (as much as possible).
If we put a question mark after the *
, it matches reluctantly:
What if we want to allow the escaping of quotes via backslashes? We can do that by using the guard (?<!\\)
before the quotes:
> const regExp = /(?<!\\)"(.*?)(?<!\\)"/g;
> String.raw`\"straight\" and "curly"`.replace(regExp, '“$1”')
'\\"straight\\" and “curly”'
As a post-processing step, we would still need to do:
However, this regular expression can fail when there is a backslash-escaped backslash:
The second backslash prevented the quotes from becoming curly.
We can fix that if we make our guard more sophisticated (?:
makes the group non-capturing):
The new guard allows pairs of backslashes before quotes:
> const regExp = /(?<=[^\\](?:\\\\)*)"(.*?)(?<=[^\\](?:\\\\)*)"/g;
> String.raw`Backslash: "\\"`.replace(regExp, '“$1”')
'Backslash: “\\\\”'
One issue remains. This guard prevents the first quote from being matched if it appears at the beginning of a string:
> const regExp = /(?<=[^\\](?:\\\\)*)"(.*?)(?<=[^\\](?:\\\\)*)"/g;
> `"abc"`.replace(regExp, '“$1”')
'"abc"'
We can fix that by changing the first guard to: (?<=[^\\](?:\\\\)*|^)
> const regExp = /(?<=[^\\](?:\\\\)*|^)"(.*?)(?<=[^\\](?:\\\\)*)"/g;
> `"abc"`.replace(regExp, '“$1”')
'“abc”'
@jonasraoni
on Twitter.RegExp
)” in “JavaScript for impatient programmers”