JavaScript for impatient programmers (beta)
Please support this book: buy it or donate
(Ad, please don’t block.)

39. Regular expressions (RegExp)



  Availability of features

Unless stated otherwise, all regular expression features are supported by ES5 and later.

39.1. Creating regular expressions

39.1.1. Literal vs. constructor

The two main ways of creating regular expressions, are:

Both regular expressions have the same two parts:

39.1.2. Cloning and non-destructively modifying regular expressions

There are two variants of the constructor RegExp():

The second variant is useful for cloning regular expressions, optionally while modifying them. Flags are immutable and this is the only way of changing them. For example:

function copyAndAddFlags(regExp, flags='') {
  // The constructor doesn’t allow duplicate flags,
  // make sure there aren’t any:
  const newFlags = [...new Set(regExp.flags + flags)].join('');
  return new RegExp(regExp, newFlags);
}
assert.equal(/abc/i.flags, 'i');
assert.equal(copyAndAddFlags(/abc/i, 'g').flags, 'gi');

39.2. Syntax

39.2.1. Syntax characters

At the top level of a regular expression, the following syntax characters are special. They are escaped by prefixing a backslash (\).

\ ^ $ . * + ? ( ) [ ] { } |

In regular expression literals, you must also escape the slash (not necessary with new RegExp()):

> /\//.test('/')
true
> new RegExp('/').test('/')
true

39.2.2. Basic atoms

Atoms are the basic building blocks of regular expressions.

39.2.2.1. Unicode property escapes

Unicode property escapes look like this:

  1. \p{prop=value}: matches all characters whose property prop has the value value.
  2. \P{prop=value}: matches all characters that do not have a property prop whose value is value.
  3. \p{bin_prop}: matches all characters whose binary property bin_prop is True.
  4. \P{bin_prop}: matches all characters whose binary property bin_prop is False.

Comments:

Examples:

Further reading:

39.2.3. Character classes

39.2.4. Groups

39.2.5. Quantifiers

By default, all of the following quantifiers are greedy:

To make them reluctant, put question marks (?) after them:

> /".*"/.exec('"abc"def"')[0]  // greedy
'"abc"def"'
> /".*?"/.exec('"abc"def"')[0] // reluctant
'"abc"'

39.2.6. Assertions

39.2.7. Disjunction (|)

Caveat: this operator has low precedence. Use groups if necessary:

39.3. Flags

Table 20: These are the regular expression flags supported by JavaScript.
Literal flag Property name ES Description
g global ES3 Match multiple times
i ignoreCase ES3 Match case-insensitively
m multiline ES3 ^ and $ match per line
s dotall ES2018 Dot matches line terminators
u unicode ES6 Unicode mode (recommended)
y sticky ES6 No characters between matches

The following regular expression flags are available in JavaScript (tbl. 20 provides a compact overview):

39.3.1. Flag: Unicode mode via /u

The flag /u switches on a special Unicode mode for a regular expression. That mode enables several features:

The following subsections explain the last item in more detail. They use the following Unicode character to explain when the atomic units are code points and when they are code units:

const codePoint = '🙂';
const codeUnits = '\uD83D\uDE42'; // UTF-16

assert.equal(codePoint, codeUnits); // same string!

I’m only switching between 🙂 and \uD83D\uDE42, to illustrate how JavaScript sees things. Both are equivalent and can be used interchangeably in strings and regular expressions.

39.3.1.1. Consequence: you can put code points in character classes

With /u, the two code units of 🙂 are interpreted as a single character:

> /^[🙂]$/u.test('🙂')
true

Without /u, 🙂 is interpreted as two characters:

> /^[\uD83D\uDE42]$/.test('\uD83D\uDE42')
false
> /^[\uD83D\uDE42]$/.test('\uDE42')
true

Note that ^ and $ demand that the input string have a single character. That’s why the first result is false.

39.3.1.2. Consequence: the dot operator (.) matches code points, not code units

With /u, the dot operator matches code points (.match() plus /g returns an Array with all the matches of a regular expression):

> '🙂'.match(/./gu).length
1

Without /u, the dot operator matches single code units:

> '\uD83D\uDE80'.match(/./g).length
2
39.3.1.3. Consequence: quantifiers apply to code points, not code units

With /u, a quantifier applies to the whole preceding code point:

> /^🙂{3}$/u.test('🙂🙂🙂')
true

Without /u, a quantifier only applies to the preceding code unit:

> /^\uD83D\uDE80{3}$/.test('\uD83D\uDE80\uDE80\uDE80')
true

39.4. Properties of regular expression objects

Noteworthy:

39.4.1. Flags as properties

Each regular expression flag exists as a property, with a longer, more descriptive name:

> /a/i.ignoreCase
true
> /a/.ignoreCase
false

This is the complete list of flag properties:

39.4.2. Other properties

Each regular expression also has the following properties:

39.5. Methods for working with regular expressions

39.5.1. regExp.test(str): is there a match?

The regular expression method .test() returns true if regExp matches str:

> /abc/.test('ABC')
false
> /abc/i.test('ABC')
true
> /\.js$/.test('main.js')
true

With .test() you should normally avoid the /g flag. If you use it, you generally don’t get the same result every time you call the method:

> const r = /a/g;
> r.test('aab')
true
> r.test('aab')
true
> r.test('aab')
false

The results are due to /a/ having two matches in the string. After all of those were found, .test() returns false.

39.5.2. str.search(regExp): at what index is the match?

The string method .search() returns the first index of str at which there is a match for regExp:

> '_abc_'.search(/abc/)
1
> 'main.js'.search(/\.js$/)
4

39.5.3. regExp.exec(str): capturing groups

39.5.3.1. Getting a match object for the first match

Without the flag /g, .exec() returns all captures of the first match for regExp in str:

assert.deepEqual(
  /(a+)b/.exec('ab aab'),
  {
    0: 'ab',
    1: 'a',
    index: 0,
    input: 'ab aab',
    groups: undefined,
  }
);

The result is a match object with the following properties:

39.5.3.2. Named groups (ES2018)

The previous example contained a single positional group. The following example demonstrates named groups:

const regExp = /^(?<key>[A-Za-z]+): (?<value>.*)$/u;
assert.deepEqual(
  regExp.exec('first: Jane'),
  {
    0: 'first: Jane',
    1: 'first',
    2: 'Jane',
    index: 0,
    input: 'first: Jane',
    groups: { key: 'first', value: 'Jane' },
  }
);

As you can see, the named groups key and value also exist as positional groups.

39.5.3.3. Looping over multiple matches

If you want to retrieve all matches of a regular expression (not just the first one), you need to switch on the flag /g. Then you can call .exec() multiple times and get another match each time. After the last match, .exec() returns null.

> const regExp = /(a+)b/g;
> regExp.exec('ab aab')
{ 0: 'ab', 1: 'a', index: 0, input: 'ab aab', groups: undefined }
> regExp.exec('ab aab')
{ 0: 'aab', 1: 'aa', index: 3, input: 'ab aab', groups: undefined }
> regExp.exec('ab aab')
null

Therefore, you can loop over all matches as follows:

const regExp = /(a+)b/g;
const str = 'ab aab';

let match;
// Check for null via truthiness
// Alternative: while ((match = regExp.exec(str)) !== null)
while (match = regExp.exec(str)) {
  console.log(match[1]);
}
// Output:
// 'a'
// 'aa'

Sharing regular expressions with /g has a few pitfalls, which are explained later.

  Exercise: Extract quoted text via .exec()

exercises/reg-exp/extract_quoted_test.js

39.5.4. str.match(regExp): return all matching substrings

Without /g, .match() works like .exec() – it returns a single match object.

With /g, .match() returns all substrings of str that match regExp:

> 'ab aab'.match(/(a+)b/g)  // important: /g
[ 'ab', 'aab' ]

If there is no match, .match() returns null:

> 'xyz'.match(/(a+)b/g)
null

You can use the Or operator to protect yourself against null:

const numberOfMatches = (str.match(regExp) || []).length;

39.5.5. str.replace(searchValue, replacementValue)

.replace() has several different modes, depending on what values you provide for its parameters:

The next subsections assume that a regular expression with /g is being used.

39.5.5.1. replacementValue is a string

If the replacement value is a string, the dollar sign has special meaning – it inserts things matched by the regular expression:

Text Result
$$ single $
$& complete match
$` text before match
$' text after match
$n capture of positional group n (n > 0)
$<name> capture of named group name

Example: Inserting the text before, inside, and after the matched substring.

> 'a1 a2'.replace(/a/g, "($`|$&|$')")
'(|a|1 a2)1 (a1 |a|2)2'

Example: Inserting the captures of positional groups.

> const regExp = /^([A-Za-z]+): (.*)$/ug;
> 'first: Jane'.replace(regExp, 'KEY: $1, VALUE: $2')
'KEY: first, VALUE: Jane'

Example: Inserting the captures of named groups.

> const regExp = /^(?<key>[A-Za-z]+): (?<value>.*)$/ug;
> 'first: Jane'.replace(regExp, 'KEY: $<key>, VALUE: $<value>')
'KEY: first, VALUE: Jane'
39.5.5.2. replacementValue is a function

If the replacement value is a function, you can compute each replacement. In the following example, we multiply each non-negative integer, that we find, by two.

assert.equal(
  '3 cats and 4 dogs'.replace(/[0-9]+/g, (all) => 2 * Number(all)),
  '6 cats and 8 dogs'
);

The replacement function gets the following parameters. Note how similar they are to match objects. The parameters are all positional, but I’ve included how one usually names them:

  Exercise: Change quotes via .replace() and a named group

exercises/reg-exp/change_quotes_test.js

39.5.6. Other methods for working with regular expressions

The first parameter of String.prototype.split() is either a string or a regular expression. If it is the latter then substrings captured by groups are added to the result of the method:

> 'a : b : c'.split(/( *):( *)/)
[ 'a', ' ', ' ', 'b', ' ', ' ', 'c' ]

Consult the chapter on strings for more information.

39.6. Flag /g and its pitfalls

The following two regular expression methods do something unusual if /g is switched on:

Then they can be called repeatedly and deliver all matches inside a string. Property .lastIndex of the regular expression is used to track the current position inside the string. For example:

const r = /a/g;
assert.equal(r.lastIndex, 0);

assert.equal(r.test('aa'), true); // 1st match?
assert.equal(r.lastIndex, 1); // after 1st match

assert.equal(r.test('aa'), true); // 2nd match?
assert.equal(r.lastIndex, 2); // after 2nd match

assert.equal(r.test('aa'), false); // 3rd match?
assert.equal(r.lastIndex, 0); // start over

So how is flag /g problematic? We’ll first explore the problems and then solutions.

39.6.1. Problem: You can’t inline a regular expression with flag /g

A regular expression with /g can’t be inlined: For example, in the following while loop, the regular expression is created fresh, every time the condition is checked. Therefore, its .lastIndex is always zero and the loop never terminates.

let count = 0;
// Infinite loop
while (/a/g.test('babaa')) {
  count++;
}

39.6.2. Problem: Removing /g can break code

If code expects a regular expression with /g and has a loop over the results of .exec() or .test() then a regular expression without /g can cause an infinite loop:

const regExp = /a/; // Missing: flag /g

let count = 0;
// Infinite loop
while (regExp.test('babaa')) {
  count++;
}

Why? Because .test() always returns the first result, true, and never false.

39.6.3. Problem: Adding /g can break code

With .test(), there is another caveat: If you want to check exactly once if a regular expression matches a string then the regular expression must not have /g. Otherwise, you generally get a different result, every time you call .test():

> const r = /^X/g;
> r.test('Xa')
true
> r.test('Xa')
false

Normally, you won’t add /g if you intend to use .test() in this manner. But it can happen if, e.g., you use the same regular expression for testing and for replacing. Or if you get the regular expression via a parameter.

39.6.4. Problem: Code can break if .lastIndex isn’t zero

When a regular expression is created, .lastIndex is initialized to zero. If code ever receives a regular expression whose .lastIndex is not zero, it can break. For example:

const regExp = /a/g;
regExp.lastIndex = 4;

let count = 0;
while (regExp.test('babaa')) {
  count++;
}
assert.equal(count, 1); // should be 3

.lastIndex not being zero can happen relatively easily if a regular expression is shared and not handled properly.

39.6.5. Dealing with /g and .lastIndex

Consider the following scenario: You want to implement a function countOccurrences(regExp, str) that counts how often regExp has a match inside str. How do you prevent a wrong regExp from breaking your code? Let’s look at three approaches.

First, you can throw exceptions if /g isn’t set or .lastIndex isn’t zero:

function countOccurrences(regExp, str) {
  if (!regExp.global) {
    throw new Error('Flag /g of regExp must be set');
  }
  if (regExp.lastIndex !== 0) {
    throw new Error('regExp.lastIndex must be zero');
  }
  
  let count = 0;
  while (regExp.test(str)) {
    count++;
  }
  return count;
}

Second, you can clone the parameter. That has the added benefit that regExp won’t be changed.

function countOccurrences(regExp, str) {
  const cloneFlags = regExp.flags + (regExp.global ? '' : 'g');
  const clone = new RegExp(regExp, cloneFlags);

  let count = 0;
  while (clone.test(str)) {
    count++;
  }
  return count;
}

Third, you can use .match() to count occurrences – which doesn’t change or depend on .lastIndex.

function countOccurrences(regExp, str) {
  if (!regExp.global) {
    throw new Error('Flag /g of regExp must be set');
  }
  return (str.match(regExp) || []).length;
}

39.7. Techniques for working with regular expressions

39.7.1. Escaping arbitrary text for regular expressions

The following function escapes an arbitrary text so that it is matched verbatim if you put it inside a regular expression:

function escapeForRegExp(str) {
  return str.replace(/[\\^$.*+?()[\]{}|]/g, '\\$&'); // (A)
}
assert.equal(escapeForRegExp('[yes?]'), String.raw`\[yes\?\]`);
assert.equal(escapeForRegExp('_g_'), String.raw`_g_`);

In line A, we escape all syntax characters. Note that /u forbids many escapes: among others, \: and \-.

This is how you can use escapeForRegExp() to replace an arbitrary text multiple times:

> const re = new RegExp(escapeForRegExp(':-)'), 'ug');
> ':-) :-) :-)'.replace(re, '🙂')
'🙂 🙂 🙂'

39.7.2. Matching everything or nothing

Sometimes, you may need a regular expression that matches everything or nothing. For example, as a sentinel value.