This chapter gives an overview of the JavaScript API for regular expressions. It assumes that you are roughly familiar with how they work. If you are not, there are many good tutorials on the Web. Two examples are:
The terms used here closely reflect the grammar in the ECMAScript specification. I sometimes deviate to make things easier to understand.
The syntax for general atoms is as follows:
All of the following characters have special meaning:
\ ^ $ . * + ? ( ) [ ] { } |
You can escape them by prefixing a backslash. For example:
> /^(ab)$/.test('(ab)') false > /^\(ab\)$/.test('(ab)') true
Additional special characters are:
Inside a character class [...]
:
-
Inside a group that starts with a question mark (?...)
:
: = ! < >
The angle brackets are used only by the XRegExp library (see Chapter 30), to name groups.
.
(dot)
Matches any JavaScript character (UTF-16 code unit) except line terminators (newline, carriage return, etc.). To really match any character, use [\s\S]
. For example:
> /./.test('\n') false > /[\s\S]/.test('\n') true
\f
(form feed), \n
(line feed, newline), \r
(carriage return), \t
(horizontal tab), and \v
(vertical tab).
\0
matches the NUL character (\u0000
).
\cA
– \cZ
.
\u0000
– \xFFFF
(Unicode code units; see Chapter 24).
\x00
– \xFF
.
\d
matches any digit (same as [0-9]
);
\D
matches any nondigit (same as [^0-9]
).
\w
matches any Latin alphanumeric character plus underscore (same as [A-Za-z0-9_]
);
\W
matches all characters not matched by \w
.
\s
matches whitespace characters (space, tab, line feed, carriage return, form feed, all Unicode spaces, etc.);
\S
matches all nonwhitespace characters.
The syntax for character classes is as follows:
[«charSpecs»]
matches any single character that matches at least one of the charSpecs
.
[^«charSpecs»]
matches any single character that does not match any of the charSpecs
.
The following constructs are all character specifications:
Source characters match themselves. Most characters are source characters (even many characters that are special elsewhere). Only three characters are not:
\ ] -
As usual, you escape via a backslash. If you want to match a dash without escaping it, it must be the first character after the opening bracket or the right side of a range, as described shortly.
Class escapes: Any of the character escapes and character class escapes listed previously are allowed. There is one additional escape:
\b
): Outside a character class, \b
matches word boundaries. Inside a character class, it matches the control character backspace.
-
), followed by a source character or a class escape.
To demonstrate using character classes, this example parses a date formatted in the ISO 8601 standard:
function
parseIsoDate
(
str
)
{
var
match
=
/^([0-9]{4})-([0-9]{2})-([0-9]{2})$/
.
exec
(
str
);
// Other ways of writing the regular expression:
// /^([0-9][0-9][0-9][0-9])-([0-9][0-9])-([0-9][0-9])$/
// /^(\d\d\d\d)-(\d\d)-(\d\d)$/
if
(
!
match
)
{
throw
new
Error
(
'Not an ISO date: '
+
str
);
}
console
.
log
(
'Year: '
+
match
[
1
]);
console
.
log
(
'Month: '
+
match
[
2
]);
console
.
log
(
'Day: '
+
match
[
3
]);
}
And here is the interaction:
> parseIsoDate('2001-12-24') Year: 2001 Month: 12 Day: 24
The syntax for groups is as follows:
(«pattern»)
is a capturing group. Whatever is matched by pattern
can be accessed via backreferences or as the result of a match operation.
(?:«pattern»)
is a noncapturing group. pattern
is still matched against the input, but not saved as a capture. Therefore, the group does not have a number you can refer to (e.g., via a backreference).
\1
, \2
, and so on are known as backreferences; they refer back to a previously matched group. The number after the backslash can be any integer greater than or equal to 1, but the first digit must not be 0.
In this example, a backreference guarantees the same amount of a’s before and after the dash:
> /^(a+)-\1$/.test('a-a') true > /^(a+)-\1$/.test('aaa-aaa') true > /^(a+)-\1$/.test('aa-a') false
This example uses a backreference to match an HTML tag (obviously, you should normally use a proper parser to process HTML):
> var tagName = /<([^>]+)>[^<]*<\/\1>/; > tagName.exec('<b>bold</b>')[1] 'b' > tagName.exec('<strong>text</strong>')[1] 'strong' > tagName.exec('<strong>text</stron>') null
Any atom (including character classes and groups) can be followed by a quantifier:
?
means match never or once.
*
means match zero or more times.
+
means match one or more times.
{n}
means match exactly n
times.
{n,}
means match n
or more times.
{n,m}
means match at least n
, at most m
, times.
By default, quantifiers are greedy; that is, they match as much as possible. You can get reluctant matching (as little as possible) by suffixing any of the preceding quantifiers (including the ranges in curly braces) with a question mark (?
). For example:
> '<a> <strong>'.match(/^<(.*)>/)[1] // greedy 'a> <strong' > '<a> <strong>'.match(/^<(.*?)>/)[1] // reluctant 'a'
Thus, .*?
is a useful pattern for matching everything until the next occurrence of the following atom. For example, the following is a more compact version of the regular expression for HTML tags just shown (which used [^<]*
instead of .*?
):
/<(.+?)>.*?<\/\1>/
Assertions, shown in the following list, are checks about the current position in the input:
| Matches only at the beginning of the input. |
| Matches only at the end of the input. |
| Matches only at a word boundary.
Don’t confuse with |
| Matches only if not at a word boundary. |
| Positive lookahead: Matches only if |
| Negative lookahead: Matches only if |
This example matches a word boundary via \b
:
> /\bell\b/.test('hello') false > /\bell\b/.test('ello') false > /\bell\b/.test('ell') true
This example matches the inside of a word via \B
:
> /\Bell\B/.test('ell') false > /\Bell\B/.test('hell') false > /\Bell\B/.test('hello') true
Lookbehind is not supported. Manually Implementing Lookbehind explains how to implement it manually.
A disjunction operator (|
) separates two alternatives; either of the alternatives must match for the disjunction to match. The alternatives are atoms (optionally including quantifiers).
The operator binds very weakly, so you have to be careful that the alternatives don’t extend too far.
For example, the following regular expression matches all strings that either start with aa
or end with bb
:
> /^aa|bb$/.test('aaxx') true > /^aa|bb$/.test('xxbb') true
In other words, the disjunction binds more weakly than even ^
and $
and the two alternatives are ^aa
and bb$
. If you want to match the two strings 'aa'
and 'bb'
, you need parentheses:
/^(aa|bb)$/
Similarly, if you want to match the strings 'aab'
and 'abb'
:
/^a(a|b)b$/
JavaScript’s regular expressions have only very limited support for Unicode. Especially when it comes to code points in the astral planes, you have to be careful. Chapter 24 explains the details.
You can create a regular expression via either a literal or a constructor and configure how it works via flags.
There are two ways to create a regular expression: you can use a literal or the constructor RegExp
:
Literal |
| Compiled at load time |
Constructor (second argument is optional) |
| Compiled at runtime |
A literal and a constructor differ in when they are compiled:
The literal is compiled at load time. The following code will cause an exception when it is evaluated:
function
foo
()
{
/[/;
}
The constructor compiles the regular expression when it is called. The following code will not cause an exception, but calling foo()
will:
function
foo
()
{
new
RegExp
(
'['
);
}
Thus, you should normally use literals, but you need the constructor if you want to dynamically assemble a regular expression.
Flags are a suffix of regular expression literals and a parameter of regular expression constructors; they modify the matching behavior of regular expressions. The following flags exist:
Short name | Long name | Description |
|
| The given regular expression is matched multiple times. Influences several methods, especially |
|
| Case is ignored when trying to match the given regular expression. |
|
| In multiline mode, the begin operator |
The short name is used for literal prefixes and constructor parameters (see examples in the next section). The long name is used for properties of a regular expression that indicate what flags were set during its creation.
Regular expressions have the following instance properties:
Flags: boolean values indicating what flags are set:
global
: Is flag /g
set?
ignoreCase
: Is flag /i
set?
multiline
: Is flag /m
set?
Data for matching multiple times (flag /g
is set):
lastIndex
is the index where to continue the search next time.
The following is an example of accessing the instance properties for flags:
> var regex = /abc/i; > regex.ignoreCase true > regex.multiline false
In this example, we create the same regular expression first with a literal, then with a constructor, and use the test()
method to determine whether it matches a string:
> /abc/.test('ABC') false > new RegExp('abc').test('ABC') false
In this example, we create a regular expression that ignores case (flag /i
):
> /abc/i.test('ABC') true > new RegExp('abc', 'i').test('ABC') true
The test()
method checks whether a regular expression, regex
, matches a string, str
:
regex
.
test
(
str
)
test()
operates differently depending on whether the flag /g
is set or not.
If the flag /g
is not set, then the method checks whether there is a match somewhere in str
. For example:
> var str = '_x_x'; > /x/.test(str) true > /a/.test(str) false
If the flag /g
is set, then the method returns true
as many times as there are matches for regex
in str
. The property regex.lastIndex
contains the index after the last match:
> var regex = /x/g; > regex.lastIndex 0 > regex.test(str) true > regex.lastIndex 2 > regex.test(str) true > regex.lastIndex 4 > regex.test(str) false
The search()
method looks for a match with regex
within str
:
str
.
search
(
regex
)
If there is a match, the index where it was found is returned. Otherwise, the result is -1
. The properties global
and lastIndex
of regex
are ignored as the search is performed (and lastIndex
is not changed).
For example:
> 'abba'.search(/b/) 1 > 'abba'.search(/x/) -1
If the argument of search()
is not a regular expression, it is converted to one:
> 'aaab'.search('^a+b+$') 0
The following method call captures groups while matching regex
against str
:
var
matchData
=
regex
.
exec
(
str
);
If there was no match, matchData
is null
. Otherwise, matchData
is a match result, an array with two additional properties:
input
is the complete input string.
index
is the index where the match was found.
If the flag /g
is not set, only the first match is returned:
> var regex = /a(b+)/; > regex.exec('_abbb_ab_') [ 'abbb', 'bbb', index: 1, input: '_abbb_ab_' ] > regex.lastIndex 0
If the flag /g
is set, all matches are returned if you invoke exec()
repeatedly. The return value null
signals that there are no more matches. The property lastIndex
indicates where matching will continue next time:
> var regex = /a(b+)/g; > var str = '_abbb_ab_'; > regex.exec(str) [ 'abbb', 'bbb', index: 1, input: '_abbb_ab_' ] > regex.lastIndex 6 > regex.exec(str) [ 'ab', 'b', index: 7, input: '_abbb_ab_' ] > regex.lastIndex 10 > regex.exec(str) null
Here we loop over matches:
var
regex
=
/a(b+)/g
;
var
str
=
'_abbb_ab_'
;
var
match
;
while
(
match
=
regex
.
exec
(
str
))
{
console
.
log
(
match
[
1
]);
}
and we get the following output:
bbb b
The following method call matches regex
against str
:
var
matchData
=
str
.
match
(
regex
);
If the flag /g
of regex
is not set, this method works like RegExp.prototype.exec()
:
> 'abba'.match(/a/) [ 'a', index: 0, input: 'abba' ]
If the flag is set, then the method returns an array with all matching substrings in str
(i.e., group 0 of every match) or null
if there is no match:
> 'abba'.match(/a/g) [ 'a', 'a' ] > 'abba'.match(/x/g) null
The replace()
method searches a string, str
, for matches with search
and replaces them with replacement
:
str
.
replace
(
search
,
replacement
)
There are several ways in which the two parameters can be specified:
search
Either a string or a regular expression:
/g
flag. This is unexpected and a major pitfall.
global
flag, otherwise only one attempt is made to match the regular expression.
replacement
Either a string or a function:
If replacement
is a string, its content is used verbatim to replace the match. The only exception is the special character dollar sign ($
), which starts so-called replacement directives:
$n
inserts group n from the match. n
must be at least 1 ($0
has no special meaning).
The matching substring:
$`
(backtick) inserts the text before the match.
$&
inserts the complete match.
$'
(apostrophe) inserts the text after the match.
$$
inserts a single $
.
This example refers to the matching substring and its prefix and suffix:
> 'axb cxd'.replace(/x/g, "[$`,$&,$']") 'a[a,x,b cxd]b c[axb c,x,d]d'
This example refers to a group:
> '"foo" and "bar"'.replace(/"(.*?)"/g, '#$1#') '#foo# and #bar#'
If replacement
is a function, it computes the string that is to replace the match. This function has the following signature:
function
(
completeMatch
,
group_1
,
...,
group_n
,
offset
,
inputStr
)
completeMatch
is the same as $&
previously, offset
indicates where the match was found, and inputStr
is what is being matched against.
Thus, you can use the special variable arguments
to access groups (group 1 via arguments[1]
, and so on). For example:
> function replaceFunc(match) { return 2 * match } > '3 apples and 5 oranges'.replace(/[0-9]+/g, replaceFunc) '6 apples and 10 oranges'
Regular expressions whose /g
flag is set are problematic if a method invoked on them must be invoked multiple times to return all results. That’s the case for two methods:
RegExp.prototype.test()
RegExp.prototype.exec()
Then JavaScript abuses the regular expression as an iterator, as a pointer into the sequence of results. That causes problems:
/g
regular expressions can’t be inlined
For example:
// Don’t do that:
var
count
=
0
;
while
(
/a/g
.
test
(
'babaa'
))
count
++
;
The preceding loop is infinite, because a new regular expression is created for each loop iteration, which restarts the iteration over the results. Therefore, the code must be rewritten:
var
count
=
0
;
var
regex
=
/a/g
;
while
(
regex
.
test
(
'babaa'
))
count
++
;
Here is another example:
// Don’t do that:
function
extractQuoted
(
str
)
{
var
match
;
var
result
=
[];
while
((
match
=
/"(.*?)"/g
.
exec
(
str
))
!=
null
)
{
result
.
push
(
match
[
1
]);
}
return
result
;
}
Calling the preceding function will again result in an infinite loop. The correct version is (why lastIndex
is set to 0 is explained shortly):
var
QUOTE_REGEX
=
/"(.*?)"/g
;
function
extractQuoted
(
str
)
{
QUOTE_REGEX
.
lastIndex
=
0
;
var
match
;
var
result
=
[];
while
((
match
=
QUOTE_REGEX
.
exec
(
str
))
!=
null
)
{
result
.
push
(
match
[
1
]);
}
return
result
;
}
Using the function:
> extractQuoted('"hello", "world"') [ 'hello', 'world' ]
It’s a best practice not to inline anyway (then you can give regular expressions descriptive names). But you have to be aware that you can’t do it, not even in quick hacks.
/g
regular expressions as parameters
test()
and exec()
multiple times must be careful with a regular expression handed to it as a parameter. Its flag /g
must active and, to be safe, its lastIndex
should be set to zero (an explanation is offered in the next example).
/g
regular expressions (e.g., constants)
lastIndex
property to zero, before using it as an iterator (an explanation is offered in the next example). As iteration depends on lastIndex
, such a regular expression can’t be used in more than one iteration at the same time.
The following example illustrates problem 2. It is a naive implementation of a function that counts how many matches there are for the regular expression regex
in the string str
:
// Naive implementation
function
countOccurrences
(
regex
,
str
)
{
var
count
=
0
;
while
(
regex
.
test
(
str
))
count
++
;
return
count
;
}
Here’s an example of using this function:
> countOccurrences(/x/g, '_x_x') 2
The first problem is that this function goes into an infinite loop if the regular expression’s /g
flag is not set. For example:
countOccurrences
(
/x/
,
'_x_x'
)
// never terminates
The second problem is that the function doesn’t work correctly if regex.lastIndex
isn’t 0, because that property indicates where to start the search. For example:
> var regex = /x/g; > regex.lastIndex = 2; > countOccurrences(regex, '_x_x') 1
The following implementation fixes the two problems:
function
countOccurrences
(
regex
,
str
)
{
if
(
!
regex
.
global
)
{
throw
new
Error
(
'Please set flag /g of regex'
);
}
var
origLastIndex
=
regex
.
lastIndex
;
// store
regex
.
lastIndex
=
0
;
var
count
=
0
;
while
(
regex
.
test
(
str
))
count
++
;
regex
.
lastIndex
=
origLastIndex
;
// restore
return
count
;
}
A simpler alternative is to use match()
:
function
countOccurrences
(
regex
,
str
)
{
if
(
!
regex
.
global
)
{
throw
new
Error
(
'Please set flag /g of regex'
);
}
return
(
str
.
match
(
regex
)
||
[]).
length
;
}
There’s one possible pitfall: str.match()
returns null
if the /g
flag is set and there are no matches. We avoid that pitfall in the preceding code by using []
if the result of match()
isn’t truthy.
This section gives a few tips and tricks for working with regular expressions in JavaScript.
Sometimes, when you assemble a regular expression manually, you want to use a given string verbatim. That means that none of the special characters (e.g., *
, [
) should be interpreted as such—all of them need to be escaped. JavaScript has no built-in means for this kind of quoting, but you can program your own function, quoteText
, that would work as follows:
> console.log(quoteText('*All* (most?) aspects.')) \*All\* \(most\?\) aspects\.
Such a function is especially handy if you need to do a search and replace with multiple occurrences. Then the value to search for must be a regular expression with the global
flag set. With quoteText()
, you can use arbitrary strings. The function looks like this:
function
quoteText
(
text
)
{
return
text
.
replace
(
/[\\^$.*+?()[\]{}|=!<>:-]/g
,
'\\$&'
);
}
All special characters are escaped, because you may want to quote several characters inside parentheses or square brackets.
If you don’t use assertions such as ^
and $
, most regular expression methods find a pattern anywhere. For example:
> /aa/.test('xaay') true > /^aa$/.test('xaay') false
The empty regular expression matches everything. We can create an instance of RegExp
based on that regular expression like this:
> new RegExp('').test('dfadsfdsa') true > new RegExp('').test('') true
However, the empty regular expression literal would be //
, which is interpreted as a comment by JavaScript. Therefore, the following is the closest you can get via a literal: /(?:)/
(empty noncapturing group). The group matches everything, while not capturing anything, which the group from influencing the result returned by exec()
. Even JavaScript itself uses the preceding representation when displaying an empty regular expression:
> new RegExp('') /(?:)/
The empty regular expression has an inverse—the regular expression that matches nothing:
> var never = /.^/; > never.test('abc') false > never.test('') false
Lookbehind is an assertion. Similar to lookahead, a pattern is used to check something about the current position in the input, but otherwise ignored. In contrast to lookahead, the match for the pattern has to end at the current position (not start at it).
The following function replaces each occurrence of the string 'NAME'
with the value of the parameter name
, but only if the occurrence is not preceded by a quote. We handle the quote by “manually” checking the character before the current match:
function
insertName
(
str
,
name
)
{
return
str
.
replace
(
/NAME/g
,
function
(
completeMatch
,
offset
)
{
if
(
offset
===
0
||
(
offset
>
0
&&
str
[
offset
-
1
]
!==
'"'
))
{
return
name
;
}
else
{
return
completeMatch
;
}
}
);
}
> insertName('NAME "NAME"', 'Jane') 'Jane "NAME"' > insertName('"NAME" NAME', 'Jane') '"NAME" Jane'
An alternative is to include the characters that may escape in the regular expression. Then you have to temporarily add a prefix to the string you are searching in; otherwise, you’d miss matches at the beginning of that string:
function
insertName
(
str
,
name
)
{
var
tmpPrefix
=
' '
;
str
=
tmpPrefix
+
str
;
str
=
str
.
replace
(
/([^"])NAME/g
,
function
(
completeMatch
,
prefix
)
{
return
prefix
+
name
;
}
);
return
str
.
slice
(
tmpPrefix
.
length
);
// remove tmpPrefix
}
Atoms (see Atoms: General):
.
(dot) matches everything except line terminators (e.g., newlines). Use [\s\S]
to really match everything.
Character class escapes:
\d
matches digits ([0-9]
); \D
matches nondigits ([^0-9]
).
\w
matches Latin alphanumeric characters plus underscore ([A-Za-z0-9_]
); \W
matches all other characters.
\s
matches all whitespace characters (space, tab, line feed, etc.); \S
matches all nonwhitespace characters.
Character class (set of characters): [...]
and [^...]
[abc]
(all characters except \ ] -
match themselves)
[\d\w]
[A-Za-z0-9]
Groups:
(...)
; backreference: \1
(?:...)
Quantifiers (see Quantifiers):
Greedy:
? * +
{n} {n,} {n,m}
?
after any of the greedy quantifiers.
Assertions (see Assertions):
^ $
\b \B
(?=...)
(pattern must come next, but is otherwise ignored)
(?!...)
(pattern must not come next, but is otherwise ignored)
Disjunction: |
Creating a regular expression (see Creating a Regular Expression):
/xyz/i
(compiled at load time)
new RegExp('xzy', 'i')
(compiled at runtime)
Flags (see Flags):
/g
(influences several regular expression methods)
/i
/m
(^
and $
match per line, as opposed to the complete input)
Methods:
regex.test(str)
: Is there a match (see RegExp.prototype.test: Is There a Match?)?
/g
is not set: Is there a match somewhere?
/g
is set: Return true
as many times as there are matches.
str.search(regex)
: At what index is there a match (see String.prototype.search: At What Index Is There a Match?)?
regex.exec(str)
: Capture groups (see the section RegExp.prototype.exec: Capture Groups)?
/g
is not set: Capture groups of first match only (invoked once)
/g
is set: Capture groups of all matches (invoked repeatedly; returns null
if there are no more matches)
str.match(regex)
: Capture groups or return all matching substrings (see String.prototype.match: Capture Groups or Return All Matching Substrings)
/g
is not set: Capture groups
/g
is set: Return all matching substrings in an array
str.replace(search, replacement)
: Search and replace (see String.prototype.replace: Search and Replace)
search
: String or regular expression (use the latter, set /g
!)
replacement
: String (with $1
, etc.) or function (arguments[1]
is group 1, etc.) that returns a string
For tips on using the flag /g
, see Problems with the Flag /g.
Mathias Bynens (@mathias) and Juan Ignacio Dopazo (@juandopazo) recommended using match()
and test()
for counting occurrences, and Šime Vidas (@simevidas) warned me about being careful with match()
if there are no matches. The pitfall of the global flag causing infinite loops comes from a talk by Andrea Giammarchi (@webreflection). Claude Pache told me to escape more characters in quoteText()
.