Strings are primitive values in JavaScript and immutable. That is, string-related operations always produce new strings and never change existing strings.
Literals for strings:
const str1 = 'Don\'t say "goodbye"'; // string literal
const str2 = "Don't say \"goodbye\""; // string literals
assert.equal(
`As easy as ${123}!`, // template literal
'As easy as 123!',
);
Backslashes are used to:
\\
represents a backslash
\n
represents a newline
\r
represents a carriage return
\t
represents a tab
Inside a String.raw
tagged template (line A), backslashes are treated as normal characters:
assert.equal(
String.raw`\ \n\t`, // (A)
'\\ \\n\\t',
);
Convertings values to strings:
> String(undefined)
'undefined'
> String(null)
'null'
> String(123.45)
'123.45'
> String(true)
'true'
Copying parts of a string
// There is no type for characters;
// reading characters produces strings:
const str3 = 'abc';
assert.equal(
str3[2], 'c' // no negative indices allowed
);
assert.equal(
str3.at(-1), 'c' // negative indices allowed
);
// Copying more than one character:
assert.equal(
'abc'.slice(0, 2), 'ab'
);
Concatenating strings:
assert.equal(
'I bought ' + 3 + ' apples',
'I bought 3 apples',
);
let str = '';
str += 'I bought ';
str += 3;
str += ' apples';
assert.equal(
str, 'I bought 3 apples',
);
JavaScript characters are 16 bits in size. They are what is indexed in strings and what .length
counts.
Code points are the atomic parts of Unicode text. Most of them fit into one JavaScript character, some of them occupy two (especially emojis):
assert.equal(
'A'.length, 1
);
assert.equal(
'🙂'.length, 2
);
Grapheme clusters (user-perceived characters) represent written symbols. Each one comprises one or more code points.
Due to these facts, we shouldn’t split text into JavaScript characters, we should split it into grapheme clusters. For more information on how to handle text, see “Atoms of text: code points, JavaScript characters, grapheme clusters” (§22.7).
This subsection gives a brief overview of the string API. There is a more comprehensive quick reference at the end of this chapter.
Finding substrings:
> 'abca'.includes('a')
true
> 'abca'.startsWith('ab')
true
> 'abca'.endsWith('ca')
true
> 'abca'.indexOf('a')
0
> 'abca'.lastIndexOf('a')
3
Splitting and joining:
assert.deepEqual(
'a, b,c'.split(/, ?/),
['a', 'b', 'c']
);
assert.equal(
['a', 'b', 'c'].join(', '),
'a, b, c'
);
Padding and trimming:
> '7'.padStart(3, '0')
'007'
> 'yes'.padEnd(6, '!')
'yes!!!'
> '\t abc\n '.trim()
'abc'
> '\t abc\n '.trimStart()
'abc\n '
> '\t abc\n '.trimEnd()
'\t abc'
Repeating and changing case:
> '*'.repeat(5)
'*****'
> '= b2b ='.toUpperCase()
'= B2B ='
> 'ΑΒΓ'.toLowerCase()
'αβγ'
Plain string literals are delimited by either single quotes or double quotes:
const str1 = 'abc';
const str2 = "abc";
assert.equal(str1, str2);
Single quotes are used more often because it makes it easier to mention HTML, where double quotes are preferred.
The next chapter covers template literals, which give us:
The backslash lets us create special characters:
'\n'
'\r\n'
'\t'
'\\'
The backslash also lets us use the delimiter of a string literal inside that literal:
assert.equal(
'She said: "Let\'s go!"',
"She said: \"Let's go!\"");
JavaScript has no extra data type for characters – characters are always represented as strings.
const str = 'abc';
// Reading a JavaScript character at a given index
assert.equal(str[1], 'b');
// Counting the JavaScript characters in a string:
assert.equal(str.length, 3);
The characters we see on screen are called grapheme clusters. Most of them are represented by single JavaScript characters. However, there are also grapheme clusters (especially emojis) that are represented by multiple JavaScript characters:
> '🙂'.length
2
How that works is explained in “Atoms of text: code points, JavaScript characters, grapheme clusters” (§22.7).
+
If at least one operand is a string, the plus operator (+
) converts any non-strings to strings and concatenates the result:
assert.equal(3 + ' times ' + 4, '3 times 4');
The assignment operator +=
is useful if we want to assemble a string, piece by piece:
let str = ''; // must be `let`!
str += 'Say it';
str += ' one more';
str += ' time';
assert.equal(str, 'Say it one more time');
Concatenating via +
is efficient
Using +
to assemble strings is quite efficient because most JavaScript engines internally optimize it.
Exercise: Concatenating strings
exercises/strings/concat_string_array_test.mjs
.push()
and .join()
)Occasionally, taking a detour via an Array can be useful for concatenating strings – especially if there is to be a separator between them (such as ', '
in line A):
function getPackingList(isAbroad = false, days = 1) {
const items = [];
items.push('tooth brush');
if (isAbroad) {
items.push('passport');
}
if (days > 3) {
items.push('water bottle');
}
return items.join(', '); // (A)
}
assert.equal(
getPackingList(),
'tooth brush'
);
assert.equal(
getPackingList(true, 7),
'tooth brush, passport, water bottle'
);
These are three ways of converting a value x
to a string:
String(x)
''+x
x.toString()
(does not work for undefined
and null
)
Recommendation: use the descriptive and safe String()
.
Examples:
assert.equal(String(undefined), 'undefined');
assert.equal(String(null), 'null');
assert.equal(String(false), 'false');
assert.equal(String(true), 'true');
assert.equal(String(123.45), '123.45');
Pitfall for booleans: If we convert a boolean to a string via String()
, we generally can’t convert it back via Boolean()
:
> String(false)
'false'
> Boolean('false')
true
The only string for which Boolean()
returns false
, is the empty string.
Plain objects have a default string representation that is not very useful:
> String({a: 1})
'[object Object]'
Arrays have a better string representation, but it still hides much information:
> String(['a', 'b'])
'a,b'
> String(['a', ['b']])
'a,b'
> String([1, 2])
'1,2'
> String(['1', '2'])
'1,2'
> String([true])
'true'
> String(['true'])
'true'
> String(true)
'true'
Stringifying functions, returns their source code:
> String(function f() {return 4})
'function f() {return 4}'
We can override the built-in way of stringifying objects by implementing the method toString()
:
const obj = {
toString() {
return 'hello';
}
};
assert.equal(String(obj), 'hello');
The JSON data format is a text representation of JavaScript values. Therefore, JSON.stringify()
can also be used to convert values to strings:
> JSON.stringify({a: 1})
'{"a":1}'
> JSON.stringify(['a', ['b']])
'["a",["b"]]'
The caveat is that JSON only supports null
, booleans, numbers, strings, Arrays, and objects (which it always treats as if they were created by object literals).
Tip: The third parameter lets us switch on multiline output and specify how much to indent – for example:
console.log(JSON.stringify({first: 'Jane', last: 'Doe'}, null, 2));
This statement produces the following output:
{
"first": "Jane",
"last": "Doe"
}
Strings can be compared via the following operators:
< <= > >=
There is one important caveat to consider: These operators compare based on the numeric values of JavaScript characters. That means that the order that JavaScript uses for strings is different from the one used in dictionaries and phone books:
> 'A' < 'B' // ok
true
> 'a' < 'B' // not ok
false
> 'ä' < 'b' // not ok
false
Properly comparing text is beyond the scope of this book. It is supported via the ECMAScript Internationalization API (Intl
).
Quick recap of “Unicode – a brief introduction” (§21):
Code points are the atomic parts of Unicode text. Each code point is 21 bits in size.
JavaScript strings implement Unicode via the encoding format UTF-16. It uses one or two 16-bit code units to encode a single code point.
Grapheme clusters (user-perceived characters) represent written symbols, as displayed on screen or paper. One or more code points are needed to encode a single grapheme cluster.
The following code demonstrates that a single code point comprises one or two JavaScript characters. We count the latter via .length
:
// 3 code points, 3 JavaScript characters:
assert.equal('abc'.length, 3);
// 1 code point, 2 JavaScript characters:
assert.equal('🙂'.length, 2);
The following table summarizes the concepts we have just explored:
Entity | Size | Encoded via |
---|---|---|
JavaScript character (UTF-16 code unit) | 16 bits | – |
Unicode code point | 21 bits | 1–2 code units |
Unicode grapheme cluster | 1+ code points |
Let’s explore JavaScript’s tools for working with code points.
A Unicode code point escape lets us specify a code point hexadecimally (1–5 digits). It produces one or two JavaScript characters.
> '\u{1F642}'
'🙂'
Unicode escape sequences
In the ECMAScript language specification, Unicode code point escapes and Unicode code unit escapes (which we’ll encounter later) are called Unicode escape sequences.
String.fromCodePoint()
converts a single code point to 1–2 JavaScript characters:
> String.fromCodePoint(0x1F642)
'🙂'
.codePointAt()
converts 1–2 JavaScript characters to a single code point:
> '🙂'.codePointAt(0).toString(16)
'1f642'
We can iterate over a string, which visits code points (not JavaScript characters). Iteration is described later in this book. One way of iterating is via a for-of
loop:
const str = '🙂a';
assert.equal(str.length, 3);
for (const codePointChar of str) {
console.log(codePointChar);
}
Output:
🙂
a
Array.from()
is also based on iteration and visits code points:
> Array.from('🙂a')
[ '🙂', 'a' ]
That makes it a good tool for counting code points:
> Array.from('🙂a').length
2
> '🙂a'.length
3
Indices and lengths of strings are based on JavaScript characters (as represented by UTF-16 code units).
To specify a code unit hexadecimally, we can use a Unicode code unit escape with exactly four hexadecimal digits:
> '\uD83D\uDE42'
'🙂'
And we can use String.fromCharCode()
. Char code is the standard library’s name for code unit:
> String.fromCharCode(0xD83D) + String.fromCharCode(0xDE42)
'🙂'
To get the char code of a character, use .charCodeAt()
:
> '🙂'.charCodeAt(0).toString(16)
'd83d'
If the code point of a character is below 256, we can refer to it via a ASCII escape with exactly two hexadecimal digits:
> 'He\x6C\x6Co'
'Hello'
(The official name of ASCII escapes is Hexadecimal escape sequences – it was the first escape that used hexadecimal numbers.)
When working with text that may be written in any human language, it’s best to split at the boundaries of grapheme clusters, not at the boundaries of code points.
TC39 is working on Intl.Segmenter
, a proposal for the ECMAScript Internationalization API to support Unicode segmentation (along grapheme cluster boundaries, word boundaries, sentence boundaries, etc.).
Until that proposal becomes a standard, we can use one of several libraries that are available (do a web search for “JavaScript grapheme”).
Table 22.1 describes how various values are converted to strings.
x | String(x) |
---|---|
undefined | 'undefined' |
null | 'null' |
boolean | false → 'false' , true → 'true' |
number | Example: 123 → '123' |
bigint | Example: 123n → '123' |
string | x (input, unchanged) |
symbol | Example: Symbol('abc') → 'Symbol(abc)' |
object | Configurable via, e.g., toString() |
Table 22.1: Converting values to strings.
String.fromCharCode()
[ES1]
.charCodeAt()
[ES1]
String.fromCodePoint()
[ES6]
.codePointAt()
[ES6]
String.prototype.*
: finding and matchingString.prototype.startsWith(searchString, startPos=0)
[ES6]
Returns true
if searchString
occurs in the string at index startPos
. Returns false
otherwise.
> '.gitignore'.startsWith('.')
true
> 'abcde'.startsWith('bc', 1)
true
String.prototype.endsWith(searchString, endPos=this.length)
[ES6]
Returns true
if the string would end with searchString
if its length were endPos
. Returns false
otherwise.
> 'poem.txt'.endsWith('.txt')
true
> 'abcde'.endsWith('cd', 4)
true
String.prototype.includes(searchString, startPos=0)
[ES6]
Returns true
if the string contains the searchString
and false
otherwise. The search starts at startPos
.
> 'abc'.includes('b')
true
> 'abc'.includes('b', 2)
false
String.prototype.indexOf(searchString, minIndex=0)
[ES1]
Returns the lowest index at which searchString
appears within the string or -1
, otherwise. Any returned index will be minIndex
or higher.
> 'abab'.indexOf('a')
0
> 'abab'.indexOf('a', 1)
2
> 'abab'.indexOf('c')
-1
String.prototype.lastIndexOf(searchString, maxIndex=Infinity)
[ES1]
Returns the highest index at which searchString
appears within the string or -1
, otherwise. Any returned index will be maxIndex
or lower.
> 'abab'.lastIndexOf('ab', 2)
2
> 'abab'.lastIndexOf('ab', 1)
0
> 'abab'.lastIndexOf('ab')
2
String.prototype.match(regExpOrString)
[ES3]
(1 of 2) regExpOrString
is RegExp without /g
or string.
match(
regExpOrString: string | RegExp
): null | RegExpMatchArray
If regExpOrString
is a regular expression with flag /g
not set, then .match()
returns the first match for regExpOrString
within the string. Or null
if there is no match.
If regExpOrString
is a string, it is used to create a regular expression (think parameter of new RegExp()
) before performing the previously mentioned steps.
The result has the following type:
interface RegExpMatchArray extends Array<string> {
index: number;
input: string;
groups: undefined | {
[key: string]: string
};
}
Numbered capture groups become Array indices (which is why this type extends Array
). Named capture groups](#named-capture-groups) (ES2018) become properties of .groups
. In this mode, .match()
works like [RegExp.prototype.exec()
.
Examples:
> 'ababb'.match(/a(b+)/)
{ 0: 'ab', 1: 'b', index: 0, input: 'ababb', groups: undefined }
> 'ababb'.match(/a(?<foo>b+)/)
{ 0: 'ab', 1: 'b', index: 0, input: 'ababb', groups: { foo: 'b' } }
> 'abab'.match(/x/)
null
(2 of 2) regExpOrString
is RegExp with /g
.
match(
regExpOrString: RegExp
): null | Array<string>
If flag /g
of regExpOrString
is set, .match()
returns either an Array with all matches or null
if there was no match.
> 'ababb'.match(/a(b+)/g)
[ 'ab', 'abb' ]
> 'ababb'.match(/a(?<foo>b+)/g)
[ 'ab', 'abb' ]
> 'abab'.match(/x/g)
null
String.prototype.search(regExpOrString)
[ES3]
Returns the index at which regExpOrString
occurs within the string. If regExpOrString
is a string, it is used to create a regular expression (think parameter of new RegExp()
).
> 'a2b'.search(/[0-9]/)
1
> 'a2b'.search('[0-9]')
1
String.prototype.*
: extractingString.prototype.slice(start=0, end=this.length)
[ES3]
Returns the substring of the string that starts at (including) index start
and ends at (excluding) index end
. If an index is negative, it is added to .length
before it is used (-1
becomes this.length-1
, etc.).
> 'abc'.slice(1, 3)
'bc'
> 'abc'.slice(1)
'bc'
> 'abc'.slice(-2)
'bc'
String.prototype.at(index: number)
[ES2022]
index
as a string.
undefined
.
index
is negative, it is added to .length
before it is used (-1
becomes this.length-1
, etc.).
> 'abc'.at(0)
'a'
> 'abc'.at(-1)
'c'
String.prototype.split(separator, limit?)
[ES3]
Splits the string into an Array of substrings – the strings that occur between the separators.
The separator can be a string:
> 'a : b : c'.split(':')
[ 'a ', ' b ', ' c' ]
It can also be a regular expression:
> 'a : b : c'.split(/ *: */)
[ 'a', 'b', 'c' ]
> 'a : b : c'.split(/( *):( *)/)
[ 'a', ' ', ' ', 'b', ' ', ' ', 'c' ]
The last invocation demonstrates that captures made by groups in the regular expression become elements of the returned Array.
If we want the separators to be part of the returned string fragments, we can use a regular expression with a lookbehind assertion:
> 'a : b : c'.split(/(?<=:)/)
[ 'a :', ' b :', ' c' ]
Thanks to the lookbehind assertion, the regular expression used for splitting matches but doesn’t capture any characters (which would be taken away from the output fragments).
Warning about .split('')
: Using the method this way splits a string into JavaScript characters. That doesn’t work well when dealing with astral code points (which are encoded as two JavaScript characters). For example, emojis are astral:
> '🙂X🙂'.split('')
[ '\uD83D', '\uDE42', 'X', '\uD83D', '\uDE42' ]
Instead, it is better to use Array.from()
(or spreading):
> Array.from('🙂X🙂')
[ '🙂', 'X', '🙂' ]
String.prototype.substring(start, end=this.length)
[ES1]
Use .slice()
instead of this method. .substring()
wasn’t implemented consistently in older engines and doesn’t support negative indices.
String.prototype.*
: combiningString.prototype.concat(...strings)
[ES3]
Returns the concatenation of the string and strings
. 'a'.concat('b')
is equivalent to 'a'+'b'
. The latter is much more popular.
> 'ab'.concat('cd', 'ef', 'gh')
'abcdefgh'
String.prototype.padEnd(len, fillString=' ')
[ES2017]
Appends (fragments of) fillString
to the string until it has the desired length len
. If it already has or exceeds len
, then it is returned without any changes.
> '#'.padEnd(2)
'# '
> 'abc'.padEnd(2)
'abc'
> '#'.padEnd(5, 'abc')
'#abca'
String.prototype.padStart(len, fillString=' ')
[ES2017]
Prepends (fragments of) fillString
to the string until it has the desired length len
. If it already has or exceeds len
, then it is returned without any changes.
> '#'.padStart(2)
' #'
> 'abc'.padStart(2)
'abc'
> '#'.padStart(5, 'abc')
'abca#'
String.prototype.repeat(count=0)
[ES6]
Returns the string, concatenated count
times.
> '*'.repeat()
''
> '*'.repeat(3)
'***'
String.prototype.*
: transformingString.prototype.replaceAll(searchValue, replaceValue)
[ES2021]
What to do if you can’t use .replaceAll()
If .replaceAll()
isn’t available on your targeted platform, you can use .replace()
instead. How is explained in “str.replace(searchValue, replacementValue)
[ES3]” (§45.13.8.1).
(1 of 2) replaceValue
is string.
replaceAll(
searchValue: string | RegExp,
replaceValue: string
): string
Replaces all matches of searchValue
with replaceValue
. If searchValue
is a regular expression without flag /g
, a TypeError
is thrown.
> 'x.x.'.replaceAll('.', '#') // interpreted literally
'x#x#'
> 'x.x.'.replaceAll(/./g, '#')
'####'
> 'x.x.'.replaceAll(/./, '#')
TypeError: String.prototype.replaceAll called with
a non-global RegExp argument
Special characters in replaceValue
are:
$$
: becomes $
$n
: becomes the capture of numbered group n
(alas, $0
stands for the string '$0'
, it does not refer to the complete match)
$&
: becomes the complete match
$`
: becomes everything before the match
$'
: becomes everything after the match
Examples:
> 'a 1995-12 b'.replaceAll(/([0-9]{4})-([0-9]{2})/g, '|$2|')
'a |12| b'
> 'a 1995-12 b'.replaceAll(/([0-9]{4})-([0-9]{2})/g, '|$&|')
'a |1995-12| b'
> 'a 1995-12 b'.replaceAll(/([0-9]{4})-([0-9]{2})/g, '|$`|')
'a |a | b'
Named capture groups (ES2018) are supported, too:
$<name>
becomes the capture of named group name
Example:
assert.equal(
'a 1995-12 b'.replaceAll(
/(?<year>[0-9]{4})-(?<month>[0-9]{2})/g, '|$<month>|'),
'a |12| b');
(2 of 2) replaceValue
is function.
replaceAll(
searchValue: string | RegExp,
replaceValue: (...args: Array<any>) => string
): string
If the second parameter is a function, occurrences are replaced with the strings it returns. Its parameters args
are:
matched: string
. The complete match
g1: string|undefined
. The capture of numbered group 1
g2: string|undefined
. The capture of numbered group 2
offset: number
. Where was the match found in the input string?
input: string
. The whole input string
const regexp = /([0-9]{4})-([0-9]{2})/g;
const replacer = (all, year, month) => '|' + all + '|';
assert.equal(
'a 1995-12 b'.replaceAll(regexp, replacer),
'a |1995-12| b');
Named capture groups (ES2018) are supported, too. If there are any, an argument is added at the end with an object whose properties contain the captures:
const regexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})/g;
const replacer = (...args) => {
const groups=args.pop();
return '|' + groups.month + '|';
};
assert.equal(
'a 1995-12 b'.replaceAll(regexp, replacer),
'a |12| b');
String.prototype.replace(searchValue, replaceValue)
[ES3]
For more information on this method, see “str.replace(searchValue, replacementValue)
[ES3]” (§45.13.8.1).
(1 of 2) replaceValue
is string or RegExp without /g
.
replace(
searchValue: string | RegExp,
replaceValue: string
): string
Works similarly to .replaceAll()
, but only replaces the first occurrence:
> 'x.x.'.replace('.', '#') // interpreted literally
'x#x.'
> 'x.x.'.replace(/./, '#')
'#.x.'
(1 of 2) replaceValue
is RegExp with /g
.
replace(
searchValue: string | RegExp,
replaceValue: (...args: Array<any>) => string
): string
Works exactly like .replaceAll()
.
String.prototype.toUpperCase()
[ES1]
Returns a copy of the string in which all lowercase alphabetic characters are converted to uppercase. How well that works for various alphabets, depends on the JavaScript engine.
> '-a2b-'.toUpperCase()
'-A2B-'
> 'αβγ'.toUpperCase()
'ΑΒΓ'
String.prototype.toLowerCase()
[ES1]
Returns a copy of the string in which all uppercase alphabetic characters are converted to lowercase. How well that works for various alphabets, depends on the JavaScript engine.
> '-A2B-'.toLowerCase()
'-a2b-'
> 'ΑΒΓ'.toLowerCase()
'αβγ'
String.prototype.trim()
[ES5]
Returns a copy of the string in which all leading and trailing whitespace (spaces, tabs, line terminators, etc.) is gone.
> '\r\n#\t '.trim()
'#'
> ' abc '.trim()
'abc'
String.prototype.trimStart()
[ES2019]
Similar to .trim()
but only the beginning of the string is trimmed:
> ' abc '.trimStart()
'abc '
String.prototype.trimEnd()
[ES2019]
Similar to .trim()
but only the end of the string is trimmed:
> ' abc '.trimEnd()
' abc'
String.prototype.normalize(form = 'NFC')
[ES6]
form
: 'NFC', 'NFD', 'NFKC', 'NFKD'
String.prototype.isWellFormed()
[ES2024]
Returns true
if a string is ill-formed and contains lone surrogates (see .toWellFormed()
for more information). Otherwise, it returns false
.
> '🙂'.split('') // split into code units
[ '\uD83D', '\uDE42' ]
> '\uD83D\uDE42'.isWellFormed()
true
> '\uD83D\uDE42\uD83D'.isWellFormed() // lone surrogate 0xD83D
false
String.prototype.toWellFormed()
[ES2024]
Each JavaScript string character is a UTF-16 code unit. One code point is encoded as either one UTF-16 code unit or two UTF-16 code unit. In the latter case, the two code units are called leading surrogate and trailing surrogate. A surrogate without its partner is called a lone surrogate. A string with one or more lone surrogates is ill-formed.
.toWellFormed()
converts an ill-formed string to a well-formed one by replacing each lone surrogate with code point 0xFFFD (“replacement character”). That character is often displayed as a � (a black rhombus with a white question mark). It is located in the Specials Unicode block of characters, at the very end of the Basic Multilingual Plane. This is what Wikipedia says about the replacement character: “It is used to indicate problems when a system is unable to render a stream of data to correct symbols.”
assert.deepEqual(
'🙂'.split(''), // split into code units
['\uD83D', '\uDE42']
);
assert.deepEqual(
// 0xD83D is a lone surrogate
'\uD83D\uDE42\uD83D'.toWellFormed().split(''),
['\uD83D', '\uDE42', '\uFFFD']
);
Exercise: Using string methods
exercises/strings/remove_extension_test.mjs