This chapter explains the improved support for Unicode that ECMAScript 6 brings. For a general introduction to Unicode, read Chap. “Unicode and JavaScript” in “Speaking JavaScript”.
There are three areas in which ECMAScript 6 has improved support for Unicode:
\u{···}
String.prototype.codePointAt()
.String.fromCodePoint()
./u
(plus boolean property unicode
) improves handling of surrogate pairs.Additionally, ES6 is based on Unicode version 5.1.0, whereas ES5 is based on Unicode version 3.0.
There are three parameterized escape sequences for representing characters in JavaScript:
\xHH
\uHHHH
\u{···}
Unicode code point escapes are new in ES6. They let you specify code points beyond 16 bits. If you wanted to do that in ECMAScript 5, you had to encode each code point as two UTF-16 code units (a surrogate pair). These code units could be expressed via Unicode escapes. For example, the following statement logs a rocket (code point 0x1F680) to most consoles:
With a Unicode code point escape you can specify code points greater than 16 bits directly:
The escape sequences can be used in the following locations:
\uHHHH |
\u{···} |
\xHH |
|
---|---|---|---|
Identifiers | ✔ | ✔ | |
String literals | ✔ | ✔ | ✔ |
Template literals | ✔ | ✔ | ✔ |
Regular expression literals | ✔ | Only with flag /u
|
✔ |
Identifiers:
\uHHHH
becomes a single code point.\u{···}
becomes a single code point.String literals:
\xHH
contributes a UTF-16 code unit.\uHHHH
contributes a UTF-16 code unit.\u{···}
contributes the UTF-16 encoding of its code point (one or two UTF-16 code units).Template literals:
Regular expressions:
/u
is set, because \u{3}
is interpreted as three times the character u
, otherwise:
Various information:
The spec distinguishes between BMP patterns (flag /u
not set) and Unicode patterns (flag /u
set). Sect. “Pattern Semantics” explains that they are handled differently and how.
As a reminder, here is how grammar rules are be parameterized in the spec:
R
has the subscript [U]
then that means there are two versions of it: R
and R_U
.[?U]
.[+U]
it only exists if the subscript [U]
is present.[~U]
it only exists if the subscript [U]
is not present.You can see this parameterization in action in Sect. “Patterns”, where the subscript [U]
creates separate grammars for BMP patterns and Unicode patterns:
\u
is not followed by four hexadecimal digits, it is interpreted as u
). In Unicode patterns that only works for the following characters (which frees up \u
for Unicode code point escapes): ^ $ \ . * + ? ( ) [ ] { } |
"\u{" HexDigits "}"
is only allowed in Unicode patterns. In those patterns, lead and trail surrogates are also grouped to help with UTF-16 decoding.Sect. “CharacterEscape” explains how various escape sequences are translated to characters (roughly: either code units or code points).