26. Unicode in ES6

This chapter explains the improved support for Unicode that ECMAScript 6 brings. For a general introduction to Unicode, read Chap. “Unicode and JavaScript” in “Speaking JavaScript”.

26.1. Unicode is better supported in ES6
26.2. Escape sequences in ES6
- 26.2.1. Where can escape sequences be used?
- 26.2.2. Escape sequences in the ES6 spec

26.1 Unicode is better supported in ES6

There are three areas in which ECMAScript 6 has improved support for Unicode:

Unicode escapes for code points beyond 16 bits: \u{···}
Can be used in identifiers, string literals, template literals and regular expression literals. They are explained in the next section.
Strings:
- Iteration honors Unicode code points.
- Read code point values via String.prototype.codePointAt().
- Create a string from code point values via String.fromCodePoint().
Regular expressions:
- New flag /u (plus boolean property unicode) improves handling of surrogate pairs.

Additionally, ES6 is based on Unicode version 5.1.0, whereas ES5 is based on Unicode version 3.0.

26.2 Escape sequences in ES6

There are three parameterized escape sequences for representing characters in JavaScript:

Hex escape (exactly two hexadecimal digits): \xHH
```
  > '\x7A' === 'z'
  true
```
Unicode escape (exactly four hexadecimal digits): \uHHHH
```
  > '\u007A' === 'z'
  true
```
Unicode code point escape (1 or more hexadecimal digits): \u{···}
```
  > '\u{7A}' === 'z'
  true
```

Unicode code point escapes are new in ES6. They let you specify code points beyond 16 bits. If you wanted to do that in ECMAScript 5, you had to encode each code point as two UTF-16 code units (a surrogate pair). These code units could be expressed via Unicode escapes. For example, the following statement logs a rocket (code point 0x1F680) to most consoles:

console.log('\uD83D\uDE80');

With a Unicode code point escape you can specify code points greater than 16 bits directly:

console.log('\u{1F680}');

26.2.1 Where can escape sequences be used?

The escape sequences can be used in the following locations:

	`\uHHHH`	`\u{···}`	`\xHH`
Identifiers	✔	✔
String literals	✔	✔	✔
Template literals	✔	✔	✔
Regular expression literals	✔	Only with flag `/u`	✔

Identifiers:

A 4-digit Unicode escape \uHHHH becomes a single code point.
A Unicode code point escape \u{···} becomes a single code point.

> const hello = 123;
> hell\u{6F}
123

String literals:

Strings are internally stored as UTF-16 code units.
A hex escape \xHH contributes a UTF-16 code unit.
A 4-digit Unicode escape \uHHHH contributes a UTF-16 code unit.
A Unicode code point escape \u{···} contributes the UTF-16 encoding of its code point (one or two UTF-16 code units).

Template literals:

In template literals, escape sequences are handled like in string literals.
In tagged templates, how escape sequences are interpreted depends on the tag function. It can choose between two interpretations:
- Cooked: escape sequences are handled like in string literals.
- Raw: escape sequences are handled as a sequence of characters.

> `hell\u{6F}` // cooked
'hello'
> String.raw`hell\u{6F}` // raw
'hell\\u{6F}'

Regular expressions:

Unicode code point escapes are only allowed if the flag /u is set, because \u{3} is interpreted as three times the character u, otherwise:
```
  > /^\u{3}$/.test('uuu')
  true
```

26.2.2 Escape sequences in the ES6 spec

Various information:

The spec treats source code as a sequence of Unicode code points: “Source Text”
Unicode escape sequences sequences in identifiers: “Names and Keywords”
Strings are internally stored as sequences of UTF-16 code units: “String Literals”
Strings – how various escape sequences are translated to UTF-16 code units: “Static Semantics: SV”
Template literals – how various escape sequences are translated to UTF-16 code units: “Static Semantics: TV and TRV”

26.2.2.1 Regular expressions

The spec distinguishes between BMP patterns (flag /u not set) and Unicode patterns (flag /u set). Sect. “Pattern Semantics” explains that they are handled differently and how.

As a reminder, here is how grammar rules are be parameterized in the spec:

If a grammar rule R has the subscript [U] then that means there are two versions of it: R and R_U.
Parts of the rule can pass on the subscript via [?U].
If a part of a rule has the prefix [+U] it only exists if the subscript [U] is present.
If a part of a rule has the prefix [~U] it only exists if the subscript [U] is not present.

You can see this parameterization in action in Sect. “Patterns”, where the subscript [U] creates separate grammars for BMP patterns and Unicode patterns:

IdentityEscape: In BMP patterns, many characters can be prefixed with a backslash and are interpreted as themselves (for example: if \u is not followed by four hexadecimal digits, it is interpreted as u). In Unicode patterns that only works for the following characters (which frees up \u for Unicode code point escapes): ^ $ \ . * + ? ( ) [ ] { } |
RegExpUnicodeEscapeSequence: "\u{" HexDigits "}" is only allowed in Unicode patterns. In those patterns, lead and trail surrogates are also grouped to help with UTF-16 decoding.

Sect. “CharacterEscape” explains how various escape sequences are translated to characters (roughly: either code units or code points).

Next: 27. Tail call optimization

26. Unicode in ES6 #

26.1 Unicode is better supported in ES6 #

26.2 Escape sequences in ES6 #

26.2.1 Where can escape sequences be used? #

26.2.2 Escape sequences in the ES6 spec #