Exploring ES2018 and ES2019
Please support this book: buy it or donate
(Ad, please don’t block.)

8. RegExp Unicode property escapes



This chapter explains the proposal “RegExp Unicode Property Escapes” by Mathias Bynens.

8.1. Overview

JavaScript lets you match characters by mentioning the “names” of sets of characters. For example, \s stands for “whitespace”:

> /^\s+$/u.test('\t \n\r')
true

The proposal lets you additionally match characters by mentioning their Unicode character properties (what those are is explained next) inside the curly braces of \p{}. Two examples:

> /^\p{White_Space}+$/u.test('\t \n\r')
true
> /^\p{Script=Greek}+$/u.test('μετά')
true

As you can see, one of the benefits of property escapes is is that they make regular expressions more self-descriptive. Additional benefits will become clear later.

Before we delve into how property escapes work, let’s examine what Unicode character properties are.

8.2. Unicode character properties

In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character. Quoting the Unicode Standard, Sect. 3.3, D3:

The semantics of a character are determined by its identity, normative properties, and behavior.

8.2.1. Examples of properties

These are a few examples of properties:

8.2.2. Types of properties

The following types of properties exist:

8.2.3. Matching properties and property values

Properties and property values are matched as follows:

8.3. Unicode property escapes for regular expressions

Unicode property escapes look like this:

  1. \p{prop=value}: Match all characters whose property prop has the value value.
  2. \P{prop=value}: Match all characters that do not have a property prop whose value is value.
  3. \p{bin_prop}: Match all characters whose binary property bin_prop is True.
  4. \P{bin_prop}: Match all characters whose binary property bin_prop is False.

Comments:

8.3.1. Details

Things to note:

8.4. Examples

Matching whitespace:

> /^\p{White_Space}+$/u.test('\t \n\r')
true

Matching letters:

> /^\p{Letter}+$/u.test('πüé')
true

Matching Greek letters:

> /^\p{Script=Greek}+$/u.test('μετά')
true

Matching Latin letters:

> /^\p{Script=Latin}+$/u.test('Grüße')
true
> /^\p{Script=Latin}+$/u.test('façon')
true
> /^\p{Script=Latin}+$/u.test('mañana')
true

Matching lone surrogate characters:

> /^\p{Surrogate}+$/u.test('\u{D83D}')
true
> /^\p{Surrogate}+$/u.test('\u{DE00}')
true

Note that Unicode code points in astral planes (such as emojis) are composed of two JavaScript characters (a leading surrogate and a trailing surrogate). Therefore, you’d expect the previous regular expression to match the emoji 🙂, which is all surrogates:

> '🙂'.length
2
> '🙂'.charCodeAt(0).toString(16)
'd83d'
> '🙂'.charCodeAt(1).toString(16)
'de42'

However, with the /u flag, property escapes match code points, not JavaScript characters:

> /^\p{Surrogate}+$/u.test('🙂')
false

In other words, 🙂 is considered to be a single character:

> /^.$/u.test('🙂')
true

8.5. Trying it out

V8 5.8+ implement this proposal, it is switched on via --harmony_regexp_property:

8.6. Further reading

JavaScript:

The Unicode standard: