Leanpub: Publish Early, Publish Often

Chapter 5: Unicode Codepoint Escape Syntax

PHP’s lack of native Unicode support can make things difficult when coding for the web. While libraries like iconv and mbstring have simplified working with strings, there were no simple mechanisms available to create Unicode characters or strings without converting them from an HTML or JSON representation:

Workaround examples

$char = html_entity_decode('&#x2603', 0, 'UTF-8');
$char = mb_convert_encoding('&#x2603', 'UTF-8', 'HTML-ENTITIES');
$char = json_decode('"\\u2603"');

PHP 7 finally introduces native support for Unicode character escape sequences within strings, just like you’d see in Ruby or ECMAScript 6:

$char = "\u{2603}";

This makes it much easier (and quicker) to embed Unicode characters, especially ones that aren’t easily typed. These can be used in strings alongside other characters too. For example, here’s the U+202E RIGHT-TO-LEFT OVERRIDE character being used to display the string in reverse:

echo "\u{202E}This is backwards"; // displays: sdrawkcab si sihT

You can omit leading 0s if you’d like:

echo "\u{58}"; // "X"
echo "\u{0058}"; // "X"

Why the `{}`s?

Some other languages (C/C++/Java) use a format without the {} characters: \uXXXX. Unfortunately this limits their use to the Basic Multilingual Plane (U+0000 to U+FFFF). However, Unicode supports other characters beyond 16 bits.

For example, if we wanted to represent the U+1F427 PENGUIN emoji, our escape sequence would look something like this: \u1F427. Most languages would intepret this as U+1F42 GREEK SMALL LETTER OMICRON WITH PSILI AND VARIA plus a 7, which is not what we want. In these languages, you’d have to encode it using two 16-bit sequences like this: \uD83D\uDC27. This isn’t very clear though.

Wrapping with {} characters allows us to easily go beyond that 16-bit limitation without sacrficing clarity: \u{1F427}

Limitations

This feature follows the behavior of all other escape sequences in PHP - they can only be used within double-quoted strings and heredocs:

Example usage

$foo = "\u{2109}\u{2134}\u{2134}";
// ℉ℴℴ

$bar = <<<EOT
    \u{212C}\u{212B}\u{211D}
EOT;
// ℬÅℝ

And like other sequences such as \t, they will not be expanded when they occur in single-quoted strings or nowdocs:

These will not work

$foo = '\u{2109}\u{2134}\u{2134}';
// \u{2109}\u{2134}\u{2134}

$bar = <<<'EOT'
    \u{212C}\u{212B}\u{211D}
EOT;
// \u{212C}\u{212B}\u{211D}

Backwards Compatibility

Double-quoted strings and heredocs containing \u{ followed by an invalid sequence will now result in an error. This can be avoided by escaping the leading backslash with another backslash (\\u{).

Chapter 5: Unicode Codepoint Escape Syntax

Why the {}s?

Limitations

Backwards Compatibility

Further Reading:

Why the `{}`s?