ASCII is so 1963. Nowadays, computers must support a broad range of different characters beyond the 128 we had in the early days of computing - not just accents and emojis but also completely different writing systems used around the globe. The Unicode standard packs a whopping 143,859 characters into an elegant system used by over 95% of the Internet, but PHP's string functions don't play nicely with Unicode by default, making it difficult for developers to properly handle such a wide array of possible user inputs.
In this talk, we'll explore why Unicode is important, how the various encodings like UTF-8 work under-the-hood, how to handle them within PHP, and some nifty tricks and shortcuts to preserve performance.
2. Colin O’Dell
● Principal Engineer at Unleashed Technologies
● PHP for ~20 years; 13 years professionally
● Creator & maintainer of league/commonmark library
● PHP League leadership team
● Owner of moderngeekware.com
● @colinodell
3. Agenda
● A History of Encoding Systems
● Unicode Standard
● Unicode Encodings
● Using Unicode in PHP
● Tips & Tricks
● Questions & Answers
11. 1960s: ASCII
● American Standard Code for Information Interchange
● 7-bit binary encoding
○ 0000000 = 0
○ ...
○ 1111111 = 127
12. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
13. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
Character Hex Binary Character Hex Binary
LF (line feed) 0x0A 0001010 E 0x45 1000101
3 0x33 0110011 e 0x65 1100101
14. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
00xxxxx = 32 control codes
01xxxxx = 32 numbers & symbols
10xxxxx = 32 uppercase letters and some extra symbols
11xxxxx = 32 lowercase letters and some extra symbols
15. A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
16. A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
a = 0x61 = 1100001
b = 0x62 = 1100010
…
z = 0x7A = 1111010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
17. But computers use 8-bit bytes...
ASCII (7 Bits) ???
Start 00000000 10000000
End 01111111 11111111
Count 128 128
18. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
7-bit
ASCII
19. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
8
???
9
A
B
C
D
E
F
8-bit
“Extended
ASCII”
22. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A NBSP Ą ˘ Ł ¤ Ľ Ś § ¨ Š Ş Ť Ź SHY Ž Ż
B ° ą ˛ ł ´ ľ ś ˇ ¸ š ş ť ź ˝ ž ż
C Ŕ Á Â Ă Ä Ĺ Ć Ç Č É Ę Ë Ě Í Î Ď
D Đ Ń Ň Ó Ô Ő Ö × Ř Ů Ú Ű Ü Ý Ţ ß
E ŕ á â ă ä ĺ ć ç č é ę ë ě í î ď
F đ ń ň ó ô ő ö ÷ ř ů ú ű ü ý ţ ˙
ISO
8859-2
23. 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼
1 ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ ⌂
8 Ç ü é â ä à å ç ê ë è ï î ì Ä Å
9 É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ
A á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »
B ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐
C └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧
D ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀
E α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩
F ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ NBSP
Code
Page
437
(IBM
PC)
24.
25. 8-bit “Extended ASCII”
● ISO 8859 - 16 variations:
○ ISO 8859-1 (“Latin 1”, Western European)
○ ISO 8859-2 (“Latin 2”, Central European)
○ ISO 8859-3 (“Latin 3”, South European)
○ ISO 8859-4 (“Latin 4”, North European)
○ ISO 8859-5 (Latin/Cyrillic)
○ ISO 8859-6 (Latin/Arabic)
○ ISO 8859-7 (Latin/Greek)
○ ISO 8859-8 (Latin/Hebrew)
○ ISO 8859-9 (“Latin 5”, Turkish)
○ ISO 8859-10 (“Latin 6”, Nordic)
○ ISO 8859-11 (Latin/Thai)
○ ISO 8859-12 (Latin/Devanagari) - abandoned
○ ISO 8859-13 (“Latin 7”, Baltic Rim)
○ ISO 8859-14 (“Latin 8”, Celtic)
○ ISO 8859-15 (“Latin 9”)
■ Revision of 8859-1 with swaps out less-
used chars; adds euro currency symbol
○ ISO 8859-16 (“Latin 10”, South-Eastern European)
● Windows-1252
● CP 437 - Original IBM PC
● Mac OS Roman character set
● TRS-80 character set
● Atari’s ATASCII
● Commodore’s PETSCII
● HP Roman-8 and Roman-9
● DEC’s Multinational Character Set
● Lotus International Character Set
● ECMA-94
31. “The Unicode Standard is the universal character
encoding standard for written characters and text. It
defines a consistent way of encoding multilingual text
that enables the exchange of text data internationally and
creates the foundation for global software”
32. Code Points
Problem:
How to accommodate larger character sets without wasting memory?
Solution:
Break the one-to-one correspondence between characters and
bits/encoding! Offer different ways to encode based on
different needs.
33. ASCII vs. Unicode
Character Encoded Bits
H 01001000 (0x48)
P 01010000 (0x50)
Glyph Code Point
P U+0050
LATIN CAPITAL LETTER P
H U+0048
LATIN CAPITAL LETTER H
Encoded Bits
????
????
34. Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
45. Recap
● Code Point: a number representing a single character*
○ 143,859 defined as of Unicode 13.0
○ Format: U+hhhhhh
● Codespace: A range of numerical values available for encoding characters
○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
● Code Planes: Continuous group of 65,536 (216) code points
○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first
two positions in six position hexadecimal format (U+hhhhhh)
48. Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
a a a a a a a a
Glyphs:
49. Glyphs and Graphemes
Glyph /
Grapheme c a f e
Unicode
Character
c a f e
Code Point
U+0063 U+0061 U+0066 U+0065
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
50. Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
51. Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
e + ◌́ = é
e
52. Glyphs and Graphemes: Combining Diacritical Marks
Z̷̧̨̰̋Å̸̮͉ ̵͉̣̄̇̀
L̵͉̣̄̇̀G
̸̮͉̊ O
̸̱͒̓ ̷̧̨̰̋Ț͝E̪̘̗̓͝X̪̘̗T
̸̰̺̝̍̈
53. Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
VS
15
54. Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
Glyph /
Grapheme
Unicode
Character
✈
Code Point
U+2708 U+FE0F
AIRPLANE
VARIATION
SELECTOR 16
(EMOJI STYLE)
VS
16
VS
15
55. Glyphs and Graphemes: Regional Indicator Symbols
Glyph /
Grapheme 🇺🇸
Unicode
Character
🇺 🇸
Code Point
U+1F1FA U+1F1F8
REGIONAL
INDICATOR
SYMBOL
LETTER U
REGIONAL
INDICATOR
SYMBOL
LETTER S
Glyph /
Grapheme 🇨🇦
Unicode
Character
🇨 🇦
Code Point
U+1F1E8 U+1F1E6
REGIONAL
INDICATOR
SYMBOL
LETTER C
REGIONAL
INDICATOR
SYMBOL
LETTER A
56. Glyphs and Graphemes: Modifiers
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FC
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-3
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FE
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-5
57. Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
👨 👩 👶 👧
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+1F469 U+1F476 U+1F467
MAN WOMAN BABY GIRL
58. Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467
MAN
ZERO
WIDTH
JOINER
WOMAN
ZERO
WIDTH
JOINER
BABY
ZERO
WIDTH
JOINER
GIRL
ZWJ ZWJ ZWJ
65. Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
67. UTF-32
Fixed-byte encoding; 4 bytes per code point
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
68. UTF-32
Fixed-byte encoding; 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
Examples:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 00000000 00000000 01000001
😸
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
69. UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
70. Example:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 01000001
Variable-length encoding; 2 or 4 bytes per character
UTF-16
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
71. UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
72. U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
75. UTF-8
Variable-length encoding; 1-4 bytes per code point
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
76. UTF-8
Trick 1: ASCII === UTF-8
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
77. UTF-8
Trick 2: Virtually all languages only need 1, 2, or 3 bytes
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
78. UTF-8
Trick 3: First byte tells you the length
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
79. UTF-8
Trick 4: Self-synchronization
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
80. UTF-8
Trick 5: No 0x00 bytes, except for NUL
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
81. UTF Encoding Summary
UTF-32 UTF-16 UTF-8
Encoding length Fixed Variable Variable
4 bytes per code
point
2 or 4 bytes per
code point
1-4 bytes per code
point
Memory-efficient No Somewhat Yes
CPU-efficient Yes Somewhat Somewhat
Self-synchronizing No Yes Yes
Contains null
(0x00) bytes
Yes Yes No
ASCII-compatible No No Yes
84. Handling Text In Programming Languages
1. Treat text as a sequence of bytes (PHP, C)
$smile = "xF0x9Fx98x80";
echo $smile; // => '😀'
echo strlen($smile); // => 4
1. Treat text as a sequence of Unicode code points (Python 3)
2. Treat text as a sequence of UTF-16 code units (JavaScript, C#)
const smile = 'uD83DuDE00';
console.log(smile); // => '😀'
console.log(smile.length); // => 2
85. PHP Strings
Be careful!
● Strings are simply byte sequences
● Encoding-agnostic
● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
86. PHP String Functions
Function What It Actually Does
strlen() Counts the length in bytes
str_replace() Replaces bytes
substr() Returns a subset of bytes
strtoupper() Converts alphabetic ASCII bytes to uppercase based on
globally-set locale
Works for ASCII; not entirely safe* for Unicode!
87. ext/mbstring
Provides multibyte-safe string functions
Standard Function mbstring Alternative
strlen() mb_strlen()
str_replace() (none)
substr() mb_substr()
strtoupper() mb_strtoupper()
Tip: All functions accept an
optional parameter to specify
the encoding, if known; will be
auto-detected otherwise.
88. ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Mary had a little lamb
MB_CASE_UPPER MARY HAD A LITTLE LAMB
MB_CASE_LOWER mary had a little lamb
MB_CASE_TITLE Mary Had A Little Lamb
MB_CASE_FOLD mary had a little lamb
89. ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Ich grüße den Mann
(I greet the man)
MB_CASE_UPPER ICH GRÜSSE DEN MANN
MB_CASE_LOWER ich grüße den mann
MB_CASE_TITLE Ich Grüße Den Mann
MB_CASE_FOLD ich grüsse den mann
90. ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Property
Code
Matches Example
L Any letter p{L}
Ll Lower case letter p{Ll}
Lu Upper case letter p{Lu}
Lm Modifier letter p{Lm}
Lt Title case letter p{Lt}
Lo Other letter p{Lo}
Property
Code
Matches Example
S Any symbol p{S}
Sc Currency symbol p{Sc}
Sk Modifier symbol p{Sk}
Sm Mathematical
symbol
p{Sm}
So Other symbol p{So}
91. Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Examples: p{Greek} or p{Egyptian_Hieroglyphs}
ext/pcre
92. Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
ext/pcre
93. Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
Match a Unicode extended grapheme cluster: X
Think of it like a . but for multiple characters
that combine into a single glyph
ext/pcre
94. ext/intl - IntlChar class
var_dump(IntlChar::charName('⛄'));
// string(20) "SNOWMAN WITHOUT SNOW"
$name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS";
var_dump(IntlChar::charFromName($name));
// int(9843)
var_dump(IntlChar::isupper("A"));
// bool(true)
95. ext/intl - Normalizer class
1. U+01FA - “Precomposed” character (LATIN CAPITAL
LETTER A WITH RING ABOVE AND ACUTE)
2. A + U+030A + U+0301 - A base letter A followed by two
combining marks (U+030A COMBINING RING ABOVE
and U+0301 COMBINING ACUTE ACCENT)
3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN
CAPITAL LETTER A WITH RING ABOVE) followed by a
combining accent (U+0301 COMBINING ACUTE
ACCENT)
4. U+212B + U+0301 - A compatibility character (U+212B
ANGSTROM SIGN) followed by a combining accent
(U+0301 COMBINING ACUTE ACCENT)
Ǻ
100. ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
101. ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
102. ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
// This is the Euro symbol 'EUR'.
echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
// This is the Euro symbol ''.
103. PHP Extension Summary
ext/iconv: Convert between encodings
ext/mbstring: Work with multi-byte string encodings like UTF-8
ext/pcre: Special UTF-compatible matching when /u modifier enabled
ext/intl: Work with individual codepoints and graphemes
105. Disclaimer
Clever hacks and micro-optimizations are usually unnecessary and can be
detrimental to long-term maintenance!
Don’t use these unless you absolutely need them.
106. Taking Advantage of UTF-Encoded Bytes
PHP string functions can still be used in some cases:
if (str_contains($utf8, '&')) { … }
$trimmed = trim($utf8);
$firstChar = substr($utf32, 0, 4);
Requires solid understanding of UTF encodings and what the functions do
Don’t be clever unless there’s a clear advantage!
107. Splitting Strings Into Codepoints
mb_str_split($str) - returns array of individual codepoints (PHP 7.4+)
UTF-8 polyfill for older versions: preg_split('//u', $str)
(Works for codepoints, not graphemes)
108. ASCII-Only UTF-8 Strings
Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions:
$isAscii = mb_detect_encoding($str, 'ASCII', true);
Micro-optimization (2x faster):
$isASCII = strlen($str) === mb_strlen($str);
Speed is fractions of milliseconds; micro-optimization only
important for parsing-heavy applications
109. Writing Silly Code
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
110. Writing Silly Code
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT
PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
111. Writing Silly Code (Don’t Do This)
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
112. Writing Silly Code (Seriously, Don’t Do This)
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
$you can use = 'U+2000 EN QUAD whitespace';
114. Recap & Recommendations
● Unicode supports virtually every known modern and historic writing system
● Codepoints != Glyphs/Graphemes != Encoding
● Use and support UTF-8 everywhere, especially for user input
● PHP strings are just raw bytes
● Use mbstring functions
Simple device
Type a key, sends some numbers, same letter comes out the other side
But there needs to be a standard
Developed in 1960s for teleprinters (“Teletype”) and early computers
7-bit: each letter you type in gets converted into 7 bits
Support for:
Upper and lowercase letters
Numbers
Basic, common symbols
More control codes (CR, LF, BS, HT, BEL)
(next for examples)
(how to encode/decode)
Something really clever going on here
Group by first two bits
4 “pages” or sections, 32 chars each
Letters in alphabetical order, starting at 1 (not random)
Even more clever - converting between upper and lowercase by changing one bit
“Extended ASCII” sounds like a standard, but it’s not
AKA Latin 1 for the Americas, Western Europe, Oceania, and much of Africa
Superset/extension of ISO 8859-1
Adds curly quotation marks
De-facto standard for Windows
Aka Latin 2 for Central or Eastern European Languages
UI graphics, science, and math
Standard EGA VGA encoding on gfx cards
That’s a lot! However,
In practice, most users only used one standard locally. Which was fine...
Standards proliferation
(Problem) You could add more bits, but that wasted computing resources (which were scarce at the time) for users who only needed Latin or ASCII-like characters
ATTN: 4 vs 5 char convention
Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
Codespace: entire range of numerical values available for encoding characters
Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
Codespace: entire range of numerical values available for encoding characters
Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
Unicode does not specify how the character / code point should be displayed (or encoded)!
Unicode does not specify how the character / code point should be displayed (or encoded)!
Combining Diacritical Marks
In this example: 5 code points but 4 graphemes
GRAPHEME = smallest unit of a writing system
Think about putting cursor in this text and selecting something or pressing backspace
“Zalgo text” or “glitch text”
Combining Diacritical Marks
Combining Diacritical Marks
Combining Diacritical Marks
Combining Diacritical Marks
Combining Diacritical Marks
Windows supports 52,000 family combinations
Windows supports 52,000 family combinations
If system lacks dedicated image, individual emojis are shown
Combining Diacritical Marks
Pros: Code points always use some number of bytes; very straight-forward
Cons: not very memory efficient, can contain null bytes, not self-synchronizing
BMP = basically everything except emojis and historical scripts
“Surrogate pairs”; values are reserved, no code points with those values
Pros: more memory efficient (most of the time), works well for BMP; is self-synchronizing
Cons: 4-byte encoding logic somewhat messy; can contain null bytes
This symbol can be encoded 4 different ways
Intl normalizer class
In UTF-8: 3 bytes for snowman, 1 for space, 1 for each letter c a f e, and 1 for diacritical combining acute accent mark