SlideShare une entreprise Scribd logo
1  sur  116
Demystifying Unicode
@colinodell
Colin O’Dell
● Principal Engineer at Unleashed Technologies
● PHP for ~20 years; 13 years professionally
● Creator & maintainer of league/commonmark library
● PHP League leadership team
● Owner of moderngeekware.com
● @colinodell
Agenda
● A History of Encoding Systems
● Unicode Standard
● Unicode Encodings
● Using Unicode in PHP
● Tips & Tricks
● Questions & Answers
Assumptions
● Some familiarity with PHP
● Basic understanding of binary and hexadecimal
● Focus on high-level concepts!
Encoding Systems
Encoding Systems
L 1001100
L
A (Brief) History of
Encoding Systems
1837: Morse Code (Internationalized in 1844)
“Morse-Vail Telegraph Key” by the National Museum of American History is licenced under CC BY-NC 2.0
1930s: Teleprinters
1960s: Teletypes (TTYs) For Computing
1960s: ASCII
● American Standard Code for Information Interchange
● 7-bit binary encoding
○ 0000000 = 0
○ ...
○ 1111111 = 127
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPAC
E ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
Character Hex Binary Character Hex Binary
LF (line feed) 0x0A 0001010 E 0x45 1000101
3 0x33 0110011 e 0x65 1100101
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
00xxxxx = 32 control codes
01xxxxx = 32 numbers & symbols
10xxxxx = 32 uppercase letters and some extra symbols
11xxxxx = 32 lowercase letters and some extra symbols
A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
A = 0x41 = 1000001
B = 0x42 = 1000010
…
Z = 0x5A = 1011010
a = 0x61 = 1100001
b = 0x62 = 1100010
…
z = 0x7A = 1111010
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
00xxxxx
01xxxxx
10xxxxx
11xxxxx
But computers use 8-bit bytes...
ASCII (7 Bits) ???
Start 00000000 10000000
End 01111111 11111111
Count 128 128
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
7-bit
ASCII
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
8
???
9
A
B
C
D
E
F
8-bit
“Extended
ASCII”
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E à á â ã ä å æ ç è é ê ë ì í î ï
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
ISO
8859-1
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
8 € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž
9 ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ
A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯
B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E à á â ã ä å æ ç è é ê ë ì í î ï
F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
Windows-1252
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8
9
A NBSP Ą ˘ Ł ¤ Ľ Ś § ¨ Š Ş Ť Ź SHY Ž Ż
B ° ą ˛ ł ´ ľ ś ˇ ¸ š ş ť ź ˝ ž ż
C Ŕ Á Â Ă Ä Ĺ Ć Ç Č É Ę Ë Ě Í Î Ď
D Đ Ń Ň Ó Ô Ő Ö × Ř Ů Ú Ű Ü Ý Ţ ß
E ŕ á â ă ä ĺ ć ç č é ę ë ě í î ď
F đ ń ň ó ô ő ö ÷ ř ů ú ű ü ý ţ ˙
ISO
8859-2
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼
1 ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼
2 SPACE ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [  ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ ⌂
8 Ç ü é â ä à å ç ê ë è ï î ì Ä Å
9 É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ
A á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »
B ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐
C └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧
D ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀
E α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩
F ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ NBSP
Code
Page
437
(IBM
PC)
8-bit “Extended ASCII”
● ISO 8859 - 16 variations:
○ ISO 8859-1 (“Latin 1”, Western European)
○ ISO 8859-2 (“Latin 2”, Central European)
○ ISO 8859-3 (“Latin 3”, South European)
○ ISO 8859-4 (“Latin 4”, North European)
○ ISO 8859-5 (Latin/Cyrillic)
○ ISO 8859-6 (Latin/Arabic)
○ ISO 8859-7 (Latin/Greek)
○ ISO 8859-8 (Latin/Hebrew)
○ ISO 8859-9 (“Latin 5”, Turkish)
○ ISO 8859-10 (“Latin 6”, Nordic)
○ ISO 8859-11 (Latin/Thai)
○ ISO 8859-12 (Latin/Devanagari) - abandoned
○ ISO 8859-13 (“Latin 7”, Baltic Rim)
○ ISO 8859-14 (“Latin 8”, Celtic)
○ ISO 8859-15 (“Latin 9”)
■ Revision of 8859-1 with swaps out less-
used chars; adds euro currency symbol
○ ISO 8859-16 (“Latin 10”, South-Eastern European)
● Windows-1252
● CP 437 - Original IBM PC
● Mac OS Roman character set
● TRS-80 character set
● Atari’s ATASCII
● Commodore’s PETSCII
● HP Roman-8 and Roman-9
● DEC’s Multinational Character Set
● Lotus International Character Set
● ECMA-94
But then along came the Internet...
https://xkcd.com/927/
“The Unicode Standard is the universal character
encoding standard for written characters and text. It
defines a consistent way of encoding multilingual text
that enables the exchange of text data internationally and
creates the foundation for global software”
Code Points
Problem:
How to accommodate larger character sets without wasting memory?
Solution:
Break the one-to-one correspondence between characters and
bits/encoding! Offer different ways to encode based on
different needs.
ASCII vs. Unicode
Character Encoded Bits
H 01001000 (0x48)
P 01010000 (0x50)
Glyph Code Point
P U+0050
LATIN CAPITAL LETTER P
H U+0048
LATIN CAPITAL LETTER H
Encoded Bits
????
????
Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
D
U+0044
LATIN CAPITAL LETTER D
U+1F604
SMILING FACE WITH
OPEN MOUTH AND
SMILING EYES
Code Planes
Recap
● Code Point: a number representing a single character*
○ 143,859 defined as of Unicode 13.0
○ Format: U+hhhhhh
● Codespace: A range of numerical values available for encoding characters
○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
● Code Planes: Continuous group of 65,536 (216) code points
○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first
two positions in six position hexadecimal format (U+hhhhhh)
Glyphs and Graphemes
Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
Character / Code Point:
a
U+0061
LATIN SMALL LETTER A
a a a a a a a a
Glyphs:
Glyphs and Graphemes
Glyph /
Grapheme c a f e
Unicode
Character
c a f e
Code Point
U+0063 U+0061 U+0066 U+0065
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
Glyphs and Graphemes: Combining Diacritical Marks
Glyph /
Grapheme c a f é
Unicode
Character
c a f e ◌́
Code Point
U+0063 U+0061 U+0066 U+0065 U+0301
LATIN
SMALL
LETTER C
LATIN
SMALL
LETTER A
LATIN
SMALL
LETTER F
LATIN
SMALL
LETTER E
COMBINING
ACUTE ACCENT
e + ◌́ = é
e
Glyphs and Graphemes: Combining Diacritical Marks
Z̷̧̨̰̋Å̸̮͉ ̵͉̣̄̇̀
L̵͉̣̄̇̀G
̸̮͉̊ O
̸̱͒̓ ̷̧̨̰̋Ț͝E̪̘̗̓͝X̪̘̗T
̸̰̺̝̍̈
Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
VS
15
Glyphs and Graphemes: Variation Selectors
Glyph /
Grapheme ✈
Unicode
Character
✈
Code Point
U+2708 U+FE0E
AIRPLANE
VARIATION
SELECTOR 15
(TEXT STYLE)
Glyph /
Grapheme
Unicode
Character
✈
Code Point
U+2708 U+FE0F
AIRPLANE
VARIATION
SELECTOR 16
(EMOJI STYLE)
VS
16
VS
15
Glyphs and Graphemes: Regional Indicator Symbols
Glyph /
Grapheme 🇺🇸
Unicode
Character
🇺 🇸
Code Point
U+1F1FA U+1F1F8
REGIONAL
INDICATOR
SYMBOL
LETTER U
REGIONAL
INDICATOR
SYMBOL
LETTER S
Glyph /
Grapheme 🇨🇦
Unicode
Character
🇨 🇦
Code Point
U+1F1E8 U+1F1E6
REGIONAL
INDICATOR
SYMBOL
LETTER C
REGIONAL
INDICATOR
SYMBOL
LETTER A
Glyphs and Graphemes: Modifiers
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FC
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-3
Glyph /
Grapheme
Unicode
Character
👋
Code Point
U+1F44B U+1F3FE
WAVING
HAND SIGN
EMOJI
MODIFIER
FITZPATRICK
TYPE-5
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
👨 👩 👶 👧
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+1F469 U+1F476 U+1F467
MAN WOMAN BABY GIRL
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
Unicode
Character
👨 👩 👶 👧
Code
Point
U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467
MAN
ZERO
WIDTH
JOINER
WOMAN
ZERO
WIDTH
JOINER
BABY
ZERO
WIDTH
JOINER
GIRL
ZWJ ZWJ ZWJ
Glyphs and Graphemes: ZWJ Sequences
Glyphs and Graphemes: ZWJ Sequences
Glyphs and Graphemes: ZWJ Sequences
Glyph /
Grapheme
Unicode
Character
Code
Point
U+1F477 U+200D U+2642
CONSTRU
CTION
WORKER
ZERO
WIDTH
JOINER
MALE SIGN
ZWJ
Glyph /
Grapheme
Unicode
Character
Code
Point
U+1F477 U+200D U+2640
CONSTRU
CTION
WORKER
ZERO
WIDTH
JOINER
FEMALE
SIGN
ZWJ
Glyphs and Graphemes: ZWJ Sequences
Glyph / Grapheme
Unicode Character
Code Point
U+1F477 U+1F3FE U+200D U+2640
CONSTRUCTION
WORKER
EMOJI MODIFIER
FITZPATRICK
TYPE-5
ZERO WIDTH
JOINER
FEMALE SIGN
ZWJ
Enough about code points...
Encoding Schemes
Glyph Code Point Encoded Bits
P U+0050
LATIN CAPITAL LETTER P
????
h U+0068
LATIN SMALL LETTER H
????
Σ U+03A3
GREEK CAPITAL LETTER SIGMA
????
U+0634
ARABIC LETTER SHEEN
????
U+1D2ED
MAYAN NUMERAL
THIRTEEN
????
😸
U+1F638
GRINNING CAT FACE WITH
SMILING EYES
????
H U+0048
LATIN CAPITAL LETTER H
????
Encoding Schemes
● Most popular:
○ UTF-8
○ UTF-16
○ UTF-32
UTF-32
Fixed-byte encoding; 4 bytes per code point
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
UTF-32
Fixed-byte encoding; 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+10FFFF
xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
Examples:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 00000000 00000000 01000001
😸
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
Example:
A
U+0041
LATIN CAPITAL A
0x0041 => 1000001 00000000 01000001
Variable-length encoding; 2 or 4 bytes per character
UTF-16
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
Variable-length encoding; 2 or 4 bytes per character
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Codepoint range Unicode scalar value (binary) Encoded bytes
U+0000..U+D7FF,
U+E000..U+EFFF
(Basic Multilingual Plane)
00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Variable-length encoding; 2 or 4 bytes per character
U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000
W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx
W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy
UTF-16
0xD800-
0xDBFF
0xDC00-
0xDFFF
Example:
😸
U+1F638
GRINNING CAT WITH
SMILING EYES
0x1F638 => 1 11110110 00111000 11011000 00111101 11011110 00111000
Codepoint range Unicode scalar value (binary) Encoded bytes
U+010000..U+10FFFF
(Supplementary Planes)
Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
Variable-length encoding; 2 or 4 bytes per character
U' = 11110110 00111000 // 0x1F638 - 0x10000 = 0xF638
W1 = 11011000 00111101 // 0xD800 + 0000111101
W2 = 11011110 00111000 // 0xDC00 + 1000111000
UTF-16
UTF-8
Variable-length encoding; 1-4 bytes per code point
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 1: ASCII === UTF-8
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 2: Virtually all languages only need 1, 2, or 3 bytes
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 3: First byte tells you the length
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 4: Self-synchronization
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF-8
Trick 5: No 0x00 bytes, except for NUL
Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes
U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII
U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic,
Arabic, Hebrew, and more, plus
combining diacritical marks
U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most
Chinese, Japanese, and Korean
characters
U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic
scripts, mathematical symbols,
and emoji.
UTF Encoding Summary
UTF-32 UTF-16 UTF-8
Encoding length Fixed Variable Variable
4 bytes per code
point
2 or 4 bytes per
code point
1-4 bytes per code
point
Memory-efficient No Somewhat Yes
CPU-efficient Yes Somewhat Somewhat
Self-synchronizing No Yes Yes
Contains null
(0x00) bytes
Yes Yes No
ASCII-compatible No No Yes
https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg
Unicode in PHP
Handling Text In Programming Languages
1. Treat text as a sequence of bytes (PHP, C)
$smile = "xF0x9Fx98x80";
echo $smile; // => '😀'
echo strlen($smile); // => 4
1. Treat text as a sequence of Unicode code points (Python 3)
2. Treat text as a sequence of UTF-16 code units (JavaScript, C#)
const smile = 'uD83DuDE00';
console.log(smile); // => '😀'
console.log(smile.length); // => 2
PHP Strings
Be careful!
● Strings are simply byte sequences
● Encoding-agnostic
● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
PHP String Functions
Function What It Actually Does
strlen() Counts the length in bytes
str_replace() Replaces bytes
substr() Returns a subset of bytes
strtoupper() Converts alphabetic ASCII bytes to uppercase based on
globally-set locale
Works for ASCII; not entirely safe* for Unicode!
ext/mbstring
Provides multibyte-safe string functions
Standard Function mbstring Alternative
strlen() mb_strlen()
str_replace() (none)
substr() mb_substr()
strtoupper() mb_strtoupper()
Tip: All functions accept an
optional parameter to specify
the encoding, if known; will be
auto-detected otherwise.
ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Mary had a little lamb
MB_CASE_UPPER MARY HAD A LITTLE LAMB
MB_CASE_LOWER mary had a little lamb
MB_CASE_TITLE Mary Had A Little Lamb
MB_CASE_FOLD mary had a little lamb
ext/mbstring
Provides multibyte-safe string functions
mb_convert_case(string $string, int $mode, ?string $encoding = null): string
Input $string $mode Output
Ich grüße den Mann
(I greet the man)
MB_CASE_UPPER ICH GRÜSSE DEN MANN
MB_CASE_LOWER ich grüße den mann
MB_CASE_TITLE Ich Grüße Den Mann
MB_CASE_FOLD ich grüsse den mann
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Property
Code
Matches Example
L Any letter p{L}
Ll Lower case letter p{Ll}
Lu Upper case letter p{Lu}
Lm Modifier letter p{Lm}
Lt Title case letter p{Lt}
Lo Other letter p{Lo}
Property
Code
Matches Example
S Any symbol p{S}
Sc Currency symbol p{Sc}
Sk Modifier symbol p{Sk}
Sm Mathematical
symbol
p{Sm}
So Other symbol p{So}
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Examples: p{Greek} or p{Egyptian_Hieroglyphs}
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
ext/pcre
Enable UTF-8 support with u modifier: preg_match('/foo/u')
Match a character with a Unicode property: p{xx} (37 different codes)
Match a character with a Unicode script: p{xxxx} (102 different scripts)
Match a character without a Unicode property: P{xx}
Match a Unicode extended grapheme cluster: X
Think of it like a . but for multiple characters
that combine into a single glyph
ext/pcre
ext/intl - IntlChar class
var_dump(IntlChar::charName('⛄'));
// string(20) "SNOWMAN WITHOUT SNOW"
$name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS";
var_dump(IntlChar::charFromName($name));
// int(9843)
var_dump(IntlChar::isupper("A"));
// bool(true)
ext/intl - Normalizer class
1. U+01FA - “Precomposed” character (LATIN CAPITAL
LETTER A WITH RING ABOVE AND ACUTE)
2. A + U+030A + U+0301 - A base letter A followed by two
combining marks (U+030A COMBINING RING ABOVE
and U+0301 COMBINING ACUTE ACCENT)
3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN
CAPITAL LETTER A WITH RING ABOVE) followed by a
combining accent (U+0301 COMBINING ACUTE
ACCENT)
4. U+212B + U+0301 - A compatibility character (U+212B
ANGSTROM SIGN) followed by a combining accent
(U+0301 COMBINING ACUTE ACCENT)
Ǻ
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
Ǻ
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
foreach ($variations as $str) {
echo urlencode(Normalizer::normalize($str));
echo "n";
}
Ǻ
$variations = [
"xC7xBA",
"A" . "xCCx8AxCCx81",
"xC3x85xCCx81",
"xE2x84xABxCCx81",
];
foreach ($variations as $str) {
echo urlencode(Normalizer::normalize($str));
echo "n";
}
// %C7%BA
// %C7%BA
// %C7%BA
// %C7%BA
Ǻ
ext/intl - Grapheme Functions
grapheme_​
extract()
grapheme_​
stripos()
grapheme_​
stristr()
grapheme_​
strlen()
grapheme_​
strpos()
grapheme_​
strripos()
grapheme_​
strrpos()
grapheme_​
strstr()
grapheme_​
substr()
$str = '⛄ Café';
echo strlen($str); // 10
echo mb_strlen($str); // 7
echo grapheme_strlen($str); // 6
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
ext/iconv - iconv() function to convert encodings
$text = "This is the Euro symbol '€'."; // UTF-8 string
echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL;
// Notice: iconv(): Detected an illegal character in input string
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;
// This is the Euro symbol 'EUR'.
echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL;
// This is the Euro symbol ''.
PHP Extension Summary
ext/iconv: Convert between encodings
ext/mbstring: Work with multi-byte string encodings like UTF-8
ext/pcre: Special UTF-compatible matching when /u modifier enabled
ext/intl: Work with individual codepoints and graphemes
Fun Tricks & Micro-Optimizations
Disclaimer
Clever hacks and micro-optimizations are usually unnecessary and can be
detrimental to long-term maintenance!
Don’t use these unless you absolutely need them.
Taking Advantage of UTF-Encoded Bytes
PHP string functions can still be used in some cases:
if (str_contains($utf8, '&')) { … }
$trimmed = trim($utf8);
$firstChar = substr($utf32, 0, 4);
Requires solid understanding of UTF encodings and what the functions do
Don’t be clever unless there’s a clear advantage!
Splitting Strings Into Codepoints
mb_str_split($str) - returns array of individual codepoints (PHP 7.4+)
UTF-8 polyfill for older versions: preg_split('//u', $str)
(Works for codepoints, not graphemes)
ASCII-Only UTF-8 Strings
Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions:
$isAscii = mb_detect_encoding($str, 'ASCII', true);
Micro-optimization (2x faster):
$isASCII = strlen($str) === mb_strlen($str);
Speed is fractions of milliseconds; micro-optimization only
important for parsing-heavy applications
Writing Silly Code
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
Writing Silly Code
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT
PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
Writing Silly Code (Don’t Do This)
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
Writing Silly Code (Seriously, Don’t Do This)
PHP supports Unicode in variable and function names:
class (╯°□°)╯︵┻━┻ extends Exception {}
throw new (╯°□°)╯︵┻━┻;
$👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
$you can use = 'U+2000 EN QUAD whitespace';
Recap
Recap & Recommendations
● Unicode supports virtually every known modern and historic writing system
● Codepoints != Glyphs/Graphemes != Encoding
● Use and support UTF-8 everywhere, especially for user input
● PHP strings are just raw bytes
● Use mbstring functions
Questions?
Thank You!
Slides & feedback: https://joind.in/talk/9bdc2
Questions? @colinodell or colinodell@gmail.com

Contenu connexe

Tendances

Додаток 22
Додаток 22Додаток 22
Додаток 22ymcmb_ua
 
Cassandra introduction at FinishJUG
Cassandra introduction at FinishJUGCassandra introduction at FinishJUG
Cassandra introduction at FinishJUGDuyhai Doan
 
PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!Blanca Mancilla
 
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, ...
UGC Net June 2009 Paper 1 Solved ,  Paper 1, Research and Teaching Aptitude, ...UGC Net June 2009 Paper 1 Solved ,  Paper 1, Research and Teaching Aptitude, ...
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, ...mcrashidkhan
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With RDavid Chiu
 
Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33
Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33
Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33Guy Boulianne
 
No Flex Zone: Empathy Driven Development
No Flex Zone: Empathy Driven DevelopmentNo Flex Zone: Empathy Driven Development
No Flex Zone: Empathy Driven DevelopmentDuretti H.
 
she'ir-ehmetjan
she'ir-ehmetjanshe'ir-ehmetjan
she'ir-ehmetjantughchi
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Sawood Alam
 
Writing (Meteor) Code With Style
Writing (Meteor) Code With StyleWriting (Meteor) Code With Style
Writing (Meteor) Code With StyleStephan Hochhaus
 
ゲーム理論BASIC 第27回 - 交渉ゲーム : 交渉問題とナッシュ交渉解-
ゲーム理論BASIC 第27回 - 交渉ゲーム : 交渉問題とナッシュ交渉解-ゲーム理論BASIC 第27回 - 交渉ゲーム : 交渉問題とナッシュ交渉解-
ゲーム理論BASIC 第27回 - 交渉ゲーム : 交渉問題とナッシュ交渉解-ssusere0a682
 
Kaggle Google Quest Q&A Labeling 反省会 LT資料 47th place solution
Kaggle Google Quest Q&A Labeling 反省会 LT資料 47th place solutionKaggle Google Quest Q&A Labeling 反省会 LT資料 47th place solution
Kaggle Google Quest Q&A Labeling 反省会 LT資料 47th place solutionKen'ichi Matsui
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstarsStephan Hochhaus
 
PostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestPostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestAndrea Adami
 

Tendances (19)

Додаток 22
Додаток 22Додаток 22
Додаток 22
 
Cassandra introduction at FinishJUG
Cassandra introduction at FinishJUGCassandra introduction at FinishJUG
Cassandra introduction at FinishJUG
 
wreewrer
wreewrerwreewrer
wreewrer
 
PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!PyLadies Talk: Learn to love the command line!
PyLadies Talk: Learn to love the command line!
 
UGC NET COMPUTER SCIENCE JUNE 2009 PAPER-II
UGC NET COMPUTER SCIENCE JUNE 2009 PAPER-IIUGC NET COMPUTER SCIENCE JUNE 2009 PAPER-II
UGC NET COMPUTER SCIENCE JUNE 2009 PAPER-II
 
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, ...
UGC Net June 2009 Paper 1 Solved ,  Paper 1, Research and Teaching Aptitude, ...UGC Net June 2009 Paper 1 Solved ,  Paper 1, Research and Teaching Aptitude, ...
UGC Net June 2009 Paper 1 Solved , Paper 1, Research and Teaching Aptitude, ...
 
Social Network Analysis With R
Social Network Analysis With RSocial Network Analysis With R
Social Network Analysis With R
 
Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33
Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33
Le magazine Paranoia, Automne 2003. Vol 10, No 2, Issue 33
 
UGC NET COMPUTER SCIENCE JUNE 2010 PAPER-II
UGC NET COMPUTER SCIENCE JUNE 2010 PAPER-IIUGC NET COMPUTER SCIENCE JUNE 2010 PAPER-II
UGC NET COMPUTER SCIENCE JUNE 2010 PAPER-II
 
Отчет
ОтчетОтчет
Отчет
 
No Flex Zone: Empathy Driven Development
No Flex Zone: Empathy Driven DevelopmentNo Flex Zone: Empathy Driven Development
No Flex Zone: Empathy Driven Development
 
she'ir-ehmetjan
she'ir-ehmetjanshe'ir-ehmetjan
she'ir-ehmetjan
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 
Writing (Meteor) Code With Style
Writing (Meteor) Code With StyleWriting (Meteor) Code With Style
Writing (Meteor) Code With Style
 
ゲーム理論BASIC 第27回 - 交渉ゲーム : 交渉問題とナッシュ交渉解-
ゲーム理論BASIC 第27回 - 交渉ゲーム : 交渉問題とナッシュ交渉解-ゲーム理論BASIC 第27回 - 交渉ゲーム : 交渉問題とナッシュ交渉解-
ゲーム理論BASIC 第27回 - 交渉ゲーム : 交渉問題とナッシュ交渉解-
 
G.A's
G.A'sG.A's
G.A's
 
Kaggle Google Quest Q&A Labeling 反省会 LT資料 47th place solution
Kaggle Google Quest Q&A Labeling 反省会 LT資料 47th place solutionKaggle Google Quest Q&A Labeling 反省会 LT資料 47th place solution
Kaggle Google Quest Q&A Labeling 反省会 LT資料 47th place solution
 
Meteor - not just for rockstars
Meteor - not just for rockstarsMeteor - not just for rockstars
Meteor - not just for rockstars
 
PostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit TestPostgreSQL Day italy 2016 Unit Test
PostgreSQL Day italy 2016 Unit Test
 

Similaire à Demystifying Unicode - Longhorn PHP 2021

32 weight loss tips for men – how men can lose weight
32 weight loss tips for men – how men can lose weight32 weight loss tips for men – how men can lose weight
32 weight loss tips for men – how men can lose weightJohnEpps6
 
Evolution towards the Internet of Everything
Evolution towards the Internet of EverythingEvolution towards the Internet of Everything
Evolution towards the Internet of EverythingTim Winchcomb
 
Secretaria de Saúde de Pernambuco negou antecipação de vacina aos rodoviários
Secretaria de Saúde de Pernambuco negou antecipação de vacina aos rodoviáriosSecretaria de Saúde de Pernambuco negou antecipação de vacina aos rodoviários
Secretaria de Saúde de Pernambuco negou antecipação de vacina aos rodoviáriosJornal do Commercio
 
Math Workbook Grade 1 Module.pdf
Math Workbook Grade 1 Module.pdfMath Workbook Grade 1 Module.pdf
Math Workbook Grade 1 Module.pdfPuzzleCreator
 
Kubernetes ネットワーキングのすべて
Kubernetes ネットワーキングのすべてKubernetes ネットワーキングのすべて
Kubernetes ネットワーキングのすべてLINE Corporation
 
Додаток 3
Додаток 3Додаток 3
Додаток 3ymcmb_ua
 
Cómo abrir archivos .HLP WinHelp en Windows 10.pdf
Cómo abrir archivos .HLP WinHelp en Windows 10.pdfCómo abrir archivos .HLP WinHelp en Windows 10.pdf
Cómo abrir archivos .HLP WinHelp en Windows 10.pdfManuelAndrino2
 
Phap luat giao dich dien tu
Phap luat giao dich dien tuPhap luat giao dich dien tu
Phap luat giao dich dien tuHung Nguyen
 
1998 ACURA INTEGRA Service Repair Manual
1998 ACURA INTEGRA Service Repair Manual1998 ACURA INTEGRA Service Repair Manual
1998 ACURA INTEGRA Service Repair Manualjksemndmm
 
【修士論文紹介】ソーシャルメディアからの作用を考慮した金融市場の観測・予測モデルの提案
【修士論文紹介】ソーシャルメディアからの作用を考慮した金融市場の観測・予測モデルの提案【修士論文紹介】ソーシャルメディアからの作用を考慮した金融市場の観測・予測モデルの提案
【修士論文紹介】ソーシャルメディアからの作用を考慮した金融市場の観測・予測モデルの提案Koichiro tamura
 
(Guia para elaborar,_estrutura
(Guia para elaborar,_estrutura(Guia para elaborar,_estrutura
(Guia para elaborar,_estruturaKátia Amaral
 
Ebook4385(www.takbook.com)
Ebook4385(www.takbook.com)Ebook4385(www.takbook.com)
Ebook4385(www.takbook.com)computerka
 
katagaitai CTF workshop #10 AESに対する相関電力解析
katagaitai CTF workshop #10 AESに対する相関電力解析katagaitai CTF workshop #10 AESに対する相関電力解析
katagaitai CTF workshop #10 AESに対する相関電力解析trmr
 
Ugly Duck Clothing UK eBay Store
Ugly Duck Clothing UK  eBay StoreUgly Duck Clothing UK  eBay Store
Ugly Duck Clothing UK eBay StoreRabius Sany
 

Similaire à Demystifying Unicode - Longhorn PHP 2021 (20)

32 weight loss tips for men – how men can lose weight
32 weight loss tips for men – how men can lose weight32 weight loss tips for men – how men can lose weight
32 weight loss tips for men – how men can lose weight
 
Sahih boukhary 1
Sahih boukhary  1Sahih boukhary  1
Sahih boukhary 1
 
Evolution towards the Internet of Everything
Evolution towards the Internet of EverythingEvolution towards the Internet of Everything
Evolution towards the Internet of Everything
 
Secretaria de Saúde de Pernambuco negou antecipação de vacina aos rodoviários
Secretaria de Saúde de Pernambuco negou antecipação de vacina aos rodoviáriosSecretaria de Saúde de Pernambuco negou antecipação de vacina aos rodoviários
Secretaria de Saúde de Pernambuco negou antecipação de vacina aos rodoviários
 
Planning v2
Planning v2Planning v2
Planning v2
 
Math Workbook Grade 1 Module.pdf
Math Workbook Grade 1 Module.pdfMath Workbook Grade 1 Module.pdf
Math Workbook Grade 1 Module.pdf
 
Kubernetes ネットワーキングのすべて
Kubernetes ネットワーキングのすべてKubernetes ネットワーキングのすべて
Kubernetes ネットワーキングのすべて
 
Додаток 3
Додаток 3Додаток 3
Додаток 3
 
RabatBangla.pdf
RabatBangla.pdfRabatBangla.pdf
RabatBangla.pdf
 
Cómo abrir archivos .HLP WinHelp en Windows 10.pdf
Cómo abrir archivos .HLP WinHelp en Windows 10.pdfCómo abrir archivos .HLP WinHelp en Windows 10.pdf
Cómo abrir archivos .HLP WinHelp en Windows 10.pdf
 
Phap luat giao dich dien tu
Phap luat giao dich dien tuPhap luat giao dich dien tu
Phap luat giao dich dien tu
 
الإستاتيكا
الإستاتيكاالإستاتيكا
الإستاتيكا
 
1998 ACURA INTEGRA Service Repair Manual
1998 ACURA INTEGRA Service Repair Manual1998 ACURA INTEGRA Service Repair Manual
1998 ACURA INTEGRA Service Repair Manual
 
【修士論文紹介】ソーシャルメディアからの作用を考慮した金融市場の観測・予測モデルの提案
【修士論文紹介】ソーシャルメディアからの作用を考慮した金融市場の観測・予測モデルの提案【修士論文紹介】ソーシャルメディアからの作用を考慮した金融市場の観測・予測モデルの提案
【修士論文紹介】ソーシャルメディアからの作用を考慮した金融市場の観測・予測モデルの提案
 
Mat fin
Mat finMat fin
Mat fin
 
(Guia para elaborar,_estrutura
(Guia para elaborar,_estrutura(Guia para elaborar,_estrutura
(Guia para elaborar,_estrutura
 
Ebook4385(www.takbook.com)
Ebook4385(www.takbook.com)Ebook4385(www.takbook.com)
Ebook4385(www.takbook.com)
 
08_0648_EA.pdf
08_0648_EA.pdf08_0648_EA.pdf
08_0648_EA.pdf
 
katagaitai CTF workshop #10 AESに対する相関電力解析
katagaitai CTF workshop #10 AESに対する相関電力解析katagaitai CTF workshop #10 AESに対する相関電力解析
katagaitai CTF workshop #10 AESに対する相関電力解析
 
Ugly Duck Clothing UK eBay Store
Ugly Duck Clothing UK  eBay StoreUgly Duck Clothing UK  eBay Store
Ugly Duck Clothing UK eBay Store
 

Plus de Colin O'Dell

Releasing High Quality Packages - Longhorn PHP 2021
Releasing High Quality Packages - Longhorn PHP 2021Releasing High Quality Packages - Longhorn PHP 2021
Releasing High Quality Packages - Longhorn PHP 2021Colin O'Dell
 
Releasing High Quality PHP Packages - ConFoo Montreal 2019
Releasing High Quality PHP Packages - ConFoo Montreal 2019Releasing High Quality PHP Packages - ConFoo Montreal 2019
Releasing High Quality PHP Packages - ConFoo Montreal 2019Colin O'Dell
 
Debugging Effectively - ConFoo Montreal 2019
Debugging Effectively - ConFoo Montreal 2019Debugging Effectively - ConFoo Montreal 2019
Debugging Effectively - ConFoo Montreal 2019Colin O'Dell
 
Automating Deployments with Deployer - php[world] 2018
Automating Deployments with Deployer - php[world] 2018Automating Deployments with Deployer - php[world] 2018
Automating Deployments with Deployer - php[world] 2018Colin O'Dell
 
Releasing High-Quality Packages - php[world] 2018
Releasing High-Quality Packages - php[world] 2018Releasing High-Quality Packages - php[world] 2018
Releasing High-Quality Packages - php[world] 2018Colin O'Dell
 
Debugging Effectively - DrupalCon Nashville 2018
Debugging Effectively - DrupalCon Nashville 2018Debugging Effectively - DrupalCon Nashville 2018
Debugging Effectively - DrupalCon Nashville 2018Colin O'Dell
 
CommonMark: Markdown Done Right - ZendCon 2017
CommonMark: Markdown Done Right - ZendCon 2017CommonMark: Markdown Done Right - ZendCon 2017
CommonMark: Markdown Done Right - ZendCon 2017Colin O'Dell
 
Rise of the Machines: PHP and IoT - ZendCon 2017
Rise of the Machines: PHP and IoT - ZendCon 2017Rise of the Machines: PHP and IoT - ZendCon 2017
Rise of the Machines: PHP and IoT - ZendCon 2017Colin O'Dell
 
Debugging Effectively - All Things Open 2017
Debugging Effectively - All Things Open 2017Debugging Effectively - All Things Open 2017
Debugging Effectively - All Things Open 2017Colin O'Dell
 
Hacking Your Way To Better Security - DrupalCon Baltimore 2017
Hacking Your Way To Better Security - DrupalCon Baltimore 2017Hacking Your Way To Better Security - DrupalCon Baltimore 2017
Hacking Your Way To Better Security - DrupalCon Baltimore 2017Colin O'Dell
 
Debugging Effectively - PHP UK 2017
Debugging Effectively - PHP UK 2017Debugging Effectively - PHP UK 2017
Debugging Effectively - PHP UK 2017Colin O'Dell
 
Debugging Effectively - SunshinePHP 2017
Debugging Effectively - SunshinePHP 2017Debugging Effectively - SunshinePHP 2017
Debugging Effectively - SunshinePHP 2017Colin O'Dell
 
Automating Your Workflow with Gulp.js - php[world] 2016
Automating Your Workflow with Gulp.js - php[world] 2016Automating Your Workflow with Gulp.js - php[world] 2016
Automating Your Workflow with Gulp.js - php[world] 2016Colin O'Dell
 
Rise of the Machines: PHP and IoT - php[world] 2016
Rise of the Machines: PHP and IoT - php[world] 2016Rise of the Machines: PHP and IoT - php[world] 2016
Rise of the Machines: PHP and IoT - php[world] 2016Colin O'Dell
 
Debugging Effectively - ZendCon 2016
Debugging Effectively - ZendCon 2016Debugging Effectively - ZendCon 2016
Debugging Effectively - ZendCon 2016Colin O'Dell
 
Hacking Your Way to Better Security - ZendCon 2016
Hacking Your Way to Better Security - ZendCon 2016Hacking Your Way to Better Security - ZendCon 2016
Hacking Your Way to Better Security - ZendCon 2016Colin O'Dell
 
Hacking Your Way to Better Security - PHP South Africa 2016
Hacking Your Way to Better Security - PHP South Africa 2016Hacking Your Way to Better Security - PHP South Africa 2016
Hacking Your Way to Better Security - PHP South Africa 2016Colin O'Dell
 
Debugging Effectively - DrupalCon Europe 2016
Debugging Effectively - DrupalCon Europe 2016Debugging Effectively - DrupalCon Europe 2016
Debugging Effectively - DrupalCon Europe 2016Colin O'Dell
 
CommonMark: Markdown done right - Nomad PHP September 2016
CommonMark: Markdown done right - Nomad PHP September 2016CommonMark: Markdown done right - Nomad PHP September 2016
CommonMark: Markdown done right - Nomad PHP September 2016Colin O'Dell
 
Debugging Effectively - Frederick Web Tech 9/6/16
Debugging Effectively - Frederick Web Tech 9/6/16Debugging Effectively - Frederick Web Tech 9/6/16
Debugging Effectively - Frederick Web Tech 9/6/16Colin O'Dell
 

Plus de Colin O'Dell (20)

Releasing High Quality Packages - Longhorn PHP 2021
Releasing High Quality Packages - Longhorn PHP 2021Releasing High Quality Packages - Longhorn PHP 2021
Releasing High Quality Packages - Longhorn PHP 2021
 
Releasing High Quality PHP Packages - ConFoo Montreal 2019
Releasing High Quality PHP Packages - ConFoo Montreal 2019Releasing High Quality PHP Packages - ConFoo Montreal 2019
Releasing High Quality PHP Packages - ConFoo Montreal 2019
 
Debugging Effectively - ConFoo Montreal 2019
Debugging Effectively - ConFoo Montreal 2019Debugging Effectively - ConFoo Montreal 2019
Debugging Effectively - ConFoo Montreal 2019
 
Automating Deployments with Deployer - php[world] 2018
Automating Deployments with Deployer - php[world] 2018Automating Deployments with Deployer - php[world] 2018
Automating Deployments with Deployer - php[world] 2018
 
Releasing High-Quality Packages - php[world] 2018
Releasing High-Quality Packages - php[world] 2018Releasing High-Quality Packages - php[world] 2018
Releasing High-Quality Packages - php[world] 2018
 
Debugging Effectively - DrupalCon Nashville 2018
Debugging Effectively - DrupalCon Nashville 2018Debugging Effectively - DrupalCon Nashville 2018
Debugging Effectively - DrupalCon Nashville 2018
 
CommonMark: Markdown Done Right - ZendCon 2017
CommonMark: Markdown Done Right - ZendCon 2017CommonMark: Markdown Done Right - ZendCon 2017
CommonMark: Markdown Done Right - ZendCon 2017
 
Rise of the Machines: PHP and IoT - ZendCon 2017
Rise of the Machines: PHP and IoT - ZendCon 2017Rise of the Machines: PHP and IoT - ZendCon 2017
Rise of the Machines: PHP and IoT - ZendCon 2017
 
Debugging Effectively - All Things Open 2017
Debugging Effectively - All Things Open 2017Debugging Effectively - All Things Open 2017
Debugging Effectively - All Things Open 2017
 
Hacking Your Way To Better Security - DrupalCon Baltimore 2017
Hacking Your Way To Better Security - DrupalCon Baltimore 2017Hacking Your Way To Better Security - DrupalCon Baltimore 2017
Hacking Your Way To Better Security - DrupalCon Baltimore 2017
 
Debugging Effectively - PHP UK 2017
Debugging Effectively - PHP UK 2017Debugging Effectively - PHP UK 2017
Debugging Effectively - PHP UK 2017
 
Debugging Effectively - SunshinePHP 2017
Debugging Effectively - SunshinePHP 2017Debugging Effectively - SunshinePHP 2017
Debugging Effectively - SunshinePHP 2017
 
Automating Your Workflow with Gulp.js - php[world] 2016
Automating Your Workflow with Gulp.js - php[world] 2016Automating Your Workflow with Gulp.js - php[world] 2016
Automating Your Workflow with Gulp.js - php[world] 2016
 
Rise of the Machines: PHP and IoT - php[world] 2016
Rise of the Machines: PHP and IoT - php[world] 2016Rise of the Machines: PHP and IoT - php[world] 2016
Rise of the Machines: PHP and IoT - php[world] 2016
 
Debugging Effectively - ZendCon 2016
Debugging Effectively - ZendCon 2016Debugging Effectively - ZendCon 2016
Debugging Effectively - ZendCon 2016
 
Hacking Your Way to Better Security - ZendCon 2016
Hacking Your Way to Better Security - ZendCon 2016Hacking Your Way to Better Security - ZendCon 2016
Hacking Your Way to Better Security - ZendCon 2016
 
Hacking Your Way to Better Security - PHP South Africa 2016
Hacking Your Way to Better Security - PHP South Africa 2016Hacking Your Way to Better Security - PHP South Africa 2016
Hacking Your Way to Better Security - PHP South Africa 2016
 
Debugging Effectively - DrupalCon Europe 2016
Debugging Effectively - DrupalCon Europe 2016Debugging Effectively - DrupalCon Europe 2016
Debugging Effectively - DrupalCon Europe 2016
 
CommonMark: Markdown done right - Nomad PHP September 2016
CommonMark: Markdown done right - Nomad PHP September 2016CommonMark: Markdown done right - Nomad PHP September 2016
CommonMark: Markdown done right - Nomad PHP September 2016
 
Debugging Effectively - Frederick Web Tech 9/6/16
Debugging Effectively - Frederick Web Tech 9/6/16Debugging Effectively - Frederick Web Tech 9/6/16
Debugging Effectively - Frederick Web Tech 9/6/16
 

Dernier

Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Dernier (20)

Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Demystifying Unicode - Longhorn PHP 2021

  • 2. Colin O’Dell ● Principal Engineer at Unleashed Technologies ● PHP for ~20 years; 13 years professionally ● Creator & maintainer of league/commonmark library ● PHP League leadership team ● Owner of moderngeekware.com ● @colinodell
  • 3. Agenda ● A History of Encoding Systems ● Unicode Standard ● Unicode Encodings ● Using Unicode in PHP ● Tips & Tricks ● Questions & Answers
  • 4. Assumptions ● Some familiarity with PHP ● Basic understanding of binary and hexadecimal ● Focus on high-level concepts!
  • 7. A (Brief) History of Encoding Systems
  • 8. 1837: Morse Code (Internationalized in 1844) “Morse-Vail Telegraph Key” by the National Museum of American History is licenced under CC BY-NC 2.0
  • 10. 1960s: Teletypes (TTYs) For Computing
  • 11. 1960s: ASCII ● American Standard Code for Information Interchange ● 7-bit binary encoding ○ 0000000 = 0 ○ ... ○ 1111111 = 127
  • 12. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPAC E ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL
  • 13. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPAC E ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL Character Hex Binary Character Hex Binary LF (line feed) 0x0A 0001010 E 0x45 1000101 3 0x33 0110011 e 0x65 1100101
  • 14. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx 00xxxxx = 32 control codes 01xxxxx = 32 numbers & symbols 10xxxxx = 32 uppercase letters and some extra symbols 11xxxxx = 32 lowercase letters and some extra symbols
  • 15. A = 0x41 = 1000001 B = 0x42 = 1000010 … Z = 0x5A = 1011010 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx
  • 16. A = 0x41 = 1000001 B = 0x42 = 1000010 … Z = 0x5A = 1011010 a = 0x61 = 1100001 b = 0x62 = 1100010 … z = 0x7A = 1111010 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 00xxxxx 01xxxxx 10xxxxx 11xxxxx
  • 17. But computers use 8-bit bytes... ASCII (7 Bits) ??? Start 00000000 10000000 End 01111111 11111111 Count 128 128
  • 18. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 7-bit ASCII
  • 19. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 8 ??? 9 A B C D E F 8-bit “Extended ASCII”
  • 20. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ 8 9 A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯ B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß E à á â ã ä å æ ç è é ê ë ì í î ï F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ ISO 8859-1
  • 21. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ DEL 8 € ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž 9 ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ A NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯ B ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ C À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï D Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß E à á â ã ä å æ ç è é ê ë ì í î ï F ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Windows-1252
  • 22. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ 8 9 A NBSP Ą ˘ Ł ¤ Ľ Ś § ¨ Š Ş Ť Ź SHY Ž Ż B ° ą ˛ ł ´ ľ ś ˇ ¸ š ş ť ź ˝ ž ż C Ŕ Á Â Ă Ä Ĺ Ć Ç Č É Ę Ë Ě Í Î Ď D Đ Ń Ň Ó Ô Ő Ö × Ř Ů Ú Ű Ü Ý Ţ ß E ŕ á â ă ä ĺ ć ç č é ę ë ě í î ď F đ ń ň ó ô ő ö ÷ ř ů ú ű ü ý ţ ˙ ISO 8859-2
  • 23. 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 NUL ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼ 1 ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼ 2 SPACE ! " # $ % & ' ( ) * + , - . / 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 4 @ A B C D E F G H I J K L M N O 5 P Q R S T U V W X Y Z [ ] ^ _ 6 ` a b c d e f g h i j k l m n o 7 p q r s t u v w x y z { | } ~ ⌂ 8 Ç ü é â ä à å ç ê ë è ï î ì Ä Å 9 É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ A á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « » B ░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛ ┐ C └ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬ ╧ D ╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐ ▀ E α ß Γ π Σ σ µ τ Φ Θ Ω δ ∞ φ ε ∩ F ≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■ NBSP Code Page 437 (IBM PC)
  • 24.
  • 25. 8-bit “Extended ASCII” ● ISO 8859 - 16 variations: ○ ISO 8859-1 (“Latin 1”, Western European) ○ ISO 8859-2 (“Latin 2”, Central European) ○ ISO 8859-3 (“Latin 3”, South European) ○ ISO 8859-4 (“Latin 4”, North European) ○ ISO 8859-5 (Latin/Cyrillic) ○ ISO 8859-6 (Latin/Arabic) ○ ISO 8859-7 (Latin/Greek) ○ ISO 8859-8 (Latin/Hebrew) ○ ISO 8859-9 (“Latin 5”, Turkish) ○ ISO 8859-10 (“Latin 6”, Nordic) ○ ISO 8859-11 (Latin/Thai) ○ ISO 8859-12 (Latin/Devanagari) - abandoned ○ ISO 8859-13 (“Latin 7”, Baltic Rim) ○ ISO 8859-14 (“Latin 8”, Celtic) ○ ISO 8859-15 (“Latin 9”) ■ Revision of 8859-1 with swaps out less- used chars; adds euro currency symbol ○ ISO 8859-16 (“Latin 10”, South-Eastern European) ● Windows-1252 ● CP 437 - Original IBM PC ● Mac OS Roman character set ● TRS-80 character set ● Atari’s ATASCII ● Commodore’s PETSCII ● HP Roman-8 and Roman-9 ● DEC’s Multinational Character Set ● Lotus International Character Set ● ECMA-94
  • 26.
  • 27.
  • 28. But then along came the Internet...
  • 30.
  • 31. “The Unicode Standard is the universal character encoding standard for written characters and text. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software”
  • 32. Code Points Problem: How to accommodate larger character sets without wasting memory? Solution: Break the one-to-one correspondence between characters and bits/encoding! Offer different ways to encode based on different needs.
  • 33. ASCII vs. Unicode Character Encoded Bits H 01001000 (0x48) P 01010000 (0x50) Glyph Code Point P U+0050 LATIN CAPITAL LETTER P H U+0048 LATIN CAPITAL LETTER H Encoded Bits ???? ????
  • 34. Glyph Code Point Encoded Bits P U+0050 LATIN CAPITAL LETTER P ???? h U+0068 LATIN SMALL LETTER H ???? Σ U+03A3 GREEK CAPITAL LETTER SIGMA ???? U+0634 ARABIC LETTER SHEEN ???? U+1D2ED MAYAN NUMERAL THIRTEEN ???? 😸 U+1F638 GRINNING CAT FACE WITH SMILING EYES ???? H U+0048 LATIN CAPITAL LETTER H ????
  • 35.
  • 37.
  • 38. U+1F604 SMILING FACE WITH OPEN MOUTH AND SMILING EYES
  • 39.
  • 40.
  • 41.
  • 42.
  • 44.
  • 45. Recap ● Code Point: a number representing a single character* ○ 143,859 defined as of Unicode 13.0 ○ Format: U+hhhhhh ● Codespace: A range of numerical values available for encoding characters ○ Support for 1,114,112 codepoints (0x000000 - 0x10FFFF) ● Code Planes: Continuous group of 65,536 (216) code points ○ 17 planes, numbered 0 - 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh)
  • 47. Character / Code Point: a U+0061 LATIN SMALL LETTER A
  • 48. Character / Code Point: a U+0061 LATIN SMALL LETTER A a a a a a a a a Glyphs:
  • 49. Glyphs and Graphemes Glyph / Grapheme c a f e Unicode Character c a f e Code Point U+0063 U+0061 U+0066 U+0065 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E
  • 50. Glyphs and Graphemes: Combining Diacritical Marks Glyph / Grapheme c a f é Unicode Character c a f e ◌́ Code Point U+0063 U+0061 U+0066 U+0065 U+0301 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT
  • 51. Glyphs and Graphemes: Combining Diacritical Marks Glyph / Grapheme c a f é Unicode Character c a f e ◌́ Code Point U+0063 U+0061 U+0066 U+0065 U+0301 LATIN SMALL LETTER C LATIN SMALL LETTER A LATIN SMALL LETTER F LATIN SMALL LETTER E COMBINING ACUTE ACCENT e + ◌́ = é e
  • 52. Glyphs and Graphemes: Combining Diacritical Marks Z̷̧̨̰̋Å̸̮͉ ̵͉̣̄̇̀ L̵͉̣̄̇̀G ̸̮͉̊ O ̸̱͒̓ ̷̧̨̰̋Ț͝E̪̘̗̓͝X̪̘̗T ̸̰̺̝̍̈
  • 53. Glyphs and Graphemes: Variation Selectors Glyph / Grapheme ✈ Unicode Character ✈ Code Point U+2708 U+FE0E AIRPLANE VARIATION SELECTOR 15 (TEXT STYLE) VS 15
  • 54. Glyphs and Graphemes: Variation Selectors Glyph / Grapheme ✈ Unicode Character ✈ Code Point U+2708 U+FE0E AIRPLANE VARIATION SELECTOR 15 (TEXT STYLE) Glyph / Grapheme Unicode Character ✈ Code Point U+2708 U+FE0F AIRPLANE VARIATION SELECTOR 16 (EMOJI STYLE) VS 16 VS 15
  • 55. Glyphs and Graphemes: Regional Indicator Symbols Glyph / Grapheme 🇺🇸 Unicode Character 🇺 🇸 Code Point U+1F1FA U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER U REGIONAL INDICATOR SYMBOL LETTER S Glyph / Grapheme 🇨🇦 Unicode Character 🇨 🇦 Code Point U+1F1E8 U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER C REGIONAL INDICATOR SYMBOL LETTER A
  • 56. Glyphs and Graphemes: Modifiers Glyph / Grapheme Unicode Character 👋 Code Point U+1F44B U+1F3FC WAVING HAND SIGN EMOJI MODIFIER FITZPATRICK TYPE-3 Glyph / Grapheme Unicode Character 👋 Code Point U+1F44B U+1F3FE WAVING HAND SIGN EMOJI MODIFIER FITZPATRICK TYPE-5
  • 57. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme 👨 👩 👶 👧 Unicode Character 👨 👩 👶 👧 Code Point U+1F468 U+1F469 U+1F476 U+1F467 MAN WOMAN BABY GIRL
  • 58. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character 👨 👩 👶 👧 Code Point U+1F468 U+200D U+1F469 U+200D U+1F476 U+200D U+1F467 MAN ZERO WIDTH JOINER WOMAN ZERO WIDTH JOINER BABY ZERO WIDTH JOINER GIRL ZWJ ZWJ ZWJ
  • 59. Glyphs and Graphemes: ZWJ Sequences
  • 60. Glyphs and Graphemes: ZWJ Sequences
  • 61. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character Code Point U+1F477 U+200D U+2642 CONSTRU CTION WORKER ZERO WIDTH JOINER MALE SIGN ZWJ Glyph / Grapheme Unicode Character Code Point U+1F477 U+200D U+2640 CONSTRU CTION WORKER ZERO WIDTH JOINER FEMALE SIGN ZWJ
  • 62. Glyphs and Graphemes: ZWJ Sequences Glyph / Grapheme Unicode Character Code Point U+1F477 U+1F3FE U+200D U+2640 CONSTRUCTION WORKER EMOJI MODIFIER FITZPATRICK TYPE-5 ZERO WIDTH JOINER FEMALE SIGN ZWJ
  • 63. Enough about code points...
  • 65. Glyph Code Point Encoded Bits P U+0050 LATIN CAPITAL LETTER P ???? h U+0068 LATIN SMALL LETTER H ???? Σ U+03A3 GREEK CAPITAL LETTER SIGMA ???? U+0634 ARABIC LETTER SHEEN ???? U+1D2ED MAYAN NUMERAL THIRTEEN ???? 😸 U+1F638 GRINNING CAT FACE WITH SMILING EYES ???? H U+0048 LATIN CAPITAL LETTER H ????
  • 66. Encoding Schemes ● Most popular: ○ UTF-8 ○ UTF-16 ○ UTF-32
  • 67. UTF-32 Fixed-byte encoding; 4 bytes per code point Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx
  • 68. UTF-32 Fixed-byte encoding; 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000 000xxxxx xxxxxxxx xxxxxxxx Examples: A U+0041 LATIN CAPITAL A 0x0041 => 1000001 00000000 00000000 00000000 01000001 😸 U+1F638 GRINNING CAT WITH SMILING EYES 0x1F638 => 11111011000111000 00000000 00000001 11110110 00111000
  • 69. UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  • 70. Example: A U+0041 LATIN CAPITAL A 0x0041 => 1000001 00000000 01000001 Variable-length encoding; 2 or 4 bytes per character UTF-16 Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  • 71. UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
  • 72. U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000 W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy UTF-16 Variable-length encoding; 2 or 4 bytes per character Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy
  • 73. Codepoint range Unicode scalar value (binary) Encoded bytes U+0000..U+D7FF, U+E000..U+EFFF (Basic Multilingual Plane) 00000 xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy Variable-length encoding; 2 or 4 bytes per character U' = xxxxxxxxxxyyyyyyyyyy // U - 0x10000 W1 = 110110xxxxxxxxxx // 0xD800 + xxxxxxxxxx W2 = 110111yyyyyyyyyy // 0xDC00 + yyyyyyyyyy UTF-16 0xD800- 0xDBFF 0xDC00- 0xDFFF
  • 74. Example: 😸 U+1F638 GRINNING CAT WITH SMILING EYES 0x1F638 => 1 11110110 00111000 11011000 00111101 11011110 00111000 Codepoint range Unicode scalar value (binary) Encoded bytes U+010000..U+10FFFF (Supplementary Planes) Uuuuu uuuuuuuu uuuuuuuu 110110xx xxxxxxxx 110111yy yyyyyyyy Variable-length encoding; 2 or 4 bytes per character U' = 11110110 00111000 // 0x1F638 - 0x10000 = 0xF638 W1 = 11011000 00111101 // 0xD800 + 0000111101 W2 = 11011110 00111000 // 0xDC00 + 1000111000 UTF-16
  • 75. UTF-8 Variable-length encoding; 1-4 bytes per code point Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 76. UTF-8 Trick 1: ASCII === UTF-8 Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 77. UTF-8 Trick 2: Virtually all languages only need 1, 2, or 3 bytes Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 78. UTF-8 Trick 3: First byte tells you the length Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 79. UTF-8 Trick 4: Self-synchronization Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 80. UTF-8 Trick 5: No 0x00 bytes, except for NUL Codepoint range Byte 1 Byte 2 Byte 3 Byte 4 Notes U+0000..U+0080 0xxxxxxx Covers first 128 codepoints; ASCII U+0080..U+07FF 110xxxxx 10xxxxxx (Almost) all Latin, Greek, Cyrillic, Arabic, Hebrew, and more, plus combining diacritical marks U+0800..U+FFFF 1110xxxx 10xxxxxx 10xxxxxx Rest of the BMP, including most Chinese, Japanese, and Korean characters U+10000..U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx All other planes, including historic scripts, mathematical symbols, and emoji.
  • 81. UTF Encoding Summary UTF-32 UTF-16 UTF-8 Encoding length Fixed Variable Variable 4 bytes per code point 2 or 4 bytes per code point 1-4 bytes per code point Memory-efficient No Somewhat Yes CPU-efficient Yes Somewhat Somewhat Self-synchronizing No Yes Yes Contains null (0x00) bytes Yes Yes No ASCII-compatible No No Yes
  • 84. Handling Text In Programming Languages 1. Treat text as a sequence of bytes (PHP, C) $smile = "xF0x9Fx98x80"; echo $smile; // => '😀' echo strlen($smile); // => 4 1. Treat text as a sequence of Unicode code points (Python 3) 2. Treat text as a sequence of UTF-16 code units (JavaScript, C#) const smile = 'uD83DuDE00'; console.log(smile); // => '😀' console.log(smile.length); // => 2
  • 85. PHP Strings Be careful! ● Strings are simply byte sequences ● Encoding-agnostic ● Some (not all) string functions assume fixed-width, 8-bit ASCII encoding
  • 86. PHP String Functions Function What It Actually Does strlen() Counts the length in bytes str_replace() Replaces bytes substr() Returns a subset of bytes strtoupper() Converts alphabetic ASCII bytes to uppercase based on globally-set locale Works for ASCII; not entirely safe* for Unicode!
  • 87. ext/mbstring Provides multibyte-safe string functions Standard Function mbstring Alternative strlen() mb_strlen() str_replace() (none) substr() mb_substr() strtoupper() mb_strtoupper() Tip: All functions accept an optional parameter to specify the encoding, if known; will be auto-detected otherwise.
  • 88. ext/mbstring Provides multibyte-safe string functions mb_convert_case(string $string, int $mode, ?string $encoding = null): string Input $string $mode Output Mary had a little lamb MB_CASE_UPPER MARY HAD A LITTLE LAMB MB_CASE_LOWER mary had a little lamb MB_CASE_TITLE Mary Had A Little Lamb MB_CASE_FOLD mary had a little lamb
  • 89. ext/mbstring Provides multibyte-safe string functions mb_convert_case(string $string, int $mode, ?string $encoding = null): string Input $string $mode Output Ich grüße den Mann (I greet the man) MB_CASE_UPPER ICH GRÜSSE DEN MANN MB_CASE_LOWER ich grüße den mann MB_CASE_TITLE Ich Grüße Den Mann MB_CASE_FOLD ich grüsse den mann
  • 90. ext/pcre Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Property Code Matches Example L Any letter p{L} Ll Lower case letter p{Ll} Lu Upper case letter p{Lu} Lm Modifier letter p{Lm} Lt Title case letter p{Lt} Lo Other letter p{Lo} Property Code Matches Example S Any symbol p{S} Sc Currency symbol p{Sc} Sk Modifier symbol p{Sk} Sm Mathematical symbol p{Sm} So Other symbol p{So}
  • 91. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Examples: p{Greek} or p{Egyptian_Hieroglyphs} ext/pcre
  • 92. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Match a character without a Unicode property: P{xx} ext/pcre
  • 93. Enable UTF-8 support with u modifier: preg_match('/foo/u') Match a character with a Unicode property: p{xx} (37 different codes) Match a character with a Unicode script: p{xxxx} (102 different scripts) Match a character without a Unicode property: P{xx} Match a Unicode extended grapheme cluster: X Think of it like a . but for multiple characters that combine into a single glyph ext/pcre
  • 94. ext/intl - IntlChar class var_dump(IntlChar::charName('⛄')); // string(20) "SNOWMAN WITHOUT SNOW" $name = "RECYCLING SYMBOL FOR TYPE-1 PLASTICS"; var_dump(IntlChar::charFromName($name)); // int(9843) var_dump(IntlChar::isupper("A")); // bool(true)
  • 95. ext/intl - Normalizer class 1. U+01FA - “Precomposed” character (LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE) 2. A + U+030A + U+0301 - A base letter A followed by two combining marks (U+030A COMBINING RING ABOVE and U+0301 COMBINING ACUTE ACCENT) 3. U+00C5 + U+0301 - An accented letter (U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE) followed by a combining accent (U+0301 COMBINING ACUTE ACCENT) 4. U+212B + U+0301 - A compatibility character (U+212B ANGSTROM SIGN) followed by a combining accent (U+0301 COMBINING ACUTE ACCENT) Ǻ
  • 96. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; Ǻ
  • 97. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; foreach ($variations as $str) { echo urlencode(Normalizer::normalize($str)); echo "n"; } Ǻ
  • 98. $variations = [ "xC7xBA", "A" . "xCCx8AxCCx81", "xC3x85xCCx81", "xE2x84xABxCCx81", ]; foreach ($variations as $str) { echo urlencode(Normalizer::normalize($str)); echo "n"; } // %C7%BA // %C7%BA // %C7%BA // %C7%BA Ǻ
  • 99. ext/intl - Grapheme Functions grapheme_​ extract() grapheme_​ stripos() grapheme_​ stristr() grapheme_​ strlen() grapheme_​ strpos() grapheme_​ strripos() grapheme_​ strrpos() grapheme_​ strstr() grapheme_​ substr() $str = '⛄ Café'; echo strlen($str); // 10 echo mb_strlen($str); // 7 echo grapheme_strlen($str); // 6
  • 100. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string
  • 101. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL; // Notice: iconv(): Detected an illegal character in input string
  • 102. ext/iconv - iconv() function to convert encodings $text = "This is the Euro symbol '€'."; // UTF-8 string echo iconv("UTF-8", "ISO-8859-1", $text), PHP_EOL; // Notice: iconv(): Detected an illegal character in input string echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL; // This is the Euro symbol 'EUR'. echo iconv("UTF-8", "ISO-8859-1//IGNORE", $text), PHP_EOL; // This is the Euro symbol ''.
  • 103. PHP Extension Summary ext/iconv: Convert between encodings ext/mbstring: Work with multi-byte string encodings like UTF-8 ext/pcre: Special UTF-compatible matching when /u modifier enabled ext/intl: Work with individual codepoints and graphemes
  • 104. Fun Tricks & Micro-Optimizations
  • 105. Disclaimer Clever hacks and micro-optimizations are usually unnecessary and can be detrimental to long-term maintenance! Don’t use these unless you absolutely need them.
  • 106. Taking Advantage of UTF-Encoded Bytes PHP string functions can still be used in some cases: if (str_contains($utf8, '&')) { … } $trimmed = trim($utf8); $firstChar = substr($utf32, 0, 4); Requires solid understanding of UTF encodings and what the functions do Don’t be clever unless there’s a clear advantage!
  • 107. Splitting Strings Into Codepoints mb_str_split($str) - returns array of individual codepoints (PHP 7.4+) UTF-8 polyfill for older versions: preg_split('//u', $str) (Works for codepoints, not graphemes)
  • 108. ASCII-Only UTF-8 Strings Is a UTF-8 string pure ASCII? If so, no need for (slower) mbstring functions: $isAscii = mb_detect_encoding($str, 'ASCII', true); Micro-optimization (2x faster): $isASCII = strlen($str) === mb_strlen($str); Speed is fractions of milliseconds; micro-optimization only important for parsing-heavy applications
  • 109. Writing Silly Code PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻;
  • 110. Writing Silly Code PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; Uses U+FF08 FULLWIDTH LEFT PARENTHESIS and U+FF09 FULLWIDTH RIGHT PARENTHESIS since normal parens (U+0028/U+0029) are not allowed here.
  • 111. Writing Silly Code (Don’t Do This) PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; $👉😎👉 = "Ann Perkins!"; // Parks and Rec reference
  • 112. Writing Silly Code (Seriously, Don’t Do This) PHP supports Unicode in variable and function names: class (╯°□°)╯︵┻━┻ extends Exception {} throw new (╯°□°)╯︵┻━┻; $👉😎👉 = "Ann Perkins!"; // Parks and Rec reference $you can use = 'U+2000 EN QUAD whitespace';
  • 113. Recap
  • 114. Recap & Recommendations ● Unicode supports virtually every known modern and historic writing system ● Codepoints != Glyphs/Graphemes != Encoding ● Use and support UTF-8 everywhere, especially for user input ● PHP strings are just raw bytes ● Use mbstring functions
  • 116. Thank You! Slides & feedback: https://joind.in/talk/9bdc2 Questions? @colinodell or colinodell@gmail.com

Notes de l'éditeur

  1. Questions as we go? Raise hand
  2. Converts characters into electrical signals
  3. Standardized in 1865
  4. Simple device Type a key, sends some numbers, same letter comes out the other side
  5. But there needs to be a standard
  6. Developed in 1960s for teleprinters (“Teletype”) and early computers 7-bit: each letter you type in gets converted into 7 bits
  7. Support for: Upper and lowercase letters Numbers Basic, common symbols More control codes (CR, LF, BS, HT, BEL) (next for examples)
  8. (how to encode/decode)
  9. Something really clever going on here Group by first two bits 4 “pages” or sections, 32 chars each
  10. Letters in alphabetical order, starting at 1 (not random)
  11. Even more clever - converting between upper and lowercase by changing one bit
  12. “Extended ASCII” sounds like a standard, but it’s not
  13. AKA Latin 1 for the Americas, Western Europe, Oceania, and much of Africa
  14. Superset/extension of ISO 8859-1 Adds curly quotation marks De-facto standard for Windows
  15. Aka Latin 2 for Central or Eastern European Languages
  16. UI graphics, science, and math Standard EGA VGA encoding on gfx cards
  17. That’s a lot! However,
  18. In practice, most users only used one standard locally. Which was fine...
  19. Standards proliferation
  20. (Problem) You could add more bits, but that wasted computing resources (which were scarce at the time) for users who only needed Latin or ASCII-like characters
  21. ATTN: 4 vs 5 char convention
  22. Support for 1,114,112 codepoints (0x000000 - 0x10FFFF) Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh) Codespace: entire range of numerical values available for encoding characters
  23. Code Planes: Continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh) Codespace: entire range of numerical values available for encoding characters Support for 1,114,112 codepoints (0x000000 - 0x10FFFF)
  24. Unicode does not specify how the character / code point should be displayed (or encoded)!
  25. Unicode does not specify how the character / code point should be displayed (or encoded)!
  26. Combining Diacritical Marks
  27. In this example: 5 code points but 4 graphemes GRAPHEME = smallest unit of a writing system Think about putting cursor in this text and selecting something or pressing backspace
  28. “Zalgo text” or “glitch text”
  29. Combining Diacritical Marks
  30. Combining Diacritical Marks
  31. Combining Diacritical Marks
  32. Combining Diacritical Marks
  33. Combining Diacritical Marks
  34. Windows supports 52,000 family combinations
  35. Windows supports 52,000 family combinations
  36. If system lacks dedicated image, individual emojis are shown
  37. Combining Diacritical Marks
  38. Pros: Code points always use some number of bytes; very straight-forward Cons: not very memory efficient, can contain null bytes, not self-synchronizing
  39. BMP = basically everything except emojis and historical scripts
  40. “Surrogate pairs”; values are reserved, no code points with those values
  41. Pros: more memory efficient (most of the time), works well for BMP; is self-synchronizing Cons: 4-byte encoding logic somewhat messy; can contain null bytes
  42. This symbol can be encoded 4 different ways
  43. Intl normalizer class
  44. In UTF-8: 3 bytes for snowman, 1 for space, 1 for each letter c a f e, and 1 for diacritical combining acute accent mark
  45. Now for some fun tricks