Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
ℙƴ☂ℌøἤ	
	 	⒝⒴⒯⒠⒮
DΣMYƧƬIFIΣD
Boris	FELD	-	PyParis,	Paris	-	2017
Boris	FELD
Python	developer
Mercurial	and	Python	consultant	at	Octobus
https://lothiraldan.github.io/
@lothiraldan
/me
Unicode	is	���!
Let's	test	it!
What	is	the	length	of	this	Unicode	string	in	Python	2?
len(u' ')
1
2
3
4
1.	Unicode	length
It	depends	of	your	python:
DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64
$>	docker	run	-t	-i	$DOCKER_IMAGE	/opt/python/cp27-...
When	could	you	see	this	error	message?
UnicodeEncodeError:	'ascii'	codec	can't	encode	character
When	doing	.encode('ascii'...
In	all	of	these	situations!
>>>	x	=	u'é'
>>>	x.encode('ascii')
Traceback	(most	recent	call	last):
				File	"<stdin>",	line...
When	should	you	use	chr	and	unichr?
You	should	always	use	chr.
You	should	always	use	unichr.
You	should	chr	for	ASCII	and	...
Prefer	using	unichr	for	everything.
Chr	vs	unichr
Skeptical	dog	is	skeptical
We	have	to	go	back!
The	60s
Apollo	11
Woodstock
Something	important
Something	huge
ASCII	was	born
In	1960s,	the	American	Standards	Association	wanted	to	answer	the	question:
How	to	represent	text	digitally?
The	important...
Problem,	computers	are	only	speaking	bits.	How	to	transform	text	into	bits?
Problem
We	know	how	to	convert	integer	to	binary:
0			=	0000000
1			=	0000001
2			=	0000010
3			=	0000011
.............
127	=	1111...
ASCII	with	Python
Let's	take	a	string:
"pyparis"
A	string	is	a	sequence	of	characters:
assert	list("pyparis")	==	['p',	'y',	'p',	'a',	'r',	'...
assert	type("pyparis"[0])	==	<type	'str'>
assert	len("pyparis"[0])	==	1
A	character	(from	the	Greek	χαρακτήρ	"engraved	or
...
For	retrieving	the	ASCII	code	point	of	a	character,	we	can	use	ord:
assert	ord("p")	==	112
To	reverse	the	process	we	can	u...
p y p a r i s
Code	Point 112 121 112 97 114 105 115
Code	points
p y p a r i s
Code	Point 112 121 112 97 114 105 115
Binary 1110000 1111001 1110000 1100001 1110010 1101001 1110011
code	po...
encode	is	meant	to	transform	a	string	into	some	bytes:
string	=	'abc'
bytes	=	bytes.encode('ascii')
assert	hex(bytes)	==	'...
Everything	is	awesome...
...	right?
Small	problem
ASCII	solved	the	problem	for	USA	but	not	for	everyone	else.
Not	everyone	speaks	english
ASCII	only	use	the	7	lower	bits	of	a	byte.	01100001
But	on	most	computer	a	byte	is	actually	8	bits	so	we	can	support	more	...
Some	were	based	on	ASCII	and	use	a	8	bit	to	add	support	for	accents	for	example,	like
Latin1	that	defines	the	character	É	...
It	was	a	mess
Initial	text a b ã é
Latin1	Code	Point 97 98 227 233
Latin1	encoding 01100001 01100010 11100011 11101001
ASCII	decoding a ...
Here	comes	our	savior!
One	Standard	to	rule	them	all,
One	Standard	to	find	them,
One	Standard	to	bring	them	all
and	in	the	greater	good	bind	them...
Unicode	is	a	computing	industry	standard	for	the	consistent
encoding,	representation,	and	handling	of	text	expressed	in
mo...
The	latest	version	of	Unicode	contains	a	repertoire	of
128,237	characters	covering	135	modern	and	historic
scripts,	as	wel...
Remember	the	ASCII	table?
Unicode	vs	ASCII
Unicode	with	Python
Let's	take	a	unicode	character	€.
First,	declare	the	encoding	of	your	python	source	file	as	utf-8:
#	-*-	coding:	utf-8	-*-...
Let's	convert	the	code	point	into	binary:
€
Code	Point 8364
Naive	conversion 00100000	10101100
Problem
It	doesn't	fit	into	1	byte.
The	problems	when	you	start	using	more	than	1	bytes	are	multiple	and	annoying:
How	to	order	th...
As	ASCII	was	simple,	transforming	ASCII	code	points	into	binary	was	straightforward.
But	the	presence	of	high	code	point	c...
If	you	are	not	sure,	use	UTF-8,	it	will	be	compatible	with	every	characters,	works	well
most	of	the	time	and	solved	multi-...
UTF-8	Everywhere	Manifesto
UTF-8	everywhere
A €
Code	Point 65 8364
Naive
conversion
01000001 00100000	10101100
UTF-8 01000001 11100010	10000010	10101100
UTF-16 000000...
Let's	clarify	something:
encode	is	meant	to	transform	an	unicode	string	into	some	bytes:
hex(u'é'.encode('utf-8'))	==	'c3a...
Python	2
Counting	the	length	of	an	ASCII	string	is	easy,	count	the	number	of	bytes!
But	it's	much	more	harder	with	Unicode	strings....
Python	2	comes	in	several	flavor,	two	are	related	to	Unicode.	Its	either	a	narrow	build	or	a
wide	build.	It	basically	chan...
Remember	the	signification	of	encode	and	decode?
Encode	transforms	an	Unicode	string	into	some	bytes.
Decode	transforms	so...
Python	2	always	had	a	string	type	but	introduced	the	Unicode	type	in	Python	2.1.
Python	2	str	is	badly	named	as	it's	basic...
Python	is	a	strongly	typed	language,	meaning	that	Python	shouldn't	coerce	types	behind
your	back:
'012'	+	3
Traceback	(mos...
You	can	use	chr	to	get	the	character	of	a	code	point:
assert	chr(65)	==	'A'
But	it	only	works	with	ASCII	characters!
chr(8...
Python	3	♥	 	♥	 	♥	 	♥
Python	3	now	always	store	its	strings	the	same	way	and	len	returns	you	the	right	answer	no
matter	what:
x	=	' '
assert	len...
Python	3	biggest	change	was	to	change	the	type	systems	of	strings.
Bytes String Unicode	strings
Python	2 str unicode
Pytho...
Now	that	Python	3	have	separate	types	for	bytes	and	string,	we	now	longer	can	mess	with
encode	and	decode:
string	=	''
str...
Unicode	strings	are	now	the	norm,	so	Python	3	dropped	the	u	prefix	for	Unicode	strings	and
replaced	it	by	a	b	prefix	for	b...
Python	3	no	longer	have	separate	functions	for	chr	and	unichr,	just	use	chr.
assert	chr(65)	==	'A'
assert	chr(8364)	==	'€'...
Pain	relief	tips
Thanks	to	the	new	type	system,	it	is	now	easier	to	identify	which	part	of	the	code	needs	to
encode	strings	and	decode	byte...
Software	should	only	work	with	Unicode	strings	internally,
decoding	the	input	data	as	soon	as	possible	and	encoding
the	ou...
You	cannot	infer	the	encodings	of	bytes:
Content-Type:	text/html;	charset=ISO-8859-4
<meta	http-equiv="Content-Type"	conte...
encode	and	decode	accepts	a	second	arguments	for	error	handling.	By	default	it	is	set	on
strict,	which	means	crash
x	=	u'a...
Use	Unicode	anytime	possible.
Use	Python	3.
Explicitly	encode	str	and	decode	str	in	Python	2,	it
might	solves	bugs	in	your...
for	c	in	range(0x1F410,	0x1F4f0):
				print	(r"U%08x"%c).decode("unicode-escape"),
	 	 	 	 	 	 	 	 	 	 	 	 	 	 	 	 	 	 	 	...
Thank	you!
The	Absolute	Minimum	Every	Software	Developer	Absolutely,	Positively	Must	Know
About	Unicode	and	Character	Sets	(No	Excuse...
PyParis 2017 / Unicode and bytes demystified, by Boris Feld
Prochain SlideShare
Chargement dans…5
×

PyParis 2017 / Unicode and bytes demystified, by Boris Feld

274 vues

Publié le

PyParis 2017
http://pyparis.org

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

PyParis 2017 / Unicode and bytes demystified, by Boris Feld

  1. 1. ℙƴ☂ℌøἤ ⒝⒴⒯⒠⒮ DΣMYƧƬIFIΣD Boris FELD - PyParis, Paris - 2017
  2. 2. Boris FELD Python developer Mercurial and Python consultant at Octobus https://lothiraldan.github.io/ @lothiraldan /me
  3. 3. Unicode is ���!
  4. 4. Let's test it!
  5. 5. What is the length of this Unicode string in Python 2? len(u' ') 1 2 3 4 1. Unicode length
  6. 6. It depends of your python: DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64 $> docker run -t -i $DOCKER_IMAGE /opt/python/cp27-cp27mu/bin/python -c "print len(u'U0001f60e')" 1 But it can also be: DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64 $> docker run -t -i $DOCKER_IMAGE /opt/python/cp27-cp27m/bin/python -c "print len(u'U0001f60e')" 2 Unicode length
  7. 7. When could you see this error message? UnicodeEncodeError: 'ascii' codec can't encode character When doing .encode('ascii') When doing .decode('ascii') When doing .decode('utf-8') In all of theses situations 2. UnicodeEncodeError
  8. 8. In all of these situations! >>> x = u'é' >>> x.encode('ascii') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 0: ordinal not in range(128) >>> x.decode('ascii') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 0: ordinal not in range(128) >>> x.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 0: ordinal not in range(128) UnicodeEncodeError
  9. 9. When should you use chr and unichr? You should always use chr. You should always use unichr. You should chr for ASCII and unichr for Unicode. 3. Chr vs unichr
  10. 10. Prefer using unichr for everything. Chr vs unichr
  11. 11. Skeptical dog is skeptical
  12. 12. We have to go back!
  13. 13. The 60s
  14. 14. Apollo 11
  15. 15. Woodstock
  16. 16. Something important
  17. 17. Something huge
  18. 18. ASCII was born
  19. 19. In 1960s, the American Standards Association wanted to answer the question: How to represent text digitally? The important question
  20. 20. Problem, computers are only speaking bits. How to transform text into bits? Problem
  21. 21. We know how to convert integer to binary: 0 = 0000000 1 = 0000001 2 = 0000010 3 = 0000011 ............. 127 = 1111111 Let's assign each character an integer from 0 to 127 named "code point". Pretty simple solution
  22. 22. ASCII with Python
  23. 23. Let's take a string: "pyparis" A string is a sequence of characters: assert list("pyparis") == ['p', 'y', 'p', 'a', 'r', 'i', 's'] What is a string?
  24. 24. assert type("pyparis"[0]) == <type 'str'> assert len("pyparis"[0]) == 1 A character (from the Greek χαρακτήρ "engraved or stamped mark" on coins or seals, "branding mark, symbol") is a sign or symbol. — Wikipedia A character is basically anything. It could represents be a letter, a digit or even an emoji. What is character
  25. 25. For retrieving the ASCII code point of a character, we can use ord: assert ord("p") == 112 To reverse the process we can use chr: assert chr(112) == "p" Code point in Python
  26. 26. p y p a r i s Code Point 112 121 112 97 114 105 115 Code points
  27. 27. p y p a r i s Code Point 112 121 112 97 114 105 115 Binary 1110000 1111001 1110000 1100001 1110010 1101001 1110011 code point encode binary code point decode binary ASCII encoding
  28. 28. encode is meant to transform a string into some bytes: string = 'abc' bytes = bytes.encode('ascii') assert hex(bytes) == '616263' decode is meant to transform some bytes into a string: bytes = unhex('616263') string = bytes.decode('ascii') assert string == 'abc' Each of these methods accepts an encoding parameter for the name of the conversion algorithm to use. Encode vs Decode
  29. 29. Everything is awesome...
  30. 30. ... right?
  31. 31. Small problem
  32. 32. ASCII solved the problem for USA but not for everyone else. Not everyone speaks english
  33. 33. ASCII only use the 7 lower bits of a byte. 01100001 But on most computer a byte is actually 8 bits so we can support more characters. And so new standard were born... Other standards
  34. 34. Some were based on ASCII and use a 8 bit to add support for accents for example, like Latin1 that defines the character É with the code point 201. Some other, were not compatible at all, like EBCDIC, used on IBM mainframes, where the 1001011 (code point 75) code point represent the punctuation mark "." while in ASCII it represent "A". Of course they were not all cross-compatible... Other standards
  35. 35. It was a mess
  36. 36. Initial text a b ã é Latin1 Code Point 97 98 227 233 Latin1 encoding 01100001 01100010 11100011 11101001 ASCII decoding a b ERROR ERROR Mac OS Roman decoding a b „ È EBCDIC decoding / ERROR T Z Example
  37. 37. Here comes our savior!
  38. 38. One Standard to rule them all, One Standard to find them, One Standard to bring them all and in the greater good bind them Unicode the savior
  39. 39. Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. — Wikipedia It all started in 1987-1988 as a coordination between Joe Becker from Xerox and Lee Collins and Mark Davis from Apple. The unicode code points are fortunately for us ASCII compatible. What is Unicode?
  40. 40. The latest version of Unicode contains a repertoire of 128,237 characters covering 135 modern and historic scripts, as well as multiple symbol sets. — Wikipedia ASCII was defining 127 characters, so Unicode defines 1000 times more characters. It defines several blocks: Basic Latin: ab...XYZ Greek, Aramaic, Cherokee: Δ‫ע‬Ꮧ Right to left scripts, Cuneiform, hieroglyphs: Mahjong Tiles, Domino Tiles, Playing cards: Emoticons, Musical notations: Unicode size
  41. 41. Remember the ASCII table? Unicode vs ASCII
  42. 42. Unicode with Python
  43. 43. Let's take a unicode character €. First, declare the encoding of your python source file as utf-8: # -*- coding: utf-8 -*- Then, you can write it this way: u'€' Or: u'u20AC' Its code point is 8364: ord(u'€') == 8364 How to write Unicode in Python
  44. 44. Let's convert the code point into binary: € Code Point 8364 Naive conversion 00100000 10101100 Problem
  45. 45. It doesn't fit into 1 byte. The problems when you start using more than 1 bytes are multiple and annoying: How to order the bytes, Big And Little Endian problems anyone? How to recognize which byte you are reading in a file or stream? How to detect and correct transmission errors where only some bytes were missing? 8364 into binary takes two bytes. Unicode characters code points goes well beyond 1 000 000 (because of non allocated yet), taking up to 3 bytes. Multi-bytes
  46. 46. As ASCII was simple, transforming ASCII code points into binary was straightforward. But the presence of high code point characters in Unicode complexify the process. There are multiple ways of doing it, called encodings: UTF-8 UTF-16 UTF-32 Multiple encoding
  47. 47. If you are not sure, use UTF-8, it will be compatible with every characters, works well most of the time and solved multi-bytes related problems Elegantly. If you process more Asian characters than Latin, use UTF-16 so you use less space and memory. If you need to interact with another program, use the default other program encoding (CSV anyone?). Comparison of Unicode encodings - Wikipedia Choose an encoding
  48. 48. UTF-8 Everywhere Manifesto UTF-8 everywhere
  49. 49. A € Code Point 65 8364 Naive conversion 01000001 00100000 10101100 UTF-8 01000001 11100010 10000010 10101100 UTF-16 00000000 01000001 00100000 10101100 UTF-32 00000000 00000000 00000000 01000001 00000000 00000000 00100000 10101100 What are the differences?
  50. 50. Let's clarify something: encode is meant to transform an unicode string into some bytes: hex(u'é'.encode('utf-8')) == 'c3a9' decode is meant to transform some bytes into an unicode string: unhex('c3a9').decode('utf-8') == u'é' Encode vs Decode
  51. 51. Python 2
  52. 52. Counting the length of an ASCII string is easy, count the number of bytes! But it's much more harder with Unicode strings. Python 2 tries hard to get you a correct answer. Let's take back our example: . Its code point is 128526. 1. String length
  53. 53. Python 2 comes in several flavor, two are related to Unicode. Its either a narrow build or a wide build. It basically change how Python stores its strings. For code point < 65535, everything works the same, Python store each character separately and only one character. For code point > 65535, it differs. The wide build character size is enough for all Unicode code points. But the narrow build character size is not big enough for code point > 65535, so it store upper code points as a pair of characters. The narrow build use less memory but it explains why the narrow build returns 2 for len(u' '), it's because Python 2 actually store two characters. Multiple flavors of Python 2
  54. 54. Remember the signification of encode and decode? Encode transforms an Unicode string into some bytes. Decode transforms some bytes into an Unicode string. 2. Encoding / Decoding in Python 2
  55. 55. Python 2 always had a string type but introduced the Unicode type in Python 2.1. Python 2 str is badly named as it's basically a bag of bytes. When you display it, Python will try to decode it for you. So for ASCII only strings, encode and decode will return the same. x = 'abc' assert x.encode('ascii') == x assert x.decode('ascii') == x Python 2 type system
  56. 56. Python is a strongly typed language, meaning that Python shouldn't coerce types behind your back: '012' + 3 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: cannot concatenate 'str' and 'int' objects But it's not respecting this property with strings. Remember that decode convert bytes into an Unicode string in Python? x = u'é' x.decode('utf-8') As decode is called on an Unicode instance, it isn't bytes. So python tries to makes some bytes out of the string and does: x = u'é' x.encode('ascii').decode('utf-8') That's way you can see an UnicodeEncodeError error while trying to decode an Unicode Python 2 type coercing
  57. 57. You can use chr to get the character of a code point: assert chr(65) == 'A' But it only works with ASCII characters! chr(8364) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: chr() arg not in range(256) For Unicode you need to use unichr: assert unichr(8364) == u'€' 3. Python 2 chr vs unichr
  58. 58. Python 3 ♥ ♥ ♥ ♥
  59. 59. Python 3 now always store its strings the same way and len returns you the right answer no matter what: x = ' ' assert len(x) == 1 1. Python 3 single flavor
  60. 60. Python 3 biggest change was to change the type systems of strings. Bytes String Unicode strings Python 2 str unicode Python 3 bytes str 2. Python 3 big change
  61. 61. Now that Python 3 have separate types for bytes and string, we now longer can mess with encode and decode: string = '' string.decode('ascii') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'str' object has no attribute 'decode' Decoding an Unicode string never made sense anyway. bytes = b'' bytes.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'bytes' object has no attribute 'encode' So you always know what the types you are dealing with. 2. Python 3 coherent type system
  62. 62. Unicode strings are now the norm, so Python 3 dropped the u prefix for Unicode strings and replaced it by a b prefix for bytes, so you directly write: x = ' ' Python 3.3 reintroduced the prefix for codebases that needs to be compatible with Python 2 and Python 3, so it's also works: x = u' ' 2. No more u prefix
  63. 63. Python 3 no longer have separate functions for chr and unichr, just use chr. assert chr(65) == 'A' assert chr(8364) == '€' 3. Python 3 chr
  64. 64. Pain relief tips
  65. 65. Thanks to the new type system, it is now easier to identify which part of the code needs to encode strings and decode bytes. bytes Outside world decode Library unicode Business logic unicode encode Library bytes Outside world 1. Unicode sandwich
  66. 66. Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end. — Python doc on unicode Unicode sandwich
  67. 67. You cannot infer the encodings of bytes: Content-Type: text/html; charset=ISO-8859-4 <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <?xml version="1.0" encoding="UTF-8" ?> # -*- coding: iso8859-1 -*- If you really really really really need to guess the encoding, you can use chardet, but remember, it's a best effort scenario. 2. Use declared encoding
  68. 68. encode and decode accepts a second arguments for error handling. By default it is set on strict, which means crash x = u'abcé' x.encode('ascii', errors='strict') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 3... You can also use replace to replace invalid character by ?: assert x.encode('ascii', errors='replace') == 'abc?' Or you can simply ignore them: assert x.encode('ascii', errors='ignore') == 'abc' Finally you can replace them by their XML code: assert x.encode('ascii', errors='xmlcharrefreplace') == 'abcé' 3. Error handling
  69. 69. Use Unicode anytime possible. Use Python 3. Explicitly encode str and decode str in Python 2, it might solves bugs in your code and ease Python 3 conversions. Unicode sandwich. Never guess an encoding! Use error handling. Conclusion
  70. 70. for c in range(0x1F410, 0x1F4f0): print (r"U%08x"%c).decode("unicode-escape"), Python fun
  71. 71. Thank you!
  72. 72. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Pragmatic Unicode Unicode In Python, Completely Demystified What every programmer absolutely, positively needs to know about encodings and character sets to work with text Holy batman Reddit on unicode References

×