7. UTF-8
● There is no difference between an ASCII-encoded and a UTF-8 encoded
file if no “extended” characters appear in it.
● Except if there's a BOM (byte order mark):
● UTF-8: EF BB BF (  )
● UTF-16: FE FF ( U+FFFE is reserved for this very purpose )
NOT HELPFUL:
9. This is why we declare encodings:
RIGHT SINGLE QUOTATION MARK
U+2019
>>> u'u2019'.encode('utf-8')
'xe2x80x99'
>>> 'xe2x80x99'.decode('cp1252')
u'xe2u20acu2122'
>>> print u'xe2u20acu2122'
’
All because of a missing <meta charset="utf-8">
10. If you REALLY need ASCII:
>>> print u'rxe9sumxe9'
résumé
>>> print u'rxe9sumxe9'.encode(errors='ignore')
rsum
>>> print u'rxe9sumxe9'.encode(errors='replace')
r?sum?
$ pip install unidecode
>>> from unidecode import unidecode
>>> print unidecode(u'rxe9sumxe9')
resume
12. Combining marks:
LATIN SMALL LETTER E LATIN SMALL LETTER E COMBINING DIAERESIS
WITH DIAERESIS U+0065 U+0308
U+00EB
>>> print u'Zoxeb'
Zoë
>>> print u'Zoeu0308'
Zoë
>>> from unicodedata import normalize
>>> normalize('NFC', u'Zoeu0308')
u'Zoxeb'
>>> normalize('NFD', u'Zoxeb')
u'Zoeu0308'
OS X on HFS+ normalises filenames, others don't
14. PEP-8
Code in the core Python distribution should always use the ASCII or Latin-1
encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is
preferred over Latin-1, see PEP 3120.
Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8)
should only be used when a comment or docstring needs to mention an
author name that requires Latin-1; otherwise, using x, u or U escapes is the
preferred way to include non-ASCII data in string literals.
For Python 3.0 and beyond, the following policy is prescribed for the
standard library (see PEP 3131): All identifiers in the Python standard
library MUST use ASCII-only identifiers, and SHOULD use English words
wherever feasible (in many cases, abbreviations and technical terms are used
which aren't English). In addition, string literals and comments must also be
in ASCII. The only exceptions are (a) test cases testing the non-ASCII
features, and (b) names of authors. Authors whose names are not based on
the latin alphabet MUST provide a latin transliteration of their names.
15. Libraries:
● unidecode
● For when you absolutely need ASCII – folds accents and
transliterates from many languages.
● chardet
● Guesses most likely character encoding of a given bytestring.
Based on Mozilla's code.
● unicode-nazi
● Yells about any implicit unicode/bytestring conversion in your
code. Useful when porting code to Python 3.
16. Links:
● All About Python and Unicode
● A detailed reference on all things pertaining to Python and Unicode.
● Pragmatic Unicode
● PyCon 2012 talk on Unicode in Python, covering v3 as well.
● Love Hotels and Unicode
● A look at the inside politics and other quirky aspects of Unicode.
● Python Unicode – Fixing UTF-8 encoded as Latin-1
● Another poor soul who ran into this problem.
● Why the Obama tweet was garbled
● A quick explanation with comments from the people responsible.
● Unicode Support Shootout
● An advanced treatise on how most languages (including Python) fail at Unicode.