SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
except UnicodeError:
  # A practical guide to fighting Unicode demons

                               Aram Dulyan (@Aramgutang)
                              Sydney Python Users group (SyPy)
                                                05 APR 2012
What is Unicode?
Looking inside:
In Python:




  class unicode(basestring):
    ...
The great escapes:


  >>> 'e' == u'e'
  True

  >>> 'xc9' == u'xc9'
  False

  >>> u'xc9' == u'u00c9' == u'U000000c9'
  True
UTF-8
●   There is no difference between an ASCII-encoded and a UTF-8 encoded
    file if no “extended” characters appear in it.
●   Except if there's a BOM (byte order mark):
    ●   UTF-8: EF BB BF (  )
    ●   UTF-16: FE FF ( U+FFFE is reserved for this very purpose )




    NOT HELPFUL:
Encode/decode:


● Encode to bytes
● Decode to unicode




●   or, forget decode completely:
    >>> 'fortxc3xa3'.decode('utf-8')
    u'fortxe9'
    >>> unicode('fortxc3xa3', 'utf-8')
    u'fortxe9'
This is why we declare encodings:



                                 RIGHT SINGLE QUOTATION MARK
                                            U+2019




 >>> u'u2019'.encode('utf-8')
 'xe2x80x99'
 >>> 'xe2x80x99'.decode('cp1252')
 u'xe2u20acu2122'
 >>> print u'xe2u20acu2122'
 ’



 All because of a missing <meta charset="utf-8">
If you REALLY need ASCII:


  >>> print u'rxe9sumxe9'
  résumé
  >>> print u'rxe9sumxe9'.encode(errors='ignore')
  rsum
  >>> print u'rxe9sumxe9'.encode(errors='replace')
  r?sum?


  $ pip install unidecode
  >>> from unidecode import unidecode
  >>> print unidecode(u'rxe9sumxe9')
  resume
The “u” prefix:
  >>> '%s %s' % (u'unicode', 'string')
  u'unicode string'
  >>> 'string ' + u'unicode'
  u'string unicode'


  class Loonie(object):
      def __str__(self):
          return 'Throatwobbler Mangrove'
      def __unicode__(self):
          return u'Richard Luxuryyacht'

  >>> '%s' % Loonie()
  'Throatwobbler Mangrove'
  >>> u'%s' % Loonie()
  u'Richard Luxuryyacht'

  >>> '%s %s' % (Loonie(), u'is silly')
  u'Throatwobbler Mangrove is silly'
Combining marks:


LATIN SMALL LETTER E       LATIN SMALL LETTER E   COMBINING DIAERESIS
   WITH DIAERESIS                 U+0065                U+0308
       U+00EB


>>> print u'Zoxeb'
Zoë
>>> print u'Zoeu0308'
Zoë

>>> from unicodedata   import normalize
>>> normalize('NFC',   u'Zoeu0308')
u'Zoxeb'
>>> normalize('NFD',   u'Zoxeb')
u'Zoeu0308'


OS X on HFS+ normalises filenames, others don't
Warning:
PEP-8
Code in the core Python distribution should always use the ASCII or Latin-1
encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is
preferred over Latin-1, see PEP 3120.
Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8)
should only be used when a comment or docstring needs to mention an
author name that requires Latin-1; otherwise, using x, u or U escapes is the
preferred way to include non-ASCII data in string literals.
For Python 3.0 and beyond, the following policy is prescribed for the
standard library (see PEP 3131): All identifiers in the Python standard
library MUST use ASCII-only identifiers, and SHOULD use English words
wherever feasible (in many cases, abbreviations and technical terms are used
which aren't English). In addition, string literals and comments must also be
in ASCII. The only exceptions are (a) test cases testing the non-ASCII
features, and (b) names of authors. Authors whose names are not based on
the latin alphabet MUST provide a latin transliteration of their names.
Libraries:

●   unidecode
    ●   For when you absolutely need ASCII – folds accents and
        transliterates from many languages.
●   chardet
    ●   Guesses most likely character encoding of a given bytestring.
        Based on Mozilla's code.
●   unicode-nazi
    ●   Yells about any implicit unicode/bytestring conversion in your
        code. Useful when porting code to Python 3.
Links:

●   All About Python and Unicode
    ●   A detailed reference on all things pertaining to Python and Unicode.
●   Pragmatic Unicode
    ●   PyCon 2012 talk on Unicode in Python, covering v3 as well.
●   Love Hotels and Unicode
    ●   A look at the inside politics and other quirky aspects of Unicode.
●   Python Unicode – Fixing UTF-8 encoded as Latin-1
    ●   Another poor soul who ran into this problem.
●   Why the Obama tweet was garbled
    ●   A quick explanation with comments from the people responsible.
●   Unicode Support Shootout
    ●   An advanced treatise on how most languages (including Python) fail at Unicode.

Contenu connexe

En vedette

PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Mason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmersMason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmersJerome Eteve
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Jerome Eteve
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 

En vedette (6)

PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Mason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmersMason - A Template system for us Perl programmers
Mason - A Template system for us Perl programmers
 
Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)Understand unicode & utf8 in perl (2)
Understand unicode & utf8 in perl (2)
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 

Similaire à Except UnicodeError: battling Unicode demons in Python

UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingBert Pattyn
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - ITguest6ddfb98
 
Userspace drivers-2016
Userspace drivers-2016Userspace drivers-2016
Userspace drivers-2016Chris Simmonds
 
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...Area41
 
Taking the hard out of hardware
Taking the hard out of hardwareTaking the hard out of hardware
Taking the hard out of hardwareRonald McCollam
 
Arduino arduino boardnano
Arduino   arduino boardnanoArduino   arduino boardnano
Arduino arduino boardnanoclickengenharia
 
Don't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade MachinesDon't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade MachinesMichael Scovetta
 
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Pluginsamiable_indian
 
arduino
arduinoarduino
arduinomurbz
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
 
Ardx eg-spar-web-rev10
Ardx eg-spar-web-rev10Ardx eg-spar-web-rev10
Ardx eg-spar-web-rev10stemplar
 

Similaire à Except UnicodeError: battling Unicode demons in Python (20)

Unicode basics in python
Unicode basics in pythonUnicode basics in python
Unicode basics in python
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
UTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character EncodingUTF-8: The Secret of Character Encoding
UTF-8: The Secret of Character Encoding
 
20141106 asfws unicode_hacks
20141106 asfws unicode_hacks20141106 asfws unicode_hacks
20141106 asfws unicode_hacks
 
Ghosterr
GhosterrGhosterr
Ghosterr
 
Comprehasive Exam - IT
Comprehasive Exam - ITComprehasive Exam - IT
Comprehasive Exam - IT
 
Userspace drivers-2016
Userspace drivers-2016Userspace drivers-2016
Userspace drivers-2016
 
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
hashdays 2011: Ange Albertini - Such a weird processor - messing with x86 opc...
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Taking the hard out of hardware
Taking the hard out of hardwareTaking the hard out of hardware
Taking the hard out of hardware
 
Arduino arduino boardnano
Arduino   arduino boardnanoArduino   arduino boardnano
Arduino arduino boardnano
 
Ardx experimenters-guide-web
Ardx experimenters-guide-webArdx experimenters-guide-web
Ardx experimenters-guide-web
 
Indroduction arduino
Indroduction arduinoIndroduction arduino
Indroduction arduino
 
Indroduction the arduino
Indroduction the arduinoIndroduction the arduino
Indroduction the arduino
 
Let's begin io t with $10
Let's begin io t with $10Let's begin io t with $10
Let's begin io t with $10
 
Don't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade MachinesDon't Give Credit: Hacking Arcade Machines
Don't Give Credit: Hacking Arcade Machines
 
Writing Metasploit Plugins
Writing Metasploit PluginsWriting Metasploit Plugins
Writing Metasploit Plugins
 
arduino
arduinoarduino
arduino
 
How To Build And Launch A Successful Globalized App From Day One Or All The ...
How To Build And Launch A Successful Globalized App From Day One  Or All The ...How To Build And Launch A Successful Globalized App From Day One  Or All The ...
How To Build And Launch A Successful Globalized App From Day One Or All The ...
 
Ardx eg-spar-web-rev10
Ardx eg-spar-web-rev10Ardx eg-spar-web-rev10
Ardx eg-spar-web-rev10
 

Dernier

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Except UnicodeError: battling Unicode demons in Python

  • 1. except UnicodeError: # A practical guide to fighting Unicode demons Aram Dulyan (@Aramgutang) Sydney Python Users group (SyPy) 05 APR 2012
  • 2.
  • 5. In Python: class unicode(basestring): ...
  • 6. The great escapes: >>> 'e' == u'e' True >>> 'xc9' == u'xc9' False >>> u'xc9' == u'u00c9' == u'U000000c9' True
  • 7. UTF-8 ● There is no difference between an ASCII-encoded and a UTF-8 encoded file if no “extended” characters appear in it. ● Except if there's a BOM (byte order mark): ● UTF-8: EF BB BF (  ) ● UTF-16: FE FF ( U+FFFE is reserved for this very purpose ) NOT HELPFUL:
  • 8. Encode/decode: ● Encode to bytes ● Decode to unicode ● or, forget decode completely: >>> 'fortxc3xa3'.decode('utf-8') u'fortxe9' >>> unicode('fortxc3xa3', 'utf-8') u'fortxe9'
  • 9. This is why we declare encodings: RIGHT SINGLE QUOTATION MARK U+2019 >>> u'u2019'.encode('utf-8') 'xe2x80x99' >>> 'xe2x80x99'.decode('cp1252') u'xe2u20acu2122' >>> print u'xe2u20acu2122' ’ All because of a missing <meta charset="utf-8">
  • 10. If you REALLY need ASCII: >>> print u'rxe9sumxe9' résumé >>> print u'rxe9sumxe9'.encode(errors='ignore') rsum >>> print u'rxe9sumxe9'.encode(errors='replace') r?sum? $ pip install unidecode >>> from unidecode import unidecode >>> print unidecode(u'rxe9sumxe9') resume
  • 11. The “u” prefix: >>> '%s %s' % (u'unicode', 'string') u'unicode string' >>> 'string ' + u'unicode' u'string unicode' class Loonie(object): def __str__(self): return 'Throatwobbler Mangrove' def __unicode__(self): return u'Richard Luxuryyacht' >>> '%s' % Loonie() 'Throatwobbler Mangrove' >>> u'%s' % Loonie() u'Richard Luxuryyacht' >>> '%s %s' % (Loonie(), u'is silly') u'Throatwobbler Mangrove is silly'
  • 12. Combining marks: LATIN SMALL LETTER E LATIN SMALL LETTER E COMBINING DIAERESIS WITH DIAERESIS U+0065 U+0308 U+00EB >>> print u'Zoxeb' Zoë >>> print u'Zoeu0308' Zoë >>> from unicodedata import normalize >>> normalize('NFC', u'Zoeu0308') u'Zoxeb' >>> normalize('NFD', u'Zoxeb') u'Zoeu0308' OS X on HFS+ normalises filenames, others don't
  • 14. PEP-8 Code in the core Python distribution should always use the ASCII or Latin-1 encoding (a.k.a. ISO-8859-1). For Python 3.0 and beyond, UTF-8 is preferred over Latin-1, see PEP 3120. Files using ASCII should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using x, u or U escapes is the preferred way to include non-ASCII data in string literals. For Python 3.0 and beyond, the following policy is prescribed for the standard library (see PEP 3131): All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible (in many cases, abbreviations and technical terms are used which aren't English). In addition, string literals and comments must also be in ASCII. The only exceptions are (a) test cases testing the non-ASCII features, and (b) names of authors. Authors whose names are not based on the latin alphabet MUST provide a latin transliteration of their names.
  • 15. Libraries: ● unidecode ● For when you absolutely need ASCII – folds accents and transliterates from many languages. ● chardet ● Guesses most likely character encoding of a given bytestring. Based on Mozilla's code. ● unicode-nazi ● Yells about any implicit unicode/bytestring conversion in your code. Useful when porting code to Python 3.
  • 16. Links: ● All About Python and Unicode ● A detailed reference on all things pertaining to Python and Unicode. ● Pragmatic Unicode ● PyCon 2012 talk on Unicode in Python, covering v3 as well. ● Love Hotels and Unicode ● A look at the inside politics and other quirky aspects of Unicode. ● Python Unicode – Fixing UTF-8 encoded as Latin-1 ● Another poor soul who ran into this problem. ● Why the Obama tweet was garbled ● A quick explanation with comments from the people responsible. ● Unicode Support Shootout ● An advanced treatise on how most languages (including Python) fail at Unicode.