SlideShare une entreprise Scribd logo
1  sur  133
Perl and Unicode
Perl and Unicode
Mike Whitaker, BBC/EnlightenedPerl.org
The problem

• Keeping track of input and output
  encodings
• Not losing encoding data in the middle
• Understanding the difference between
  characters and bytes
Characters vs bytes
Characters vs bytes

characters
Characters vs bytes

characters
             $
Characters vs bytes

characters
              $
             U+0024
Characters vs bytes

characters
              $
             U+0024



  bytes
 (UTF-8)
Characters vs bytes

characters
              $
             U+0024



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $       €
             U+0024



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $        €
             U+0024   U+20AC



  bytes      0x24
 (UTF-8)
Characters vs bytes

characters
              $        €
             U+0024   U+20AC



  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Characters vs bytes

   2
characters
              $        €
             U+0024   U+20AC



  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Characters vs bytes

   2
characters
              $        €
             U+0024   U+20AC


   4
  bytes      0x24 0xE2 0x82 0xAC
 (UTF-8)
Handling Encodings
Handling Encodings
input
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé
         bytes in some
input     encoding or
             other


        decode                        encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé                     àbçdé
         bytes in some
                                 bytes in desired
input     encoding or
                                    encoding
             other


        decode                         encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
           àbçdé                     àbçdé
         bytes in some
                                 bytes in desired
input     encoding or
                                    encoding        output
             other


        decode                         encode
                         àbçdé
                    character-based
                        internal
                     representation
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding        output
                  other


            decode                          encode
use Encode;
$chars = decode($enc,
                              àbçdé
          $bytes);
                         character-based
                             internal
                          representation
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding           output
                  other


            decode                          encode
use Encode;
$chars = decode($enc,
                              àbçdé            use Encode;
                                               $bytes = encode($enc,
          $bytes);                                       $chars);
                         character-based
                             internal
                          representation
The Holy Grail
The Holy Grail

•   Can represent all
    encodings
The Holy Grail

•   Can represent all
    encodings

•   Has multibyte character
    support
The Holy Grail

•   Can represent all
    encodings

•   Has multibyte character
    support

    •   for example, length()
        should count
        characters, not bytes
It doesn't work like
        that
use Encode;
use Encode;
Only works in Perl 5.8
and above
use Encode;
Only works in Perl 5.8   Why the $£%^&*()
                         are you using 5.6
and above                ANYWAY?
use Encode;
Only works in Perl 5.8
and above

There are solutions for 5.6 and even
earlier. But they're HORRIBLE.
character-based
    internal
 representation
character-based
    internal      Perl has one!
 representation
character-based
    internal               Perl has one!
 representation



          Magic internal representation.
character-based
    internal               Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
character-based
    internal               Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
              It's encoding-agnostic.
character-based
    internal                 Perl has one!
 representation



          Magic internal representation.

       All string functions know about it.
              It's encoding-agnostic.

                    In fact....
IT'S UTF-8!
-8!
almost
         TF
    SU
 IT'
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
àbçdé                     àbçdé
           bytes in                 bytes in
input   machine's 8bit           machine's 8bit   output
          encoding                 encoding




                         àbçdé

                   bytes in machine's
                     8bit encoding
I18N? What the £$%^&*('s that?
           àbçdé                     àbçdé
           bytes in                 bytes in
input   machine's 8bit           machine's 8bit   output
          encoding                 encoding




                         àbçdé

                   bytes in machine's
                     8bit encoding
People are still writing
 Perl like it was Perl 4
People are
still writing
 Perl like it
was Perl 4
People are
still writing
 Perl like it
was Perl 4
...and we have to support
them.
People are
still writing
 Perl like it
was Perl 4
...and we have to support
them.

Even though our string
functions expect chars.
????

Perl's magic internal
  representation
????

Perl's magic internal
  representation



                        if
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;

                         else
????

Perl's magic internal
  representation



                           if
          all characters are representable in
          local machine's 8 bit charset, use
                         that;

                           else
                        use UTF-8
àbçdé
  UTF-8
characters
àbçdé
  UTF-8
characters   use Encode;
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé
 machine
  bytes
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé
 machine
  bytes      use Encode;
             $bytes = encode($enc,
                       $chars);
àbçdé                               àbçdé
  UTF-8                          bytes in desired
characters                                          output
             use Encode;            encoding
             $bytes = encode($enc,
                       $chars);




 àbçdé                               àbçdé
 machine                         bytes in desired   output
  bytes      use Encode;
                                    encoding
             $bytes = encode($enc,
                       $chars);
UTF-8
characters
UTF-8
characters




+
UTF-8
characters




+àbçdé
 machine
  bytes
UTF-8




+ =
characters



             ?????


 àbçdé
 machine
  bytes
UTF-8




          + =
          characters



                       ?????


àbçdé
machine
 bytes
UTF-8




          + =
          characters



                       ?????


àbçdé
machine      promote
 bytes
UTF-8




          + =
          characters



                                    ?????


àbçdé                    àbçdé
machine                  UTF-8
             promote
 bytes                 characters
UTF-8




          + =
          characters

                                         àbçdé

                                    UTF-8 bytes


àbçdé                    àbçdé
machine                  UTF-8
             promote
 bytes                 characters
àbçdé
machine
 bytes
àbçdé
machine   output
 bytes
àbçdé
          Content-Encoding: UTF-8
machine                             output
 bytes
àbçdé
          Content-Encoding: UTF-8   bd
                                    ? ? ?
machine                              output
 bytes
àbçdé
               Content-Encoding: UTF-8   bd
                                         ? ? ?
machine                                   output
 bytes
          Content-Encoding: ISO-8859-1
àbçdé
               Content-Encoding: UTF-8   bd
                                         ? ? ?
machine                                   output
 bytes
          Content-Encoding: ISO-8859-1   àbçdé
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
  UTF-8
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
  UTF-8                                      output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8
  UTF-8                                      output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
             Content-Encoding: ISO-8859-1
àbçdé
                  Content-Encoding: UTF-8   bd
                                            ? ? ?
 machine                                     output
  bytes
             Content-Encoding: ISO-8859-1   àbçdé



 àbçdé
                  Content-Encoding: UTF-8    àbçdé
  UTF-8                                       output
characters
             Content-Encoding: ISO-8859-1   àbçdé
ARRR   GH!!!!
It gets worse.
You can't tell what
you've actually got
You can't tell what
you've actually got

  utf8::is_utf8()
You can't tell what
  you've actually got

      utf8::is_utf8()
does not mean what you think it means
You can't tell what
you've actually got
You can't tell what
    you've actually got
encoded
 bytes
You can't tell what
    you've actually got
encoded
 bytes         utf8::is_utf8() = false
You can't tell what
    you've actually got
encoded
 bytes          utf8::is_utf8() = false
               EVEN IF they're UTF-8
You can't tell what
      you've actually got
 encoded
  bytes           utf8::is_utf8() = false
                 EVEN IF they're UTF-8
 decoded
UTF-8 chars
You can't tell what
      you've actually got
 encoded
  bytes           utf8::is_utf8() = false
                 EVEN IF they're UTF-8
 decoded
UTF-8 chars       utf8::is_utf8() = true
You can't tell what
       you've actually got
  encoded
   bytes           utf8::is_utf8() = false
                  EVEN IF they're UTF-8
 decoded
UTF-8 chars        utf8::is_utf8() = true


  decoded
machine bytes
You can't tell what
       you've actually got
  encoded
   bytes           utf8::is_utf8() = false
                  EVEN IF they're UTF-8
 decoded
UTF-8 chars        utf8::is_utf8() = true


  decoded
machine bytes      utf8::is_utf8() = false
The science bit
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
• 3 argument form of open() - PerlIO layers
  open(FILEHANDLE, ">:encoding(UTF-8)",
  $file);
The science bit
• Encode.pm
  use Encode; $bytes = encode($enc,
  $chars);
• 3 argument form of open() - PerlIO layers
  open(FILEHANDLE, ">:encoding(UTF-8)",
  $file);
• binmode(FILEHANDE,
'utf8' vs 'UTF-8'
'utf8' vs 'UTF-8'
• Encode.pm
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
 • :utf8
'utf8' vs 'UTF-8'
• Encode.pm
 • utf8 = marks it as UTF-8 and hopes...
 • UTF-8 = is actually valid UTF-8
• PerlIO layers:
 • :utf8
 • :encoding(UTF-8)
use utf8;
use utf8;


• Does NOT do what you might think it
  does
use utf8;


• Does NOT do what you might think it
  does
• All it says is 'my source code is UTF-8'.
Modules
Modules
• It depends on the module:
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
 • DBI - mysql_enable_utf8 in
   DBI::connect()
Modules
• It depends on the module:
 • CGI - $CGI::PARAM_UTF8=1;
 • LWP::UserAgent -
   >decoded_content() method honours
   Content-Encoding:
 • DBI - mysql_enable_utf8 in
   DBI::connect()
 • XML::LibXML - looks at encoding,
In summary
In summary
• decode bytes as soon as you get them:
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
 • encode(), binmode(STDOUT), 3 arg
    open()
In summary
• decode bytes as soon as you get them:
 • decode(), binmode(STDIN), 3 arg
    open()
• encode characters just before you output:
 • encode(), binmode(STDOUT), 3 arg
    open()
• keep track of whether your strings are
NEVER EVER EVER
rely on the encoding of
      Perl's internal
     representation
and...
...there is
NO SUCH THING
        as
  "plain text"
Handling Encodings
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
The Holy Fail (thanks Joel!)
                àbçdé                     àbçdé
              bytes in some
                                      bytes in desired
input          encoding or
                                         encoding             output
                  other


            decode                              encode
use Encode;
$chars = decode($enc,
                              àbçdé              use Encode;
                                                 $bytes = encode($enc,
          $bytes);                                         $chars);
                        Perl's magic internal
                          representation
Questions?

Contenu connexe

Dernier

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Dernier (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Perl And Unicode

  • 2. Perl and Unicode Mike Whitaker, BBC/EnlightenedPerl.org
  • 3. The problem • Keeping track of input and output encodings • Not losing encoding data in the middle • Understanding the difference between characters and bytes
  • 8. Characters vs bytes characters $ U+0024 bytes (UTF-8)
  • 9. Characters vs bytes characters $ U+0024 bytes 0x24 (UTF-8)
  • 10. Characters vs bytes characters $ € U+0024 bytes 0x24 (UTF-8)
  • 11. Characters vs bytes characters $ € U+0024 U+20AC bytes 0x24 (UTF-8)
  • 12. Characters vs bytes characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 13. Characters vs bytes 2 characters $ € U+0024 U+20AC bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 14. Characters vs bytes 2 characters $ € U+0024 U+20AC 4 bytes 0x24 0xE2 0x82 0xAC (UTF-8)
  • 15.
  • 18. Handling Encodings àbçdé bytes in some input encoding or other
  • 19. Handling Encodings àbçdé bytes in some input encoding or other decode
  • 20. Handling Encodings àbçdé bytes in some input encoding or other decode àbçdé character-based internal representation
  • 21. Handling Encodings àbçdé bytes in some input encoding or other decode encode àbçdé character-based internal representation
  • 22. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding other decode encode àbçdé character-based internal representation
  • 23. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode àbçdé character-based internal representation
  • 24. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé $bytes); character-based internal representation
  • 25. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); character-based internal representation
  • 27. The Holy Grail • Can represent all encodings
  • 28. The Holy Grail • Can represent all encodings • Has multibyte character support
  • 29. The Holy Grail • Can represent all encodings • Has multibyte character support • for example, length() should count characters, not bytes
  • 30. It doesn't work like that
  • 32. use Encode; Only works in Perl 5.8 and above
  • 33. use Encode; Only works in Perl 5.8 Why the $£%^&*() are you using 5.6 and above ANYWAY?
  • 34. use Encode; Only works in Perl 5.8 and above There are solutions for 5.6 and even earlier. But they're HORRIBLE.
  • 35. character-based internal representation
  • 36. character-based internal Perl has one! representation
  • 37. character-based internal Perl has one! representation Magic internal representation.
  • 38. character-based internal Perl has one! representation Magic internal representation. All string functions know about it.
  • 39. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. It's encoding-agnostic.
  • 40. character-based internal Perl has one! representation Magic internal representation. All string functions know about it. It's encoding-agnostic. In fact....
  • 41.
  • 43. -8! almost TF SU IT'
  • 44. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
  • 45. àbçdé àbçdé bytes in bytes in input machine's 8bit machine's 8bit output encoding encoding àbçdé bytes in machine's 8bit encoding
  • 46. I18N? What the £$%^&*('s that? àbçdé àbçdé bytes in bytes in input machine's 8bit machine's 8bit output encoding encoding àbçdé bytes in machine's 8bit encoding
  • 47. People are still writing Perl like it was Perl 4
  • 48. People are still writing Perl like it was Perl 4
  • 49. People are still writing Perl like it was Perl 4 ...and we have to support them.
  • 50. People are still writing Perl like it was Perl 4 ...and we have to support them. Even though our string functions expect chars.
  • 51. ???? Perl's magic internal representation
  • 52. ???? Perl's magic internal representation if
  • 53. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that;
  • 54. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that; else
  • 55. ???? Perl's magic internal representation if all characters are representable in local machine's 8 bit charset, use that; else use UTF-8
  • 56.
  • 58. àbçdé UTF-8 characters use Encode; $bytes = encode($enc, $chars);
  • 59. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars);
  • 60. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes
  • 61. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé machine bytes use Encode; $bytes = encode($enc, $chars);
  • 62. àbçdé àbçdé UTF-8 bytes in desired characters output use Encode; encoding $bytes = encode($enc, $chars); àbçdé àbçdé machine bytes in desired output bytes use Encode; encoding $bytes = encode($enc, $chars);
  • 63.
  • 67. UTF-8 + = characters ????? àbçdé machine bytes
  • 68. UTF-8 + = characters ????? àbçdé machine bytes
  • 69. UTF-8 + = characters ????? àbçdé machine promote bytes
  • 70. UTF-8 + = characters ????? àbçdé àbçdé machine UTF-8 promote bytes characters
  • 71. UTF-8 + = characters àbçdé UTF-8 bytes àbçdé àbçdé machine UTF-8 promote bytes characters
  • 72.
  • 74. àbçdé machine output bytes
  • 75. àbçdé Content-Encoding: UTF-8 machine output bytes
  • 76. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes
  • 77. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1
  • 78. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé
  • 79. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 characters
  • 80. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé UTF-8 output characters
  • 81. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 UTF-8 output characters
  • 82. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters
  • 83. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters Content-Encoding: ISO-8859-1
  • 84. àbçdé Content-Encoding: UTF-8 bd ? ? ? machine output bytes Content-Encoding: ISO-8859-1 àbçdé àbçdé Content-Encoding: UTF-8 àbçdé UTF-8 output characters Content-Encoding: ISO-8859-1 àbçdé
  • 85. ARRR GH!!!!
  • 86.
  • 88. You can't tell what you've actually got
  • 89. You can't tell what you've actually got utf8::is_utf8()
  • 90. You can't tell what you've actually got utf8::is_utf8() does not mean what you think it means
  • 91. You can't tell what you've actually got
  • 92. You can't tell what you've actually got encoded bytes
  • 93. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false
  • 94. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8
  • 95. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars
  • 96. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true
  • 97. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true decoded machine bytes
  • 98. You can't tell what you've actually got encoded bytes utf8::is_utf8() = false EVEN IF they're UTF-8 decoded UTF-8 chars utf8::is_utf8() = true decoded machine bytes utf8::is_utf8() = false
  • 100. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars);
  • 101. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars); • 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file);
  • 102. The science bit • Encode.pm use Encode; $bytes = encode($enc, $chars); • 3 argument form of open() - PerlIO layers open(FILEHANDLE, ">:encoding(UTF-8)", $file); • binmode(FILEHANDE,
  • 104. 'utf8' vs 'UTF-8' • Encode.pm
  • 105. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes...
  • 106. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8
  • 107. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers:
  • 108. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers: • :utf8
  • 109. 'utf8' vs 'UTF-8' • Encode.pm • utf8 = marks it as UTF-8 and hopes... • UTF-8 = is actually valid UTF-8 • PerlIO layers: • :utf8 • :encoding(UTF-8)
  • 111. use utf8; • Does NOT do what you might think it does
  • 112. use utf8; • Does NOT do what you might think it does • All it says is 'my source code is UTF-8'.
  • 114. Modules • It depends on the module:
  • 115. Modules • It depends on the module: • CGI - $CGI::PARAM_UTF8=1;
  • 116. Modules • It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding:
  • 117. Modules • It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect()
  • 118. Modules • It depends on the module: • CGI - $CGI::PARAM_UTF8=1; • LWP::UserAgent - >decoded_content() method honours Content-Encoding: • DBI - mysql_enable_utf8 in DBI::connect() • XML::LibXML - looks at encoding,
  • 120. In summary • decode bytes as soon as you get them:
  • 121. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open()
  • 122. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output:
  • 123. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open()
  • 124. In summary • decode bytes as soon as you get them: • decode(), binmode(STDIN), 3 arg open() • encode characters just before you output: • encode(), binmode(STDOUT), 3 arg open() • keep track of whether your strings are
  • 125.
  • 126. NEVER EVER EVER rely on the encoding of Perl's internal representation
  • 127.
  • 128. and...
  • 129.
  • 130. ...there is NO SUCH THING as "plain text"
  • 131. Handling Encodings àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation
  • 132. The Holy Fail (thanks Joel!) àbçdé àbçdé bytes in some bytes in desired input encoding or encoding output other decode encode use Encode; $chars = decode($enc, àbçdé use Encode; $bytes = encode($enc, $bytes); $chars); Perl's magic internal representation

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n
  135. \n
  136. \n
  137. \n
  138. \n
  139. \n
  140. \n
  141. \n
  142. \n
  143. \n
  144. \n
  145. \n
  146. \n
  147. \n
  148. \n
  149. \n
  150. \n
  151. \n
  152. \n
  153. \n
  154. \n
  155. \n
  156. \n
  157. \n
  158. \n
  159. \n
  160. \n
  161. \n
  162. \n
  163. \n
  164. \n
  165. \n
  166. \n
  167. \n
  168. \n
  169. \n
  170. \n
  171. \n
  172. \n
  173. \n
  174. \n
  175. \n
  176. \n
  177. \n
  178. \n
  179. \n
  180. \n
  181. \n
  182. \n