SlideShare une entreprise Scribd logo
1  sur  138
Télécharger pour lire hors ligne
Unicode (and Python)

                                      Juan Manuel Gimeno Illa
                                        jmgimeno@diei.udl.cat

                                          November 2008




J.M.Gimeno (jmgimeno@diei.udl.cat)             Unicode          November 2008   1 / 21
Outline

 1   Before Unicode

 2   Unicode
       Unicode Concepts
       Encodings

 3   Python’s Unicode Support
       Unicode String Type
       Source Code Encoding

 4   Bibliography



J.M.Gimeno (jmgimeno@diei.udl.cat)   Unicode   November 2008   2 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


Before Unicode

        In the beginning, computing was mainly centered in North America
        and done in English. Characters were stored one-per-byte by using
        either
               ASCII (7 bits)
               EBCDIC (8 bits)
        In other parts of the world, different ways of storing their characters
        were invented
               Japan: various flavours of JIS encodings
               Russian: KOI8
               India: ISCI standard
        Also, there were some proprietary encodings defined by operating
        system vendors



J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode        November 2008   3 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


ISO-8859-*
        For the huge number of people in America, Europe, and the Middle
        East who use relatively small alphabets, there was ISO-8859
               left ASCII as ASCII (range 0 to 127)
               used the range 128 through 255 for different purposes
              1-4 Different accented characters (e.g. latin-1)
                 5 Cyrillic
                 6 Arabic
                 7 Greek
                 8 Hebrew
                 9 Turkish
               10 Nordic languages

        But you could only be using one at a time, so one couldn’t easily mix
        Greek and Cyrillic in the same file.
J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode              November 2008   4 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Before Unicode


Huston, Huston, . . .



        Clearly this was an very unsatisfactory situation
        ISO-2022 provided a partial solution allowing to shift encodings in
        the middle of a string
               it was difficult to use
               so it wasn’t widespread
        What was needed was an universal way to refer to all the different
        characters in all the alphabets
               ISO/IEC 10646
               Unicode




J.M.Gimeno (jmgimeno@diei.udl.cat)               Unicode      November 2008   5 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Solution



        One encoding for all scripts of the world
        ASCII compatibility (even Latin-1)
        Includes character meta data
               Case mapping information
               Character category information
        Accounts for scripts using different orientations
        Enables sorting and normalization support




J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   6 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode’s Terminology



    Grapheme This is what users regard as a character
             - Andr´e
  Code points This is an Unicode encoding of the string
              - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)
              - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)
   Code Units This is what the implementation stores (e.g. UTF-8
              - Andre0xCC 0x81
 This can be explored in Linux using the program gucharmap




J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   7 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Unicode Organization
        Unicode currently defines just under 100000 code points but it has
        space for upto 1114112
        They are organized into 17 planes of 216 = 65536 characters,
        numbered 0 to 16
        Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty
        well everything useful
        The characters in BMP are laid out more or less West to East
               ASCII characters from 0 to 127
               Latin-1 characters from 128 to 255
               Then moving East in Europe (Greek, Cyrillic)
               Next Middle East (Arabic, Hebrew)
               Then the Indus (scripts of India)
               Next Southeast Asia (Thai, Laotian and so on)
               and ending with China, Japan and Korea
        Planes 1 to 16 are sometimes called astral planes that include exotic,
        rare and historically important characters (old italic, byzantine
        musical symbols, etc.)
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode                 November 2008   8 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Unicode Concepts


Code Points

        Each code point (“character”) gets a number and a name
        The number is usually given in hexadecimal and prefixed by U+
        (Note that it is not a 16 bit number due to the astral planes !!!)
        Unicode includes tables with useful character properties (metadata)
        such as
               this is a number
               this is uppercase
               this is punctuation
        The standard also provides
               a helpful picture of a reasonably typical rendition
               rules for line-breaking
               hyphenation
               sorting



J.M.Gimeno (jmgimeno@diei.udl.cat)          Unicode                  November 2008   9 / 21
Unicode   Encodings


Encodings

        Along with the code points, Unicode also defines methods for storing
        them in byte sequences in a computer
        There are three approaches named UTF-8, UTF-16 and UTF-32
        UTF stands for Unicode Transformation Format or UCS
        Transformation Format where UCS stands for Unicode Character
        Set
        The characters we will use in the explanations are:
          Number               Name                                 Plane
          U+0026 (38)          AMPERSAND                             BMP
          U+0416 (1046)        CYRILLIC CAPITAL LETTER ZHE           BMP
          U+4E2D (20013)       HAN IDEOGRAPH 4E2E                    BMP
          U+10346 (66374) GOTHIC LETTER FAIHU                      Astral


J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode           November 2008   10 / 21
Unicode   Encodings


Encodings

        Along with the code points, Unicode also defines methods for storing
        them in byte sequences in a computer
        There are three approaches named UTF-8, UTF-16 and UTF-32
        UTF stands for Unicode Transformation Format or UCS
        Transformation Format where UCS stands for Unicode Character
        Set
        The characters we will use in the explanations are:
          Number               Name                                 Plane
          U+0026 (38)          AMPERSAND                             BMP
          U+0416 (1046)        CYRILLIC CAPITAL LETTER ZHE           BMP
          U+4E2D (20013)       HAN IDEOGRAPH 4E2E                    BMP
          U+10346 (66374) GOTHIC LETTER FAIHU                      Astral


J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode           November 2008   10 / 21
Unicode   Encodings


Encodings

        Along with the code points, Unicode also defines methods for storing
        them in byte sequences in a computer
        There are three approaches named UTF-8, UTF-16 and UTF-32
        UTF stands for Unicode Transformation Format or UCS
        Transformation Format where UCS stands for Unicode Character
        Set
        The characters we will use in the explanations are:
          Number               Name                                 Plane
          U+0026 (38)          AMPERSAND                             BMP
          U+0416 (1046)        CYRILLIC CAPITAL LETTER ZHE           BMP
          U+4E2D (20013)       HAN IDEOGRAPH 4E2E                    BMP
          U+10346 (66374) GOTHIC LETTER FAIHU                      Astral


J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode           November 2008   10 / 21
Unicode   Encodings


Encodings

        Along with the code points, Unicode also defines methods for storing
        them in byte sequences in a computer
        There are three approaches named UTF-8, UTF-16 and UTF-32
        UTF stands for Unicode Transformation Format or UCS
        Transformation Format where UCS stands for Unicode Character
        Set
        The characters we will use in the explanations are:
          Number               Name                                 Plane
          U+0026 (38)          AMPERSAND                             BMP
          U+0416 (1046)        CYRILLIC CAPITAL LETTER ZHE           BMP
          U+4E2D (20013)       HAN IDEOGRAPH 4E2E                    BMP
          U+10346 (66374) GOTHIC LETTER FAIHU                      Astral


J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode           November 2008   10 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-32
        The simplest way to storing characters: you use 32 bits (4 bytes) to
        store each character
        So we store 38, 1046, 20013 and 66374 as 32 bit integers
        For Latin-1 characters it wastes too much space
        Problems with C strings because most bytes are zero (use wchar t)
        There are lots of ways of storing 4 byte integers among 4 bytes
        (remember big-endian and little-endian?)
        So if you send one of these 4-byte integers to another machine
        problems occur if they use different orderings
        Solutions:
         Explicitness UTF-32BE and UTF-32LE encodings
        Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH
                      NO-BREAK SPACE) and the guarantee that U+FFFE
                      will never be a character
J.M.Gimeno (jmgimeno@diei.udl.cat)        Unicode            November 2008   11 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-16
        UTF-16 stores Unicode characters in 16 bit chunks
               all the BMP characters appear as themselves
               some trickery is needed for the astral plane ones
        There are two blocks of code points in the BMP called surrogate blocks
        High surrogates from U+D800 to U+DBFF
        Low surrogates from U+DC00 to U+DFFF
        Astral plane characters are splitted into two characters
               first, 0x10000 = 216 is subtracted from the code point
               next, its 20 bits are splitted using the low surrogate for the low ten bits
               and the high for the high ones
        This gives 20 bits or 220 characters that fits the 16 = 24 astral planes
        with 216 characters each
        So U+10346 is represented as the 16-bits integers 0xD800 0xDF46
        It also has ordering problems so the UTF-16BE, UTF-16LE or use of
        the BOM
        Nightmare in C: embedded zeros and not same size as wchar t
        The most efficient way to store asian characters
J.M.Gimeno (jmgimeno@diei.udl.cat)           Unicode                   November 2008   12 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8


        UTF-8 was invented by Ken Thompson on September 2, 1992, on a
        placemat in a New Jersey diner with Rob Pike.
        It works like this:
               characters whose value is less that 128 (ASCII) are encoded as
               themselves in one byte
               the rest will have its bits ripped apart and deal out into several (from
               two to four) bytes as follows:
                     The first byte has a bunch of high-order one bits telling how many
                     bytes are used to encode the character, followed by a zero bit
                     The rest of the bytes each begin with a single one byte followed by a
                     zero bit
                     The bits of the character are dealt out in the space left over after these
                     signalling bits




J.M.Gimeno (jmgimeno@diei.udl.cat)            Unicode                      November 2008   13 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8
      The following table summarizes the rules:

   Hex range         Binary                         UTF-8
   000000–00007F     0zzzzzzz                       0zzzzzzz
   000080–0007FF     00000yyy yyzzzzzz              110yyyyy 10zzzzzz
   000800–00FFFF     xxxxyyyy yyzzzzzz              1110xxxx 10yyyyyy 10zzzzzz
   010000–10FFFF     000wwwxx xxxxyyyy yyzzzzzz     11110www 10xxxxxx 10yyyyyy 10zzzzzz


      Our examples result in:

      Character    Binary                         UTF-8
      U+0026       00100110                       00100110
      U+0416       00000100 00010110              11010000 10010110
      U+4E2D       01001110 00101101              11100100 10111000 10101101
      U+10346      00000001 00000011 01000110     11110000 10010000 10001101 10000110


      Using hexadecimal:

                          Character     Hexadecimal
                          U+0026        0x26
                          U+0416        0xD0 0x96
                          U+4E2D        0xE4 0xB8 0xAD
                          U+10346       0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat)                   Unicode                            November 2008   14 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Unicode   Encodings


UTF-8


        UTF-8 is a kind of racist favouring us with round-eyes
               anglophones get one byte per character
               most people west of the Indus river get away with two bytes
               India and points east need three bytes per character
        Processing UTF-8 characters sequentially is about as efficient as in
        any other encoding
        But you can’t easily index into a buffer (this is the same as UTF-16)
               count characters
               array of positions
        UTF-8 has no embedded zero bytes so some C routines work
        No byte-ordering problems



J.M.Gimeno (jmgimeno@diei.udl.cat)         Unicode                 November 2008   15 / 21
Python’s Unicode Support   Unicode String Type


Python’s Unicode type


        Python has a built-in Unicode type
        Unicode string literals has the same syntax as the normal ones, with a
        u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
        Unicode literals can include the escape sequence uXXXX to denote
        character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
        uquot;u0026u0416u4e2dU00010346quot;)
        Unicode characters can be named using the escape sequence
        N{name} (e.g. uquot;N{Ampersand}quot;)
        unichr(i) returns a Unicode String with character i (the inverse is
        ord)




J.M.Gimeno (jmgimeno@diei.udl.cat)                  Unicode                    November 2008   16 / 21
Python’s Unicode Support   Unicode String Type


Python’s Unicode type


        Python has a built-in Unicode type
        Unicode string literals has the same syntax as the normal ones, with a
        u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
        Unicode literals can include the escape sequence uXXXX to denote
        character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
        uquot;u0026u0416u4e2dU00010346quot;)
        Unicode characters can be named using the escape sequence
        N{name} (e.g. uquot;N{Ampersand}quot;)
        unichr(i) returns a Unicode String with character i (the inverse is
        ord)




J.M.Gimeno (jmgimeno@diei.udl.cat)                  Unicode                    November 2008   16 / 21
Python’s Unicode Support   Unicode String Type


Python’s Unicode type


        Python has a built-in Unicode type
        Unicode string literals has the same syntax as the normal ones, with a
        u or U prefixing the quotes (e.g. uquot;This is Unicodequot;)
        Unicode literals can include the escape sequence uXXXX to denote
        character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g.
        uquot;u0026u0416u4e2dU00010346quot;)
        Unicode characters can be named using the escape sequence
        N{name} (e.g. uquot;N{Ampersand}quot;)
        unichr(i) returns a Unicode String with character i (the inverse is
        ord)




J.M.Gimeno (jmgimeno@diei.udl.cat)                  Unicode                    November 2008   16 / 21
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)
Unicode (and Python)

Contenu connexe

En vedette

Multimedia file formats
Multimedia file formatsMultimedia file formats
Multimedia file formatsShruti Garg
 
Hypertext,hypermedia and multimedia
Hypertext,hypermedia and multimediaHypertext,hypermedia and multimedia
Hypertext,hypermedia and multimediagaflores2
 
Hypertext, hypermedia and multimedia
Hypertext, hypermedia and multimediaHypertext, hypermedia and multimedia
Hypertext, hypermedia and multimediafernandadavalos2566
 
multimedia data and file format
multimedia data and file formatmultimedia data and file format
multimedia data and file formatALOK SAHNI
 
MultiMedia dbms
MultiMedia dbmsMultiMedia dbms
MultiMedia dbmsTech_MX
 
UNIX Operating System
UNIX Operating SystemUNIX Operating System
UNIX Operating SystemUnless Yuriko
 
Multimedia data and file format
Multimedia data and file formatMultimedia data and file format
Multimedia data and file formatNiketa Jain
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Karan Panjwani
 
File formats and its types
File formats and its typesFile formats and its types
File formats and its typesAnu Garg
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulationstk_gpg
 
Chapter 2 : TEXT
Chapter 2 : TEXTChapter 2 : TEXT
Chapter 2 : TEXTazira96
 
Text-Elements of multimedia
Text-Elements of multimediaText-Elements of multimedia
Text-Elements of multimediaVanitha Chandru
 
Mobile Operating System
Mobile Operating SystemMobile Operating System
Mobile Operating SystemSonal Poddar
 

En vedette (20)

Multimedia Technology - text
Multimedia Technology - textMultimedia Technology - text
Multimedia Technology - text
 
Ch04
Ch04Ch04
Ch04
 
Multimedia file formats
Multimedia file formatsMultimedia file formats
Multimedia file formats
 
Hypertext,hypermedia and multimedia
Hypertext,hypermedia and multimediaHypertext,hypermedia and multimedia
Hypertext,hypermedia and multimedia
 
Hypertext, hypermedia and multimedia
Hypertext, hypermedia and multimediaHypertext, hypermedia and multimedia
Hypertext, hypermedia and multimedia
 
multimedia data and file format
multimedia data and file formatmultimedia data and file format
multimedia data and file format
 
MultiMedia dbms
MultiMedia dbmsMultiMedia dbms
MultiMedia dbms
 
Windows 10
Windows 10Windows 10
Windows 10
 
UNIX Operating System
UNIX Operating SystemUNIX Operating System
UNIX Operating System
 
Unix
UnixUnix
Unix
 
Multimedia data and file format
Multimedia data and file formatMultimedia data and file format
Multimedia data and file format
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
File formats and its types
File formats and its typesFile formats and its types
File formats and its types
 
Windows 10
Windows 10Windows 10
Windows 10
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulation
 
Chapter 2 : TEXT
Chapter 2 : TEXTChapter 2 : TEXT
Chapter 2 : TEXT
 
File formats
File formatsFile formats
File formats
 
Text-Elements of multimedia
Text-Elements of multimediaText-Elements of multimedia
Text-Elements of multimedia
 
Mobile Operating System
Mobile Operating SystemMobile Operating System
Mobile Operating System
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulation
 

Similaire à Unicode (and Python)

Similaire à Unicode (and Python) (6)

Unicode
UnicodeUnicode
Unicode
 
Unicode
UnicodeUnicode
Unicode
 
Io
IoIo
Io
 
chapter-2.pptx
chapter-2.pptxchapter-2.pptx
chapter-2.pptx
 
Abap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfilesAbap slide class4 unicode-plusfiles
Abap slide class4 unicode-plusfiles
 
Lecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.pptLecture_ASCII and Unicode.ppt
Lecture_ASCII and Unicode.ppt
 

Plus de Juan-Manuel Gimeno

Visualización de datos enlazados
Visualización de datos enlazadosVisualización de datos enlazados
Visualización de datos enlazadosJuan-Manuel Gimeno
 
Functional programming in clojure
Functional programming in clojureFunctional programming in clojure
Functional programming in clojureJuan-Manuel Gimeno
 
Proves de Software (en Java amb JUnit)
Proves de Software (en Java amb JUnit)Proves de Software (en Java amb JUnit)
Proves de Software (en Java amb JUnit)Juan-Manuel Gimeno
 
Conceptes bàsics de la Web 2.0
Conceptes bàsics de la Web 2.0Conceptes bàsics de la Web 2.0
Conceptes bàsics de la Web 2.0Juan-Manuel Gimeno
 
Metaclass Programming in Python
Metaclass Programming in PythonMetaclass Programming in Python
Metaclass Programming in PythonJuan-Manuel Gimeno
 
Object-oriented Programming in Python
Object-oriented Programming in PythonObject-oriented Programming in Python
Object-oriented Programming in PythonJuan-Manuel Gimeno
 
Python: the Project, the Language and the Style
Python: the Project, the Language and the StylePython: the Project, the Language and the Style
Python: the Project, the Language and the StyleJuan-Manuel Gimeno
 

Plus de Juan-Manuel Gimeno (8)

Visualización de datos enlazados
Visualización de datos enlazadosVisualización de datos enlazados
Visualización de datos enlazados
 
Functional programming in clojure
Functional programming in clojureFunctional programming in clojure
Functional programming in clojure
 
Sistemas de recomendación
Sistemas de recomendaciónSistemas de recomendación
Sistemas de recomendación
 
Proves de Software (en Java amb JUnit)
Proves de Software (en Java amb JUnit)Proves de Software (en Java amb JUnit)
Proves de Software (en Java amb JUnit)
 
Conceptes bàsics de la Web 2.0
Conceptes bàsics de la Web 2.0Conceptes bàsics de la Web 2.0
Conceptes bàsics de la Web 2.0
 
Metaclass Programming in Python
Metaclass Programming in PythonMetaclass Programming in Python
Metaclass Programming in Python
 
Object-oriented Programming in Python
Object-oriented Programming in PythonObject-oriented Programming in Python
Object-oriented Programming in Python
 
Python: the Project, the Language and the Style
Python: the Project, the Language and the StylePython: the Project, the Language and the Style
Python: the Project, the Language and the Style
 

Dernier

How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 

Dernier (20)

How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 

Unicode (and Python)

  • 1. Unicode (and Python) Juan Manuel Gimeno Illa jmgimeno@diei.udl.cat November 2008 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 1 / 21
  • 2. Outline 1 Before Unicode 2 Unicode Unicode Concepts Encodings 3 Python’s Unicode Support Unicode String Type Source Code Encoding 4 Bibliography J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 2 / 21
  • 3. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 4. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 5. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 6. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 7. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 8. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 9. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 10. Before Unicode Before Unicode In the beginning, computing was mainly centered in North America and done in English. Characters were stored one-per-byte by using either ASCII (7 bits) EBCDIC (8 bits) In other parts of the world, different ways of storing their characters were invented Japan: various flavours of JIS encodings Russian: KOI8 India: ISCI standard Also, there were some proprietary encodings defined by operating system vendors J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21
  • 11. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 12. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 13. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 14. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 15. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 16. Before Unicode ISO-8859-* For the huge number of people in America, Europe, and the Middle East who use relatively small alphabets, there was ISO-8859 left ASCII as ASCII (range 0 to 127) used the range 128 through 255 for different purposes 1-4 Different accented characters (e.g. latin-1) 5 Cyrillic 6 Arabic 7 Greek 8 Hebrew 9 Turkish 10 Nordic languages But you could only be using one at a time, so one couldn’t easily mix Greek and Cyrillic in the same file. J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21
  • 17. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 18. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 19. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 20. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 21. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 22. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 23. Before Unicode Huston, Huston, . . . Clearly this was an very unsatisfactory situation ISO-2022 provided a partial solution allowing to shift encodings in the middle of a string it was difficult to use so it wasn’t widespread What was needed was an universal way to refer to all the different characters in all the alphabets ISO/IEC 10646 Unicode J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21
  • 24. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 25. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 26. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 27. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 28. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 29. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 30. Unicode Unicode Concepts Unicode’s Solution One encoding for all scripts of the world ASCII compatibility (even Latin-1) Includes character meta data Case mapping information Character category information Accounts for scripts using different orientations Enables sorting and normalization support J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21
  • 31. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 32. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 33. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 34. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 35. Unicode Unicode Concepts Unicode’s Terminology Grapheme This is what users regard as a character - Andr´e Code points This is an Unicode encoding of the string - AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE) - Andre’=AndreU+0301 (COMBINING ACUTE ACCENT) Code Units This is what the implementation stores (e.g. UTF-8 - Andre0xCC 0x81 This can be explored in Linux using the program gucharmap J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21
  • 36. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 37. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 38. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 39. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 40. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 41. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 42. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 43. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 44. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 45. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 46. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 47. Unicode Unicode Concepts Unicode Organization Unicode currently defines just under 100000 code points but it has space for upto 1114112 They are organized into 17 planes of 216 = 65536 characters, numbered 0 to 16 Plane 0 is called Basic Multilingual Plane (BMP) and contains pretty well everything useful The characters in BMP are laid out more or less West to East ASCII characters from 0 to 127 Latin-1 characters from 128 to 255 Then moving East in Europe (Greek, Cyrillic) Next Middle East (Arabic, Hebrew) Then the Indus (scripts of India) Next Southeast Asia (Thai, Laotian and so on) and ending with China, Japan and Korea Planes 1 to 16 are sometimes called astral planes that include exotic, rare and historically important characters (old italic, byzantine musical symbols, etc.) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21
  • 48. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 49. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 50. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 51. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 52. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 53. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 54. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 55. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 56. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 57. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 58. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 59. Unicode Unicode Concepts Code Points Each code point (“character”) gets a number and a name The number is usually given in hexadecimal and prefixed by U+ (Note that it is not a 16 bit number due to the astral planes !!!) Unicode includes tables with useful character properties (metadata) such as this is a number this is uppercase this is punctuation The standard also provides a helpful picture of a reasonably typical rendition rules for line-breaking hyphenation sorting J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21
  • 60. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 61. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 62. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 63. Unicode Encodings Encodings Along with the code points, Unicode also defines methods for storing them in byte sequences in a computer There are three approaches named UTF-8, UTF-16 and UTF-32 UTF stands for Unicode Transformation Format or UCS Transformation Format where UCS stands for Unicode Character Set The characters we will use in the explanations are: Number Name Plane U+0026 (38) AMPERSAND BMP U+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMP U+4E2D (20013) HAN IDEOGRAPH 4E2E BMP U+10346 (66374) GOTHIC LETTER FAIHU Astral J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21
  • 64. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 65. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 66. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 67. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 68. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 69. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 70. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 71. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 72. Unicode Encodings UTF-32 The simplest way to storing characters: you use 32 bits (4 bytes) to store each character So we store 38, 1046, 20013 and 66374 as 32 bit integers For Latin-1 characters it wastes too much space Problems with C strings because most bytes are zero (use wchar t) There are lots of ways of storing 4 byte integers among 4 bytes (remember big-endian and little-endian?) So if you send one of these 4-byte integers to another machine problems occur if they use different orderings Solutions: Explicitness UTF-32BE and UTF-32LE encodings Byte Order Mark (BOM) Character U+FEFF (ZERO WIDTH NO-BREAK SPACE) and the guarantee that U+FFFE will never be a character J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21
  • 73. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 74. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 75. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 76. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 77. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 78. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 79. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 80. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 81. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 82. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 83. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 84. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 85. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 86. Unicode Encodings UTF-16 UTF-16 stores Unicode characters in 16 bit chunks all the BMP characters appear as themselves some trickery is needed for the astral plane ones There are two blocks of code points in the BMP called surrogate blocks High surrogates from U+D800 to U+DBFF Low surrogates from U+DC00 to U+DFFF Astral plane characters are splitted into two characters first, 0x10000 = 216 is subtracted from the code point next, its 20 bits are splitted using the low surrogate for the low ten bits and the high for the high ones This gives 20 bits or 220 characters that fits the 16 = 24 astral planes with 216 characters each So U+10346 is represented as the 16-bits integers 0xD800 0xDF46 It also has ordering problems so the UTF-16BE, UTF-16LE or use of the BOM Nightmare in C: embedded zeros and not same size as wchar t The most efficient way to store asian characters J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21
  • 87. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 88. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 89. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 90. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 91. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 92. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 93. Unicode Encodings UTF-8 UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. It works like this: characters whose value is less that 128 (ASCII) are encoded as themselves in one byte the rest will have its bits ripped apart and deal out into several (from two to four) bytes as follows: The first byte has a bunch of high-order one bits telling how many bytes are used to encode the character, followed by a zero bit The rest of the bytes each begin with a single one byte followed by a zero bit The bits of the character are dealt out in the space left over after these signalling bits J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21
  • 94. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 95. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 96. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 97. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 98. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 99. Unicode Encodings UTF-8 The following table summarizes the rules: Hex range Binary UTF-8 000000–00007F 0zzzzzzz 0zzzzzzz 000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz 000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz 010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz Our examples result in: Character Binary UTF-8 U+0026 00100110 00100110 U+0416 00000100 00010110 11010000 10010110 U+4E2D 01001110 00101101 11100100 10111000 10101101 U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110 Using hexadecimal: Character Hexadecimal U+0026 0x26 U+0416 0xD0 0x96 U+4E2D 0xE4 0xB8 0xAD U+10346 0xF0 0x90 0x8D 0x86 J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21
  • 100. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 101. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 102. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 103. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 104. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 105. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 106. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 107. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 108. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 109. Unicode Encodings UTF-8 UTF-8 is a kind of racist favouring us with round-eyes anglophones get one byte per character most people west of the Indus river get away with two bytes India and points east need three bytes per character Processing UTF-8 characters sequentially is about as efficient as in any other encoding But you can’t easily index into a buffer (this is the same as UTF-16) count characters array of positions UTF-8 has no embedded zero bytes so some C routines work No byte-ordering problems J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21
  • 110. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 111. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21
  • 112. Python’s Unicode Support Unicode String Type Python’s Unicode type Python has a built-in Unicode type Unicode string literals has the same syntax as the normal ones, with a u or U prefixing the quotes (e.g. uquot;This is Unicodequot;) Unicode literals can include the escape sequence uXXXX to denote character point U+XXXX and UXXXXXXXX for U+XXXXXXXX (e.g. uquot;u0026u0416u4e2dU00010346quot;) Unicode characters can be named using the escape sequence N{name} (e.g. uquot;N{Ampersand}quot;) unichr(i) returns a Unicode String with character i (the inverse is ord) J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21