2. INTRODUCTION
• Computers at their most basic level just
deal with numbers. They store letters,
numerals and other characters by
assigning a number for each one.
• �In the pre-Unicode environment, we
had single 8-bit characters sets, which
limited us to 256 characters max. No
single encoding could contain enough
characters to cover all the languages.
• �so hundreds of different encoding
systems were developed for assigning
numbers to characters.
Page 2
3. Cnt…
• As a result, these coding systems
conflict with each other. That is, two
encodings can use the same number
for two different characters or different
numbers for the same character.
• �Any given computer needs to support
many different encodings.
• �yet whenever data is passed
between different encodings or
platforms, that data always runs the
risk of corruption.
Page 3
4. examples of character encoding
systems
• examples of character encoding
systems
• Morse code,
• Baudot code,
• the American Standard Code for
Information Interchange (ASCII)
• Unicode.
Page 4
5. WHAT IS UNICODE ?
Unicode provides a unique number for
every character,
no matter what the platform,
no matter what the program,
no matter what the language.
The Unicode Standard is a character coding
system designed to support the worldwide
interchange, processing, and display of the
written texts of the diverse languages.
Page 5
6. From ASCII to Unicode
• �Most character sets and encodings in
70s/80s were modifications or
extensions of ASCII
• �Most common encodings now a days
use single byte per character (SBCS)
• �They are all limited to 256 characters
• �Due to that, none of them can even
cover the letters for the Western
European languages
Page 6
7. Where is Unicode Used ?
• �The Unicode standards has been
adopted by many software and hardware
vendors.
• �Most OSs support Unicode.
• �Unicode is required for international
document and data interchange, the
Internet and the WWW, and therefore by
modern standards such as:
• �Java, C#, Perl, Python
• �Markup languages such as XML,
HTML, XHTML,
• �JavaScript, LDAP, CORBA etc.
Page 7
8. UTF-8
• �UTF-8 is the 8-bit encoding of Unicode
• �It’s a variable-width encoding and also
a strict superset of ASCII.
• �“Strict superset” means that every
character in ASCII is available in UTF-8
with the same corresponding code point
value
• �1 character = 1byte to 4 bytes in the
encoding
• �Characters from European scripts:
either 1or 2 bytes
• �Asian scripts: 3 or 4 bytes
Page 8
9. • �UTF-8 used for UNIX-platforms, HTML
and most Internet Browsers
• �Main benefits of UTF-8
• �compact storage requirements for
European scripts
• �In general European scripts will occupy
less storage on disk and memory
• �Ease of migration –since 7-bit ASCII
data remains the same in UTF-8, data
conversion effort between ASCII based
character sets and UTF-8 is reduced
significantly.
Page 9
10. UTF-16
• �UTF-16 is the 16-bit encoding of
Unicode
• Basically an extension of UCS-2
• �One Unicode character can be 2 or 4
bytes in
• �the encoding Characters from
European and most Asian scripts are
represented in 2 bytes
• �Supplementary characters are
represented in 4 bytes
• �UTF-16 is the main Unicode encoding
from Windows 2K
Page 10
11. • �Main benefits of UTF-16:
• �More compact storage requirements for
Asian scripts (2 bytes for commonly used
characters)
• �Ideal if European and Asian scripts are
used together
• �UTF-16 will occupy less storage on
disk and memory than with UTF-8 (3
bytes for Asian part) Balance of efficient
access to characters and economical
use of storage.
Page 11
13. Unicode @ the Library
• �» Display all scripts and characters
• �» Record data in all languages
• �» Exchange bibliographic data
• �» Search in all languages …
Page 13