3. World of Languages – Asian Countries
Source: Ethnologue- Languages of the World (The exact number of languages may never be determined 3
exactly)
4. World of Languages – Asian region
(Half of the world’s languages are spoken in only eight countries)
4
5. World of Languages – Asian Countries
Country Number of Languages Country Population Official or National Languages
Indonesia 742 245,452,739 Indonesian
India 427 1,095,351,995 Assamese, Bengali, Bodo, Dogri, English, Gujarati, Hindi, Kannada,
Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Marwari,
Nepali, Oriya, Panjabi, Sanskrit, Sindhi, Tamil, Telugu, Urdu,
China 241 1,313,973,713 Chinese, Zhuang, Uighur, Hmong, Hani
Philippines 180 89,468,677 Filipino, English
Malaysia 147 24,385,858 Malay
Nepal 125 28,287,147 Nepali, Gurung, Tamang
Myanmar 109 47,382,633 Burmese
Vietnam 93 84,402,966 Vietnamese
Laos 82 6,368,481 Lao
Thailand 75 64,631,595 Thai
Iran 74 68,688,433 Arabic, Farsi
Pakistan 69 165,803,560 Urdu, Panjabi, Sindhi, English
Afghanistan 45 31,056,997 Dari, Pashto
Bangladesh 38 147,365,352 Bengali
Bhutan 24 2,279,723 Dzongkha
Iraq 23 26,783,383 Arabic, Kurdi
Cambodia 19 13,881,427 Khmer
Brunei 17 379,444 Malay, English
Mongolia 12 2,832,224 Halh Mongolian
Sri Lanka 8 20,222,240 Sinhala, Tamil, English
5
6. World of Languages – Script Diversity
Three types of Major Scripts in South, South
East & East Asia
In East Asia - Chinese Ideographic Scripts
In South Asia, Around Indian sub-continent & Part of
South Asia - Influence by Brahmi Scripts
Part of South East Asia and Austrasia - Roman Scripts
Two Major Types of Scripts in West & Central
Asia
In Central Asia Historically in Arabic, but later
Transformed in to Cyrillic
In Western Asia, Arabic Scripts is widely used
One major Type of Script in Europe and West
Roman Script
6
7. World of Languages – Script in Asia
Chinese (Mandarin) 885,000,000 普通話 Nepali 16,200,000 नेपाली
English 322,000,000 English Filipino (Tagalog) 14,850,000 Tagalog
Arabic (Alarabia) 280,000,000 لعربية Assamese 14,604,000 aসমীয়া
Bengali 196,000,000 বাংলা Azeri/Azerbaijani (Cyrillic) 13,869,000 Азәрбајҹан дили
Hindi 182,000,000 िह दी Sinhala 13,218,000 සිං හල
Portuguese (Português) 182,000,000 português Zhuang 10,000,000 Saw cuengh
Indonesian 140,000,000 Indonesea Pashto/Pakhto 9,585,000 پښتو
Japanese (Nihongo) 125,000,000 日本語 Kazakh 8,000,000 Қазақ / قازاق
Hankuko (Korean) 75,000,000 한국어 [韓國語] Uighur (Uyghur) 7,464,000 Уйғур /ئۇيغۇر
Telugu 73,000,000 ెలుగు Khmer 7,063,200 ភាសាែ◌ ខមរ
Vietnamese 66,897,000 Tiếng Việt Dari 7,000,000 دَري
ِ
Marathi 64,783,000 मराठी Tatar 7,000,000 татарча / تاتارچا
Tamil 62,000,000 தமிழ் Turkmen 5,397,500 түркmенче
Turkish (Türkçe) 59,000,000 Türkçe Kashmiri 4,381,000 काऽशुर / كٲشر
ُ
Urdu 54,000,000 اردو Lao 4,000,000 ພາສາລາວ
Gujarati 44,000,000 ગુજરાતી Balinese 3,800,000 Bahasa Bali
Malayalam 34,014,000 മലയാളം Kyrgyz 2,631,420 Кыргыз
Kannada 33,663,000 ಕನನ್ಡ Fijian 650,000 vaka-Viti
Punjabi/Panjabi 25,700,000 ਪੰ ਜਾਬੀ / باجنپ Maldivian Dhivehi 280000 ި ެ ި
ދވހ
Thai 21,000,000 ภาษาไทย Sanskrit 194,433 सं कृतम ्
Sindhi 19,675,000 سنڌي Tahitian 150,000 Te Reo Tahiti
Uzbek (Cyrillic) 18,386,000 Ўзбек Maori 70,000 Te Reo Māori
Bahasa Melayu (Malay) 17,600,000 Bahasa melayu Hawaiian 8,000 Ōlelo Hawai'i
7
9. Nature of Text
The most basic media.
Easiest to generate, store and transfer
in PC.
Still the best for complex explanation.
Using structured text/Hypertext
Light weight
Smallest sized media
Static
Language dependent (biggest
problem)
9
10. Text – Digital Form
Input Digital Form Output
Creation Typeface
Keyboard
Bitmap font
Handwriting Vector Font
Text Data
Handwriting Recognition
Printed Documents
Optical Character
Recognition (OCR) (Character code) Voice
ASCII: 8 bit
Human Voice Unicode: 16 bit Text-to-Speech
Voice Recognition Universal Character Set: 32 bit
10
11. Indexing and Hypertext
Large Text Data
Indexing
Rapid random access/search While, it is hard when we try to
process by machine a plur ality of
media together. The tele phone and
method for Large Text Data.
radio for voice, the camera for image.
we usually tend to handle diff erent
media individually. Even with the
computer, the represen tative device,
origin -ally it could only handle text and
numbers.
With technological progre ss, it
Essential for reference type
became able to handle voice and
images and to com municate, but there
we re still many limitat ions. Tel
applications
Dictionary, Encyclopedia
Etc.
a b c d e
Hypertext ad am bi bot by
Non-sequential navigation adjust adorn
structure for Large Text Data
Used in Web pages (HTML) Index
11
12. Hypertext, Hypermedia and Multimedia
ia Hy
ed pe
tim Hypermedia rte
ul xt
M
Hypermedia system includes the non-
linear Information links of hypertext
systems and the continuous and
discrete media of multimedia systems.
12
13. Typography
Until end of 14th Century, all writing
was done by hand.
Typography – the design of the
characters that make up text and
display type and the way they are
configured on the page.
Modern software allows :
Rotation or distorting type, wrap around
images,
13
14. Typography – Evolution of Asian Scripts
3 rd Bc
1st century
3 rd century
6 th century
8 th Century
Pa l l awa
10 th Century
12 th century
M rn
ode ණ
Kannada
Tamil
Sinhala
Devanagari
Gujarati
Bengali
Oriya
Teligu
Malayalam
Panjabi
14
18. Typography – 8 Bit English and Sinhala
1989 - SLASCII
Wadan Tharuwa SBIOS
18
19. The Code Page Problem
Characters in most languages are traditionally
represented by single-byte values
Allows for 256 characters max
Real limit for most encodings is 192 characters
This includes letters, digits, punctuation, symbols
When a system is used for a new language, the
encoding has to be adapted to use that
language’s characters
Encodings proliferate
Each language or group of languages gets its own
encoding
Different vendors or standards committees devise
different encodings, so generally each language has
several, often incompatible, encodings
19
20. Multi-byte encodings
Some languages (Chinese, Japanese, Korean,
etc.) have more than 256 characters
Encoding standards for these languages use
sequences of bytes for many characters
In many standards, not all characters are the same
number of bytes
Can’t tell whether a given byte is a whole character
or part of a character
Corruption of one byte can corrupt the whole data
stream
20
22. Interoperability problems
Can’t easily mix languages in a document or
system
Data not tagged with encoding, so loss can
occur when transferring between systems
Most encodings are ASCII-based, so problems
often not seen with English-only data
Two possible solutions:
Systematic tagging of textual data with encoding
ID
Universal encoding standard with all languages’
characters
22
27. Encoding space
Plane Row Character
number number number
27
28. Unicode
21-bit encoding space allows for 1,114,112
characters
95,156 code point values assigned to
characters in Unicode 3.2
137,216 code point values set aside for
application use
2,114 code point values set aside for non-
character use
879,626 code point values reserved for future
character assignments
28
29. The Unicode Encoding Space
10
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1 Basic Multilingual Plane
0
29
30. The Unicode Encoding Space
10
F
E
D
C
B
A
9 Supplementary Planes
8
7
6
5
4
3
2
1
0
30
31. The Unicode Encoding Space
10 Supplementary Special-Purpose
F
E Plane
D
C
B
A
9
8
7
6
5
4
3 Supplementary Ideographic Plane
2 Supplementary Multilingual Plane
1
0
31
32. The Unicode Encoding Space
Private Use Planes
10
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1
0
32
33. The Unicode Encoding Space
10
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1 Basic Multilingual Plane
0
33
34. The Basic Multilingual Plane
0
General Scripts Area
1
2 Symbols Area CJK Punct.
3 CJK Punct.
4
5
Han
6
7
8
9
A Yi
B
Hangul
C
D Surrogates Area
E
Private Use Area
F Compatibility Area
34
35. The General Scripts Area
00/01 Latin
02/03 IPA Diacriticals Greek
04/05 Cyrillic Armenian Hebrew
06/07 Arabic Syriac Thaana
08/09 Devanagari Bengali
0A/0B Gurmukhi Gujarati Oriya Tamil
0C/0D Telugu Kannada Malayalam Sinhala
0E/0F Thai Lao Tibetan
10/11 Myanmar Georgian Hangul
12/13 Ethiopic Cherokee
14/15 Canadian Aboriginal Syllabics
Ogh
16/17 am Runic Philippine Khmer
18/19 Mongolian
1A/1B
1C/1D
1E/1F Latin Greek
35
36. Unicode Coverage
European scripts
Latin, Greek, Cyrillic, Armenian, Georgian, IPA
Bidirectional (Middle Eastern) scripts
Hebrew, Arabic, Syriac, Thaana
Indic (Indian and Southeast Asian) scripts
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil,
Telugu, Kannada, Malayalam, Sinhala, Thai, Lao,
Khmer, Myanmar, Tibetan, Philippine
East Asian scripts
Chinese (Han) characters, Japanese (Hiragana and
Katakana), Korean (Hangul), Yi
Other modern scripts
Mongolian, Ethiopic, Cherokee, Canadian Aboriginal
Historical scripts
Runic, Ogham, Old Italic, Gothic, Deseret
Punctuation and symbols
Numerals, math symbols, scientific symbols, arrows,
blocks, geometric shapes, Braille, musical notation, etc.
36
37. Characters, Glyphs, and Fonts
In computer terms, a character is a
grouping of bits (binary ones and
zeros) in packages of 8: one or more
bytes
There are two broad classes of
characters: data characters and
control characters
37
38. Characters, Glyphs, and Fonts
A – Arial
A - Times New Roman
A - Courier new
A – Giddyup Standard
A - Bodoni
A - Papyrus
A - Forte
38
39. Characters, Glyphs, and Fonts
You can run out of available characters pretty
quick if you allow all those strange foreign,
mathematical, scientific, engineering, currency,
and other symbols
(Informal Roman)
39
40. Unicode properties
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
Representative
glyph A
Code point: 0041
Name: LATIN CAPITAL LETTER A
Semantic General category: Uppercase letter (Lu)
properties Canonical combining class: Standard spacing (0)
Bidirectional category: Left-to-right (L)
Mirrored: no (N)
Lowercase mapping: 0061
40
43. Combining characters
Actually, either.
Unicode is generative, with accent marks represented
with their own code point values…
= U+0065 (e) U+0301 (accent)
…but common combinations of letters and accents are
also given their own code points for convenience.
= U+00E9
43
44. Combining characters
This can be tough, because the two representations are
to be treated as absolutely identical.
=
U+0065 U+0301 = U+00E9
44
45. Combining characters
Things can get really wild for characters with more
than one accent mark:
= 006F (o) 0302 (circumflex) 0323 (dot)
= 006F (o) 0323 (dot) 0302 (circumflex)
= 00F4 (o-circumflex) 0323 (dot)
= 1ECD (o-dot) 0302 (circumflex)
= 1ED9 (o-circumflex-dot)
45
49. Smart rendering: Tamil
Ur r y N m k j
Keyboard: Ur rU yU NU mU kU jU
Code b8a bb0 bb0 bc2 baf bc2
points: ba3 bc2 bae bc2 b95 bc2
Screen: b9c bc2
49
51. Canonical equivalence
01FA
LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
212B 0301
ANGSTROM SIGN
COMBINING ACUTE ACCENT
00C5 0301
LATIN CAPITAL LETTER A WITH RING ABOVE
COMBINING ACUTE ACCENT
0041 030A 0301
LATIN CAPITAL LETTER A
COMBINING RING ABOVE
COMBINING ACUTE ACCENT
51
52. Case mapping
Case mapping may produce strings of different length
01F0 004A 030C
Case mapping may depend on the locale
English 0069 0049
Turkish/Azeri 0069 0130
52
53. Combining characters
Things can get really wild for characters with more
than one accent mark:
= 006F (o) 0302 (circumflex) 0323 (dot)
= 006F (o) 0323 (dot) 0302 (circumflex)
= 00F4 (o-circumflex) 0323 (dot)
= 1ECD (o-dot) 0302 (circumflex)
= 1ED9 (o-circumflex-dot)
53
55. Typography - Complex Ligature
ttha in Devanagari ttha in Tamil Tva in Malayalam Tva in Sinhala
55
56. Typography - Complex Ligature
U+200C UTF8 E2 80 8C U+200D UTF8 E2 80 8D
Tva with ZWNJ in Malayalam Tva with ZWJ in Malayalam
Tva with ZWNJ in Sinhala Tva with ZWJ in Sinhala
56
58. Typography - Complex Ligature
Preventing Conjunct Forms in Devanagari
Half-Consonants in Devanagari
58
59. Typography - Complex Ligature
Buddha in Sinhala
59
60. Typography - Complex Ligature in DB
<html>
<head>
<title>සිංහල</title></head>
<body>
<?php
include("connection.php"); //simple connection setting
$result = mysql_query("SET NAMES utf8"); //the main trick
$cmd = "select * from sinhala";
$result = mysql_query($cmd);
while ($myrow = mysql_fetch_row($result))
{
echo ($myrow[0]);
}
?>
</body>
</html>
//The dump for my database storing sinhala utf strings is
CREATE TABLE `sinhala` (
`data` varchar(1000) character set utf8 collate utf8_bin default NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `sinhala` VALUES (‘අම්මා');
60
61. Typography
Typical typefaces (fonts) and type styles used
in Word Processors
Typefaces
Times New Roman Arial symbol
Courier Impact
Arial Narrow free hand
Palatino
San Serif Special
Serif typefaces typefaces typefaces
Crazy fonts can be distracting!
Type styles Bold Italics Outline
61
62. Typography
Special effects
Kerning increases or decreases the spacing
between certain pairs of letters to improve
their appearance.
Line spacing or leading
Orientation
Anti-alias : To smooth out a text edge.This
makes the edges of the text blend into the
background so that the text is cleaner and
more readable when it is large.
62
69. Typography
Special effects cont..
Converting text to path :
Text converted to paths retains all
of its visual attributes, but you
can edit it only as paths.
69
70. Typography
Bitmap Font
Vector Font
True Type
Fast, Standard, for
computer screen, Printer
Adobe Type 1
Precise, Professional, used Screen from “Fontographer”
for publishing
Normal
Anti-aliased Small font
For LCD screen
ClearType etc.
Optimized
70
71. Text- Cross-media Technology
Voice Recognition
Converts voice (sound data) text data
Need real time procession
Specific speaker/Non specific speaker
Text-to-Speech (Speech Synthesis)
Computer “dictates” text data
Automatic information services/New
mail dictation.
71
72. Text- Cross-media Technology cont…
Optical Character Recognition
Converts text bitmap image to real text
data
Used with image scanner
Handwriting Recognition
Similar to OCR, but use writing
order/direction for better recognition.
Used in PIM (Personal Information
Manager)Devices (palmtop computers),
72
73. Text- Cross-media Technology cont…
Machine Translation
All text based techniques are language
dependent
Needs automatic translation
Vertical Market – Technical document translation
Personal Market – Web browsing
Combination of media technology
Automatically translate international telephone
messages.
Japanese Japanese English English
Voice Text data Text data Voice
Japanese Machine English
voice recognition Translation Speech Synthesis
73
74. File Format
.TXT - (unformatted text eg. Notepad)
.DOC - (Developed by Microsoft eg. MS-
Word)
.RTF - (Rich Text Format)
PDF - (Portable Document Format) –
Adobe
PS - (Post Script) – Page Description
Language Use mainly for Desk Top
Publishing
74