Multimedia Technology - text

Multimedia Technology
Text

S T Nandasara
ADMTC/UCSC

1

World of Languages

2

World of Languages – Asian Countries

Source: Ethnologue- Languages of the World (The exact number of languages may never be determined 3
exactly)

World of Languages – Asian region

(Half of the world’s languages are spoken in only eight countries)

4

World of Languages – Asian Countries
Country Number of Languages Country Population Official or National Languages
Indonesia 742 245,452,739 Indonesian
India 427 1,095,351,995 Assamese, Bengali, Bodo, Dogri, English, Gujarati, Hindi, Kannada,
Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Marwari,
Nepali, Oriya, Panjabi, Sanskrit, Sindhi, Tamil, Telugu, Urdu,
China 241 1,313,973,713 Chinese, Zhuang, Uighur, Hmong, Hani
Philippines 180 89,468,677 Filipino, English
Malaysia 147 24,385,858 Malay
Nepal 125 28,287,147 Nepali, Gurung, Tamang
Myanmar 109 47,382,633 Burmese
Vietnam 93 84,402,966 Vietnamese
Laos 82 6,368,481 Lao
Thailand 75 64,631,595 Thai
Iran 74 68,688,433 Arabic, Farsi
Pakistan 69 165,803,560 Urdu, Panjabi, Sindhi, English
Afghanistan 45 31,056,997 Dari, Pashto
Bangladesh 38 147,365,352 Bengali
Bhutan 24 2,279,723 Dzongkha
Iraq 23 26,783,383 Arabic, Kurdi
Cambodia 19 13,881,427 Khmer
Brunei 17 379,444 Malay, English
Mongolia 12 2,832,224 Halh Mongolian
Sri Lanka 8 20,222,240 Sinhala, Tamil, English

5

World of Languages – Script Diversity

 Three types of Major Scripts in South, South
East & East Asia
 In East Asia - Chinese Ideographic Scripts
 In South Asia, Around Indian sub-continent & Part of
South Asia - Influence by Brahmi Scripts
 Part of South East Asia and Austrasia - Roman Scripts
 Two Major Types of Scripts in West & Central
Asia
 In Central Asia Historically in Arabic, but later
Transformed in to Cyrillic
 In Western Asia, Arabic Scripts is widely used
 One major Type of Script in Europe and West
 Roman Script

6

World of Languages – Script in Asia
Chinese (Mandarin) 885,000,000 普通話 Nepali 16,200,000 नेपाली

English 322,000,000 English Filipino (Tagalog) 14,850,000 Tagalog

Arabic (Alarabia) 280,000,000 ‫لعربية‬ Assamese 14,604,000 aসমীয়া

Bengali 196,000,000 বাংলা Azeri/Azerbaijani (Cyrillic) 13,869,000 Азәрбајҹан дили

Hindi 182,000,000 िह दी Sinhala 13,218,000 සිං හල

Portuguese (Português) 182,000,000 português Zhuang 10,000,000 Saw cuengh

Indonesian 140,000,000 Indonesea Pashto/Pakhto 9,585,000 ‫پښتو‬

Japanese (Nihongo) 125,000,000 日本語 Kazakh 8,000,000 Қазақ / ‫قازاق‬

Hankuko (Korean) 75,000,000 한국어 [韓國語] Uighur (Uyghur) 7,464,000 Уйғур /‫ئۇيغۇر‬

Telugu 73,000,000 ెలుగు Khmer 7,063,200 ភាសាែ◌ ខមរ

Vietnamese 66,897,000 Tiếng Việt Dari 7,000,000 ‫دَري‬
ِ
Marathi 64,783,000 मराठी Tatar 7,000,000 татарча / ‫تاتارچا‬

Tamil 62,000,000 தமிழ் Turkmen 5,397,500 түркmенче

Turkish (Türkçe) 59,000,000 Türkçe Kashmiri 4,381,000 काऽशुर / ‫كٲشر‬
ُ

Urdu 54,000,000 ‫اردو‬ Lao 4,000,000 ພາສາລາວ

Gujarati 44,000,000 ગુજરાતી Balinese 3,800,000 Bahasa Bali

Malayalam 34,014,000 മലയാളം Kyrgyz 2,631,420 Кыргыз

Kannada 33,663,000 ಕನನ್ಡ Fijian 650,000 vaka-Viti

Punjabi/Panjabi 25,700,000 ਪੰ ਜਾਬੀ / ‫باجنپ‬ Maldivian Dhivehi 280000 ި ެ ި
‫ދވހ‬

Thai 21,000,000 ภาษาไทย Sanskrit 194,433 सं कृतम ्

Sindhi 19,675,000 ‫سنڌي‬ Tahitian 150,000 Te Reo Tahiti

Uzbek (Cyrillic) 18,386,000 Ўзбек Maori 70,000 Te Reo Māori

Bahasa Melayu (Malay) 17,600,000 Bahasa melayu Hawaiian 8,000 Ōlelo Hawai'i
7

World of Languages – Script in Asia

8

Nature of Text
 The most basic media.
 Easiest to generate, store and transfer
in PC.
 Still the best for complex explanation.
 Using structured text/Hypertext
 Light weight
 Smallest sized media
 Static
 Language dependent (biggest
problem)
9

Text – Digital Form
Input Digital Form Output
Creation Typeface
Keyboard
Bitmap font
Handwriting Vector Font
Text Data
Handwriting Recognition

Printed Documents
Optical Character
Recognition (OCR) (Character code) Voice
ASCII: 8 bit
Human Voice Unicode: 16 bit Text-to-Speech
Voice Recognition Universal Character Set: 32 bit
10

Indexing and Hypertext
Large Text Data
 Indexing
 Rapid random access/search While, it is hard when we try to
process by machine a plur ality of
media together. The tele phone and

method for Large Text Data.
radio for voice, the camera for image.
we usually tend to handle diff erent
media individually. Even with the
computer, the represen tative device,
origin -ally it could only handle text and
numbers.
With technological progre ss, it

 Essential for reference type
became able to handle voice and
images and to com municate, but there
we re still many limitat ions. Tel

applications
Dictionary, Encyclopedia
Etc.
a b c d e

 Hypertext ad am bi bot by

 Non-sequential navigation adjust adorn

structure for Large Text Data
 Used in Web pages (HTML) Index
11

Hypertext, Hypermedia and Multimedia

ia Hy
ed pe
tim Hypermedia rte
ul xt
M

Hypermedia system includes the non-
linear Information links of hypertext
systems and the continuous and
discrete media of multimedia systems.

12

Typography
 Until end of 14th Century, all writing
was done by hand.
 Typography – the design of the
characters that make up text and
display type and the way they are
configured on the page.
 Modern software allows :
 Rotation or distorting type, wrap around
images,

13

Typography – Evolution of Asian Scripts
3 rd Bc

1st century

3 rd century

6 th century

8 th Century
Pa l l awa

10 th Century

12 th century

M rn
ode ණ

Kannada

Tamil

Sinhala
Devanagari

Gujarati
Bengali

Oriya

Teligu

Malayalam
Panjabi

14

Typography – Complex Scripts

Bengali Devanagar Gujarati
i
Kannada Malayalam Teligu

Sinhala Tamil Ranjana

Gurmuki Oriya Tibetan

Khmer Lao Thai

Jawani Thana Bagini

Sanskrit

15

Typography - Complex Vowels

16

Typography – ASCII & EBCDIC

ASCII EBCDIC

17

Typography – 8 Bit English and Sinhala

1989 - SLASCII

Wadan Tharuwa SBIOS

18

The Code Page Problem
 Characters in most languages are traditionally
represented by single-byte values
 Allows for 256 characters max
 Real limit for most encodings is 192 characters
 This includes letters, digits, punctuation, symbols
 When a system is used for a new language, the
encoding has to be adapted to use that
language’s characters
 Encodings proliferate
 Each language or group of languages gets its own
encoding
 Different vendors or standards committees devise
different encodings, so generally each language has
several, often incompatible, encodings

19

Multi-byte encodings

 Some languages (Chinese, Japanese, Korean,
etc.) have more than 256 characters
 Encoding standards for these languages use
sequences of bytes for many characters
 In many standards, not all characters are the same
number of bytes
 Can’t tell whether a given byte is a whole character
or part of a character
 Corruption of one byte can corrupt the whole data
stream

20

Interoperability problems

 Can’t easily mix languages in a document or
system
 Data not tagged with encoding, so loss can
occur when transferring between systems
 Most encodings are ASCII-based, so problems
often not seen with English-only data
 Two possible solutions:
 Systematic tagging of textual data with encoding
ID
 Universal encoding standard with all languages’
characters
22

Encoding space

An ASCII character is 7 bits wide

23

Encoding space

Most encodings press the eighth bit into service

24

Encoding space

Early versions of Unicode used 16 bits

25

Encoding space

Unicode now uses 21 bits

26

Encoding space

Plane Row Character
number number number

27

Unicode
 21-bit encoding space allows for 1,114,112
characters
 95,156 code point values assigned to
characters in Unicode 3.2
 137,216 code point values set aside for
application use
 2,114 code point values set aside for non-
character use
 879,626 code point values reserved for future
character assignments

28

The Unicode Encoding Space

10
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1 Basic Multilingual Plane
0

29


10
F
E
D
C
B
A
9 Supplementary Planes
8
7
6
5
4
3
2
1
0

30


10 Supplementary Special-Purpose
F
E Plane
D
C
B
A
9
8
7
6
5
4
3 Supplementary Ideographic Plane
2 Supplementary Multilingual Plane
1
0

31


Private Use Planes
10
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1
0

32


10
F
E
D
C
B
A
9
8
7
6
5
4
3
2
1 Basic Multilingual Plane
0

33

The Basic Multilingual Plane
0
General Scripts Area
1
2 Symbols Area CJK Punct.

3 CJK Punct.
4
5
Han
6
7
8
9
A Yi
B
Hangul
C
D Surrogates Area
E
Private Use Area
F Compatibility Area
34

The General Scripts Area
00/01 Latin
02/03 IPA Diacriticals Greek
04/05 Cyrillic Armenian Hebrew
06/07 Arabic Syriac Thaana
08/09 Devanagari Bengali
0A/0B Gurmukhi Gujarati Oriya Tamil
0C/0D Telugu Kannada Malayalam Sinhala
0E/0F Thai Lao Tibetan
10/11 Myanmar Georgian Hangul
12/13 Ethiopic Cherokee
14/15 Canadian Aboriginal Syllabics
Ogh
16/17 am Runic Philippine Khmer
18/19 Mongolian
1A/1B
1C/1D
1E/1F Latin Greek
35

Unicode Coverage
 European scripts
 Latin, Greek, Cyrillic, Armenian, Georgian, IPA
 Bidirectional (Middle Eastern) scripts
 Hebrew, Arabic, Syriac, Thaana
 Indic (Indian and Southeast Asian) scripts
 Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil,
Telugu, Kannada, Malayalam, Sinhala, Thai, Lao,
Khmer, Myanmar, Tibetan, Philippine
 East Asian scripts
 Chinese (Han) characters, Japanese (Hiragana and
Katakana), Korean (Hangul), Yi
 Other modern scripts
 Mongolian, Ethiopic, Cherokee, Canadian Aboriginal
 Historical scripts
 Runic, Ogham, Old Italic, Gothic, Deseret
 Punctuation and symbols
 Numerals, math symbols, scientific symbols, arrows,
blocks, geometric shapes, Braille, musical notation, etc.
36

Characters, Glyphs, and Fonts

 In computer terms, a character is a
grouping of bits (binary ones and
zeros) in packages of 8: one or more
bytes
 There are two broad classes of
characters: data characters and
control characters

37


A – Arial
A - Times New Roman
A - Courier new
A – Giddyup Standard
A - Bodoni
A - Papyrus
A - Forte
38


 You can run out of available characters pretty
quick if you allow all those strange foreign,
mathematical, scientific, engineering, currency,
and other symbols

(Informal Roman)

39

Unicode properties

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Representative
glyph A
Code point: 0041
Name: LATIN CAPITAL LETTER A
Semantic General category: Uppercase letter (Lu)
properties Canonical combining class: Standard spacing (0)
Bidirectional category: Left-to-right (L)
Mirrored: no (N)
Lowercase mapping: 0061
40

Combining characters

One character…

41


…or two?

42

Actually, either.
Unicode is generative, with accent marks represented
with their own code point values…

= U+0065 (e) U+0301 (accent)

…but common combinations of letters and accents are
also given their own code points for convenience.

= U+00E9

43


This can be tough, because the two representations are
to be treated as absolutely identical.

=
U+0065 U+0301 = U+00E9

44

Things can get really wild for characters with more
than one accent mark:

= 006F (o) 0302 (circumflex) 0323 (dot)
= 006F (o) 0323 (dot) 0302 (circumflex)
= 00F4 (o-circumflex) 0323 (dot)
= 1ECD (o-dot) 0302 (circumflex)
= 1ED9 (o-circumflex-dot)

45

Typography - Complex Vowels Positioning

46

Smart rendering: Arabic
Keyboard: Code points:
0628 064e 0628 0650
babibu b
babib
babi
bab
ba
Screen: 0628 064f 0020 0628

47

Smart rendering: Burmese

Keyboard: Code points:
1000 1039 101b
krui
kru
kr
102f 102d
Screen:

48

Smart rendering: Tamil
Ur r y N m k j
Keyboard: Ur rU yU NU mU kU jU
Code b8a bb0 bb0 bc2 baf bc2
points: ba3 bc2 bae bc2 b95 bc2
Screen: b9c bc2

49

Typography - Complex Ligature

50

Canonical equivalence

01FA
LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE

212B 0301
ANGSTROM SIGN
COMBINING ACUTE ACCENT

00C5 0301
LATIN CAPITAL LETTER A WITH RING ABOVE

0041 030A 0301
LATIN CAPITAL LETTER A
COMBINING RING ABOVE
51

Case mapping

 Case mapping may produce strings of different length

01F0  004A 030C
 Case mapping may depend on the locale

English 0069  0049

Turkish/Azeri 0069  0130

52

Things can get really wild for characters with more
than one accent mark:

= 006F (o) 0302 (circumflex) 0323 (dot)
= 006F (o) 0323 (dot) 0302 (circumflex)
= 00F4 (o-circumflex) 0323 (dot)
= 1ECD (o-dot) 0302 (circumflex)
= 1ED9 (o-circumflex-dot)

53

Typography – Unicode Sinhala
1998 – Unicode Ver. 3.0 Sinhala
1987- Unicode Ver. 1.0 Sinhala

54


ttha in Devanagari ttha in Tamil Tva in Malayalam Tva in Sinhala

55


U+200C UTF8 E2 80 8C U+200D UTF8 E2 80 8D

Tva with ZWNJ in Malayalam Tva with ZWJ in Malayalam

Tva with ZWNJ in Sinhala Tva with ZWJ in Sinhala

56

Typography - Complex Ligature-UTF 8

U+0000 .. U+007F 1 byte 0xxx xxxx
U+0080 .. U+07FF 2 bytes 110x xxxx 10xx xxxx
U+0800 .. U+FFFF 3 bytes 1110 xxxx 10xx xxxx 10xx xxxx
U+10000 .. U+10FFFF 4 bytes 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx

U+0026 AMPERSAND (decimal 38)
U+0D85 SINHALA LETTER AYANNA (decimal 3,461)
U+4E2D HAN IDEOGRAPH 4E2D (decimal 20,013)
U+10346 GOTHIC LETTER FAIHU (decimal 66,374)
U+0E12 THAI LETTER THO PHUTHAO (3602)

57


Preventing Conjunct Forms in Devanagari

Half-Consonants in Devanagari
58


Buddha in Sinhala

59

Typography - Complex Ligature in DB
<html>
<head>
<title>සිංහල</title></head>
<body>
<?php
include("connection.php"); //simple connection setting
$result = mysql_query("SET NAMES utf8"); //the main trick
$cmd = "select * from sinhala";
$result = mysql_query($cmd);
while ($myrow = mysql_fetch_row($result))
{
echo ($myrow[0]);
}
?>
</body>
</html>

//The dump for my database storing sinhala utf strings is
CREATE TABLE `sinhala` (
`data` varchar(1000) character set utf8 collate utf8_bin default NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

INSERT INTO `sinhala` VALUES (‘අම්මා');

60

Typography
 Typical typefaces (fonts) and type styles used
in Word Processors
Typefaces
Times New Roman Arial  symbol
Courier Impact
Arial Narrow free hand
Palatino
San Serif Special
Serif typefaces typefaces typefaces
Crazy fonts can be distracting!
Type styles Bold Italics Outline
61

Typography

 Special effects
 Kerning increases or decreases the spacing
between certain pairs of letters to improve
their appearance.
 Line spacing or leading
 Orientation
 Anti-alias : To smooth out a text edge.This
makes the edges of the text blend into the
background so that the text is cleaner and
more readable when it is large.

62

Typography

Ascender height
Cap Height

X height

Base line

Descanter height

63

Typography - Tracking & Kerning

64

Typography - Orientation

65

Typography – Anti-alias

66

Typography
 Special effects cont..
 strokes, fills, effects and styles
to text

stroke fill effect style

67

Typography
 Attaching text to a path

68

Typography
 Converting text to path :
Text converted to paths retains all
of its visual attributes, but you
can edit it only as paths.

69

Typography
 Bitmap Font
 Vector Font
 True Type
Fast, Standard, for
computer screen, Printer
 Adobe Type 1
Precise, Professional, used Screen from “Fontographer”
for publishing
Normal
 Anti-aliased Small font
 For LCD screen
ClearType etc.
Optimized
70

Text- Cross-media Technology
 Voice Recognition
 Converts voice (sound data) text data
 Need real time procession
 Specific speaker/Non specific speaker
 Text-to-Speech (Speech Synthesis)
 Computer “dictates” text data
Automatic information services/New
mail dictation.

71

Text- Cross-media Technology cont…

 Optical Character Recognition
 Converts text bitmap image to real text
data
 Used with image scanner
 Handwriting Recognition
 Similar to OCR, but use writing
order/direction for better recognition.
 Used in PIM (Personal Information
Manager)Devices (palmtop computers),

72

Text- Cross-media Technology cont…
 Machine Translation
 All text based techniques are language
dependent
 Needs automatic translation
Vertical Market – Technical document translation
Personal Market – Web browsing
 Combination of media technology
Automatically translate international telephone
messages.
Japanese Japanese English English
Voice Text data Text data Voice
Japanese Machine English
voice recognition Translation Speech Synthesis
73

File Format

 .TXT - (unformatted text eg. Notepad)
 .DOC - (Developed by Microsoft eg. MS-
Word)
 .RTF - (Rich Text Format)
 PDF - (Portable Document Format) –
Adobe
 PS - (Post Script) – Page Description
Language Use mainly for Desk Top
Publishing
74

Multimedia Technology - text

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Dernier

Dernier (20)

Multimedia Technology - text