SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Arabic Script
SHOULD NOT
be so Scary!
What else Unicode still needs to do
for the Perso-Arabic script
Behnam Esfahbod
behnam.es
37th Internationalization and Unicode Conference – October 2013 – Santa Clara, CA
In the past decade...
The FarsiWeb Project
●
●
●
●

National Standard for Persian Unicode text
Standard Persian keyboard layouts on all major OSs
Standard-based Open Source Persian TTF fonts
Persian support in GNOME & Mozilla projects

IRNIC, The .IR ccTLD Registry
●
●

Deploy Persian (second-level) IDNs
Apply for the IDN ccTLD (actually, 2 of them!)

Persian Computing Community
●
●

Over 300 engineers, designers,
SIGs, e.g. Persian Typography

ICANN & IETF
●
●

IDN Variant Issues Project
MESWG Task Force on Arabic-Script IDNs
Previously on Unicode
Perso-Arabic Script
Mainly Arabic and Indo-Iranian languages
●
●

Is the basis for many alphabets
Persian, Urdu, Pashto, …, and Arabic!

No alphabet uses all kinds of shapes
●
●

Four-dots is not used in Arabic or Persian
But 99% of the shapes are recognized by everyone

Many writing styles
●

Naskh, Nasta’liq, etc

Not all alphabets use all the features
●

Arabic language/alphabet doesn’t use ZWNJ, traditionally
Rooted in the language properties and its conjugation forms

●

But recently is being used for non-Arabic words

IBM ⇒ ‫آییبیمام‬
Semantic Encoding
Writing Direction
●
●

Letters:
Digits:

right-to-left
left-to-right

⇒ Unicode Bidirectional Algorithm (Bidi/UBA)

‫س ل ما م‬
One code-point for each letter
⇒ Unicode Arabic Joining Algorithm

‫سلـام‬
‫سلم‬
●

Joining Control Characters
Dual-Joining (yellow) & Right-Joining (blue)
Special Characters
1. U+0640 ARABIC TATWEEL
●
●

a.k.a. Kashida
‫کشیــــــــــــــــــده ⇒ کشیده‬

2. U+200C ZERO WIDTH NON-JOINER
●
●

a.k.a. ZWNJ
‫نامهمای ⇒ نامهای‬

3. U+200D ZERO WIDTH JOINER
●
●

a.k.a. ZWJ
.‫ه.ش. ⇒ ه.ش‬
‍
Status Quo
1. Kashida
●
●

On the keyboard: too similar to Hyphen
Makes the text hard to process (search, …)

2. ZWNJ
●
●
●
●
●
●

Inaccessible in non-standard keyboard layouts
Confuses search engines and applications
IDNA2003
UTR #31 (Unicode Identifier and Pattern Syntax)
IDNA2008
UTR #36 (Unicode Security Considerations)

3. ZWJ
●

Too little use
But when it’s needed, user is already tired of the other issues
Be Stylish
Typographic Styles
Western styles for Latin-like scripts
●
●
●

Bold
Italic & Slanted (including Iranic: left-slanted)
Underline
Question:
When, how, and by whom these three became the de facto standard?

Justification & Elongation
●
●
●

Traditionally used in manuscripts
Used to be available in movable type too
Recent study shows Elongation is still the right way to
emphasize text in Perso-Arabic script
© 2013 Nasser Hadjloo, “Curious Case of Word Stylers”, TEDxTehran

1000 persons × 10 emphasis methods
⇒ Justification & Elongation lead the way!
Typographic Emphasis
Letter Spacing

LATIN SCRIPT
Elongation + Word Spacing

‫خــــط فـــــارســــی‬
●

CSS
“letter-spacing” is useless (‫)خ ‍ط‬
‍
●
“text-justify” does work, only for justification
⇒ No sane way to do elongation
●
Verbal Emphasis
Letter Repetition

Latin script… Noooo!
Elongation

!‫خط فارسی… یبـلـــــی‬
●

Keyboard
●

●

Direct mapping of characters to typographic elements (glyphs)
is bad!
Press a Kashida key a few times to emphasize some letter,
bad again!
Text Input & Storage
Joining Control
ZWNJ is very common in Persian
●
●

Necessary to create compound words
And mandatory for some words

‫خانهمای‬
‫یبییبی‬
●

But preferential in some others

‫خطکش‬
‫میوشود‬

Larry Tesler: “no modes”
●

Mode: a distinct setting in which the same user input will produce
perceived different results than it would in other settings
Modeless Input Methods
Current keyboards techniques
●
●

Based on the typewriter technologies
Based on the needs of Latin script

Kashida insertion should be implicit
●
●

Three key presses give you three letters
Three letters need three key presses

[‫]ب[...............]ل[....]ه‬

‫یبـــــــــــــــــــلــــــه‬
ZWNJ insertion should be implicit
●
●

Instead of “mode”, we need a “modifier”
Shift key would make much sense!
Note: only applies to dual-joining letters, if followed by a joining letter
Obstacles
Input
●
●
●

Need to remove unnecessary ZWNJ dynamically
Hard to implement in keyboard engines
Have to be implemented in an IME

Processing
●

Search
●
●

●

Kashida should be ignored
ZWNJ should be respected, sometimes

Identifiers
●

Standards have to deal with them, individually!

Output
●
●

A few apps still have problems with ZWNJ
Kashida not handled correctly in most fonts
Multilingual Environments
HCI Model & Language

Language in the computer

Language in the user’s mind
Encoding Challenges
Characters with similar shapes (Not in all joining
forms)
Arabic Yeh vs. Persian Yeh
© SIL International

Prefered shapes of letter Heh family based on languages
Internationalized Identifiers
Confused about Kashida
●

Use NFKC to get rid of it

Confused about ZWNJ
●

●

Some protocols mandate some rules
●
And FAIL if the input doesn’t comply
●
UTR #31 (Unicode Identifier and Pattern Syntax) rule
2.3.A1
●
UTR #46 ⇢ IDNA2008 ⇢ UTR #31
No standard approach to make it more user-friendly
●
Although the semantics are obvious to the user

No harmony in same-shape problems
●
●
●

Protocols barely talk about it
Policy-making works only for a few cases
Applications decide how to handle it, if they do
Developing a Solution
Existing Similar Methods
Case-Mapping
●
●

●

Letter cases (different representation of the same concept)
Useful for western scripts (Latin, Cyrillic, Greek, …)
●
Doesn’t work for the other scripts
Language-dependent

Normalization Form Canonical
●

●

Deal with encoding issues
●
Ensure backward-compatibility
Allow more expansion of UCD
●
Better forward-compatibility

Normalization Form Compatibility
●
●

Works good for Arabic script (Kashida)
But too much damage to other scripts
The Solution SHALL be able to…
●

Remove unnecessary (invisible) ZWNJs

●

Remove Kashida characters

●

Place text into an specific language
●

●

Maintaining the expected shape

Stay consistent when language is not specified
Arabic Shape Mapping
●

Language-less Basic Normalization
●

●

Language-less Identifier Normalization
●
●

●

Basic normalization
Remove any Kashida

Language-based Shape Mapping
●

●

Remove any unnecessary (invisible) ZWNJ

Map characters with same joining-form shapes to the right
one for the target language

Language-less Shape Mapping
●

Not straightforward at all!
‫‪T H E‬‬
‫‪E N D‬‬
‫پـــــــــــــــــــــــایان‬

Contenu connexe

Similaire à Arabic Script SHOULD NOT be so Scary!

Programming language design and implemenation
Programming language design and implemenationProgramming language design and implemenation
Programming language design and implemenation
Ashwini Awatare
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languages
Varun Garg
 

Similaire à Arabic Script SHOULD NOT be so Scary! (20)

Programming assignment help by myassignmenthelp
Programming assignment help by myassignmenthelpProgramming assignment help by myassignmenthelp
Programming assignment help by myassignmenthelp
 
Introduction Programming Languages
Introduction Programming LanguagesIntroduction Programming Languages
Introduction Programming Languages
 
Introduction to programming languages
Introduction to programming languagesIntroduction to programming languages
Introduction to programming languages
 
Computer Programming Overview
Computer Programming OverviewComputer Programming Overview
Computer Programming Overview
 
Computer programming programming_langugages
Computer programming programming_langugagesComputer programming programming_langugages
Computer programming programming_langugages
 
Computer programing 111 lecture 1
Computer programing 111 lecture 1 Computer programing 111 lecture 1
Computer programing 111 lecture 1
 
Some of my best friends are localisers
Some of my best friends are localisersSome of my best friends are localisers
Some of my best friends are localisers
 
Presentation-1.pptx
Presentation-1.pptxPresentation-1.pptx
Presentation-1.pptx
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 
Programming Languages Categories / Programming Paradigm By: Prof. Lili Saghafi
Programming Languages Categories / Programming Paradigm By: Prof. Lili Saghafi Programming Languages Categories / Programming Paradigm By: Prof. Lili Saghafi
Programming Languages Categories / Programming Paradigm By: Prof. Lili Saghafi
 
Programming language design and implemenation
Programming language design and implemenationProgramming language design and implemenation
Programming language design and implemenation
 
Build Your Locale Style Guide
Build Your Locale Style GuideBuild Your Locale Style Guide
Build Your Locale Style Guide
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Lect 1. introduction to programming languages
Lect 1. introduction to programming languagesLect 1. introduction to programming languages
Lect 1. introduction to programming languages
 
DITA for Localization
DITA for LocalizationDITA for Localization
DITA for Localization
 
Essential Smart Programming Techniques that gets you hired by Tech Giants
Essential Smart Programming Techniques that gets you hired by Tech GiantsEssential Smart Programming Techniques that gets you hired by Tech Giants
Essential Smart Programming Techniques that gets you hired by Tech Giants
 
Single-Sourcing and Localization
Single-Sourcing and LocalizationSingle-Sourcing and Localization
Single-Sourcing and Localization
 
Roots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription WorkshopRoots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription Workshop
 
Ppl 13 july2019
Ppl 13 july2019Ppl 13 july2019
Ppl 13 july2019
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Arabic Script SHOULD NOT be so Scary!

  • 1. Arabic Script SHOULD NOT be so Scary! What else Unicode still needs to do for the Perso-Arabic script Behnam Esfahbod behnam.es 37th Internationalization and Unicode Conference – October 2013 – Santa Clara, CA
  • 2. In the past decade... The FarsiWeb Project ● ● ● ● National Standard for Persian Unicode text Standard Persian keyboard layouts on all major OSs Standard-based Open Source Persian TTF fonts Persian support in GNOME & Mozilla projects IRNIC, The .IR ccTLD Registry ● ● Deploy Persian (second-level) IDNs Apply for the IDN ccTLD (actually, 2 of them!) Persian Computing Community ● ● Over 300 engineers, designers, SIGs, e.g. Persian Typography ICANN & IETF ● ● IDN Variant Issues Project MESWG Task Force on Arabic-Script IDNs
  • 4. Perso-Arabic Script Mainly Arabic and Indo-Iranian languages ● ● Is the basis for many alphabets Persian, Urdu, Pashto, …, and Arabic! No alphabet uses all kinds of shapes ● ● Four-dots is not used in Arabic or Persian But 99% of the shapes are recognized by everyone Many writing styles ● Naskh, Nasta’liq, etc Not all alphabets use all the features ● Arabic language/alphabet doesn’t use ZWNJ, traditionally Rooted in the language properties and its conjugation forms ● But recently is being used for non-Arabic words IBM ⇒ ‫آییبیمام‬
  • 5. Semantic Encoding Writing Direction ● ● Letters: Digits: right-to-left left-to-right ⇒ Unicode Bidirectional Algorithm (Bidi/UBA) ‫س ل ما م‬ One code-point for each letter ⇒ Unicode Arabic Joining Algorithm ‫سلـام‬ ‫سلم‬ ● Joining Control Characters
  • 6. Dual-Joining (yellow) & Right-Joining (blue)
  • 7. Special Characters 1. U+0640 ARABIC TATWEEL ● ● a.k.a. Kashida ‫کشیــــــــــــــــــده ⇒ کشیده‬ 2. U+200C ZERO WIDTH NON-JOINER ● ● a.k.a. ZWNJ ‫نامهمای ⇒ نامهای‬ 3. U+200D ZERO WIDTH JOINER ● ● a.k.a. ZWJ .‫ه.ش. ⇒ ه.ش‬ ‍
  • 8. Status Quo 1. Kashida ● ● On the keyboard: too similar to Hyphen Makes the text hard to process (search, …) 2. ZWNJ ● ● ● ● ● ● Inaccessible in non-standard keyboard layouts Confuses search engines and applications IDNA2003 UTR #31 (Unicode Identifier and Pattern Syntax) IDNA2008 UTR #36 (Unicode Security Considerations) 3. ZWJ ● Too little use But when it’s needed, user is already tired of the other issues
  • 10. Typographic Styles Western styles for Latin-like scripts ● ● ● Bold Italic & Slanted (including Iranic: left-slanted) Underline Question: When, how, and by whom these three became the de facto standard? Justification & Elongation ● ● ● Traditionally used in manuscripts Used to be available in movable type too Recent study shows Elongation is still the right way to emphasize text in Perso-Arabic script
  • 11. © 2013 Nasser Hadjloo, “Curious Case of Word Stylers”, TEDxTehran 1000 persons × 10 emphasis methods ⇒ Justification & Elongation lead the way!
  • 12. Typographic Emphasis Letter Spacing LATIN SCRIPT Elongation + Word Spacing ‫خــــط فـــــارســــی‬ ● CSS “letter-spacing” is useless (‫)خ ‍ط‬ ‍ ● “text-justify” does work, only for justification ⇒ No sane way to do elongation ●
  • 13. Verbal Emphasis Letter Repetition Latin script… Noooo! Elongation !‫خط فارسی… یبـلـــــی‬ ● Keyboard ● ● Direct mapping of characters to typographic elements (glyphs) is bad! Press a Kashida key a few times to emphasize some letter, bad again!
  • 14. Text Input & Storage
  • 15. Joining Control ZWNJ is very common in Persian ● ● Necessary to create compound words And mandatory for some words ‫خانهمای‬ ‫یبییبی‬ ● But preferential in some others ‫خطکش‬ ‫میوشود‬ Larry Tesler: “no modes” ● Mode: a distinct setting in which the same user input will produce perceived different results than it would in other settings
  • 16. Modeless Input Methods Current keyboards techniques ● ● Based on the typewriter technologies Based on the needs of Latin script Kashida insertion should be implicit ● ● Three key presses give you three letters Three letters need three key presses [‫]ب[...............]ل[....]ه‬ ‫یبـــــــــــــــــــلــــــه‬ ZWNJ insertion should be implicit ● ● Instead of “mode”, we need a “modifier” Shift key would make much sense! Note: only applies to dual-joining letters, if followed by a joining letter
  • 17. Obstacles Input ● ● ● Need to remove unnecessary ZWNJ dynamically Hard to implement in keyboard engines Have to be implemented in an IME Processing ● Search ● ● ● Kashida should be ignored ZWNJ should be respected, sometimes Identifiers ● Standards have to deal with them, individually! Output ● ● A few apps still have problems with ZWNJ Kashida not handled correctly in most fonts
  • 19. HCI Model & Language Language in the computer Language in the user’s mind
  • 20. Encoding Challenges Characters with similar shapes (Not in all joining forms) Arabic Yeh vs. Persian Yeh
  • 21. © SIL International Prefered shapes of letter Heh family based on languages
  • 22. Internationalized Identifiers Confused about Kashida ● Use NFKC to get rid of it Confused about ZWNJ ● ● Some protocols mandate some rules ● And FAIL if the input doesn’t comply ● UTR #31 (Unicode Identifier and Pattern Syntax) rule 2.3.A1 ● UTR #46 ⇢ IDNA2008 ⇢ UTR #31 No standard approach to make it more user-friendly ● Although the semantics are obvious to the user No harmony in same-shape problems ● ● ● Protocols barely talk about it Policy-making works only for a few cases Applications decide how to handle it, if they do
  • 24. Existing Similar Methods Case-Mapping ● ● ● Letter cases (different representation of the same concept) Useful for western scripts (Latin, Cyrillic, Greek, …) ● Doesn’t work for the other scripts Language-dependent Normalization Form Canonical ● ● Deal with encoding issues ● Ensure backward-compatibility Allow more expansion of UCD ● Better forward-compatibility Normalization Form Compatibility ● ● Works good for Arabic script (Kashida) But too much damage to other scripts
  • 25. The Solution SHALL be able to… ● Remove unnecessary (invisible) ZWNJs ● Remove Kashida characters ● Place text into an specific language ● ● Maintaining the expected shape Stay consistent when language is not specified
  • 26. Arabic Shape Mapping ● Language-less Basic Normalization ● ● Language-less Identifier Normalization ● ● ● Basic normalization Remove any Kashida Language-based Shape Mapping ● ● Remove any unnecessary (invisible) ZWNJ Map characters with same joining-form shapes to the right one for the target language Language-less Shape Mapping ● Not straightforward at all!
  • 27. ‫‪T H E‬‬ ‫‪E N D‬‬ ‫پـــــــــــــــــــــــایان‬