1. A11Y? I18N? L10N? UTF8?
WTF?
Understanding the
connections between:
accessibility,
internationalization,
localization,
and character sets
Michael
Toppa
@mtoppa
WordCamp
Lancaster
March 1, 2014
7. WCAG Accessibility (A11Y)
Guidelines
1. Perceivable
2. Operable
3. Understandable and Predictable
❖ Guideline 3.1.1 Language of Page:
❖ The default human language of each Web page can be
programmatically determined.
4. Robust and Compatible
8. The lang attribute
❖ Declare the language of a WordPress theme in
header.php:
<html <?php language_attributes(); ?>>
For a US English site, this renders as:
<html lang="en-US">
❖ In HTML 5, declare the language of part of a document
<div lang="fr">
9. Uses of the lang attribute
❖ Improves search engine results
❖ Helps support server content negotiation
❖ Supports spelling and grammar checkers
❖ Supports speech synthesizers and automated
translators
❖ Allows user-agents to select language appropriate fonts
16. The Unicode slogan
“Unicode provides a unique number for every
character, no matter what the platform, no
matter what the program, no matter what the
language.”
21. Localization
“Localization refers to the adaptation of a
product, application or document content
to meet the language, cultural and other
requirements of a specific target market
(a locale).”
This often involves more than just translation
22. Internationalization
“Internationalization is the design and
development of a product, application or
document content that enables easy
localization for target audiences that vary
in culture, region, or language.”
24. Step 1: use WordPress’ I18N
functions
❖ Wrap all your text in WordPress’ I18N functions, using a
custom “text domain”. Mine is “shashin”
❖ $greeting = __( 'Howdy', 'shashin' );
❖ <li><?php _e( 'Howdy', 'shashin' ); ?></li>
❖ $string = _x( 'Buffalo', 'an animal', 'shashin' );
❖ $string = _x( 'Buffalo', 'a city in New York', 'shashin' );
❖ And others…
30. Further reading
❖ W3C
❖ How to meet WCAG 2.0: quick reference
❖ Why use the language attribute?
❖ Localization vs. Internationalization
❖ WordPress
❖ How To Localize WordPress Themes and Plugins
❖ I18n for WordPress Developers
❖ Internationalization: You’re probably doing it wrong
❖ Solving the Unicode Puzzle
Notes de l'éditeur
In this talk I’m going to give you a sampling of several related topics, and daisy-chain them together. This talk is by no means comprehensive. My goal is to give you a general sense of the connections between accessibility, internationalization, localization, and character sets, to start you down the road to understanding how to make your web content as accessible as possible to people who speak different languages, and have various levels of reading capabilities.
* I’ve been developing for the web since the days of HTML 1.0 and the Mosaic web browser.
* This is my 6th WordCamp presentation, and I have 7 plugins at wordpress.org, dating back to 2006.
* I was previously the Director of Development for WebDevStudios, and I managed the 17 person web application team at the U Penn School of Medicine.
* I mostly work in Ruby on Rails now, and I also have experience with Java, Python, Perl, and of course, PHP.
In 2005 I wrote an article on configuring Apache, Oracle, and PHP for Unicode, published in PHP Architect. At that time Unicode was just emerging as the new standard for character encoding, and configuring end-to-end support for using it in web applications was a significant undertaking. These days, Unicode support comes out of the box for the most part.
Accessibility applies to:
* older people
* people with low literacy or not fluent in the language
* people with low bandwidth connections or using older technologies
* new and infrequent users
*… and persons with disabilities
The World Wide Web Consortium (W3c) put together version 2 of their Web Content Accessibility Guidelines in 2008, and it has 4 key principles:
Perceivable - e.g. provide text alternatives for non-textual content
Operable - e.g make all functionality available from the keyboard, provide good site navigation
Understandable - e.g. help users avoid and prevent mistakes, such as clearly indicating errors in a form submission
Robust - e.g. use valid HTML and maximize compatibility with user agents such as screen readers
There are 17 guidelines to follow for making a web page understandable. The first one is that it should be possible to programmatically determine the language of a web page.
WordPress itself has been translated to over 70 languages, but if you are developing a theme or plugin, you still need to make sure you are using the lang attribute appropriately.
The language_attributes function will set a lang attribute based on the language specified in your wp-config.php file
Content negotiation lets the browser tell the server what media types and languages it prefers, and the server will do its best to comply. There is a plugin to support this in WordPress.
These 4 ideographic characters all have the same Unicode value and meanings in Chinese, Japanese, and Korean, but are rendered differently, depending on whether the lang attribute of the page is set to Simplified Chinese, Traditional Chinese, Japanese, or Korean.
Unicode is a single character set designed to include characters from just about every writing system on the planet. This is a small section of the Unicode character map, showing characters used in languages spoken in Myanmar.
It supports languages from off the planet as well. Although Klingon was not granted official incorporation into Unicode, the proposed code points for it still remain conspicuously available (which means if you download the Klingon font, and go to a blog written in Klingon, it will work).
Unicode has been prevalent on the web for about 10 years now. In the 1960s, unaccented English characters, as well as various control characters for carriage returns, page feeds, etc., were each assigned a number from 0 to 127; there was general agreement on these number assignments, and so ASCII was born (American Standard Code for Information Interchange).
The ASCII characters could fit in 7 bits, and computers used 8-bit bytes, which left an extra bit of space. This led to the proliferation of many different character sets, with each one using this extra space in a different way. Here’s Latin 1, which contains special symbols and accented characters for Western languages.
Here’s the version of Upper ASCII that supports Slavic languages. There are 15 variations on this ISO standard. This means that text generated on, say, a computer in Russia would turn into gibberish if you tried to read it on a computer in the US. This happened because the number codes representing the Cyrillic characters were assigned to totally different characters on the US computer. This became a bit of a problem when everyone started using the internet.
Unicode represents an effort to clean up this mess. Unicode can do this because it allows characters to occupy more than one byte, so it has enough room to store characters from languages around the world—even Asian languages that have thousands of characters. It’s a character set able to support over 1 million characters.
Unicode is a character set, and there are 3 different ways to encode it. UTF-8 is the unicode encoding standard for the web because, like ASCII, it’s an 8-bit encoding, and it’s compatible with the Latin1 ASCII character set. This makes it backwards compatible with most previously created Western language documents.
UTF-8 is the standard character encoding in WordPress, since version 2.2. Here’s an example from my blog, showing a multi-lingual post in the WordPress HTML editor.
A multi-lingual page like that is fairly uncommon. More commonly, content is created in one language, but we want a standardized way to enable the creation of translations into other languages. This is where localization and internationalization come in.
In addition to translation, this can also involve dealing with variations in numeric, date, currency, and time formats, varying legal requirements, and awareness of things that may be misunderstood or be offensive in other cultures.
The POT file serves as a template for translating your theme or plugin into other languages. It extracts all the text you wrapped in the WordPress’ I18N functions and puts them in a single file. If you have a plugin in the wordpress.org repository, it can generate a POT file for you. There are other tools available for this as well. See the references at the end of this talk for other ways to generate a POT file for themes and plugins
Put your POT file in a “languages” subdirectory. Providing this file with your plugin allows users willing to create a translation to do so, using an application called POEdit.
This shows all the different language translations available for the popular plugin, Contact Form 7. With POEdit, a translator can take your POT file and create a translation to another language. This translation creates a textual .po file, and then a binary, compiled version of it, in a .mo file. If you include a .mo file translation that matches the language configuration of a WordPress site, your plugin will automatically be shown in that language.
Maintaining translations can be difficult, as you will usually need to get an updated translation for each new release of your plugin or theme. Even just changes in line numbers can throw off the translation.
For web sites to be accessible, they need to be perceivable, operable, robust, and understandable. I’ve focused on the language support aspects of understandability, and hopefully this quick introduction to character sets, internationalization, and localization has given you a good starting point for making your WordPress site, plugins, or themes accessible to users who speak different languages.