In this talk, given to the Saint Louis Lambda Lounge, Michael Schade quickly discusses the background, approach, and technical implementation of the Accentuate.us system, and then demonstrates the new vim plugin, unreleased Apple Mac OS X service, and the just-released 0.9 version of the Firefox add-on.
3. Keyboard Input
• Lack appropriate input methods
• Electronic texts often entered as plain ASCII
o Transliteration Cherokee ᏴᏴᏴᏴᏴ →
galvquodiyu
o Omitting diacritics Lingala likɔngá → likonga
o Ad hoc approaches Irish béal → be/al
• Diacritics matter!
• Omission leads to ambiguities, misunderstandings
o leite vs. léite
4. Statistical Machine Learning
• Classification problem
• Machine learning
• Never-before seen words
o French: "cera" vs. "cerc," "cabl" vs. "cabo"
o Under-resourced languages
• 114 trained languages!
5. API
• Protocol: JSON
• Calls
o langs
o lift
o feedback
• Sample Call
o { "call": "charlifter.lift"
, "lang": "ht"
, "text": "Bon, la fe sa apre demen pito, le la we mwen andey."
, "locale": "ht"
}
• Full documentation at http://accentuate.us/api
7. HTTP Communication (Proxy)
Cache-Control: no-cache
Connection: keep-alive
Pragma: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding: gzip,deflate
Accept-Language: en-us,en;q=0.5
Host: ht.api.accentuate.us:8080
User-Agent: Accentuate.us/0.9b3 Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.1
Content-Length: 113
Content-Type: application/json; charset=utf-8
Keep-Alive: 115
{"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la
we mwen andey.","locale":"ht"}
8. HTTP Communication (API)
Cache-Control: no-cache
Connection: close
Pragma: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding: gzip,deflate
Accept-Language: en-us,en;q=0.5
Host: ht
User-Agent: Accentuate.us/distribution
Content-Length: 113
Content-Type: application/json; charset=utf-8
{"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la we
mwen andey.","locale":"ht"}
Name\n19-year-old entrepreneur, student at Saint Louis University\nCo-found Spearhead with mom, Accentuate.us with Kevin Scannell of SLU\n
Expanded, 45-minute version online\n\nGoing to start with background, architecture, and finally some demos\n
- 90% loss!\n\n- Irrevocable loss\n- Each is a repository of the culture, traditions, and world view\n- Akin to extinction of animal or plant species\n\n- They’re looking to the Internet and technology for that.\n\nSo, let’s help!\n
- Even Unicode-encoded languages often lack appropriate input methods\n\n- Identified problem: keyboard input\n
- Every character that allows a diacritic is a classification problem\n\n- trained with corpus of texts with diacritics\n\n- Never-before seen words: statistics of 3-character sequences in a neighborhood of the character in question\n
Simple: only three calls\n\n- Langs: get languages & localizations\n- Lift: accentuate text (legacy)\n- Feedback: add to corpora, improve models\n
Clients send requests to load-balancing proxy 'distribution center"\n\nProxy\n    - Load balances across same-language API servers\n    - Allows quick management of servers–no DNS propagation time!\n    - Increases privacy (masks real UA, IP)\n\nAPI servers ran by language communities!\n    - Makes keeping it free doable\n    - Helps learn technology \n    - Distributed to language hot spots (French servers for French-using zones, etc.).\n
Firefox API request\n\nBlue text is most important to proxy server!\n\nInformation in headers so we don’t unpack body\n\nUA must start with "Accentuate.us/version"\n    - Analytics\n    - Mismatch resolution\n    - Spam prevention\n
Accentuate the differences: API server receives less information!\n\nClient is not identifiable based on:\n\n- UA\n- Host\n- IP \n\nBlue parts are what is different from API request\n
Emacs users: stand your ground!\n\nVersion 1.0: early alpha; will\n\n- Grab context words\n- Modularize processing\n