SlideShare une entreprise Scribd logo
1  sur  31
WikiBhasha From Digital Inclusion to Digital Democracy… Dr A Kumaran Microsoft Research Feb 2011
“Egypt will not become liberal democracy overnight; it has a long and difficult journey…Becoming an open and empowered democracy needs a slow process of popular self-education, and it will not come easily or naturally.” Hindustan Times Editorial Feb 14, 2011
Agenda Language Technology Research (15Min) “Digital Inclusion vs. Digital Democracy” (5 Min) WikiBhasha (5 Min)
Language Technology ResearchFrom Classical to Statistical…
Technology & Language... ,[object Object]
3 Cen. BC  ~O(KB-MB)
Scripts/Media for preservation/…
Print Era: Post-Gutenberg
15 Cen. AD  ~O(GB)?
Printing tech./coordination & distribution/…
Electronics & Computing Era
Late 20thCen.  ~O(TB-PB)
Electronic standards/Multimedia/Storage tech./…,[object Object]
Need of the hour: Language Technology Research! Language Technology is primarily concerned with processing Natural Language data ,[object Object]
Information Extraction
Machine Translation
Language Understanding
…Needs “Computational Linguistics” Research!
Linguistics & Computational Linguistics ,[object Object]
Computational Linguistics studies Computational Models for languages
Lexicography 	-  Letter-level
Morphology & Phonology  -  Word-level
Syntax -  Sentence-level
Semantics -  …
Pragmatics -  …,[object Object]
Ex #1: Which Language? (In general, Language Identification) Is a document in English or Finnish or Tamil?  “Length of words” Near-perfect identification!
IKONE-2007 Ex #2: Which is Right?(In general, Grammar Checking & Modeling) Are these correct sentences?  S1: “Yesterday evening, I will have tea”  S2: “I enjoy my tea with biscuits”   S3: “I enjoy my tea with motor oil” ,[object Object]
P(S1): 0.01 P(S2): 0.75 P(S3): 0.05,[object Object],[object Object]
Ex #5: Statistical MT President visits Chennai ஜனாதிபதிசென்னைசெல்கிறார் Statistical Models Parallel corpora visits President Chennai செல்கிறார் ஜனாதிபதி சென்னை President –ஜனாதிபதி       visits  –செல்கிறார் Chennai –சென்னை President inaugurates Tamil Conference ஜனாதிபதி தமிழ்மாநாட்டைதுவக்குகிறார்
For most Language Technologies… Statistical approaches EXIST, and are proven to be very successful! Data are Critical! Theorem: Data drivesResearch & Technology!
Where are the data? Axiom:  Web = Language data Read “Wikipedia = Language Data” ,[object Object]

Contenu connexe

Similaire à Wikibhasha by Dr A Kumaran

Information Management Trends 2009
Information Management Trends 2009Information Management Trends 2009
Information Management Trends 2009Christopher Eagle
 
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016Michael J. Macasa
 
Inforum 2007 Into The User environment
Inforum 2007 Into The User environmentInforum 2007 Into The User environment
Inforum 2007 Into The User environmentGuus van den Brekel
 
Oss In Libraries And We Information Professional
Oss In Libraries And We Information ProfessionalOss In Libraries And We Information Professional
Oss In Libraries And We Information ProfessionalAshok Kumar Satapathy
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries mdabrowski
 
Development of the database, the website and the online transcription platfor...
Development of the database, the website and the online transcription platfor...Development of the database, the website and the online transcription platfor...
Development of the database, the website and the online transcription platfor...Itinera Nova
 
Institutional knowledge and information ecology in a Free Software ecosystem
Institutional knowledge and information ecology in a Free Software ecosystemInstitutional knowledge and information ecology in a Free Software ecosystem
Institutional knowledge and information ecology in a Free Software ecosystemDerek Keats
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikMarko Grobelnik
 
The scripting library: Combining data and information in the library
The scripting library: Combining data and information in the libraryThe scripting library: Combining data and information in the library
The scripting library: Combining data and information in the libraryBonaria Biancu
 
20110324 linked openeuropeanahumanities
20110324 linked openeuropeanahumanities20110324 linked openeuropeanahumanities
20110324 linked openeuropeanahumanitiesStefan Gradmann
 
HIT project - Humanities Integration Technology
HIT project - Humanities Integration TechnologyHIT project - Humanities Integration Technology
HIT project - Humanities Integration TechnologyJusto Hidalgo
 
MarcOnt Initiative - Protege meeting
MarcOnt Initiative - Protege meetingMarcOnt Initiative - Protege meeting
MarcOnt Initiative - Protege meetingmdabrowski
 
New ICT Trends and Issues of Librarianship
New ICT Trends and Issues of LibrarianshipNew ICT Trends and Issues of Librarianship
New ICT Trends and Issues of LibrarianshipLiaquat Rahoo
 
Web 20 E Oltre 1202297800291589 3
Web 20 E Oltre 1202297800291589 3Web 20 E Oltre 1202297800291589 3
Web 20 E Oltre 1202297800291589 3Universita' di Bari
 
Wikipedia : Workshop
Wikipedia : WorkshopWikipedia : Workshop
Wikipedia : WorkshopNIFT
 
Semantic Technolgy
Semantic TechnolgySemantic Technolgy
Semantic TechnolgyTalat Fakhri
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications sathish sak
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonHetu Bhavsar
 

Similaire à Wikibhasha by Dr A Kumaran (20)

Information Management Trends 2009
Information Management Trends 2009Information Management Trends 2009
Information Management Trends 2009
 
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016INSC580MacasaOpenSourceSoftwareLibrariesFall2016
INSC580MacasaOpenSourceSoftwareLibrariesFall2016
 
Inforum 2007 Into The User environment
Inforum 2007 Into The User environmentInforum 2007 Into The User environment
Inforum 2007 Into The User environment
 
Oss In Libraries And We Information Professional
Oss In Libraries And We Information ProfessionalOss In Libraries And We Information Professional
Oss In Libraries And We Information Professional
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
Irish Digital Libraries Summit
Irish Digital Libraries SummitIrish Digital Libraries Summit
Irish Digital Libraries Summit
 
Development of the database, the website and the online transcription platfor...
Development of the database, the website and the online transcription platfor...Development of the database, the website and the online transcription platfor...
Development of the database, the website and the online transcription platfor...
 
Institutional knowledge and information ecology in a Free Software ecosystem
Institutional knowledge and information ecology in a Free Software ecosystemInstitutional knowledge and information ecology in a Free Software ecosystem
Institutional knowledge and information ecology in a Free Software ecosystem
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
The scripting library: Combining data and information in the library
The scripting library: Combining data and information in the libraryThe scripting library: Combining data and information in the library
The scripting library: Combining data and information in the library
 
20110324 linked openeuropeanahumanities
20110324 linked openeuropeanahumanities20110324 linked openeuropeanahumanities
20110324 linked openeuropeanahumanities
 
HIT project - Humanities Integration Technology
HIT project - Humanities Integration TechnologyHIT project - Humanities Integration Technology
HIT project - Humanities Integration Technology
 
MarcOnt Initiative - Protege meeting
MarcOnt Initiative - Protege meetingMarcOnt Initiative - Protege meeting
MarcOnt Initiative - Protege meeting
 
Hci
HciHci
Hci
 
New ICT Trends and Issues of Librarianship
New ICT Trends and Issues of LibrarianshipNew ICT Trends and Issues of Librarianship
New ICT Trends and Issues of Librarianship
 
Web 20 E Oltre 1202297800291589 3
Web 20 E Oltre 1202297800291589 3Web 20 E Oltre 1202297800291589 3
Web 20 E Oltre 1202297800291589 3
 
Wikipedia : Workshop
Wikipedia : WorkshopWikipedia : Workshop
Wikipedia : Workshop
 
Semantic Technolgy
Semantic TechnolgySemantic Technolgy
Semantic Technolgy
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using python
 

Dernier

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 

Dernier (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 

Wikibhasha by Dr A Kumaran

  • 1. WikiBhasha From Digital Inclusion to Digital Democracy… Dr A Kumaran Microsoft Research Feb 2011
  • 2. “Egypt will not become liberal democracy overnight; it has a long and difficult journey…Becoming an open and empowered democracy needs a slow process of popular self-education, and it will not come easily or naturally.” Hindustan Times Editorial Feb 14, 2011
  • 3. Agenda Language Technology Research (15Min) “Digital Inclusion vs. Digital Democracy” (5 Min) WikiBhasha (5 Min)
  • 4. Language Technology ResearchFrom Classical to Statistical…
  • 5.
  • 6. 3 Cen. BC ~O(KB-MB)
  • 9. 15 Cen. AD ~O(GB)?
  • 10. Printing tech./coordination & distribution/…
  • 12. Late 20thCen. ~O(TB-PB)
  • 13.
  • 14.
  • 19.
  • 20. Computational Linguistics studies Computational Models for languages
  • 21. Lexicography - Letter-level
  • 22. Morphology & Phonology - Word-level
  • 23. Syntax - Sentence-level
  • 24. Semantics -
  • 25.
  • 26. Ex #1: Which Language? (In general, Language Identification) Is a document in English or Finnish or Tamil? “Length of words” Near-perfect identification!
  • 27.
  • 28.
  • 29. Ex #5: Statistical MT President visits Chennai ஜனாதிபதிசென்னைசெல்கிறார் Statistical Models Parallel corpora visits President Chennai செல்கிறார் ஜனாதிபதி சென்னை President –ஜனாதிபதி visits –செல்கிறார் Chennai –சென்னை President inaugurates Tamil Conference ஜனாதிபதி தமிழ்மாநாட்டைதுவக்குகிறார்
  • 30. For most Language Technologies… Statistical approaches EXIST, and are proven to be very successful! Data are Critical! Theorem: Data drivesResearch & Technology!
  • 31.
  • 32.
  • 33. WikiBhashaResearch Project on Crowd-sourcing to explore collaborative data creation for Computational Linguistic research (first focus: parallel data)
  • 34. Content Creation by Infusion… Rough content using Machine Translation Appropriate community correction to create value… Article to Target Wikipedia WikiBABEL on Wikipedia CollaborativeTranslation Cache MachineTranslationSystem Linguistic Resources
  • 35.
  • 38. Little traction with Wikipedians Published in WikiSYM 2008 Conference; Adopted for some products in Microsoft
  • 39. WikiBhasha V2.0: Design Objectives #1: Focus users on their purpose (say, Wikipedia) Content Creation, and not Translation #2: In-site Solution WikiBABEL to stay on Wikipedia for the session Submit any/all contributions #3: Generic components, but specifically purposed Vendor Neutrality Componentized Architecture …
  • 41. WikiBhasha Beta WikiBABEL released as WikiBhasha… Content creator, and not Translator ‘On-Wikipedia’ Open sourced ‘Bhasha’ in Sanskrit means ‘Language’  2010 Version; Interested in open-sourcing and contributing to Wikipedia
  • 42. WikiBhasha: User View CTF Dictionary Designed WikiBhasha as a thin edit layer Stays on Wikipedia User contribution submitted to Wikipedia Cloud Services WikiBABEL UX Wikipedia User Community WikiBhasha 2.0 API’s
  • 43. WikiBhasha: Developer View WikiBhasha designed to be modular & extendible Open-sourced, so community can contribute/enhance WikiBhasha CORE Components GUI Components(Wikipedia-specific UI and Workflow) MediaWiki Software WikiBABEL [Edit] WikiBABEL-CORE User-Interface(Generic UI Components, Scratch Pad, …) User-Experience(Linguistically Aware Wiki-site Aware Workflow Engine) User Management(Authentication, User Credentials Management, User Preferences/Skills, Contributions Tracking, …) Contextual Help(Domain-specific, Context-specific, User-Contribution Aware Help…) CTF Wikipedia Communication (Message Boards, Email/Alert Mechanisms, Wikis, …) Linguistic Resources(Mono-/Bi-lingual Dictionaries, Thesauri, …) Content Management(Content Discovery, Versions, Tagging, Notification Lists, …) 3rd Party Linguistic Services Source/Target Wiki System Interface MediawikiExtensions Lang. Technology Components(Machine Translation, Transliteration, Summarization …) Source/Target Wiki System Interface(Wiki API’s for Content Pull/Push, Content & User Management, …) MediaWikiLayer Cloud Services Layer WikiBhasha UI/UX/IntegrationComponents Layer
  • 44. WikiBhasha: A Community Project WikiBhasha is available as a Bookmarklet/ Wikipedia user-script Please contribute to your Wikipedia! WikiBhasha source code available as a MediaWiki Extension http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/WikiBhasha Please enhance it!
  • 45. WikiBhasha Release Released & Open-sourced in 10/’10 Announced jointly by MSR and WMF
  • 46. Covered in 20+ languages/countries across the world WikiBhasha Release
  • 47. WikiBhasha Release ~500K Visits & ~100K Unique Visitors Visits from 50+ countries Primarily from Europe (and Eastern Europe) Many “casual visitors” who may become “contributors”!
  • 48. Community Program Being conducted in 5 demographics Allahabad &Banaras Cairo Delhi, … Objectives Interaction with Wikipedians & Language Enthusiasts To study community adoption, user experience, data creation, and ultimately, technology development…
  • 49. Back to “Digital Democracy”Communities to Research…
  • 50. Languages: Communities & Technology Research requires Data Participatory Internet provides the data needed! Digital “haves and have-nots” For many languages of the world Digital Inclusion is a necessary first step Digital Democracy is a process in which the communities may have to take active part in…

Notes de l'éditeur

  1. Hosted site One article to one article.