SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
Popular Acronym Retrieval through Text Messaging
Praveen Yadav
praveenyadav1993@yahoo.com
Surender Singh
surendrasingh9426@gmail.com
Sukomal Pal
pal.s.cse@ismdhanbad.ac.in
Dept. of CSE
Indian School of Mines, Dhanbad
Dhanbad, 826004, India
Rishabh Kumar
kr.rishabh618@gmail.com
Harsh Singh
harshmehra31@gmail.com
ABSTRACT
We present a prototype system for providing quick information on
common and popular abbreviations through text messaging. The
system receives a text input of acronym (possibly wrongly typed)
in Roman script. The application returns a very brief information
from the first few lines of corresponding English Wikipedia page.
The system is designed especially for low-cost mobile phones
having text only messaging facility but without Internet and native
language support. The target users are primarily semi-literate
people who may not have sufficient knowledge of English. The
output is translated to native language of user query (Hindi) as
transliterated text.
CCS Concepts
• Information System Application: Miscellaneous
Keywords
Web-scrapping, Machine Translation, Transliteration.
1. INTRODUCTION
Today world is flooded with information. However, for a vast
section of people, getting right information in real-time is a far cry
because of their distance from information highway. Many people
find it difficult to obtain information even on common
abbreviations they come across in their daily life due to lack of
technological knowledge, educational background and/or
infrastructural support like low Internet penetration. Majority of
these people are not comfortable with English. However, they are
well-conversant in their native language. Today mobile phones
have become necessity for humans and hence almost every
individual possesses at least basic low-cost mobile phones. These
low cost mobile phones, other than for making calls, offer limited
features like text message service using primarily English scripts.
Short Message Service (SMS), a communication medium broadly
used by cellular phone users limit maximum message size (<=160
characters). People using these mobile phones can therefore
communicate through SMS in either English or native language
using Roman script.
We present a prototype system that can be used by the mobile
service providers to cater to the information need of users from the
Internet through SMS’s in low-cost phones. Specifically, we
provide mobile users short and basic information on their acronym
requests through SMS facility. The query SMS will have a single
acronym (possibly wrongly typed) based on user’s knowledge. The
response SMS will have short information scrapped from English
Wikipedia page which is translated in user’s native language and
then presented in transliterated form using Roman script. Although
there are a host of SMS based service [1, 2] available in patent
literature, to our knowledge, there is no such service towards
providing information access from web to mobile users,
specifically where Internet penetration is low or mobile phones
offer bare minimal facilities like call and text messaging only and
people do not have sufficient knowledge in English.
2. SYSTEM OVERVIEW
Our prototype system is built using Java and Web Harvest API [3].
Figure 1 provides an overview.
We collect the input which is supposed to be written in native
language but transliterated using Roman script in the form of SMS
through a Java Applet interface. The working of this software is
performed in four basic steps as given below.
2.1 Input Processing
Input from users is obtained in their native language in Roman
transliterated script. Since there is no universally accepted
transliteration rule, there can be several variations for the same
word in Roman script. More importantly, we are considering input
from people whose language skills are compromised. Therefore the
actual English abbreviation needs to be deciphered, that the user is
interested in, from his/her idiosyncratic input in English language
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
FIRE '14, December 05-07, 2014, Bangalore, India
© 2015 ACM. ISBN 978-1-4503-3755-7/15/08…$15.00.
DOI: http://dx.doi.org/10.1145/2824864.2824889
Figure 1. System Overview
.
which he/she has written from her own mental mapping. Therefore,
there may be some errors in user provided abbreviation. To know
what type of errors we are dealing with we conducted a survey on
people who weren’t well versed in English literature. We asked
several queries [4] from each of them. We found that following
types of error are primarily committed by users:
a) Extra Vowel: e.g., SMS may be written as ASAMAS,
ESEMES. SP may be written as ASP, ESPI etc.
b) Vowel Deficiency: e.g., IAS may be written as AS. AIDS
may be written as ADS.
c) Extra Consonant usage: e.g., CNG may be written as
CNGG.
d) Wrong Consonant usage: e.g., B.Tech may be written as
V.Tech. KVPY may be written as KBPY.
e) Other errors: Vowel replacement error e.g., CEO may be
written as CIO. Typing error e.g., IIPQ may be written as
IIPO. Same Phonetic sound error e.g., UPSC may be
written as UPSE, NEWS may be written as NEUS.
Figure 2 shows percentage of these type of errors.
We focused mainly on the first and second kind of error as they
were committed most frequently. We applied some heuristic
techniques based on N-gram (Inverted Index) and edit-distance to
match with the database entries of abbreviations, collected initially
with prior estimation. Manually, we created a list from various
online sources. It contained all kind of abbreviations ranging from
governmental organization to education field. We also created list
of common Hindi terms which are used in framing ‘Wh’-type
questions (e.g. ‘kya’ (क्या), ‘kee’ (कि), ‘kyon’(क्ययों),’kaahaan’
(िहाों) etc.).
To find matches for the queries containing abbreviations (possibly
with spelling errors), we stored the 1500 words sparsed in a huge
number of files. Basically we used Inverted Index technique to store
and find the correct abbreviations from these stored files. In the
abbreviations database considered, we tried to find all possible
character bigrams. For example, in abbreviation ‘SAARC’ 4 bi-
gram entries will be SA, AA, AR, and RC. For each such bigram,
we created a file and stored the word ‘SAARC’ in all the files. In
precise terms, the word ‘SAARC’ will appear in four files namely
SA.txt, AA.txt, AR.txt, and RC.txt. Similar exercise is carried out
for each abbreviation. There can be a maximum of 26x26
possibilities of bigram files, and therefore 26x26 files are created.
Various files will contain data for number of abbreviations.
However, there would be some which would be empty. Typically
each single file contains zero to only a few entries [5]. Similarly we
did for character trigrams leading to 26x26x26 files.
For Hindi words like ‘KYA’ (क्या) (meaning ‘what’ in English),
KY, YA are generated in bigram. There is only one trigram file
namely ‘KYA’. Now, as the application receives the query, it scans
all the words. Since user input is in mix language (we considered
Hindi and English written in Roman script) and potentially contain
a lot of typos and wrong spellings. We first eliminate vowels from
each word and then search through the files generated by N-gram
technique. Each input word is mapped to the word in the list (both
Hindi and English list) which is having the highest frequency from
the bigram and trigram files. In case of several matches, the word
having least modification is chosen, with ties broken arbitrarily. Let
us illustrate the algorithm with an example query “aaiaarctc kay
hai” (आईआरसीटीसी क्या है) (means “what is ‘aaiaarctc’?”).
Table 1. Input Processing Data
Word(o)
Vowel
Removed(w)
#char rem. (p)
AAIAARCTC RCTC 4
KYA KY 2
HAI H 1
The query is converted to uppercase and then following steps are
performed for each word:
a) Vowels are removed because of spell-errors occurring in
the use of vowels
b) Number of remaining letters (p) checked to take
following decisions:
i) If (p>=3) check only tri-gram files for a word
w or its tri-gram subsequences. Here, file
named RCT and CTC are searched to check
whether the words contain some subsequences
of RCTC. If there is matching entry in all such
files, that particular abbreviation is chosen. If
there are more than one entries which contains
this subsequence, then we choose the word
with highest number of occurrence among
these n-gram files. Least edit distance from the
query word (o) is used to break the ties and then
arbitrarily any word from the set of final word
is chosen.
ii) If (p == 2) check only bi-gram files for word
w or its bi-gram subsequence as above. Here,
file name KY is selected and is searched to
check whether the file contains some
subsequences of KY. If there is a matching
entry, that particular abbreviations is chosen.
iii) If (p<2) check only bi-gram files for
original word before vowel removal (o) or its
bi-gram subsequences. Hence file HA and AI
are searched to check whether the files contain
some subsequence of HAI to get required
matching word.
c) We check the chosen word. The word may come from
either Hindi or English or both kind of files:
Figure 2. Types of Error
i) Word is from Hindi file only: ignore the
word as it is not abbreviation-word.
ii) Word is from English file only: the word
should be further processed as it is an assumed
acronym.
iii) Two different words are returned from both
Hindi and English files: Chose the word having
minimum edit distance with original input and
then consider as either i) or ii) case.
d) The chosen abbreviations is searched through our
collection of abbreviations and then expanded form is
extracted. The steps are summarized in Figure 3.
We assumed that a single query can have maximum one acronym
supplemented with zero or more Hindi words.
2.2 System Efficiency
We tested our system on the user data we collected and these are
the results we obtained after considering two different options.
a) Removal of vowels: We removed the vowels from the
user query before searching it.
b) Considering vowels: We search directly without removal
of vowels.
Figure 4 shows the results.
2.3 Data Extraction From Wikipedia
Given a chosen acronym we first look up in our associative array
of collected acronyms with their expanded counterparts. The
expansion is used to generate the URL to Web harvest API (open
source) which returns content of that Wikipedia page in XML
format. Since we need to provide only a brief definition or
introduction that can with in a SMS for an abbreviations, we are
interested only in first few lines (1 or 2 sentences) up to 200
characters from Wiki page. The content so obtained is stored in a
temporary file.
2.4 Translation of Data
The stored data is sent to google translator and then we use Yahoo
Query Language (YQL) [6] to retrieve the translation. We use Web
Harvest API once again to extract the translated text. The extracted
content is stored to another temporary file.
2.5 Transliteration of Data
For transliteration purpose ICU4J [7] library is used. ICU4J is set
of Java libraries that provides more comprehensive support for
Unicode, software globalization and internationalization. The
translated data is passed to a method using this library and
transliterated text is generated which is provided to the user through
a Java Applet. The output is shown in Figure 5.
3. CONCLUSION
We developed a prototype application which can be used by mobile
service providers to cate the information need of their customers
through SMS. Specifically, we attempt to address the need of semi-
literate users using low-cost mobile phones in which neither
internet facility nor local language support is available. We made
use of translation and transliteration using Web application and
libraries along the Web scrapping technique to process the need and
provide the answer in native language using Roman script. We
believe this software can be immensely useful for information
dissemination and access where Internet penetration is low and
people’s knowledge of English is limited.
4. ACKNOWLEDGEMENTS
We wish to thank Divesh Sanjay Kothari and Abhinay Saraswat,
Department of CSE, ISM Dhanbad for their all-round help.
5. ADDITIONAL AUTHORS
Ashok Kumar (ISM Dhanbad, ashokdavas@gmail.com), L.
Gautam (ISM Dhanbad, gtam25@gmail.com), Abhishek Ranjan
(ISM Dhanbad, aksharudarya@gmail.com )
Figure 4. System Efficiency
Figure 3. Input Processing Steps
.
Figure 5. Input Output Panel
.
6. REFERENCES
[1] S. Lothia , W. James, and B. Hwang. System and methods for
providing subscriber-initiated information over the short
message service(SMS) or a micro browser, May 6 2003. US
Patent 6,560,456.
[2] J. Salonen, SMS inquiry and invitation distribution method
and system, Mar. 12, 2013. US Patent RE44, 073.
[3] V. Nikic and A. Wajda. Web Harvest, version 2.0, February
2010. As on June 25, 2014.
[4] User Query Data :
https://www.dropbox.com/s/kbwabem29f0mwu9/data.txt?dl
=0
[5] S. K. D S Kothari, A Saraswat and S. Pal. FAQ Retrieval
using Noisy Queries. In Fire 2013 Workshop Pre-
Proceedings, December 2013.
[6] YQL Console: https://developer.yahoo.com/yql/
[7] ICU User Guide as on June 25, 2014.
[8] Video Demo Link:
https://www.dropbox.com/s/l270iq3gnhgafvy/PARTM.mp4?
dl=0

Contenu connexe

Tendances

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifyingcsandit
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
 
Through-Mail Feature: An Enhancement to Contemporary Email Services
Through-Mail Feature: An Enhancement to Contemporary Email ServicesThrough-Mail Feature: An Enhancement to Contemporary Email Services
Through-Mail Feature: An Enhancement to Contemporary Email ServicesIRJESJOURNAL
 
Efficiency lossless data techniques for arabic text compression
Efficiency lossless data techniques for arabic text compressionEfficiency lossless data techniques for arabic text compression
Efficiency lossless data techniques for arabic text compressionijcsit
 
Paper id 24201469
Paper id 24201469Paper id 24201469
Paper id 24201469IJRAT
 
An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
 
Spell checker for Kannada OCR
Spell checker for Kannada OCRSpell checker for Kannada OCR
Spell checker for Kannada OCRdbpublications
 
Brill's Rule-based Part of Speech Tagger for Kadazan
Brill's Rule-based Part of Speech Tagger for KadazanBrill's Rule-based Part of Speech Tagger for Kadazan
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
 
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...CSCJournals
 
Analysis review on feature-based and word-rule based techniques in text stega...
Analysis review on feature-based and word-rule based techniques in text stega...Analysis review on feature-based and word-rule based techniques in text stega...
Analysis review on feature-based and word-rule based techniques in text stega...journalBEEI
 
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approach
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid ApproachPunjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approach
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approachcscpconf
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Review on feature-based method performance in text steganography
Review on feature-based method performance in text steganographyReview on feature-based method performance in text steganography
Review on feature-based method performance in text steganographyjournalBEEI
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
The effect of training set size in authorship attribution: application on sho...
The effect of training set size in authorship attribution: application on sho...The effect of training set size in authorship attribution: application on sho...
The effect of training set size in authorship attribution: application on sho...IJECEIAES
 

Tendances (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifying
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
C8 akumaran
C8 akumaranC8 akumaran
C8 akumaran
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionary
 
Through-Mail Feature: An Enhancement to Contemporary Email Services
Through-Mail Feature: An Enhancement to Contemporary Email ServicesThrough-Mail Feature: An Enhancement to Contemporary Email Services
Through-Mail Feature: An Enhancement to Contemporary Email Services
 
Efficiency lossless data techniques for arabic text compression
Efficiency lossless data techniques for arabic text compressionEfficiency lossless data techniques for arabic text compression
Efficiency lossless data techniques for arabic text compression
 
An Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile EnvironmentAn Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile Environment
 
Paper id 24201469
Paper id 24201469Paper id 24201469
Paper id 24201469
 
An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity Removal
 
Spell checker for Kannada OCR
Spell checker for Kannada OCRSpell checker for Kannada OCR
Spell checker for Kannada OCR
 
Brill's Rule-based Part of Speech Tagger for Kadazan
Brill's Rule-based Part of Speech Tagger for KadazanBrill's Rule-based Part of Speech Tagger for Kadazan
Brill's Rule-based Part of Speech Tagger for Kadazan
 
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...
Robust Text Watermarking Technique for Authorship Protection of Hindi Languag...
 
Analysis review on feature-based and word-rule based techniques in text stega...
Analysis review on feature-based and word-rule based techniques in text stega...Analysis review on feature-based and word-rule based techniques in text stega...
Analysis review on feature-based and word-rule based techniques in text stega...
 
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approach
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid ApproachPunjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approach
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approach
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Jq3616701679
Jq3616701679Jq3616701679
Jq3616701679
 
Review on feature-based method performance in text steganography
Review on feature-based method performance in text steganographyReview on feature-based method performance in text steganography
Review on feature-based method performance in text steganography
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
The effect of training set size in authorship attribution: application on sho...
The effect of training set size in authorship attribution: application on sho...The effect of training set size in authorship attribution: application on sho...
The effect of training set size in authorship attribution: application on sho...
 

Similaire à Submission_36

Voice based web browser
Voice based web browserVoice based web browser
Voice based web browserSowndaryaP
 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
 
IRJET- Voice based Billing System
IRJET-  	  Voice based Billing SystemIRJET-  	  Voice based Billing System
IRJET- Voice based Billing SystemIRJET Journal
 
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay  A Free On-Line Web Spell Checking Service For QuechuaAllin Qillqay  A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay A Free On-Line Web Spell Checking Service For QuechuaAndrea Porter
 
DEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAF
DEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAFDEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAF
DEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAFcsandit
 
Mit302 web technologies
Mit302 web technologiesMit302 web technologies
Mit302 web technologiessmumbahelp
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET Journal
 
English to punjabi machine translation system using hybrid approach of word s
English to punjabi machine translation system using hybrid approach of word sEnglish to punjabi machine translation system using hybrid approach of word s
English to punjabi machine translation system using hybrid approach of word sIAEME Publication
 
Compiler_Lecture1.pdf
Compiler_Lecture1.pdfCompiler_Lecture1.pdf
Compiler_Lecture1.pdfAkarTaher
 
On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...rahulmonikasharma
 
Speech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law compandingSpeech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law compandingiosrjce
 
B tech project_report
B tech project_reportB tech project_report
B tech project_reportabhiuaikey
 

Similaire à Submission_36 (20)

I1 geetha3 revathi
I1 geetha3 revathiI1 geetha3 revathi
I1 geetha3 revathi
 
Voice based web browser
Voice based web browserVoice based web browser
Voice based web browser
 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
 
IRJET- Voice based Billing System
IRJET-  	  Voice based Billing SystemIRJET-  	  Voice based Billing System
IRJET- Voice based Billing System
 
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay  A Free On-Line Web Spell Checking Service For QuechuaAllin Qillqay  A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
 
Ijetcas14 444
Ijetcas14 444Ijetcas14 444
Ijetcas14 444
 
Arabic MT Project
Arabic MT ProjectArabic MT Project
Arabic MT Project
 
Moses
MosesMoses
Moses
 
DEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAF
DEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAFDEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAF
DEVELOPMENT OF TOOL TO PROMOTE WEB ACCESSIBILITY FOR DEAF
 
Mit302 web technologies
Mit302 web technologiesMit302 web technologies
Mit302 web technologies
 
3.2
3.23.2
3.2
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & Autocorrection
 
English to punjabi machine translation system using hybrid approach of word s
English to punjabi machine translation system using hybrid approach of word sEnglish to punjabi machine translation system using hybrid approach of word s
English to punjabi machine translation system using hybrid approach of word s
 
Bt0076, tcpip
Bt0076, tcpipBt0076, tcpip
Bt0076, tcpip
 
Compiler_Lecture1.pdf
Compiler_Lecture1.pdfCompiler_Lecture1.pdf
Compiler_Lecture1.pdf
 
On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...
 
H010625862
H010625862H010625862
H010625862
 
Speech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law compandingSpeech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law companding
 
B tech project_report
B tech project_reportB tech project_report
B tech project_report
 
Bt0076, tcpip
Bt0076, tcpipBt0076, tcpip
Bt0076, tcpip
 

Submission_36

  • 1. Popular Acronym Retrieval through Text Messaging Praveen Yadav praveenyadav1993@yahoo.com Surender Singh surendrasingh9426@gmail.com Sukomal Pal pal.s.cse@ismdhanbad.ac.in Dept. of CSE Indian School of Mines, Dhanbad Dhanbad, 826004, India Rishabh Kumar kr.rishabh618@gmail.com Harsh Singh harshmehra31@gmail.com ABSTRACT We present a prototype system for providing quick information on common and popular abbreviations through text messaging. The system receives a text input of acronym (possibly wrongly typed) in Roman script. The application returns a very brief information from the first few lines of corresponding English Wikipedia page. The system is designed especially for low-cost mobile phones having text only messaging facility but without Internet and native language support. The target users are primarily semi-literate people who may not have sufficient knowledge of English. The output is translated to native language of user query (Hindi) as transliterated text. CCS Concepts • Information System Application: Miscellaneous Keywords Web-scrapping, Machine Translation, Transliteration. 1. INTRODUCTION Today world is flooded with information. However, for a vast section of people, getting right information in real-time is a far cry because of their distance from information highway. Many people find it difficult to obtain information even on common abbreviations they come across in their daily life due to lack of technological knowledge, educational background and/or infrastructural support like low Internet penetration. Majority of these people are not comfortable with English. However, they are well-conversant in their native language. Today mobile phones have become necessity for humans and hence almost every individual possesses at least basic low-cost mobile phones. These low cost mobile phones, other than for making calls, offer limited features like text message service using primarily English scripts. Short Message Service (SMS), a communication medium broadly used by cellular phone users limit maximum message size (<=160 characters). People using these mobile phones can therefore communicate through SMS in either English or native language using Roman script. We present a prototype system that can be used by the mobile service providers to cater to the information need of users from the Internet through SMS’s in low-cost phones. Specifically, we provide mobile users short and basic information on their acronym requests through SMS facility. The query SMS will have a single acronym (possibly wrongly typed) based on user’s knowledge. The response SMS will have short information scrapped from English Wikipedia page which is translated in user’s native language and then presented in transliterated form using Roman script. Although there are a host of SMS based service [1, 2] available in patent literature, to our knowledge, there is no such service towards providing information access from web to mobile users, specifically where Internet penetration is low or mobile phones offer bare minimal facilities like call and text messaging only and people do not have sufficient knowledge in English. 2. SYSTEM OVERVIEW Our prototype system is built using Java and Web Harvest API [3]. Figure 1 provides an overview. We collect the input which is supposed to be written in native language but transliterated using Roman script in the form of SMS through a Java Applet interface. The working of this software is performed in four basic steps as given below. 2.1 Input Processing Input from users is obtained in their native language in Roman transliterated script. Since there is no universally accepted transliteration rule, there can be several variations for the same word in Roman script. More importantly, we are considering input from people whose language skills are compromised. Therefore the actual English abbreviation needs to be deciphered, that the user is interested in, from his/her idiosyncratic input in English language Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. FIRE '14, December 05-07, 2014, Bangalore, India © 2015 ACM. ISBN 978-1-4503-3755-7/15/08…$15.00. DOI: http://dx.doi.org/10.1145/2824864.2824889 Figure 1. System Overview .
  • 2. which he/she has written from her own mental mapping. Therefore, there may be some errors in user provided abbreviation. To know what type of errors we are dealing with we conducted a survey on people who weren’t well versed in English literature. We asked several queries [4] from each of them. We found that following types of error are primarily committed by users: a) Extra Vowel: e.g., SMS may be written as ASAMAS, ESEMES. SP may be written as ASP, ESPI etc. b) Vowel Deficiency: e.g., IAS may be written as AS. AIDS may be written as ADS. c) Extra Consonant usage: e.g., CNG may be written as CNGG. d) Wrong Consonant usage: e.g., B.Tech may be written as V.Tech. KVPY may be written as KBPY. e) Other errors: Vowel replacement error e.g., CEO may be written as CIO. Typing error e.g., IIPQ may be written as IIPO. Same Phonetic sound error e.g., UPSC may be written as UPSE, NEWS may be written as NEUS. Figure 2 shows percentage of these type of errors. We focused mainly on the first and second kind of error as they were committed most frequently. We applied some heuristic techniques based on N-gram (Inverted Index) and edit-distance to match with the database entries of abbreviations, collected initially with prior estimation. Manually, we created a list from various online sources. It contained all kind of abbreviations ranging from governmental organization to education field. We also created list of common Hindi terms which are used in framing ‘Wh’-type questions (e.g. ‘kya’ (क्या), ‘kee’ (कि), ‘kyon’(क्ययों),’kaahaan’ (िहाों) etc.). To find matches for the queries containing abbreviations (possibly with spelling errors), we stored the 1500 words sparsed in a huge number of files. Basically we used Inverted Index technique to store and find the correct abbreviations from these stored files. In the abbreviations database considered, we tried to find all possible character bigrams. For example, in abbreviation ‘SAARC’ 4 bi- gram entries will be SA, AA, AR, and RC. For each such bigram, we created a file and stored the word ‘SAARC’ in all the files. In precise terms, the word ‘SAARC’ will appear in four files namely SA.txt, AA.txt, AR.txt, and RC.txt. Similar exercise is carried out for each abbreviation. There can be a maximum of 26x26 possibilities of bigram files, and therefore 26x26 files are created. Various files will contain data for number of abbreviations. However, there would be some which would be empty. Typically each single file contains zero to only a few entries [5]. Similarly we did for character trigrams leading to 26x26x26 files. For Hindi words like ‘KYA’ (क्या) (meaning ‘what’ in English), KY, YA are generated in bigram. There is only one trigram file namely ‘KYA’. Now, as the application receives the query, it scans all the words. Since user input is in mix language (we considered Hindi and English written in Roman script) and potentially contain a lot of typos and wrong spellings. We first eliminate vowels from each word and then search through the files generated by N-gram technique. Each input word is mapped to the word in the list (both Hindi and English list) which is having the highest frequency from the bigram and trigram files. In case of several matches, the word having least modification is chosen, with ties broken arbitrarily. Let us illustrate the algorithm with an example query “aaiaarctc kay hai” (आईआरसीटीसी क्या है) (means “what is ‘aaiaarctc’?”). Table 1. Input Processing Data Word(o) Vowel Removed(w) #char rem. (p) AAIAARCTC RCTC 4 KYA KY 2 HAI H 1 The query is converted to uppercase and then following steps are performed for each word: a) Vowels are removed because of spell-errors occurring in the use of vowels b) Number of remaining letters (p) checked to take following decisions: i) If (p>=3) check only tri-gram files for a word w or its tri-gram subsequences. Here, file named RCT and CTC are searched to check whether the words contain some subsequences of RCTC. If there is matching entry in all such files, that particular abbreviation is chosen. If there are more than one entries which contains this subsequence, then we choose the word with highest number of occurrence among these n-gram files. Least edit distance from the query word (o) is used to break the ties and then arbitrarily any word from the set of final word is chosen. ii) If (p == 2) check only bi-gram files for word w or its bi-gram subsequence as above. Here, file name KY is selected and is searched to check whether the file contains some subsequences of KY. If there is a matching entry, that particular abbreviations is chosen. iii) If (p<2) check only bi-gram files for original word before vowel removal (o) or its bi-gram subsequences. Hence file HA and AI are searched to check whether the files contain some subsequence of HAI to get required matching word. c) We check the chosen word. The word may come from either Hindi or English or both kind of files: Figure 2. Types of Error
  • 3. i) Word is from Hindi file only: ignore the word as it is not abbreviation-word. ii) Word is from English file only: the word should be further processed as it is an assumed acronym. iii) Two different words are returned from both Hindi and English files: Chose the word having minimum edit distance with original input and then consider as either i) or ii) case. d) The chosen abbreviations is searched through our collection of abbreviations and then expanded form is extracted. The steps are summarized in Figure 3. We assumed that a single query can have maximum one acronym supplemented with zero or more Hindi words. 2.2 System Efficiency We tested our system on the user data we collected and these are the results we obtained after considering two different options. a) Removal of vowels: We removed the vowels from the user query before searching it. b) Considering vowels: We search directly without removal of vowels. Figure 4 shows the results. 2.3 Data Extraction From Wikipedia Given a chosen acronym we first look up in our associative array of collected acronyms with their expanded counterparts. The expansion is used to generate the URL to Web harvest API (open source) which returns content of that Wikipedia page in XML format. Since we need to provide only a brief definition or introduction that can with in a SMS for an abbreviations, we are interested only in first few lines (1 or 2 sentences) up to 200 characters from Wiki page. The content so obtained is stored in a temporary file. 2.4 Translation of Data The stored data is sent to google translator and then we use Yahoo Query Language (YQL) [6] to retrieve the translation. We use Web Harvest API once again to extract the translated text. The extracted content is stored to another temporary file. 2.5 Transliteration of Data For transliteration purpose ICU4J [7] library is used. ICU4J is set of Java libraries that provides more comprehensive support for Unicode, software globalization and internationalization. The translated data is passed to a method using this library and transliterated text is generated which is provided to the user through a Java Applet. The output is shown in Figure 5. 3. CONCLUSION We developed a prototype application which can be used by mobile service providers to cate the information need of their customers through SMS. Specifically, we attempt to address the need of semi- literate users using low-cost mobile phones in which neither internet facility nor local language support is available. We made use of translation and transliteration using Web application and libraries along the Web scrapping technique to process the need and provide the answer in native language using Roman script. We believe this software can be immensely useful for information dissemination and access where Internet penetration is low and people’s knowledge of English is limited. 4. ACKNOWLEDGEMENTS We wish to thank Divesh Sanjay Kothari and Abhinay Saraswat, Department of CSE, ISM Dhanbad for their all-round help. 5. ADDITIONAL AUTHORS Ashok Kumar (ISM Dhanbad, ashokdavas@gmail.com), L. Gautam (ISM Dhanbad, gtam25@gmail.com), Abhishek Ranjan (ISM Dhanbad, aksharudarya@gmail.com ) Figure 4. System Efficiency Figure 3. Input Processing Steps . Figure 5. Input Output Panel .
  • 4. 6. REFERENCES [1] S. Lothia , W. James, and B. Hwang. System and methods for providing subscriber-initiated information over the short message service(SMS) or a micro browser, May 6 2003. US Patent 6,560,456. [2] J. Salonen, SMS inquiry and invitation distribution method and system, Mar. 12, 2013. US Patent RE44, 073. [3] V. Nikic and A. Wajda. Web Harvest, version 2.0, February 2010. As on June 25, 2014. [4] User Query Data : https://www.dropbox.com/s/kbwabem29f0mwu9/data.txt?dl =0 [5] S. K. D S Kothari, A Saraswat and S. Pal. FAQ Retrieval using Noisy Queries. In Fire 2013 Workshop Pre- Proceedings, December 2013. [6] YQL Console: https://developer.yahoo.com/yql/ [7] ICU User Guide as on June 25, 2014. [8] Video Demo Link: https://www.dropbox.com/s/l270iq3gnhgafvy/PARTM.mp4? dl=0