Submission_36

Popular Acronym Retrieval through Text Messaging
Praveen Yadav
praveenyadav1993@yahoo.com
Surender Singh
surendrasingh9426@gmail.com
Sukomal Pal
pal.s.cse@ismdhanbad.ac.in
Dept. of CSE
Indian School of Mines, Dhanbad
Dhanbad, 826004, India
Rishabh Kumar
kr.rishabh618@gmail.com
Harsh Singh
harshmehra31@gmail.com
ABSTRACT
We present a prototype system for providing quick information on
common and popular abbreviations through text messaging. The
system receives a text input of acronym (possibly wrongly typed)
in Roman script. The application returns a very brief information
from the first few lines of corresponding English Wikipedia page.
The system is designed especially for low-cost mobile phones
having text only messaging facility but without Internet and native
language support. The target users are primarily semi-literate
people who may not have sufficient knowledge of English. The
output is translated to native language of user query (Hindi) as
transliterated text.
CCS Concepts
• Information System Application: Miscellaneous
Keywords
Web-scrapping, Machine Translation, Transliteration.
1. INTRODUCTION
Today world is flooded with information. However, for a vast
section of people, getting right information in real-time is a far cry
because of their distance from information highway. Many people
find it difficult to obtain information even on common
abbreviations they come across in their daily life due to lack of
technological knowledge, educational background and/or
infrastructural support like low Internet penetration. Majority of
these people are not comfortable with English. However, they are
well-conversant in their native language. Today mobile phones
have become necessity for humans and hence almost every
individual possesses at least basic low-cost mobile phones. These
low cost mobile phones, other than for making calls, offer limited
features like text message service using primarily English scripts.
Short Message Service (SMS), a communication medium broadly
used by cellular phone users limit maximum message size (<=160
characters). People using these mobile phones can therefore
communicate through SMS in either English or native language
using Roman script.
We present a prototype system that can be used by the mobile
service providers to cater to the information need of users from the
Internet through SMS’s in low-cost phones. Specifically, we
provide mobile users short and basic information on their acronym
requests through SMS facility. The query SMS will have a single
acronym (possibly wrongly typed) based on user’s knowledge. The
response SMS will have short information scrapped from English
Wikipedia page which is translated in user’s native language and
then presented in transliterated form using Roman script. Although
there are a host of SMS based service [1, 2] available in patent
literature, to our knowledge, there is no such service towards
providing information access from web to mobile users,
specifically where Internet penetration is low or mobile phones
offer bare minimal facilities like call and text messaging only and
people do not have sufficient knowledge in English.
2. SYSTEM OVERVIEW
Our prototype system is built using Java and Web Harvest API [3].
Figure 1 provides an overview.
We collect the input which is supposed to be written in native
language but transliterated using Roman script in the form of SMS
through a Java Applet interface. The working of this software is
performed in four basic steps as given below.
2.1 Input Processing
Input from users is obtained in their native language in Roman
transliterated script. Since there is no universally accepted
transliteration rule, there can be several variations for the same
word in Roman script. More importantly, we are considering input
from people whose language skills are compromised. Therefore the
actual English abbreviation needs to be deciphered, that the user is
interested in, from his/her idiosyncratic input in English language
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
FIRE '14, December 05-07, 2014, Bangalore, India
© 2015 ACM. ISBN 978-1-4503-3755-7/15/08…$15.00.
DOI: http://dx.doi.org/10.1145/2824864.2824889
Figure 1. System Overview
.

which he/she has written from her own mental mapping. Therefore,
there may be some errors in user provided abbreviation. To know
what type of errors we are dealing with we conducted a survey on
people who weren’t well versed in English literature. We asked
several queries [4] from each of them. We found that following
types of error are primarily committed by users:
a) Extra Vowel: e.g., SMS may be written as ASAMAS,
ESEMES. SP may be written as ASP, ESPI etc.
b) Vowel Deficiency: e.g., IAS may be written as AS. AIDS
may be written as ADS.
c) Extra Consonant usage: e.g., CNG may be written as
CNGG.
d) Wrong Consonant usage: e.g., B.Tech may be written as
V.Tech. KVPY may be written as KBPY.
e) Other errors: Vowel replacement error e.g., CEO may be
written as CIO. Typing error e.g., IIPQ may be written as
IIPO. Same Phonetic sound error e.g., UPSC may be
written as UPSE, NEWS may be written as NEUS.
Figure 2 shows percentage of these type of errors.
We focused mainly on the first and second kind of error as they
were committed most frequently. We applied some heuristic
techniques based on N-gram (Inverted Index) and edit-distance to
match with the database entries of abbreviations, collected initially
with prior estimation. Manually, we created a list from various
online sources. It contained all kind of abbreviations ranging from
governmental organization to education field. We also created list
of common Hindi terms which are used in framing ‘Wh’-type
questions (e.g. ‘kya’ (क्या), ‘kee’ (कि), ‘kyon’(क्ययों),’kaahaan’
(िहाों) etc.).
To find matches for the queries containing abbreviations (possibly
with spelling errors), we stored the 1500 words sparsed in a huge
number of files. Basically we used Inverted Index technique to store
and find the correct abbreviations from these stored files. In the
abbreviations database considered, we tried to find all possible
character bigrams. For example, in abbreviation ‘SAARC’ 4 bi-
gram entries will be SA, AA, AR, and RC. For each such bigram,
we created a file and stored the word ‘SAARC’ in all the files. In
precise terms, the word ‘SAARC’ will appear in four files namely
SA.txt, AA.txt, AR.txt, and RC.txt. Similar exercise is carried out
for each abbreviation. There can be a maximum of 26x26
possibilities of bigram files, and therefore 26x26 files are created.
Various files will contain data for number of abbreviations.
However, there would be some which would be empty. Typically
each single file contains zero to only a few entries [5]. Similarly we
did for character trigrams leading to 26x26x26 files.
For Hindi words like ‘KYA’ (क्या) (meaning ‘what’ in English),
KY, YA are generated in bigram. There is only one trigram file
namely ‘KYA’. Now, as the application receives the query, it scans
all the words. Since user input is in mix language (we considered
Hindi and English written in Roman script) and potentially contain
a lot of typos and wrong spellings. We first eliminate vowels from
each word and then search through the files generated by N-gram
technique. Each input word is mapped to the word in the list (both
Hindi and English list) which is having the highest frequency from
the bigram and trigram files. In case of several matches, the word
having least modification is chosen, with ties broken arbitrarily. Let
us illustrate the algorithm with an example query “aaiaarctc kay
hai” (आईआरसीटीसी क्या है) (means “what is ‘aaiaarctc’?”).
Table 1. Input Processing Data
Word(o)
Vowel
Removed(w)
#char rem. (p)
AAIAARCTC RCTC 4
KYA KY 2
HAI H 1
The query is converted to uppercase and then following steps are
performed for each word:
a) Vowels are removed because of spell-errors occurring in
the use of vowels
b) Number of remaining letters (p) checked to take
following decisions:
i) If (p>=3) check only tri-gram files for a word
w or its tri-gram subsequences. Here, file
named RCT and CTC are searched to check
whether the words contain some subsequences
of RCTC. If there is matching entry in all such
files, that particular abbreviation is chosen. If
there are more than one entries which contains
this subsequence, then we choose the word
with highest number of occurrence among
these n-gram files. Least edit distance from the
query word (o) is used to break the ties and then
arbitrarily any word from the set of final word
is chosen.
ii) If (p == 2) check only bi-gram files for word
w or its bi-gram subsequence as above. Here,
file name KY is selected and is searched to
check whether the file contains some
subsequences of KY. If there is a matching
entry, that particular abbreviations is chosen.
iii) If (p<2) check only bi-gram files for
original word before vowel removal (o) or its
bi-gram subsequences. Hence file HA and AI
are searched to check whether the files contain
some subsequence of HAI to get required
matching word.
c) We check the chosen word. The word may come from
either Hindi or English or both kind of files:
Figure 2. Types of Error

i) Word is from Hindi file only: ignore the
word as it is not abbreviation-word.
ii) Word is from English file only: the word
should be further processed as it is an assumed
acronym.
iii) Two different words are returned from both
Hindi and English files: Chose the word having
minimum edit distance with original input and
then consider as either i) or ii) case.
d) The chosen abbreviations is searched through our
collection of abbreviations and then expanded form is
extracted. The steps are summarized in Figure 3.
We assumed that a single query can have maximum one acronym
supplemented with zero or more Hindi words.
2.2 System Efficiency
We tested our system on the user data we collected and these are
the results we obtained after considering two different options.
a) Removal of vowels: We removed the vowels from the
user query before searching it.
b) Considering vowels: We search directly without removal
of vowels.
Figure 4 shows the results.
2.3 Data Extraction From Wikipedia
Given a chosen acronym we first look up in our associative array
of collected acronyms with their expanded counterparts. The
expansion is used to generate the URL to Web harvest API (open
source) which returns content of that Wikipedia page in XML
format. Since we need to provide only a brief definition or
introduction that can with in a SMS for an abbreviations, we are
interested only in first few lines (1 or 2 sentences) up to 200
characters from Wiki page. The content so obtained is stored in a
temporary file.
2.4 Translation of Data
The stored data is sent to google translator and then we use Yahoo
Query Language (YQL) [6] to retrieve the translation. We use Web
Harvest API once again to extract the translated text. The extracted
content is stored to another temporary file.
2.5 Transliteration of Data
For transliteration purpose ICU4J [7] library is used. ICU4J is set
of Java libraries that provides more comprehensive support for
Unicode, software globalization and internationalization. The
translated data is passed to a method using this library and
transliterated text is generated which is provided to the user through
a Java Applet. The output is shown in Figure 5.
3. CONCLUSION
We developed a prototype application which can be used by mobile
service providers to cate the information need of their customers
through SMS. Specifically, we attempt to address the need of semi-
literate users using low-cost mobile phones in which neither
internet facility nor local language support is available. We made
use of translation and transliteration using Web application and
libraries along the Web scrapping technique to process the need and
provide the answer in native language using Roman script. We
believe this software can be immensely useful for information
dissemination and access where Internet penetration is low and
people’s knowledge of English is limited.
4. ACKNOWLEDGEMENTS
We wish to thank Divesh Sanjay Kothari and Abhinay Saraswat,
Department of CSE, ISM Dhanbad for their all-round help.
5. ADDITIONAL AUTHORS
Ashok Kumar (ISM Dhanbad, ashokdavas@gmail.com), L.
Gautam (ISM Dhanbad, gtam25@gmail.com), Abhishek Ranjan
(ISM Dhanbad, aksharudarya@gmail.com )
Figure 4. System Efficiency
Figure 3. Input Processing Steps
.
Figure 5. Input Output Panel
.

6. REFERENCES
[1] S. Lothia , W. James, and B. Hwang. System and methods for
providing subscriber-initiated information over the short
message service(SMS) or a micro browser, May 6 2003. US
Patent 6,560,456.
[2] J. Salonen, SMS inquiry and invitation distribution method
and system, Mar. 12, 2013. US Patent RE44, 073.
[3] V. Nikic and A. Wajda. Web Harvest, version 2.0, February
2010. As on June 25, 2014.
[4] User Query Data :
https://www.dropbox.com/s/kbwabem29f0mwu9/data.txt?dl
=0
[5] S. K. D S Kothari, A Saraswat and S. Pal. FAQ Retrieval
using Noisy Queries. In Fire 2013 Workshop Pre-
Proceedings, December 2013.
[6] YQL Console: https://developer.yahoo.com/yql/
[7] ICU User Guide as on June 25, 2014.
[8] Video Demo Link:
https://www.dropbox.com/s/l270iq3gnhgafvy/PARTM.mp4?
dl=0

Submission_36

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Submission_36

Similaire à Submission_36 (20)

Submission_36