2. which he/she has written from her own mental mapping. Therefore,
there may be some errors in user provided abbreviation. To know
what type of errors we are dealing with we conducted a survey on
people who weren’t well versed in English literature. We asked
several queries [4] from each of them. We found that following
types of error are primarily committed by users:
a) Extra Vowel: e.g., SMS may be written as ASAMAS,
ESEMES. SP may be written as ASP, ESPI etc.
b) Vowel Deficiency: e.g., IAS may be written as AS. AIDS
may be written as ADS.
c) Extra Consonant usage: e.g., CNG may be written as
CNGG.
d) Wrong Consonant usage: e.g., B.Tech may be written as
V.Tech. KVPY may be written as KBPY.
e) Other errors: Vowel replacement error e.g., CEO may be
written as CIO. Typing error e.g., IIPQ may be written as
IIPO. Same Phonetic sound error e.g., UPSC may be
written as UPSE, NEWS may be written as NEUS.
Figure 2 shows percentage of these type of errors.
We focused mainly on the first and second kind of error as they
were committed most frequently. We applied some heuristic
techniques based on N-gram (Inverted Index) and edit-distance to
match with the database entries of abbreviations, collected initially
with prior estimation. Manually, we created a list from various
online sources. It contained all kind of abbreviations ranging from
governmental organization to education field. We also created list
of common Hindi terms which are used in framing ‘Wh’-type
questions (e.g. ‘kya’ (क्या), ‘kee’ (कि), ‘kyon’(क्ययों),’kaahaan’
(िहाों) etc.).
To find matches for the queries containing abbreviations (possibly
with spelling errors), we stored the 1500 words sparsed in a huge
number of files. Basically we used Inverted Index technique to store
and find the correct abbreviations from these stored files. In the
abbreviations database considered, we tried to find all possible
character bigrams. For example, in abbreviation ‘SAARC’ 4 bi-
gram entries will be SA, AA, AR, and RC. For each such bigram,
we created a file and stored the word ‘SAARC’ in all the files. In
precise terms, the word ‘SAARC’ will appear in four files namely
SA.txt, AA.txt, AR.txt, and RC.txt. Similar exercise is carried out
for each abbreviation. There can be a maximum of 26x26
possibilities of bigram files, and therefore 26x26 files are created.
Various files will contain data for number of abbreviations.
However, there would be some which would be empty. Typically
each single file contains zero to only a few entries [5]. Similarly we
did for character trigrams leading to 26x26x26 files.
For Hindi words like ‘KYA’ (क्या) (meaning ‘what’ in English),
KY, YA are generated in bigram. There is only one trigram file
namely ‘KYA’. Now, as the application receives the query, it scans
all the words. Since user input is in mix language (we considered
Hindi and English written in Roman script) and potentially contain
a lot of typos and wrong spellings. We first eliminate vowels from
each word and then search through the files generated by N-gram
technique. Each input word is mapped to the word in the list (both
Hindi and English list) which is having the highest frequency from
the bigram and trigram files. In case of several matches, the word
having least modification is chosen, with ties broken arbitrarily. Let
us illustrate the algorithm with an example query “aaiaarctc kay
hai” (आईआरसीटीसी क्या है) (means “what is ‘aaiaarctc’?”).
Table 1. Input Processing Data
Word(o)
Vowel
Removed(w)
#char rem. (p)
AAIAARCTC RCTC 4
KYA KY 2
HAI H 1
The query is converted to uppercase and then following steps are
performed for each word:
a) Vowels are removed because of spell-errors occurring in
the use of vowels
b) Number of remaining letters (p) checked to take
following decisions:
i) If (p>=3) check only tri-gram files for a word
w or its tri-gram subsequences. Here, file
named RCT and CTC are searched to check
whether the words contain some subsequences
of RCTC. If there is matching entry in all such
files, that particular abbreviation is chosen. If
there are more than one entries which contains
this subsequence, then we choose the word
with highest number of occurrence among
these n-gram files. Least edit distance from the
query word (o) is used to break the ties and then
arbitrarily any word from the set of final word
is chosen.
ii) If (p == 2) check only bi-gram files for word
w or its bi-gram subsequence as above. Here,
file name KY is selected and is searched to
check whether the file contains some
subsequences of KY. If there is a matching
entry, that particular abbreviations is chosen.
iii) If (p<2) check only bi-gram files for
original word before vowel removal (o) or its
bi-gram subsequences. Hence file HA and AI
are searched to check whether the files contain
some subsequence of HAI to get required
matching word.
c) We check the chosen word. The word may come from
either Hindi or English or both kind of files:
Figure 2. Types of Error
3. i) Word is from Hindi file only: ignore the
word as it is not abbreviation-word.
ii) Word is from English file only: the word
should be further processed as it is an assumed
acronym.
iii) Two different words are returned from both
Hindi and English files: Chose the word having
minimum edit distance with original input and
then consider as either i) or ii) case.
d) The chosen abbreviations is searched through our
collection of abbreviations and then expanded form is
extracted. The steps are summarized in Figure 3.
We assumed that a single query can have maximum one acronym
supplemented with zero or more Hindi words.
2.2 System Efficiency
We tested our system on the user data we collected and these are
the results we obtained after considering two different options.
a) Removal of vowels: We removed the vowels from the
user query before searching it.
b) Considering vowels: We search directly without removal
of vowels.
Figure 4 shows the results.
2.3 Data Extraction From Wikipedia
Given a chosen acronym we first look up in our associative array
of collected acronyms with their expanded counterparts. The
expansion is used to generate the URL to Web harvest API (open
source) which returns content of that Wikipedia page in XML
format. Since we need to provide only a brief definition or
introduction that can with in a SMS for an abbreviations, we are
interested only in first few lines (1 or 2 sentences) up to 200
characters from Wiki page. The content so obtained is stored in a
temporary file.
2.4 Translation of Data
The stored data is sent to google translator and then we use Yahoo
Query Language (YQL) [6] to retrieve the translation. We use Web
Harvest API once again to extract the translated text. The extracted
content is stored to another temporary file.
2.5 Transliteration of Data
For transliteration purpose ICU4J [7] library is used. ICU4J is set
of Java libraries that provides more comprehensive support for
Unicode, software globalization and internationalization. The
translated data is passed to a method using this library and
transliterated text is generated which is provided to the user through
a Java Applet. The output is shown in Figure 5.
3. CONCLUSION
We developed a prototype application which can be used by mobile
service providers to cate the information need of their customers
through SMS. Specifically, we attempt to address the need of semi-
literate users using low-cost mobile phones in which neither
internet facility nor local language support is available. We made
use of translation and transliteration using Web application and
libraries along the Web scrapping technique to process the need and
provide the answer in native language using Roman script. We
believe this software can be immensely useful for information
dissemination and access where Internet penetration is low and
people’s knowledge of English is limited.
4. ACKNOWLEDGEMENTS
We wish to thank Divesh Sanjay Kothari and Abhinay Saraswat,
Department of CSE, ISM Dhanbad for their all-round help.
5. ADDITIONAL AUTHORS
Ashok Kumar (ISM Dhanbad, ashokdavas@gmail.com), L.
Gautam (ISM Dhanbad, gtam25@gmail.com), Abhishek Ranjan
(ISM Dhanbad, aksharudarya@gmail.com )
Figure 4. System Efficiency
Figure 3. Input Processing Steps
.
Figure 5. Input Output Panel
.
4. 6. REFERENCES
[1] S. Lothia , W. James, and B. Hwang. System and methods for
providing subscriber-initiated information over the short
message service(SMS) or a micro browser, May 6 2003. US
Patent 6,560,456.
[2] J. Salonen, SMS inquiry and invitation distribution method
and system, Mar. 12, 2013. US Patent RE44, 073.
[3] V. Nikic and A. Wajda. Web Harvest, version 2.0, February
2010. As on June 25, 2014.
[4] User Query Data :
https://www.dropbox.com/s/kbwabem29f0mwu9/data.txt?dl
=0
[5] S. K. D S Kothari, A Saraswat and S. Pal. FAQ Retrieval
using Noisy Queries. In Fire 2013 Workshop Pre-
Proceedings, December 2013.
[6] YQL Console: https://developer.yahoo.com/yql/
[7] ICU User Guide as on June 25, 2014.
[8] Video Demo Link:
https://www.dropbox.com/s/l270iq3gnhgafvy/PARTM.mp4?
dl=0