SlideShare a Scribd company logo
1 of 9
Download to read offline
Learning Phoneme Mappings for Transliteration without Parallel Data


                                         Sujith Ravi and Kevin Knight
                                        University of Southern California
                                         Information Sciences Institute
                                        Marina del Rey, California 90292
                                        {sravi,knight}@isi.edu




                        Abstract                                We explore the third dimension, where we see
                                                              several techniques in use:
      We present a method for performing machine
      transliteration without any parallel resources.            • Manually-constructed transliteration models,
      We frame the transliteration task as a deci-                 e.g., (Hermjakob et al., 2008).
      pherment problem and show that it is possi-
      ble to learn cross-language phoneme mapping                • Models constructed from bilingual dictionaries
      tables using only monolingual resources. We                  of terms and names, e.g., (Knight and Graehl,
      compare various methods and evaluate their                   1998; Huang et al., 2004; Haizhou et al., 2004;
      accuracies on a standard name transliteration
                                                                   Zelenko and Aone, 2006; Yoon et al., 2007;
      task.
                                                                   Li et al., 2007; Karimi et al., 2007; Sherif
                                                                   and Kondrak, 2007b; Goldwasser and Roth,
1    Introduction                                                  2008b).
Transliteration refers to the transport of names and             • Extraction of parallel examples from bilin-
terms between languages with different writing sys-                gual corpora, using bootstrap dictionaries e.g.,
tems and phoneme inventories. Recently there has                   (Sherif and Kondrak, 2007a; Goldwasser and
been a large amount of interesting work in this                    Roth, 2008a).
area, and the literature has outgrown being citable
in its entirety. Much of this work focuses on back-              • Extraction of parallel examples from compara-
transliteration, which tries to restore a name or                  ble corpora, using bootstrap dictionaries, and
term that has been transported into a foreign lan-                 temporal and word co-occurrence, e.g., (Sproat
guage. Here, there is often only one correct target                et al., 2006; Klementiev and Roth, 2008).
spelling—for example, given jyon.kairu (the
                                                                 • Extraction of parallel examples from web
name of a U.S. Senator transported to Japanese), we
                                                                   queries, using bootstrap dictionaries, e.g., (Na-
must output “Jon Kyl”, not “John Kyre” or any other
                                                                   gata et al., 2001; Oh and Isahara, 2006; Kuo et
variation.
                                                                   al., 2006; Wu and Chang, 2007).
   There are many techniques for transliteration and
back-transliteration, and they vary along a number               • Comparing terms from different languages in
of dimensions:                                                     phonetic space, e.g., (Tao et al., 2006; Goldberg
                                                                   and Elhadad, 2008).
    • phoneme substitution vs. character substitution

    • heuristic vs. generative vs. discriminative mod-           In this paper, we investigate methods to acquire
      els                                                     transliteration mappings from non-parallel sources.
                                                              We are inspired by previous work in unsupervised
    • manual vs. automatic knowledge acquisition              learning for natural language, e.g. (Yarowsky, 1995;

                                                         37
    Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 37–45,
                   Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics
English word                    English sound                Japanese sound
                    sequence                         sequence                      sequence                  Japanese katakana
     WFSA - A                         WFST - B                     WFST - C                      WFST - D        sequence
                                                  ( S P EH N S ER              ( S U P E N S A A
                ( SPENCER ABRAHAM )                                                                         (スペンサー・エーブラハム)
                                                 EY B R AH HH AE M )          E E B U R A H A M U )



Figure 1: Model used for back-transliteration of Japanese katakana names and terms into English. The model employs
a four-stage cascade of weighted finite-state transducers (Knight and Graehl, 1998).


Goldwater and Griffiths, 2007), and we are also in-                        all 1-to-1 and 1-to-2 phoneme connections between
spired by cryptanalysis—we view a corpus of for-                          English and Japanese. Phoneme bigrams that occur
eign terms as a code for English, and we attempt to                       fewer than 10 times in a Japanese corpus are omit-
break the code.                                                           ted, and we omit 1-to-3 connections. This initial
                                                                          WFST C model has 15320 uniformly weighted pa-
2    Background                                                           rameters. We then train the model on 3343 phoneme
We follow (Knight and Graehl, 1998) in tackling                           string pairs from a bilingual dictionary, using the
back-transliteration of Japanese katakana expres-                         EM algorithm. EM immediately reduces the con-
sions into English. Knight and Graehl (1998) devel-                       nections in the model to those actually observed in
oped a four-stage cascade of finite-state transducers,                     the parallel data, and after 14 iterations, there are
shown in Figure 1.                                                        only 188 connections left with P(j|e) ≥ 0.01. Fig-
                                                                          ure 2 shows the phonemic substitution table learnt
    • WFSA A - produces an English word sequence                          from parallel training.
      w with probability P(w) (based on a unigram                            We use this trained WFST C model and apply it
      word model).                                                        to the U.S. Senator name transliteration task (which
                                                                          we update to the 2008 roster). We obtain 40% er-
    • WFST B - generates an English phoneme se-
                                                                          ror, roughly matching the performance observed in
      quence e corresponding to w with probability
                                                                          (Knight and Graehl, 1998).
      P(e|w).

    • WFST C - transforms the English phoneme se-                         3       Task and Data
      quence into a Japanese phoneme sequence j ac-                       The task of this paper is to learn the mappings in
      cording to a model P(j|e).                                          Figure 2, but without parallel data, and to test those
                                                                          mappings in end-to-end transliteration. We imagine
    • WFST D - writes out the Japanese phoneme
                                                                          our problem as one faced by monolingual English
      sequence into Japanese katakana characters ac-
                                                                          speaker wandering around Japan, reading a multi-
      cording to a model P(k|j).
                                                                          tude of katakana signs, listening to people speak
   Using the cascade in the reverse (noisy-channel)                       Japanese, and eventually deciphering those signs
direction, they are able to translate new katakana                        into English. To mis-quote Warren Weaver:
names and terms into English. They report 36% er-
                                                                                   “When I look at a corpus of Japanese
ror in translating 100 U.S. Senators’ names, and they
                                                                                   katakana, I say to myself, this is really
report exceeding human transliteration performance
                                                                                   written in English, but it has been coded
in the presence of optical scanning noise.
                                                                                   in some strange symbols. I will now pro-
   The only transducer that requires parallel training
                                                                                   ceed to decode.”
data is WFST C. Knight and Graehl (1998) take sev-
eral thousand phoneme string pairs, automatically                           Our larger motivation is to move toward
align them with the EM algorithm (Dempster et                             easily-built transliteration systems for all language
al., 1977), and construct WFST C from the aligned                         pairs, regardless of parallel resources.        While
phoneme pieces.                                                           Japanese/English transliteration has its own partic-
   We re-implement their basic method by instanti-                        ular features, we believe it is a reasonable starting
ating a densely-connected version of WFST C with                          point.

                                                                     38
e       j       P(j|e)     e    j       P(j|e)   e    j      P(j|e)   e    j       P(j|e)   e     j     P(j|e)   e       j           P(j|e)        e    j       P(j|e)   e    j    P(j|e)

  AA      o       0.49       AY   ai      0.84     EH   e      0.94     HH   h       0.95     L     r     0.62     OY      oi          0.89          SH   sh y    0.33     V    b    0.75
          a       0.46            i       0.09          a      0.03          w       0.02           ru    0.37             oe          0.04               sh      0.31          bu   0.17
          oo      0.02            a       0.03                               ha      0.02                                  o           0.04               yu      0.17          w    0.03
          aa      0.02            iy      0.01                                                                             i           0.04               ssh y   0.12          a    0.02
                                  ay      0.01                                                                                                            sh i    0.04
                                                                                                                                                          ssh     0.02
                                                                                                                                                          e       0.01
  AE      a       0.93       B    b       0.82     ER   aa     0.8      IH   i       0.89     M     m     0.68     P       p           0.63          T    t       0.43     W    w    0.73
          a ssh   0.02            bu      0.15          a      0.08          e       0.05           mu    0.22             pu          0.16               to      0.25          u    0.17
          an      0.02                                  ar     0.03          in      0.01           n     0.08             pp u        0.13               tt o    0.17          o    0.04
                                                        ru     0.02          a       0.01                                  pp          0.06               ts      0.04          i    0.02
                                                        or     0.02                                                                                       tt      0.03
                                                        er     0.02                                                                                       u       0.02
                                                                                                                                                          ts u    0.02
                                                                                                                                                          ch      0.02
  AH      a       0.6        CH   tch i   0.27     EY   ee     0.58     IY   ii      0.58     N     n     0.96     PAUSE   pause       1.0           TH   su      0.48     Y    y    0.7
          o       0.13            ch      0.24          e      0.15          i       0.3            nn    0.02                                            s       0.22          i    0.26
          e       0.11            ch i    0.23          ei     0.12          e       0.07                                                                 sh      0.16          e    0.02
          i       0.07            ch y    0.2           a      0.1           ee      0.03                                                                 to      0.04          a    0.02
          u       0.06            tch y   0.02          ai     0.03                                                                                       ch      0.04
                                  tch     0.02                                                                                                            te      0.02
                                  ssh y   0.01                                                                                                            t       0.02
                                  k       0.01                                                                                                            a       0.02
  AO      o       0.6        D    d       0.54     F    h      0.58     JH   jy      0.35     NG    n     0.62     R       r           0.61          UH   u       0.79     Z    z    0.27
          oo      0.27            do      0.27          hu     0.35          j       0.24           gu    0.22             a           0.27               uu      0.09          zu   0.25
          a       0.05            dd o    0.06          hh     0.04          ji      0.21           ng    0.09             o           0.07               ua      0.04          u    0.16
          on      0.03            z       0.02          hh u   0.02          jj i    0.14           i     0.04             ru          0.03               dd      0.03          su   0.07
          au      0.03            j       0.02                               z       0.04           u     0.01             aa          0.01               u ssh   0.02          j    0.06
          u       0.01            u       0.01                                                      o     0.01                                            o       0.02          a    0.06
                                                                                                    a     0.01                                                                  n    0.03
                                                                                                                                                                                i    0.03
                                                                                                                                                                                s    0.02
                                                                                                                                                                                o    0.02
  AW      au      0.69       DH   z       0.87     G    g      0.66     K    k       0.53     OW    o     0.57     S       su          0.43          UW   uu      0.67     ZH   jy   0.43
          aw      0.15            zu      0.08          gu     0.19          ku      0.2            oo    0.39             s           0.37               u       0.29          ji   0.29
          ao      0.06            az      0.04          gg u   0.1           kk u    0.16           ou    0.02             sh          0.08               yu      0.02          j    0.29
          a       0.04                                  gy     0.03          kk      0.05                                  u           0.05
          uu      0.02                                  gg     0.01          ky      0.02                                  ss          0.02
          oo      0.02                                  ga     0.01          ki      0.01                                  ssh         0.01
          o       0.02


Figure 2: Phonemic substitution table learnt from 3343 parallel English/Japanese phoneme string pairs. English
phonemes are in uppercase, Japanese in lowercase. Mappings with P(j|e) > 0.01 are shown.
                  A   A   CH I D O                              CH E N J I                    N E B A D A                          W   A     K   O   B I A
                  A   A   K U P U R A Z A                       CH E S                             :                               W   A     N   K   A PP U
                  A   A   N D O                                      :                        O P U T I K U S U                    W   A     N   T   E N P O
                  A   A   T I S U T O                           D E K O R A       T I B U          :                               W   A     S   E   R I N
                  A   A   T O S E R I N A P I S U T O N         D E T O M O       R U T O     P I I T A A                          Y   U     N   I   O N
                  A   I   A N B I R U                           E P I G U R       A M U       P I KK A A                           Y   U     N   I   TT O SH I S U T E M U
                  A   I   D I I D O                             E R A N D O                   P I N G U U                          Y   U     U
                  A   I   K E N B E R I I                            :                        P I P E R A J I N A M I D O               :
                  A   J   I A K A PP U                          J Y A I A N       TS U        P I S A                                   :
                  A   J   I T O                                 J Y A Z U                     P I U R A                            Z E N E R A R U E A K O N
                  A   K   A SH I A K O O S U                         :                        P O I N T O                          Z E R O
                  A   K   U A                                   M Y U U Z E       U M U            :                               Z O N B I I Z U
                           :                                         :                             :                                    :
                           :                                         :                             :                                    :


   Figure 3: Some Japanese phoneme sequences generated from the monolingual katakana corpus using WFST D.


  Our monolingual resources are:                                                                         with 112,151 entries.

  • 43717 unique Japanese katakana sequences                                                        • The English gigaword corpus. Knight and
    collected from web newspaper data. We split                                                       Graehl (1998) already use frequently-occurring
    multi-word katakana phrases on the center-dot                                                     capitalized words to build the WFSA A compo-
    (“·”) character, and select a final corpus of                                                      nent of their four-stage cascade.
    9350 unique sequences. We add monolingual
    Japanese versions of the 2008 U.S. Senate ros-                                                   We seek to use our English knowledge (derived
    ter.1                                                                                         from 2 and 3) to decipher the Japanese katakana cor-
                                                                                                  pus (1) into English. Figure 3 shows a portion of the
  • The CMU pronunciation dictionary of English,                                                  Japanese corpus, which we transform into Japanese
      1
     We use “open” EM testing, in which unlabeled test data
                                                                                                  phoneme sequences using the monolingual resource
is allowed to be part of unsupervised training. However, no                                       of WFST D. We note that the Japanese phoneme in-
parallel data is allowed.                                                                         ventory contains 39 unique (“ciphertext”) symbols,

                                                                                         39
compared to the 40 English (“plaintext”) phonemes.              We allow EM to run until the P(j) likelihood ra-
   Our goal is to compare and evaluate the WFST C               tio between subsequent training iterations reaches
model learnt under two different scenarios—(a) us-              0.9999, and we terminate early if 200 iterations are
ing parallel data, and (b) using monolingual data.              reached.
For each experiment, we train only the WFST C                      Finally, we decode our test set of U.S. Senator
model and then apply it to the name translitera-                names. Following Knight et al (2006), we stretch
tion task—decoding 100 U.S. Senator names from                  out the P(j|e) model probabilities after decipher-
Japanese to English using the automata shown in                 ment training and prior to decoding our test set, by
Figure 1. For all experiments, we keep the rest of              cubing their values.
the models in the cascade (WFSA A, WFST B, and                     Decipherment under the conditions of translit-
WFST D) unchanged. We evaluate on whole-name                    eration is substantially more difficult than solv-
error-rate (maximum of 100/100) as well as normal-              ing letter-substitution ciphers (Knight et al., 2006;
ized word edit distance, which gives partial credit             Ravi and Knight, 2008; Ravi and Knight, 2009) or
for getting the first or last name correct.                      phoneme-substitution ciphers (Knight and Yamada,
                                                                1999). This is because the target table contains sig-
4       Acquiring Phoneme Mappings from                         nificant non-determinism, and because each symbol
        Non-Parallel Data                                       has multiple possible fertilities, which introduces
                                                                uncertainty about the length of the target string.
Our main data consists of 9350 unique Japanese
phoneme sequences, which we can consider as a sin-
                                                                4.1   Baseline P(e) Model
gle long sequence j. As suggested by Knight et
al (2006), we explain the existence of j as the re-             Clearly, we can design P(e) in a number of ways. We
sult of someone initially producing a long English              might expect that the more the system knows about
phoneme sequence e, according to P(e), then trans-              English, the better it will be able to decipher the
forming it into j, according to P(j|e). The probabil-           Japanese. Our baseline P(e) is a 2-gram phoneme
ity of our observed data P(j) can be written as:                model trained on phoneme sequences from the CMU
                                                                dictionary. The second row (2a) in Figure 4 shows
                                                                results when we decipher with this fixed P(e). This
            P (j) =        P (e) · P (j|e)                      approach performs poorly and gets all the Senator
                       e
                                                                names wrong.
We take P(e) to be some fixed model of mono-
lingual English phoneme production, represented                 4.2   Consonant Parity
as a weighted finite-state acceptor (WFSA). P(j|e)               When training under non-parallel conditions, we
is implemented as the initial, uniformly-weighted               find that we would like to keep our WFST C model
WFST C described in Section 2, with 15320 phone-                small, rather than instantiating a fully-connected
mic connections.                                                model. In the supervised case, parallel training al-
   We next maximize P(j) by manipulating the sub-               lows the trained model to retain only those con-
stitution table P(j|e), aiming to produce a result              nections which were observed from the data, and
such as shown in Figure 2. We accomplish this by                this helps eliminate many bad connections from the
composing the English phoneme model P(e) WFSA                   model. In the unsupervised case, there is no parallel
with the P(j|e) transducer. We then use the EM al-              data available to help us make the right choices.
gorithm to train just the P(j|e) parameters (inside                We therefore use prior knowledge and place a
the composition that predicts j), and guess the val-            consonant-parity constraint on the WFST C model.
ues for the individual phonemic substitutions that              Prior to EM training, we throw out any mapping
maximize the likelihood of the observed data P(j).2             from the P(j|e) substitution model that does not
    2
    In our experiments, we use the Carmel finite-state trans-
                                                                have the same number of English and Japanese con-
ducer package (Graehl, 1997), a toolkit with an algorithm for   sonant phonemes. This is a pattern that we observe
EM training of weighted finite-state transducers.                across a range of transliteration tasks. Here are ex-

                                                          40
Phonemic Substitution Model                                             Name Transliteration Error
                                                                                    whole-name error norm. edit distance
           1    e → j = { 1-to-1, 1-to-2 }                                                 40                25.9
                + EM aligned with parallel data
           2a   e → j = { 1-to-1, 1-to-2 }                                                100               100.0
                + decipherment training with 2-gram English P(e)
           2b   e → j = { 1-to-1, 1-to-2 }                                                98                 89.8
                + decipherment training with 2-gram English P(e)
                + consonant-parity
           2c   e → j = { 1-to-1, 1-to-2 }                                                94                 73.6
                + decipherment training with 3-gram English P(e)
                + consonant-parity
           2d   e → j = { 1-to-1, 1-to-2 }                                                77                 57.2
                + decipherment training with a word-based English model
                + consonant-parity
           2e   e → j = { 1-to-1, 1-to-2 }                                                73                 54.2
                + decipherment training with a word-based English model
                + consonant-parity
                + initialize mappings having consonant matches with higher proba-
                bility weights

Figure 4: Results on name transliteration obtained when using the phonemic substitution model trained under different
scenarios—(1) parallel training data, (2a-e) using only monolingual resources.


amples of mappings where consonant parity is vio-                   ing. The word-based model produces complete En-
lated:                                                              glish phoneme sequences corresponding to 76,152
                                                                    actual English words from the CMU dictionary.
   K => a                   N => e e
                                                                    The English phoneme sequences are represented as
   EH => s a                EY => n
                                                                    paths through a WFSA, and all paths are weighted
Modifying the WFST C in this way leads to bet-                      equally. We represent the word-based model in com-
ter decipherment tables and slightly better results                 pact form, using determinization and minimization
for the U.S. Senator task. Normalized edit distance                 techniques applicable to weighted finite-state au-
drops from 100 to just under 90 (row 2b in Figure 4).               tomata. This allows us to perform efficient EM train-
                                                                    ing on the cascade of P(e) and P(j|e) models. Under
4.3 Better English Models                                           this scheme, English phoneme sequences resulting
Row 2c in Figure 4 shows decipherment results                       from decipherment are always analyzable into actual
when we move to a 3-gram English phoneme model                      words.
for P(e). We notice considerable improvements in                       Row 2d in Figure 4 shows the results we ob-
accuracy. On the U.S. Senator task, normalized edit                 tain when training our WFST C with a word-based
distance drops from 89.8 to 73.6, and whole-name                    English phoneme model. Using the word-based
error decreases from 98 to 94.                                      model produces the best result so far on the phone-
   When we analyze the results from deciphering                     mic substitution task with non-parallel data. On the
with a 3-gram P(e) model, we find that many of the                   U.S. Senator task, word-based decipherment outper-
Japanese phoneme test sequences are decoded into                    forms the other methods by a large margin. It gets
English phoneme sequences (such as “IH K R IH                       23 out of 100 Senator names exactly right, with a
N” and “AE G M AH N”) that are not valid words.                     much lower normalized edit distance (57.2). We
This happens because the models we used for de-                     have managed to achieve this performance using
cipherment so far have no knowledge of what con-                    only monolingual data. This also puts us within
stitutes a globally valid English sequence. To help                 reach of the parallel-trained system’s performance
the phonemic substitution model learn this infor-                   (40% whole-name errors, and 25.9 word edit dis-
mation automatically, we build a word-based P(e)                    tance error) without using a single English/Japanese
from English phoneme sequences in the CMU dic-                      pair for training.
tionary and use this model for decipherment train-                     To summarize, the quality of the English phoneme

                                                             41
e    j    P(j|e)   e    j      P(j|e)   e    j      P(j|e)   e    j      P(j|e)   e     j       P(j|e)    e       j       P(j|e)   e    j      P(j|e)   e    j      P(j|e)

   AA   a    0.37     AY   ai     0.36     EH   e      0.37     HH   h      0.45     L     r       0.3       OY      a       0.27     SH   sh y   0.22     V    b      0.34
        o    0.25          oo     0.13          a      0.24          s      0.12           n       0.19              i       0.16          m      0.11          k      0.14
        i    0.15          e      0.12          o      0.12          k      0.09           ru      0.15              yu      0.1           r      0.1           m      0.13
        u    0.08          i      0.11          i      0.12          b      0.08           ri      0.04              oi      0.1           s      0.06          s      0.07
        e    0.07          a      0.11          u      0.06          m      0.07           t       0.03              ya      0.09          p      0.06          d      0.07
        oo   0.03          uu     0.05          oo     0.04          w      0.03           mu      0.02              yo      0.08          sa     0.05          r      0.04
        ya   0.01          yu     0.02          yu     0.01          p      0.03           m       0.02              e       0.08          h      0.05          t      0.03
        aa   0.01          u      0.02          ai     0.01          g      0.03           wa      0.01              o       0.06          b      0.05          h      0.02
                           o      0.02                               ky     0.02           ta      0.01              oo      0.02          t      0.04          sh     0.01
                           ee     0.02                               d      0.02           ra      0.01              ei      0.02          k      0.04          n      0.01
   AE   a    0.52     B    b      0.41     ER   aa     0.47     IH   i      0.36     M     m       0.3       P       p       0.18     T    t      0.2      W    w      0.23
        i    0.19          p      0.12          a      0.17          e      0.25           n       0.08              pu      0.08          to     0.16          r      0.2
        e    0.11          k      0.09          u      0.08          a      0.15           k       0.08              n       0.05          ta     0.05          m      0.13
        o    0.08          m      0.07          o      0.07          u      0.09           r       0.07              k       0.05          n      0.04          s      0.08
        u    0.03          s      0.04          e      0.04          o      0.09           s       0.06              sh i    0.04          ku     0.03          k      0.07
        uu   0.02          g      0.04          oo     0.03          oo     0.01           h       0.05              ku      0.04          k      0.03          h      0.06
        oo   0.02          t      0.03          ii     0.03                                t       0.04              su      0.03          te     0.02          b      0.06
                           z      0.02          yu     0.02                                g       0.04              pa      0.03          s      0.02          t      0.04
                           d      0.02          uu     0.02                                b       0.04              t       0.02          r      0.02          p      0.04
                           ch y   0.02          i      0.02                                mu      0.03              ma      0.02          gu     0.02          d      0.02
   AH   a    0.31     CH   g      0.12     EY   ee     0.3      IY   i      0.25     N     n       0.56      PAUSE   pause   1.0      TH   k      0.21     Y    s      0.25
        o    0.23          k      0.11          a      0.22          ii     0.21           ru      0.09                                    pu     0.11          k      0.18
        i    0.17          b      0.09          i      0.11          a      0.15           su      0.04                                    ku     0.1           m      0.07
        e    0.12          sh     0.07          u      0.09          aa     0.12           mu      0.02                                    d      0.08          g      0.06
        u    0.1           s      0.07          o      0.06          u      0.07           kk u    0.02                                    hu     0.07          p      0.05
        ee   0.02          r      0.07          e      0.06          o      0.05           ku      0.02                                    su     0.05          b      0.05
        oo   0.01          ch y   0.07          oo     0.05          oo     0.02           hu      0.02                                    bu     0.04          r      0.04
        aa   0.01          p      0.06          ei     0.04          ia     0.02           to      0.01                                    ko     0.03          d      0.04
                           m      0.06          ii     0.02          ee     0.02           pp u    0.01                                    ga     0.03          ur     0.03
                           ch     0.06          uu     0.01          e      0.02           bi      0.01                                    sa     0.02          ny     0.03
   AO   o    0.29     D    d      0.16     F    h      0.18     JH   b      0.13     NG    tt o    0.21      R       r       0.53     UH   a      0.24     Z    to     0.14
        a    0.26          do     0.15          hu     0.14          k      0.1            ru      0.17              n       0.07          o      0.14          zu     0.11
        e    0.14          n      0.05          b      0.09          jy     0.1            n       0.14              ur      0.05          e      0.11          ru     0.11
        oo   0.12          to     0.03          sh i   0.07          s      0.08           kk u    0.1               ri      0.03          yu     0.1           su     0.1
        i    0.08          sh i   0.03          p      0.07          m      0.08           su      0.07              ru      0.02          ai     0.09          gu     0.09
        u    0.05          ku     0.03          m      0.06          t      0.07           mu      0.06              d       0.02          i      0.08          mu     0.07
        yu   0.03          k      0.03          r      0.04          j      0.07           dd o    0.04              t       0.01          uu     0.07          n      0.06
        ee   0.01          gu     0.03          s      0.03          h      0.07           tch i   0.03              s       0.01          oo     0.07          do     0.06
                           b      0.03          ha     0.03          sh     0.06           pp u    0.03              m       0.01          aa     0.03          ji     0.02
                           s      0.02          bu     0.02          d      0.05           jj i    0.03              k       0.01          u      0.02          ch i   0.02
   AW   oo   0.2      DH   h      0.13     G    gu     0.13     K    k      0.17     OW    a       0.3       S       su      0.4      UW   u      0.39     ZH   m      0.17
        au   0.19          r      0.12          g      0.11          n      0.1            o       0.25              n       0.11          a      0.15          p      0.16
        a    0.18          b      0.09          ku     0.08          ku     0.1            oo      0.12              ru      0.05          o      0.13          t      0.15
        ai   0.11          w      0.08          bu     0.06          kk u   0.05           u       0.09              to      0.03          uu     0.12          h      0.13
        aa   0.11          t      0.07          k      0.04          to     0.03           i       0.07              ku      0.03          i      0.04          d      0.1
        e    0.05          p      0.07          b      0.04          su     0.03           ya      0.04              sh i    0.02          yu     0.03          s      0.08
        o    0.04          g      0.06          to     0.03          sh i   0.02           e       0.04              ri      0.02          ii     0.03          b      0.07
        i    0.04          jy     0.05          t      0.03          r      0.02           uu      0.02              mu      0.02          e      0.03          r      0.05
        iy   0.02          d      0.05          ha     0.03          ko     0.02           ai      0.02              hu      0.02          oo     0.02          jy     0.03
        ea   0.01          k      0.03          d      0.03          ka     0.02           ii      0.01              ch i    0.02          ee     0.02          k      0.02


Figure 5: Phonemic substitution table learnt from non-parallel corpora. For each English phoneme, only the top ten
mappings with P(j|e) > 0.01 are shown.


model used in decipherment training has a large ef-                                      where the mapping table derived from parallel data
fect on the learnt P(j|e) phonemic substitution ta-                                      does not. Because parallel data is limited, it may not
ble (i.e., probabilities for the various phoneme map-                                    contain all of the necessary mappings.
pings within the WFST C model), which in turn af-
fects the quality of the back-transliterated English                                     4.4       Size of Japanese Training Data
output produced when decoding Japanese.
                                                                                         Monolingual corpora are more easily available than
   Figure 5 shows the phonemic substitution table                                        parallel corpora, so we can use increasing amounts
learnt using word-based decipherment. The map-                                           of monolingual Japanese training data during de-
pings are reasonable, given the lack of parallel data.                                   cipherment training. The table below shows that
They are not entirely correct—for example, the map-                                      using more Japanese training data produces bet-
ping “S → s u” is there, but “S → s” is missing.                                         ter transliteration results when deciphering with the
   Sample end-to-end transliterations are illustrated                                    word-based English model.
in Figure 6. The figure shows how the transliteration                                      Japanese training data             Error on name transliteration task
results from non-parallel training improve steadily                                       (# of phoneme sequences)           whole-name error normalized word
                                                                                                                                               edit distance
as we use stronger decipherment techniques. We                                                            4,674                    87                 69.7
note that in one case (LAUTENBERG), the deci-                                                             9,350                    77                 57.2
pherment mapping table leads to a correct answer

                                                                              42
!"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5
                            !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$
                     !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5          !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#
                                                                 !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH
                                                                                               !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3
                                                                             !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH
                                                                                                                 !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/
        !"#$#%&'        ()""*+,-.%/0*"   !"#$#%&' 12)%*,#+ ()""*+,-.%/0*"
                                         1&"&''*'             =*+#>2*"?*%,- 1&"&''*' 12)%*,#+ =*+#>2*"?*%,-
                                                                                  =*+#>2*"?*%,-          =*+#>2*"?*%,      =*+#>2*"?*%,-      =*+#>2*"?
                           !"#$#%&'      ()""*+,-.%/0*"  1&"&''*'!"#$#%&'
                                         3"&#%#%$                 12)%*,#+
                                                              @A*,2)B-9C =*+#>2*"?*%,-      =*+#>2*"?*%,-
                                                                                ()""*+,-.%/0*"                =*+#>2*"?*%,
                                                                                  @A*,2)B-DC 1&"&''*' 12)%*,#+ EC
                                                                                 3"&#%#%$ @A*,2)B-DC@A*,2)B-9C =*+#>2*"?*%,-
                                                                                                         @A*,2)B                   =*+#>2*"?*%,-
                                                                                                                           @A*,2)B-DC              =*+#>
                                                                                                                                              @A*,2)B E
                                                         3"&#%#%$          @A*,2)B-9C             3"&#%#%$    @A*,2)B EC
                                                                                                                    @A*,2)B-9C     @A*,2)B-DC      @A*,
                                         45678-9::;< 45678-9::;<                 45678-9::;<    45678-9::;<
        !"#$%&          F1GH(GI-
                          !"#$%&          F1GH(GI-
                                         F1GH(GI-   F1GH(GI F1GH(GI;!:*6)*<=9=) F1GH(GI F1GH(GI;!:*6)*<=9=)
                                         !"#$%& F1GH(GI-
                                                   F1GH(GI-
                                                      !"#$%&      F1GH(GI-           F1GH(GI-
                                                                 F1GH(GI- ;!:*6)*<=9=)
                                                                                F1GH(GI- F1GH(GI-                F1GH(GI-
                                                                                                       ;!:*6)*<=9=)  F1GH
        '%()*+            '%()*+
                        .JI.K.A-          .JI.K.A   67689:.)
                                         '%()*+ .JI.K.A
                                         .JI.K.A-  .JI.K.A- 67689:.)
                                                      '%()*+      .JI.K.A
                                                                 .JI.K.A-            .JI.K.A-
                                                                                .JI.K.A .JI.K.A-
                                                                                    67689:.) 67689:.)            .JI.K.A
                                                                                                                     .JI.
                          ,-'.&          =.HLGM-.7.7.-
                                         =.HLGM :*:)>: =.HLGM,-'.& >=?6:)6:;2) :*:)>: =.HLGM :*:)>:
                                                             :*:)>:
        ,-'.&           =.HLGM-.7.7.-    ,-'.&            >=?6:)6:;2)    =.HLGM ;*@6?A.B)6:;2) 7:.A6886):C:*=)
                                                         =.HLGM-.7.7.- =.HLGM-.7.7.-
                                                                          ;*@6?A.B)6:;2) >=?6:)6:;2)>=?6:)6:;2)   ;*@6?A.B)6:;2) 7:.A
                                                                                           7:.A6886):C:*=) ;*@6?A.B)6:;2) 7:.A6886
                           /00                               /00
        /00                              /00
                           123#&         N.OHG-.MM.I=-    N.OHG-.MM.I=
                                                               123#& *@=A*6)D=@.)   N.OHG CE?7)   N.OHG-.MM.I=-
                                                                          N.OHG-.MM.I=- N.OHG-.MM.I=  *@=A*6)D=@.)               N.OHG CE?7)      N.O
        123#&           N.OHG-.MM.I=-
                          /)%4           N.OHG-.MM.I= N.OHG-.MM.I=- N.OHG-.MM.I=
                                         123#&          *@=A*6)D=@.)
                                                          /)%4            N.OHG CE?7)       N.OHG-.MM.I=- N.OHG CE?7)
                                                                                         *@=A*6)D=@.)                        N.OHG-.
        /)%4               567!&         /)%4
                                         A.P-J.Q(QF- A.P-J.Q(QF   966;6)D:96;)   A.P C==>;)     A.P F=*<;)
                                                          567!&        A.P-J.Q(QF-    A.P-J.Q(QF     966;6)D:96;) A.P C==>;)     A.P
        567!&             8%0!
                        A.P-J.Q(QF-      567!&
                                         A.P-J.Q(QF             8%0!
                                                              966;6)D:96;)
                                                              A.P-J.Q(QF-        A.P C==>;)
                                                                                A.P-J.Q(QF           A.P F=*<;)
                                                                                                    966;6)D:96;)        A.P C==>;)           A.P F=*
        8%0!               8(&9:6; 8%0!
                                   J!J-JGHHG33-                 8(&9:6; J!J-JGHHG33-
                                                          J!J-JGHHG33 9E);*@6?A.B) D:*>)CA88A.B)   C=2@)C6..A.B)
                                                                                         J!J-JGHHG33    9E);*@6?A.B)             D:*>)CA88A.B)    C=2@
        8(&9:6;         J!J-JGHHG33-
                          <=>?&          8(&9:6;R!FG1K-JL=GH
                                          J!J-JGHHG33
                                         R!FG1K-JL=GH-  9E);*@6?A.B) R!FG1K-JL=GH-
                                                       J!J-JGHHG33-    D:*>)CA88A.B)
                                                         <=>?& @=<;6)8:C=?)                C=2@)C6..A.B) D:*>)CA88A.B)
                                                                      J!J-JGHHG33 R!FG1K-JL=GH
                                                                                          9E);*@6?A.B)
                                                                                D:!:.)CA76.)   R!FG1K-JL=GH-
                                                                                                    @=<;6)8:C=?)             C=2@)C6.
                                                                                                                  D:!:.)CA76.)   R!FG
                           @3A#                                 @3A#
        <=>?&           R!FG1K-JL=GH-
                          <2?&
                                         <=>?&
                                          R!FG1K-JL=GH       @=<;6)8:C=?) RGSS-JLH5.A.H- C6.D:9A.) RGSS C6.D:9A.)D:!:.)CA76.)
                                                           R!FG1K-JL=GH-
                                                               <2?&
                                                                             D:!:.)CA76.)      R!FG1K-JL=GH-
                                                                            R!FG1K-JL=GH RGSS-JLH5.A.H D=@.)!F6AFF6?
                                         RGSS-JLH5.A.H- RGSS-JLH5.A.H D=@.)!F6AFF6?  RGSS
                                                                                              @=<;6)8:C=?)                          R!FG1K-J
                                                                                                                         RGSS C6.D:9A.)  RGSS
        @3A#               B#C5#         @3A#                  B#C5#
        <2?&            RGSS-JLH5.A.H- <2?&
                           ?)#7&        RGSS-JLH5.A.H RGSS-JLH5.A.H- SI.H7- C6.D:9A.) D=@.)!F6AFF6?
                                       SI.H7-            D=@.)!F6AFF6?  RGSS-JLH5.A.H
                                                                         RGSS
                                                     SI.H7 ?)#7& ;<.)F8A.2)           SI.H7
                                                                                 F?:.*6)   RGSS C6.D:9A.) RGSS C6.D:9A.)
                                                                                                      ;<.)F8A.2)
                                                                                                 F?:.*6)          F?:.*6)   RGSS C6.D
                                                                                                                                  F?:.
        B#C5#              D%E#@%7 B#C5#
                                       M.Q3GHJGI5          D%E#@%7 M.Q3GHJGI5
                                                     8:<26.C:*@                       8:<26.C:*@M.Q3GHJGI5-
                                                                                 M.Q3GHJGI5-                      M.Q3GHJGI5-     M.Q3

        ?)#7&           SI.H7-           ?)#7&
                                         SI.H7                SI.H7-
                                                               ;<.)F8A.2)       SI.H7
                                                                                 F?:.*6)            ;<.)F8A.2)
                                                                                                     F?:.*6)            F?:.*6)              F?:.*6)
        D%E#@%7         M.Q3GHJGI5       D%E#@%7
                                         8:<26.C:*@           M.Q3GHJGI5        8:<26.C:*@
                                                                                 M.Q3GHJGI5-         M.Q3GHJGI5-        M.Q3GHJGI5-          M.Q3GHJ


Figure 6: Results for end-to-end name transliteration. This figure shows the correct answer, the answer obtained
by training mappings on parallel data (Knight and Graehl, 1998), and various answers obtained by deciphering non-
parallel data. Method 1 uses a 2-gram P(e), Method 2 uses a 3-gram P(e), and Method 3 uses a word-based P(e).


4.5 P(j|e) Initialization                                        are weighted X (a constant) times higher than
So far, the P(j|e) connections within the WFST C               other mappings such as:
model were initialized with uniform weights prior                 N    => b              N => r
to EM training. It is a known fact that the EM al-                D    => B              EY => a a
gorithm does not necessarily find a global minimum                 in the P(j|e) model. In our experiments, we set
for the given objective function. If the search space          the value X to 100.
is bumpy and non-convex as is the case in our prob-               Initializing the WFST C in this way results in EM
lem, EM can get stuck in any of the local minima               learning better substitution tables and yields slightly
depending on what weights were used to initialize              better results for the Senator task. Normalized edit
the search. Different sets of initialization weights           distance drops from 57.2 to 54.2, and the whole-
can lead to different convergence points during EM             name error is also reduced from 77% to 73% (row
training, or in other words, depending on how the              2e in Figure 4).
P(j|e) probabilities are initialized, the final P(j|e)
substitution table learnt by EM can vary.                      4.6    Size of English Training Data
   We can use some prior knowledge to initialize the           We saw earlier (in Section 4.4) that using more
probability weights in our WFST C model, so as to              monolingual Japanese training data yields improve-
give EM a good starting point to work with. In-                ments in decipherment results. Similarly, we hy-
stead of using uniform weights, in the P(j|e) model            pothesize that using more monolingual English data
we set higher weights for the mappings where En-               can drive the decipherment towards better translit-
glish and Japanese sounds share common consonant               eration results. On the English side, we build dif-
phonemes.                                                      ferent word-based P(e) models, each trained on dif-
   For example, mappings such as:                              ferent amounts of data (English phoneme sequences
   N => n             N => a n                                 from the CMU dictionary). The table below shows
   D => d             D => d o                                 that deciphering with a word-based English model

                                                         43
built from more data produces better transliteration                     quence, the correct back-transliterated phoneme se-
results.                                                                 quence is present somewhere in the English data)
    English training data      Error on name transliteration task        and apply the same decipherment strategy using a
    (# of phoneme sequences)   whole-name error normalized word          word-based English model. The table below com-
                                                 edit distance
             76,152                  73                 54.2             pares the transliteration results for the U.S. Sena-
             97,912                  66                 49.3             tor task, when using comparable versus non-parallel
   This yields the best transliteration results on the                   data for decipherment training. While training on
Senator task with non-parallel data, getting 34 out                      comparable corpora does have benefits and reduces
of 100 Senator names exactly right.                                      the whole-name error to 59% on the Senator task, it
                                                                         is encouraging to see that our best decipherment re-
4.7 Re-ranking Results Using the Web                                     sults using only non-parallel data comes close (66%
                                                                         error).
It is possible to improve our results on the U.S. Sen-
ator task further using external monolingual re-
                                                                          English/Japanese Corpora   Error on name transliteration task
sources. Web counts are frequently used to auto-                          (# of phoneme sequences)   whole-name error normalized word
matically re-rank candidate lists for various NLP                                                                      edit distance
                                                                          Comparable Corpora               59                 41.8
tasks (Al-Onaizan and Knight, 2002). We extract                           (English = 2,608
the top 10 English candidates produced by our word-                       Japanese = 2,455)
                                                                          Non-Parallel Corpora            66                 49.3
based decipherment method for each Japanese test                          (English = 98,000
name. Using a search engine, we query the entire                          Japanese = 9,350)
English name (first and last name) corresponding to
each candidate, and collect search result counts. We
then re-rank the candidates using the collected Web
counts and pick the most frequent candidate as our                       6 Conclusion
choice.
    For example, France Murkowski gets only 1 hit                        We have presented a method for attacking machine
on Google, whereas Frank Murkowski gets 135,000                          transliteration problems without parallel data. We
hits. Re-ranking the results in this manner lowers                       developed phonemic substitution tables trained us-
the whole-name error on the Senator task from 66%                        ing only monolingual resources and demonstrated
to 61%, and also lowers the normalized edit dis-                         their performance in an end-to-end name translitera-
tance from 49.3 to 48.8. However, we do note that                        tion task. We showed that consistent improvements
re-ranking using Web counts produces similar im-                         in transliteration performance are possible with the
provements in the case of parallel training as well                      use of strong decipherment techniques, and our best
and lowers the whole-name error from 40% to 24%.                         system achieves significant improvements over the
    So, the re-ranking idea, which is simple and re-                     baseline system. In future work, we would like to
quires only monolingual resources, seems like a nice                     develop more powerful decipherment models and
strategy to apply at the end of transliteration exper-                   techniques, and we would like to harness the infor-
iments (during decoding), and can result in further                      mation available from a wide variety of monolingual
gains on the final transliteration performance.                           resources, and use it to further narrow the gap be-
                                                                         tween parallel-trained and non-parallel-trained ap-
5      Comparable versus Non-Parallel                                    proaches.
       Corpora
We also present decipherment results when using                          7 Acknowledgements
comparable corpora for training the WFST C model.
We use English and Japanese phoneme sequences                            This research was supported by the Defense Ad-
derived from a parallel corpus containing 2,683                          vanced Research Projects Agency under SRI Inter-
phoneme sequence pairs to construct comparable                           national’s prime Contract Number NBCHD040058.
corpora (such that for each Japanese phoneme se-

                                                                    44
References                                                   J. Oh and H. Isahara. 2006. Mining the web for translit-
                                                                eration lexicons: Joint-validation approach. In Proc.
Y. Al-Onaizan and K. Knight. 2002. Translating named
                                                                of the IEEE/WIC/ACM International Conference on
   entities using monolingual and bilingual resources. In
                                                                Web Intelligence.
   Proc. of ACL.
                                                             S. Ravi and K. Knight. 2008. Attacking decipherment
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.
                                                                problems optimally with low-order n-gram models. In
   Maximum likelihood from incomplete data via the EM
                                                                Proc. of EMNLP.
   algorithm. Journal of the Royal Statistical Society Se-
                                                             S. Ravi and K. Knight. 2009. Probabilistic methods for a
   ries, 39(4):1–38.
                                                                Japanese syllable cipher. In Proc. of the International
Y. Goldberg and M. Elhadad. 2008. Identification of
                                                                Conference on the Computer Processing of Oriental
   transliterated foreign words in Hebrew script. In Proc.
                                                                Languages (ICCPOL).
   of CICLing.
                                                             T. Sherif and G. Kondrak. 2007a. Bootstrapping a
D. Goldwasser and D. Roth. 2008a. Active sample se-
                                                                stochastic transducer for arabic-english transliteration
   lection for named entity transliteration. In Proc. of
                                                                extraction. In Proc. of ACL.
   ACL/HLT Short Papers.
                                                             T. Sherif and G. Kondrak. 2007b. Substring-based
D. Goldwasser and D. Roth. 2008b. Transliteration as
                                                                transliteration. In Proc. of ACL.
   constrained optimization. In Proc. of EMNLP.
                                                             R. Sproat, T. Tao, and C. Zhai. 2006. Named entity
S. Goldwater and L. Griffiths, T. 2007. A fully Bayesian
                                                                transliteration with comparable corpora. In Proc. of
   approach to unsupervised part-of-speech tagging. In
                                                                ACL.
   Proc. of ACL.
                                                             T. Tao, S. Yoon, A. Fister, R. Sproat, and C. Zhai. 2006.
J. Graehl.        1997.        Carmel finite-state toolkit.
                                                                Unsupervised named entity transliteration using tem-
   http://www.isi.edu/licensed-sw/carmel.
                                                                poral and phonetic correlation. In Proc. of EMNLP.
L. Haizhou, Z. Min, and S. Jian. 2004. A joint source-
   channel model for machine transliteration. In Proc. of    J. Wu and S. Chang, J. 2007. Learning to find English
   ACL.                                                         to Chinese transliterations on the web. In Proc. of
                                                                EMNLP/CoNLL.
U. Hermjakob, K. Knight, and H. Daume. 2008. Name
   translation in statistical machine translation—learning   D. Yarowsky. 1995. Unsupervised word sense disam-
   when to transliterate. In Proc. of ACL/HLT.                  biguation rivaling supervised methods. In Proc. of
                                                                ACL.
F. Huang, S. Vogel, and A. Waibel. 2004. Improving
   named entity translation combining phonetic and se-       S. Yoon, K. Kim, and R. Sproat. 2007. Multilingual
   mantic similarities. In Proc. of HLT/NAACL.                  transliteration using feature based phonetic method. In
                                                                Proc. of ACL.
S. Karimi, F. Scholer, and A. Turpin. 2007. Col-
   lapsed consonant and vowel models: New ap-                D. Zelenko and C. Aone. 2006. Discriminative methods
   proaches for English-Persian transliteration and back-       for transliteration. In Proc. of EMNLP.
   transliteration. In Proc. of ACL.
A. Klementiev and D. Roth. 2008. Named entity translit-
   eration and discovery in multilingual corpora. In
   Learning Machine Translation. MIT press.
K. Knight and J. Graehl. 1998. Machine transliteration.
   Computational Linguistics, 24(4):599–612.
K. Knight and K. Yamada. 1999. A computational ap-
   proach to deciphering unknown scripts. In Proc. of the
   ACL Workshop on Unsupervised Learning in Natural
   Language Processing.
K. Knight, A. Nair, N. Rathod, and K. Yamada. 2006.
   Unsupervised analysis for decipherment problems. In
   Proc. of COLING/ACL.
J. Kuo, H. Li, and Y. Yang. 2006. Learning translitera-
   tion lexicons from the web. In Proc. of ACL/COLING.
H. Li, C. Sim, K., J. Kuo, and M. Dong. 2007. Semantic
   transliteration of personal names. In Proc. of ACL.
M. Nagata, T. Saito, and K. Suzuki. 2001. Using the web
   as a bilingual dictionary. In Proc. of the ACL Work-
   shop on Data-driven Methods in Machine Translation.

                                                       45

More Related Content

What's hot

Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianIDES Editor
 
Cross-Language Masked Translation Priming in High- Proficiency Chinese-Engli...
 Cross-Language Masked Translation Priming in High- Proficiency Chinese-Engli... Cross-Language Masked Translation Priming in High- Proficiency Chinese-Engli...
Cross-Language Masked Translation Priming in High- Proficiency Chinese-Engli...English Literature and Language Review ELLR
 
Feature Structure Unification Syntactic Parser 2.0
Feature Structure Unification Syntactic Parser 2.0Feature Structure Unification Syntactic Parser 2.0
Feature Structure Unification Syntactic Parser 2.0rcaneba
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015RIILP
 
GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...
GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...
GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...Lifeng (Aaron) Han
 
Capturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language ModelingCapturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language ModelingGuy De Pauw
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translationHrishikesh Nair
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"nozyh
 
Natural Language Generation from First-Order Expressions
Natural Language Generation from First-Order ExpressionsNatural Language Generation from First-Order Expressions
Natural Language Generation from First-Order ExpressionsThomas Mathew
 
5a use of annotated corpus
5a use of annotated corpus5a use of annotated corpus
5a use of annotated corpusThennarasuSakkan
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARS
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARSPARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARS
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARSijnlc
 

What's hot (20)

Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
Cross-Language Masked Translation Priming in High- Proficiency Chinese-Engli...
 Cross-Language Masked Translation Priming in High- Proficiency Chinese-Engli... Cross-Language Masked Translation Priming in High- Proficiency Chinese-Engli...
Cross-Language Masked Translation Priming in High- Proficiency Chinese-Engli...
 
Feature Structure Unification Syntactic Parser 2.0
Feature Structure Unification Syntactic Parser 2.0Feature Structure Unification Syntactic Parser 2.0
Feature Structure Unification Syntactic Parser 2.0
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
 
Phrase Structure
Phrase StructurePhrase Structure
Phrase Structure
 
X bar ppt
X bar pptX bar ppt
X bar ppt
 
GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...
GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...
GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...
 
Capturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language ModelingCapturing Word-level Dependencies in Morpheme-based Language Modeling
Capturing Word-level Dependencies in Morpheme-based Language Modeling
 
Modality-Preserving Phrase-based Statistical Machine Translation
Modality-Preserving Phrase-based Statistical Machine TranslationModality-Preserving Phrase-based Statistical Machine Translation
Modality-Preserving Phrase-based Statistical Machine Translation
 
P-6
P-6P-6
P-6
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Ijetcas14 575
Ijetcas14 575Ijetcas14 575
Ijetcas14 575
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translation
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 
Natural Language Generation from First-Order Expressions
Natural Language Generation from First-Order ExpressionsNatural Language Generation from First-Order Expressions
Natural Language Generation from First-Order Expressions
 
5a use of annotated corpus
5a use of annotated corpus5a use of annotated corpus
5a use of annotated corpus
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
UWB semeval2016-task5
UWB semeval2016-task5UWB semeval2016-task5
UWB semeval2016-task5
 
Formal languages
Formal languagesFormal languages
Formal languages
 
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARS
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARSPARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARS
PARSING ARABIC VERB PHRASES USING PREGROUP GRAMMARS
 

Viewers also liked

Viewers also liked (9)

Live@edu benefit sso benefit
Live@edu benefit sso benefitLive@edu benefit sso benefit
Live@edu benefit sso benefit
 
Diaz payen
Diaz payenDiaz payen
Diaz payen
 
Mobile app
Mobile appMobile app
Mobile app
 
Designing web navigation
Designing web navigationDesigning web navigation
Designing web navigation
 
Grid theory
Grid theoryGrid theory
Grid theory
 
Innovation china india_sep.07_ramamurthy
Innovation china india_sep.07_ramamurthyInnovation china india_sep.07_ramamurthy
Innovation china india_sep.07_ramamurthy
 
Fast web design
Fast web designFast web design
Fast web design
 
File foldername
File foldernameFile foldername
File foldername
 
ธนาคารธนชาต จำกัด
ธนาคารธนชาต จำกัดธนาคารธนชาต จำกัด
ธนาคารธนชาต จำกัด
 

Similar to Learning phoneme mappings for transliteration without parallel data

Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfDeptii Chaudhari
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
 
Lectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersLectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersMatias Menendez
 
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMDEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMkevig
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlpeSAT Journals
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Condi Rice - American Dialect Society
Condi Rice - American Dialect SocietyCondi Rice - American Dialect Society
Condi Rice - American Dialect SocietyLauren Hall-Lew
 
referát.doc
referát.docreferát.doc
referát.docbutest
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...iyo
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfDeptii Chaudhari
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rAlexandria University
 
Multi lingual text-processing
Multi lingual text-processingMulti lingual text-processing
Multi lingual text-processingNAVER Engineering
 
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksSDL
 
Language Comparison (Korean, Japanese and English)
Language Comparison (Korean, Japanese and English)Language Comparison (Korean, Japanese and English)
Language Comparison (Korean, Japanese and English)MIN KYUNG LEE
 

Similar to Learning phoneme mappings for transliteration without parallel data (20)

Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdf
 
Ceis 3
Ceis 3Ceis 3
Ceis 3
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...
 
Lectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducersLectura 3.5 word normalizationintwitter finitestate_transducers
Lectura 3.5 word normalizationintwitter finitestate_transducers
 
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEMDEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
DEVELOPING A SIMPLIFIED MORPHOLOGICAL ANALYZER FOR ARABIC PRONOMINAL SYSTEM
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Condi Rice - American Dialect Society
Condi Rice - American Dialect SocietyCondi Rice - American Dialect Society
Condi Rice - American Dialect Society
 
referát.doc
referát.docreferát.doc
referát.doc
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Final quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using rFinal quantitative analysis of egyptian aphorisms by using r
Final quantitative analysis of egyptian aphorisms by using r
 
Multi lingual text-processing
Multi lingual text-processingMulti lingual text-processing
Multi lingual text-processing
 
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
Language Comparison (Korean, Japanese and English)
Language Comparison (Korean, Japanese and English)Language Comparison (Korean, Japanese and English)
Language Comparison (Korean, Japanese and English)
 

More from Attaporn Ninsuwan

More from Attaporn Ninsuwan (20)

J query fundamentals
J query fundamentalsJ query fundamentals
J query fundamentals
 
Jquery enlightenment
Jquery enlightenmentJquery enlightenment
Jquery enlightenment
 
Jquery-Begining
Jquery-BeginingJquery-Begining
Jquery-Begining
 
Br ainfocom94
Br ainfocom94Br ainfocom94
Br ainfocom94
 
Chapter 12 - Computer Forensics
Chapter 12 - Computer ForensicsChapter 12 - Computer Forensics
Chapter 12 - Computer Forensics
 
Techniques for data hiding p
Techniques for data hiding pTechniques for data hiding p
Techniques for data hiding p
 
Stop badware infected_sites_report_062408
Stop badware infected_sites_report_062408Stop badware infected_sites_report_062408
Stop badware infected_sites_report_062408
 
Steganography past-present-future 552
Steganography past-present-future 552Steganography past-present-future 552
Steganography past-present-future 552
 
Ch03-Computer Security
Ch03-Computer SecurityCh03-Computer Security
Ch03-Computer Security
 
Ch02-Computer Security
Ch02-Computer SecurityCh02-Computer Security
Ch02-Computer Security
 
Ch01-Computer Security
Ch01-Computer SecurityCh01-Computer Security
Ch01-Computer Security
 
Ch8-Computer Security
Ch8-Computer SecurityCh8-Computer Security
Ch8-Computer Security
 
Ch7-Computer Security
Ch7-Computer SecurityCh7-Computer Security
Ch7-Computer Security
 
Ch6-Computer Security
Ch6-Computer SecurityCh6-Computer Security
Ch6-Computer Security
 
Ch06b-Computer Security
Ch06b-Computer SecurityCh06b-Computer Security
Ch06b-Computer Security
 
Ch5-Computer Security
Ch5-Computer SecurityCh5-Computer Security
Ch5-Computer Security
 
Ch04-Computer Security
Ch04-Computer SecurityCh04-Computer Security
Ch04-Computer Security
 
Chapter5 - The Discrete-Time Fourier Transform
Chapter5 - The Discrete-Time Fourier TransformChapter5 - The Discrete-Time Fourier Transform
Chapter5 - The Discrete-Time Fourier Transform
 
Chapter4 - The Continuous-Time Fourier Transform
Chapter4 - The Continuous-Time Fourier TransformChapter4 - The Continuous-Time Fourier Transform
Chapter4 - The Continuous-Time Fourier Transform
 
Chapter3 - Fourier Series Representation of Periodic Signals
Chapter3 - Fourier Series Representation of Periodic SignalsChapter3 - Fourier Series Representation of Periodic Signals
Chapter3 - Fourier Series Representation of Periodic Signals
 

Learning phoneme mappings for transliteration without parallel data

  • 1. Learning Phoneme Mappings for Transliteration without Parallel Data Sujith Ravi and Kevin Knight University of Southern California Information Sciences Institute Marina del Rey, California 90292 {sravi,knight}@isi.edu Abstract We explore the third dimension, where we see several techniques in use: We present a method for performing machine transliteration without any parallel resources. • Manually-constructed transliteration models, We frame the transliteration task as a deci- e.g., (Hermjakob et al., 2008). pherment problem and show that it is possi- ble to learn cross-language phoneme mapping • Models constructed from bilingual dictionaries tables using only monolingual resources. We of terms and names, e.g., (Knight and Graehl, compare various methods and evaluate their 1998; Huang et al., 2004; Haizhou et al., 2004; accuracies on a standard name transliteration Zelenko and Aone, 2006; Yoon et al., 2007; task. Li et al., 2007; Karimi et al., 2007; Sherif and Kondrak, 2007b; Goldwasser and Roth, 1 Introduction 2008b). Transliteration refers to the transport of names and • Extraction of parallel examples from bilin- terms between languages with different writing sys- gual corpora, using bootstrap dictionaries e.g., tems and phoneme inventories. Recently there has (Sherif and Kondrak, 2007a; Goldwasser and been a large amount of interesting work in this Roth, 2008a). area, and the literature has outgrown being citable in its entirety. Much of this work focuses on back- • Extraction of parallel examples from compara- transliteration, which tries to restore a name or ble corpora, using bootstrap dictionaries, and term that has been transported into a foreign lan- temporal and word co-occurrence, e.g., (Sproat guage. Here, there is often only one correct target et al., 2006; Klementiev and Roth, 2008). spelling—for example, given jyon.kairu (the • Extraction of parallel examples from web name of a U.S. Senator transported to Japanese), we queries, using bootstrap dictionaries, e.g., (Na- must output “Jon Kyl”, not “John Kyre” or any other gata et al., 2001; Oh and Isahara, 2006; Kuo et variation. al., 2006; Wu and Chang, 2007). There are many techniques for transliteration and back-transliteration, and they vary along a number • Comparing terms from different languages in of dimensions: phonetic space, e.g., (Tao et al., 2006; Goldberg and Elhadad, 2008). • phoneme substitution vs. character substitution • heuristic vs. generative vs. discriminative mod- In this paper, we investigate methods to acquire els transliteration mappings from non-parallel sources. We are inspired by previous work in unsupervised • manual vs. automatic knowledge acquisition learning for natural language, e.g. (Yarowsky, 1995; 37 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 37–45, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics
  • 2. English word English sound Japanese sound sequence sequence sequence Japanese katakana WFSA - A WFST - B WFST - C WFST - D sequence ( S P EH N S ER ( S U P E N S A A ( SPENCER ABRAHAM ) (スペンサー・エーブラハム) EY B R AH HH AE M ) E E B U R A H A M U ) Figure 1: Model used for back-transliteration of Japanese katakana names and terms into English. The model employs a four-stage cascade of weighted finite-state transducers (Knight and Graehl, 1998). Goldwater and Griffiths, 2007), and we are also in- all 1-to-1 and 1-to-2 phoneme connections between spired by cryptanalysis—we view a corpus of for- English and Japanese. Phoneme bigrams that occur eign terms as a code for English, and we attempt to fewer than 10 times in a Japanese corpus are omit- break the code. ted, and we omit 1-to-3 connections. This initial WFST C model has 15320 uniformly weighted pa- 2 Background rameters. We then train the model on 3343 phoneme We follow (Knight and Graehl, 1998) in tackling string pairs from a bilingual dictionary, using the back-transliteration of Japanese katakana expres- EM algorithm. EM immediately reduces the con- sions into English. Knight and Graehl (1998) devel- nections in the model to those actually observed in oped a four-stage cascade of finite-state transducers, the parallel data, and after 14 iterations, there are shown in Figure 1. only 188 connections left with P(j|e) ≥ 0.01. Fig- ure 2 shows the phonemic substitution table learnt • WFSA A - produces an English word sequence from parallel training. w with probability P(w) (based on a unigram We use this trained WFST C model and apply it word model). to the U.S. Senator name transliteration task (which we update to the 2008 roster). We obtain 40% er- • WFST B - generates an English phoneme se- ror, roughly matching the performance observed in quence e corresponding to w with probability (Knight and Graehl, 1998). P(e|w). • WFST C - transforms the English phoneme se- 3 Task and Data quence into a Japanese phoneme sequence j ac- The task of this paper is to learn the mappings in cording to a model P(j|e). Figure 2, but without parallel data, and to test those mappings in end-to-end transliteration. We imagine • WFST D - writes out the Japanese phoneme our problem as one faced by monolingual English sequence into Japanese katakana characters ac- speaker wandering around Japan, reading a multi- cording to a model P(k|j). tude of katakana signs, listening to people speak Using the cascade in the reverse (noisy-channel) Japanese, and eventually deciphering those signs direction, they are able to translate new katakana into English. To mis-quote Warren Weaver: names and terms into English. They report 36% er- “When I look at a corpus of Japanese ror in translating 100 U.S. Senators’ names, and they katakana, I say to myself, this is really report exceeding human transliteration performance written in English, but it has been coded in the presence of optical scanning noise. in some strange symbols. I will now pro- The only transducer that requires parallel training ceed to decode.” data is WFST C. Knight and Graehl (1998) take sev- eral thousand phoneme string pairs, automatically Our larger motivation is to move toward align them with the EM algorithm (Dempster et easily-built transliteration systems for all language al., 1977), and construct WFST C from the aligned pairs, regardless of parallel resources. While phoneme pieces. Japanese/English transliteration has its own partic- We re-implement their basic method by instanti- ular features, we believe it is a reasonable starting ating a densely-connected version of WFST C with point. 38
  • 3. e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) AA o 0.49 AY ai 0.84 EH e 0.94 HH h 0.95 L r 0.62 OY oi 0.89 SH sh y 0.33 V b 0.75 a 0.46 i 0.09 a 0.03 w 0.02 ru 0.37 oe 0.04 sh 0.31 bu 0.17 oo 0.02 a 0.03 ha 0.02 o 0.04 yu 0.17 w 0.03 aa 0.02 iy 0.01 i 0.04 ssh y 0.12 a 0.02 ay 0.01 sh i 0.04 ssh 0.02 e 0.01 AE a 0.93 B b 0.82 ER aa 0.8 IH i 0.89 M m 0.68 P p 0.63 T t 0.43 W w 0.73 a ssh 0.02 bu 0.15 a 0.08 e 0.05 mu 0.22 pu 0.16 to 0.25 u 0.17 an 0.02 ar 0.03 in 0.01 n 0.08 pp u 0.13 tt o 0.17 o 0.04 ru 0.02 a 0.01 pp 0.06 ts 0.04 i 0.02 or 0.02 tt 0.03 er 0.02 u 0.02 ts u 0.02 ch 0.02 AH a 0.6 CH tch i 0.27 EY ee 0.58 IY ii 0.58 N n 0.96 PAUSE pause 1.0 TH su 0.48 Y y 0.7 o 0.13 ch 0.24 e 0.15 i 0.3 nn 0.02 s 0.22 i 0.26 e 0.11 ch i 0.23 ei 0.12 e 0.07 sh 0.16 e 0.02 i 0.07 ch y 0.2 a 0.1 ee 0.03 to 0.04 a 0.02 u 0.06 tch y 0.02 ai 0.03 ch 0.04 tch 0.02 te 0.02 ssh y 0.01 t 0.02 k 0.01 a 0.02 AO o 0.6 D d 0.54 F h 0.58 JH jy 0.35 NG n 0.62 R r 0.61 UH u 0.79 Z z 0.27 oo 0.27 do 0.27 hu 0.35 j 0.24 gu 0.22 a 0.27 uu 0.09 zu 0.25 a 0.05 dd o 0.06 hh 0.04 ji 0.21 ng 0.09 o 0.07 ua 0.04 u 0.16 on 0.03 z 0.02 hh u 0.02 jj i 0.14 i 0.04 ru 0.03 dd 0.03 su 0.07 au 0.03 j 0.02 z 0.04 u 0.01 aa 0.01 u ssh 0.02 j 0.06 u 0.01 u 0.01 o 0.01 o 0.02 a 0.06 a 0.01 n 0.03 i 0.03 s 0.02 o 0.02 AW au 0.69 DH z 0.87 G g 0.66 K k 0.53 OW o 0.57 S su 0.43 UW uu 0.67 ZH jy 0.43 aw 0.15 zu 0.08 gu 0.19 ku 0.2 oo 0.39 s 0.37 u 0.29 ji 0.29 ao 0.06 az 0.04 gg u 0.1 kk u 0.16 ou 0.02 sh 0.08 yu 0.02 j 0.29 a 0.04 gy 0.03 kk 0.05 u 0.05 uu 0.02 gg 0.01 ky 0.02 ss 0.02 oo 0.02 ga 0.01 ki 0.01 ssh 0.01 o 0.02 Figure 2: Phonemic substitution table learnt from 3343 parallel English/Japanese phoneme string pairs. English phonemes are in uppercase, Japanese in lowercase. Mappings with P(j|e) > 0.01 are shown. A A CH I D O CH E N J I N E B A D A W A K O B I A A A K U P U R A Z A CH E S : W A N K A PP U A A N D O : O P U T I K U S U W A N T E N P O A A T I S U T O D E K O R A T I B U : W A S E R I N A A T O S E R I N A P I S U T O N D E T O M O R U T O P I I T A A Y U N I O N A I A N B I R U E P I G U R A M U P I KK A A Y U N I TT O SH I S U T E M U A I D I I D O E R A N D O P I N G U U Y U U A I K E N B E R I I : P I P E R A J I N A M I D O : A J I A K A PP U J Y A I A N TS U P I S A : A J I T O J Y A Z U P I U R A Z E N E R A R U E A K O N A K A SH I A K O O S U : P O I N T O Z E R O A K U A M Y U U Z E U M U : Z O N B I I Z U : : : : : : : : Figure 3: Some Japanese phoneme sequences generated from the monolingual katakana corpus using WFST D. Our monolingual resources are: with 112,151 entries. • 43717 unique Japanese katakana sequences • The English gigaword corpus. Knight and collected from web newspaper data. We split Graehl (1998) already use frequently-occurring multi-word katakana phrases on the center-dot capitalized words to build the WFSA A compo- (“·”) character, and select a final corpus of nent of their four-stage cascade. 9350 unique sequences. We add monolingual Japanese versions of the 2008 U.S. Senate ros- We seek to use our English knowledge (derived ter.1 from 2 and 3) to decipher the Japanese katakana cor- pus (1) into English. Figure 3 shows a portion of the • The CMU pronunciation dictionary of English, Japanese corpus, which we transform into Japanese 1 We use “open” EM testing, in which unlabeled test data phoneme sequences using the monolingual resource is allowed to be part of unsupervised training. However, no of WFST D. We note that the Japanese phoneme in- parallel data is allowed. ventory contains 39 unique (“ciphertext”) symbols, 39
  • 4. compared to the 40 English (“plaintext”) phonemes. We allow EM to run until the P(j) likelihood ra- Our goal is to compare and evaluate the WFST C tio between subsequent training iterations reaches model learnt under two different scenarios—(a) us- 0.9999, and we terminate early if 200 iterations are ing parallel data, and (b) using monolingual data. reached. For each experiment, we train only the WFST C Finally, we decode our test set of U.S. Senator model and then apply it to the name translitera- names. Following Knight et al (2006), we stretch tion task—decoding 100 U.S. Senator names from out the P(j|e) model probabilities after decipher- Japanese to English using the automata shown in ment training and prior to decoding our test set, by Figure 1. For all experiments, we keep the rest of cubing their values. the models in the cascade (WFSA A, WFST B, and Decipherment under the conditions of translit- WFST D) unchanged. We evaluate on whole-name eration is substantially more difficult than solv- error-rate (maximum of 100/100) as well as normal- ing letter-substitution ciphers (Knight et al., 2006; ized word edit distance, which gives partial credit Ravi and Knight, 2008; Ravi and Knight, 2009) or for getting the first or last name correct. phoneme-substitution ciphers (Knight and Yamada, 1999). This is because the target table contains sig- 4 Acquiring Phoneme Mappings from nificant non-determinism, and because each symbol Non-Parallel Data has multiple possible fertilities, which introduces uncertainty about the length of the target string. Our main data consists of 9350 unique Japanese phoneme sequences, which we can consider as a sin- 4.1 Baseline P(e) Model gle long sequence j. As suggested by Knight et al (2006), we explain the existence of j as the re- Clearly, we can design P(e) in a number of ways. We sult of someone initially producing a long English might expect that the more the system knows about phoneme sequence e, according to P(e), then trans- English, the better it will be able to decipher the forming it into j, according to P(j|e). The probabil- Japanese. Our baseline P(e) is a 2-gram phoneme ity of our observed data P(j) can be written as: model trained on phoneme sequences from the CMU dictionary. The second row (2a) in Figure 4 shows results when we decipher with this fixed P(e). This P (j) = P (e) · P (j|e) approach performs poorly and gets all the Senator e names wrong. We take P(e) to be some fixed model of mono- lingual English phoneme production, represented 4.2 Consonant Parity as a weighted finite-state acceptor (WFSA). P(j|e) When training under non-parallel conditions, we is implemented as the initial, uniformly-weighted find that we would like to keep our WFST C model WFST C described in Section 2, with 15320 phone- small, rather than instantiating a fully-connected mic connections. model. In the supervised case, parallel training al- We next maximize P(j) by manipulating the sub- lows the trained model to retain only those con- stitution table P(j|e), aiming to produce a result nections which were observed from the data, and such as shown in Figure 2. We accomplish this by this helps eliminate many bad connections from the composing the English phoneme model P(e) WFSA model. In the unsupervised case, there is no parallel with the P(j|e) transducer. We then use the EM al- data available to help us make the right choices. gorithm to train just the P(j|e) parameters (inside We therefore use prior knowledge and place a the composition that predicts j), and guess the val- consonant-parity constraint on the WFST C model. ues for the individual phonemic substitutions that Prior to EM training, we throw out any mapping maximize the likelihood of the observed data P(j).2 from the P(j|e) substitution model that does not 2 In our experiments, we use the Carmel finite-state trans- have the same number of English and Japanese con- ducer package (Graehl, 1997), a toolkit with an algorithm for sonant phonemes. This is a pattern that we observe EM training of weighted finite-state transducers. across a range of transliteration tasks. Here are ex- 40
  • 5. Phonemic Substitution Model Name Transliteration Error whole-name error norm. edit distance 1 e → j = { 1-to-1, 1-to-2 } 40 25.9 + EM aligned with parallel data 2a e → j = { 1-to-1, 1-to-2 } 100 100.0 + decipherment training with 2-gram English P(e) 2b e → j = { 1-to-1, 1-to-2 } 98 89.8 + decipherment training with 2-gram English P(e) + consonant-parity 2c e → j = { 1-to-1, 1-to-2 } 94 73.6 + decipherment training with 3-gram English P(e) + consonant-parity 2d e → j = { 1-to-1, 1-to-2 } 77 57.2 + decipherment training with a word-based English model + consonant-parity 2e e → j = { 1-to-1, 1-to-2 } 73 54.2 + decipherment training with a word-based English model + consonant-parity + initialize mappings having consonant matches with higher proba- bility weights Figure 4: Results on name transliteration obtained when using the phonemic substitution model trained under different scenarios—(1) parallel training data, (2a-e) using only monolingual resources. amples of mappings where consonant parity is vio- ing. The word-based model produces complete En- lated: glish phoneme sequences corresponding to 76,152 actual English words from the CMU dictionary. K => a N => e e The English phoneme sequences are represented as EH => s a EY => n paths through a WFSA, and all paths are weighted Modifying the WFST C in this way leads to bet- equally. We represent the word-based model in com- ter decipherment tables and slightly better results pact form, using determinization and minimization for the U.S. Senator task. Normalized edit distance techniques applicable to weighted finite-state au- drops from 100 to just under 90 (row 2b in Figure 4). tomata. This allows us to perform efficient EM train- ing on the cascade of P(e) and P(j|e) models. Under 4.3 Better English Models this scheme, English phoneme sequences resulting Row 2c in Figure 4 shows decipherment results from decipherment are always analyzable into actual when we move to a 3-gram English phoneme model words. for P(e). We notice considerable improvements in Row 2d in Figure 4 shows the results we ob- accuracy. On the U.S. Senator task, normalized edit tain when training our WFST C with a word-based distance drops from 89.8 to 73.6, and whole-name English phoneme model. Using the word-based error decreases from 98 to 94. model produces the best result so far on the phone- When we analyze the results from deciphering mic substitution task with non-parallel data. On the with a 3-gram P(e) model, we find that many of the U.S. Senator task, word-based decipherment outper- Japanese phoneme test sequences are decoded into forms the other methods by a large margin. It gets English phoneme sequences (such as “IH K R IH 23 out of 100 Senator names exactly right, with a N” and “AE G M AH N”) that are not valid words. much lower normalized edit distance (57.2). We This happens because the models we used for de- have managed to achieve this performance using cipherment so far have no knowledge of what con- only monolingual data. This also puts us within stitutes a globally valid English sequence. To help reach of the parallel-trained system’s performance the phonemic substitution model learn this infor- (40% whole-name errors, and 25.9 word edit dis- mation automatically, we build a word-based P(e) tance error) without using a single English/Japanese from English phoneme sequences in the CMU dic- pair for training. tionary and use this model for decipherment train- To summarize, the quality of the English phoneme 41
  • 6. e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) e j P(j|e) AA a 0.37 AY ai 0.36 EH e 0.37 HH h 0.45 L r 0.3 OY a 0.27 SH sh y 0.22 V b 0.34 o 0.25 oo 0.13 a 0.24 s 0.12 n 0.19 i 0.16 m 0.11 k 0.14 i 0.15 e 0.12 o 0.12 k 0.09 ru 0.15 yu 0.1 r 0.1 m 0.13 u 0.08 i 0.11 i 0.12 b 0.08 ri 0.04 oi 0.1 s 0.06 s 0.07 e 0.07 a 0.11 u 0.06 m 0.07 t 0.03 ya 0.09 p 0.06 d 0.07 oo 0.03 uu 0.05 oo 0.04 w 0.03 mu 0.02 yo 0.08 sa 0.05 r 0.04 ya 0.01 yu 0.02 yu 0.01 p 0.03 m 0.02 e 0.08 h 0.05 t 0.03 aa 0.01 u 0.02 ai 0.01 g 0.03 wa 0.01 o 0.06 b 0.05 h 0.02 o 0.02 ky 0.02 ta 0.01 oo 0.02 t 0.04 sh 0.01 ee 0.02 d 0.02 ra 0.01 ei 0.02 k 0.04 n 0.01 AE a 0.52 B b 0.41 ER aa 0.47 IH i 0.36 M m 0.3 P p 0.18 T t 0.2 W w 0.23 i 0.19 p 0.12 a 0.17 e 0.25 n 0.08 pu 0.08 to 0.16 r 0.2 e 0.11 k 0.09 u 0.08 a 0.15 k 0.08 n 0.05 ta 0.05 m 0.13 o 0.08 m 0.07 o 0.07 u 0.09 r 0.07 k 0.05 n 0.04 s 0.08 u 0.03 s 0.04 e 0.04 o 0.09 s 0.06 sh i 0.04 ku 0.03 k 0.07 uu 0.02 g 0.04 oo 0.03 oo 0.01 h 0.05 ku 0.04 k 0.03 h 0.06 oo 0.02 t 0.03 ii 0.03 t 0.04 su 0.03 te 0.02 b 0.06 z 0.02 yu 0.02 g 0.04 pa 0.03 s 0.02 t 0.04 d 0.02 uu 0.02 b 0.04 t 0.02 r 0.02 p 0.04 ch y 0.02 i 0.02 mu 0.03 ma 0.02 gu 0.02 d 0.02 AH a 0.31 CH g 0.12 EY ee 0.3 IY i 0.25 N n 0.56 PAUSE pause 1.0 TH k 0.21 Y s 0.25 o 0.23 k 0.11 a 0.22 ii 0.21 ru 0.09 pu 0.11 k 0.18 i 0.17 b 0.09 i 0.11 a 0.15 su 0.04 ku 0.1 m 0.07 e 0.12 sh 0.07 u 0.09 aa 0.12 mu 0.02 d 0.08 g 0.06 u 0.1 s 0.07 o 0.06 u 0.07 kk u 0.02 hu 0.07 p 0.05 ee 0.02 r 0.07 e 0.06 o 0.05 ku 0.02 su 0.05 b 0.05 oo 0.01 ch y 0.07 oo 0.05 oo 0.02 hu 0.02 bu 0.04 r 0.04 aa 0.01 p 0.06 ei 0.04 ia 0.02 to 0.01 ko 0.03 d 0.04 m 0.06 ii 0.02 ee 0.02 pp u 0.01 ga 0.03 ur 0.03 ch 0.06 uu 0.01 e 0.02 bi 0.01 sa 0.02 ny 0.03 AO o 0.29 D d 0.16 F h 0.18 JH b 0.13 NG tt o 0.21 R r 0.53 UH a 0.24 Z to 0.14 a 0.26 do 0.15 hu 0.14 k 0.1 ru 0.17 n 0.07 o 0.14 zu 0.11 e 0.14 n 0.05 b 0.09 jy 0.1 n 0.14 ur 0.05 e 0.11 ru 0.11 oo 0.12 to 0.03 sh i 0.07 s 0.08 kk u 0.1 ri 0.03 yu 0.1 su 0.1 i 0.08 sh i 0.03 p 0.07 m 0.08 su 0.07 ru 0.02 ai 0.09 gu 0.09 u 0.05 ku 0.03 m 0.06 t 0.07 mu 0.06 d 0.02 i 0.08 mu 0.07 yu 0.03 k 0.03 r 0.04 j 0.07 dd o 0.04 t 0.01 uu 0.07 n 0.06 ee 0.01 gu 0.03 s 0.03 h 0.07 tch i 0.03 s 0.01 oo 0.07 do 0.06 b 0.03 ha 0.03 sh 0.06 pp u 0.03 m 0.01 aa 0.03 ji 0.02 s 0.02 bu 0.02 d 0.05 jj i 0.03 k 0.01 u 0.02 ch i 0.02 AW oo 0.2 DH h 0.13 G gu 0.13 K k 0.17 OW a 0.3 S su 0.4 UW u 0.39 ZH m 0.17 au 0.19 r 0.12 g 0.11 n 0.1 o 0.25 n 0.11 a 0.15 p 0.16 a 0.18 b 0.09 ku 0.08 ku 0.1 oo 0.12 ru 0.05 o 0.13 t 0.15 ai 0.11 w 0.08 bu 0.06 kk u 0.05 u 0.09 to 0.03 uu 0.12 h 0.13 aa 0.11 t 0.07 k 0.04 to 0.03 i 0.07 ku 0.03 i 0.04 d 0.1 e 0.05 p 0.07 b 0.04 su 0.03 ya 0.04 sh i 0.02 yu 0.03 s 0.08 o 0.04 g 0.06 to 0.03 sh i 0.02 e 0.04 ri 0.02 ii 0.03 b 0.07 i 0.04 jy 0.05 t 0.03 r 0.02 uu 0.02 mu 0.02 e 0.03 r 0.05 iy 0.02 d 0.05 ha 0.03 ko 0.02 ai 0.02 hu 0.02 oo 0.02 jy 0.03 ea 0.01 k 0.03 d 0.03 ka 0.02 ii 0.01 ch i 0.02 ee 0.02 k 0.02 Figure 5: Phonemic substitution table learnt from non-parallel corpora. For each English phoneme, only the top ten mappings with P(j|e) > 0.01 are shown. model used in decipherment training has a large ef- where the mapping table derived from parallel data fect on the learnt P(j|e) phonemic substitution ta- does not. Because parallel data is limited, it may not ble (i.e., probabilities for the various phoneme map- contain all of the necessary mappings. pings within the WFST C model), which in turn af- fects the quality of the back-transliterated English 4.4 Size of Japanese Training Data output produced when decoding Japanese. Monolingual corpora are more easily available than Figure 5 shows the phonemic substitution table parallel corpora, so we can use increasing amounts learnt using word-based decipherment. The map- of monolingual Japanese training data during de- pings are reasonable, given the lack of parallel data. cipherment training. The table below shows that They are not entirely correct—for example, the map- using more Japanese training data produces bet- ping “S → s u” is there, but “S → s” is missing. ter transliteration results when deciphering with the Sample end-to-end transliterations are illustrated word-based English model. in Figure 6. The figure shows how the transliteration Japanese training data Error on name transliteration task results from non-parallel training improve steadily (# of phoneme sequences) whole-name error normalized word edit distance as we use stronger decipherment techniques. We 4,674 87 69.7 note that in one case (LAUTENBERG), the deci- 9,350 77 57.2 pherment mapping table leads to a correct answer 42
  • 7. !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5 !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$ !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'#$5 !"#$%&'()*'+"%,)-./0%12%,0)2,/$34/&'# !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3 !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/)<3%IH !,#G,%33)?%+#,&H).#)!/,/44%4)7/&/ !"#$#%&' ()""*+,-.%/0*" !"#$#%&' 12)%*,#+ ()""*+,-.%/0*" 1&"&''*' =*+#>2*"?*%,- 1&"&''*' 12)%*,#+ =*+#>2*"?*%,- =*+#>2*"?*%,- =*+#>2*"?*%, =*+#>2*"?*%,- =*+#>2*"? !"#$#%&' ()""*+,-.%/0*" 1&"&''*'!"#$#%&' 3"&#%#%$ 12)%*,#+ @A*,2)B-9C =*+#>2*"?*%,- =*+#>2*"?*%,- ()""*+,-.%/0*" =*+#>2*"?*%, @A*,2)B-DC 1&"&''*' 12)%*,#+ EC 3"&#%#%$ @A*,2)B-DC@A*,2)B-9C =*+#>2*"?*%,- @A*,2)B =*+#>2*"?*%,- @A*,2)B-DC =*+#> @A*,2)B E 3"&#%#%$ @A*,2)B-9C 3"&#%#%$ @A*,2)B EC @A*,2)B-9C @A*,2)B-DC @A*, 45678-9::;< 45678-9::;< 45678-9::;< 45678-9::;< !"#$%& F1GH(GI- !"#$%& F1GH(GI- F1GH(GI- F1GH(GI F1GH(GI;!:*6)*<=9=) F1GH(GI F1GH(GI;!:*6)*<=9=) !"#$%& F1GH(GI- F1GH(GI- !"#$%& F1GH(GI- F1GH(GI- F1GH(GI- ;!:*6)*<=9=) F1GH(GI- F1GH(GI- F1GH(GI- ;!:*6)*<=9=) F1GH '%()*+ '%()*+ .JI.K.A- .JI.K.A 67689:.) '%()*+ .JI.K.A .JI.K.A- .JI.K.A- 67689:.) '%()*+ .JI.K.A .JI.K.A- .JI.K.A- .JI.K.A .JI.K.A- 67689:.) 67689:.) .JI.K.A .JI. ,-'.& =.HLGM-.7.7.- =.HLGM :*:)>: =.HLGM,-'.& >=?6:)6:;2) :*:)>: =.HLGM :*:)>: :*:)>: ,-'.& =.HLGM-.7.7.- ,-'.& >=?6:)6:;2) =.HLGM ;*@6?A.B)6:;2) 7:.A6886):C:*=) =.HLGM-.7.7.- =.HLGM-.7.7.- ;*@6?A.B)6:;2) >=?6:)6:;2)>=?6:)6:;2) ;*@6?A.B)6:;2) 7:.A 7:.A6886):C:*=) ;*@6?A.B)6:;2) 7:.A6886 /00 /00 /00 /00 123#& N.OHG-.MM.I=- N.OHG-.MM.I= 123#& *@=A*6)D=@.) N.OHG CE?7) N.OHG-.MM.I=- N.OHG-.MM.I=- N.OHG-.MM.I= *@=A*6)D=@.) N.OHG CE?7) N.O 123#& N.OHG-.MM.I=- /)%4 N.OHG-.MM.I= N.OHG-.MM.I=- N.OHG-.MM.I= 123#& *@=A*6)D=@.) /)%4 N.OHG CE?7) N.OHG-.MM.I=- N.OHG CE?7) *@=A*6)D=@.) N.OHG-. /)%4 567!& /)%4 A.P-J.Q(QF- A.P-J.Q(QF 966;6)D:96;) A.P C==>;) A.P F=*<;) 567!& A.P-J.Q(QF- A.P-J.Q(QF 966;6)D:96;) A.P C==>;) A.P 567!& 8%0! A.P-J.Q(QF- 567!& A.P-J.Q(QF 8%0! 966;6)D:96;) A.P-J.Q(QF- A.P C==>;) A.P-J.Q(QF A.P F=*<;) 966;6)D:96;) A.P C==>;) A.P F=* 8%0! 8(&9:6; 8%0! J!J-JGHHG33- 8(&9:6; J!J-JGHHG33- J!J-JGHHG33 9E);*@6?A.B) D:*>)CA88A.B) C=2@)C6..A.B) J!J-JGHHG33 9E);*@6?A.B) D:*>)CA88A.B) C=2@ 8(&9:6; J!J-JGHHG33- <=>?& 8(&9:6;R!FG1K-JL=GH J!J-JGHHG33 R!FG1K-JL=GH- 9E);*@6?A.B) R!FG1K-JL=GH- J!J-JGHHG33- D:*>)CA88A.B) <=>?& @=<;6)8:C=?) C=2@)C6..A.B) D:*>)CA88A.B) J!J-JGHHG33 R!FG1K-JL=GH 9E);*@6?A.B) D:!:.)CA76.) R!FG1K-JL=GH- @=<;6)8:C=?) C=2@)C6. D:!:.)CA76.) R!FG @3A# @3A# <=>?& R!FG1K-JL=GH- <2?& <=>?& R!FG1K-JL=GH @=<;6)8:C=?) RGSS-JLH5.A.H- C6.D:9A.) RGSS C6.D:9A.)D:!:.)CA76.) R!FG1K-JL=GH- <2?& D:!:.)CA76.) R!FG1K-JL=GH- R!FG1K-JL=GH RGSS-JLH5.A.H D=@.)!F6AFF6? RGSS-JLH5.A.H- RGSS-JLH5.A.H D=@.)!F6AFF6? RGSS @=<;6)8:C=?) R!FG1K-J RGSS C6.D:9A.) RGSS @3A# B#C5# @3A# B#C5# <2?& RGSS-JLH5.A.H- <2?& ?)#7& RGSS-JLH5.A.H RGSS-JLH5.A.H- SI.H7- C6.D:9A.) D=@.)!F6AFF6? SI.H7- D=@.)!F6AFF6? RGSS-JLH5.A.H RGSS SI.H7 ?)#7& ;<.)F8A.2) SI.H7 F?:.*6) RGSS C6.D:9A.) RGSS C6.D:9A.) ;<.)F8A.2) F?:.*6) F?:.*6) RGSS C6.D F?:. B#C5# D%E#@%7 B#C5# M.Q3GHJGI5 D%E#@%7 M.Q3GHJGI5 8:<26.C:*@ 8:<26.C:*@M.Q3GHJGI5- M.Q3GHJGI5- M.Q3GHJGI5- M.Q3 ?)#7& SI.H7- ?)#7& SI.H7 SI.H7- ;<.)F8A.2) SI.H7 F?:.*6) ;<.)F8A.2) F?:.*6) F?:.*6) F?:.*6) D%E#@%7 M.Q3GHJGI5 D%E#@%7 8:<26.C:*@ M.Q3GHJGI5 8:<26.C:*@ M.Q3GHJGI5- M.Q3GHJGI5- M.Q3GHJGI5- M.Q3GHJ Figure 6: Results for end-to-end name transliteration. This figure shows the correct answer, the answer obtained by training mappings on parallel data (Knight and Graehl, 1998), and various answers obtained by deciphering non- parallel data. Method 1 uses a 2-gram P(e), Method 2 uses a 3-gram P(e), and Method 3 uses a word-based P(e). 4.5 P(j|e) Initialization are weighted X (a constant) times higher than So far, the P(j|e) connections within the WFST C other mappings such as: model were initialized with uniform weights prior N => b N => r to EM training. It is a known fact that the EM al- D => B EY => a a gorithm does not necessarily find a global minimum in the P(j|e) model. In our experiments, we set for the given objective function. If the search space the value X to 100. is bumpy and non-convex as is the case in our prob- Initializing the WFST C in this way results in EM lem, EM can get stuck in any of the local minima learning better substitution tables and yields slightly depending on what weights were used to initialize better results for the Senator task. Normalized edit the search. Different sets of initialization weights distance drops from 57.2 to 54.2, and the whole- can lead to different convergence points during EM name error is also reduced from 77% to 73% (row training, or in other words, depending on how the 2e in Figure 4). P(j|e) probabilities are initialized, the final P(j|e) substitution table learnt by EM can vary. 4.6 Size of English Training Data We can use some prior knowledge to initialize the We saw earlier (in Section 4.4) that using more probability weights in our WFST C model, so as to monolingual Japanese training data yields improve- give EM a good starting point to work with. In- ments in decipherment results. Similarly, we hy- stead of using uniform weights, in the P(j|e) model pothesize that using more monolingual English data we set higher weights for the mappings where En- can drive the decipherment towards better translit- glish and Japanese sounds share common consonant eration results. On the English side, we build dif- phonemes. ferent word-based P(e) models, each trained on dif- For example, mappings such as: ferent amounts of data (English phoneme sequences N => n N => a n from the CMU dictionary). The table below shows D => d D => d o that deciphering with a word-based English model 43
  • 8. built from more data produces better transliteration quence, the correct back-transliterated phoneme se- results. quence is present somewhere in the English data) English training data Error on name transliteration task and apply the same decipherment strategy using a (# of phoneme sequences) whole-name error normalized word word-based English model. The table below com- edit distance 76,152 73 54.2 pares the transliteration results for the U.S. Sena- 97,912 66 49.3 tor task, when using comparable versus non-parallel This yields the best transliteration results on the data for decipherment training. While training on Senator task with non-parallel data, getting 34 out comparable corpora does have benefits and reduces of 100 Senator names exactly right. the whole-name error to 59% on the Senator task, it is encouraging to see that our best decipherment re- 4.7 Re-ranking Results Using the Web sults using only non-parallel data comes close (66% error). It is possible to improve our results on the U.S. Sen- ator task further using external monolingual re- English/Japanese Corpora Error on name transliteration task sources. Web counts are frequently used to auto- (# of phoneme sequences) whole-name error normalized word matically re-rank candidate lists for various NLP edit distance Comparable Corpora 59 41.8 tasks (Al-Onaizan and Knight, 2002). We extract (English = 2,608 the top 10 English candidates produced by our word- Japanese = 2,455) Non-Parallel Corpora 66 49.3 based decipherment method for each Japanese test (English = 98,000 name. Using a search engine, we query the entire Japanese = 9,350) English name (first and last name) corresponding to each candidate, and collect search result counts. We then re-rank the candidates using the collected Web counts and pick the most frequent candidate as our 6 Conclusion choice. For example, France Murkowski gets only 1 hit We have presented a method for attacking machine on Google, whereas Frank Murkowski gets 135,000 transliteration problems without parallel data. We hits. Re-ranking the results in this manner lowers developed phonemic substitution tables trained us- the whole-name error on the Senator task from 66% ing only monolingual resources and demonstrated to 61%, and also lowers the normalized edit dis- their performance in an end-to-end name translitera- tance from 49.3 to 48.8. However, we do note that tion task. We showed that consistent improvements re-ranking using Web counts produces similar im- in transliteration performance are possible with the provements in the case of parallel training as well use of strong decipherment techniques, and our best and lowers the whole-name error from 40% to 24%. system achieves significant improvements over the So, the re-ranking idea, which is simple and re- baseline system. In future work, we would like to quires only monolingual resources, seems like a nice develop more powerful decipherment models and strategy to apply at the end of transliteration exper- techniques, and we would like to harness the infor- iments (during decoding), and can result in further mation available from a wide variety of monolingual gains on the final transliteration performance. resources, and use it to further narrow the gap be- tween parallel-trained and non-parallel-trained ap- 5 Comparable versus Non-Parallel proaches. Corpora We also present decipherment results when using 7 Acknowledgements comparable corpora for training the WFST C model. We use English and Japanese phoneme sequences This research was supported by the Defense Ad- derived from a parallel corpus containing 2,683 vanced Research Projects Agency under SRI Inter- phoneme sequence pairs to construct comparable national’s prime Contract Number NBCHD040058. corpora (such that for each Japanese phoneme se- 44
  • 9. References J. Oh and H. Isahara. 2006. Mining the web for translit- eration lexicons: Joint-validation approach. In Proc. Y. Al-Onaizan and K. Knight. 2002. Translating named of the IEEE/WIC/ACM International Conference on entities using monolingual and bilingual resources. In Web Intelligence. Proc. of ACL. S. Ravi and K. Knight. 2008. Attacking decipherment A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. problems optimally with low-order n-gram models. In Maximum likelihood from incomplete data via the EM Proc. of EMNLP. algorithm. Journal of the Royal Statistical Society Se- S. Ravi and K. Knight. 2009. Probabilistic methods for a ries, 39(4):1–38. Japanese syllable cipher. In Proc. of the International Y. Goldberg and M. Elhadad. 2008. Identification of Conference on the Computer Processing of Oriental transliterated foreign words in Hebrew script. In Proc. Languages (ICCPOL). of CICLing. T. Sherif and G. Kondrak. 2007a. Bootstrapping a D. Goldwasser and D. Roth. 2008a. Active sample se- stochastic transducer for arabic-english transliteration lection for named entity transliteration. In Proc. of extraction. In Proc. of ACL. ACL/HLT Short Papers. T. Sherif and G. Kondrak. 2007b. Substring-based D. Goldwasser and D. Roth. 2008b. Transliteration as transliteration. In Proc. of ACL. constrained optimization. In Proc. of EMNLP. R. Sproat, T. Tao, and C. Zhai. 2006. Named entity S. Goldwater and L. Griffiths, T. 2007. A fully Bayesian transliteration with comparable corpora. In Proc. of approach to unsupervised part-of-speech tagging. In ACL. Proc. of ACL. T. Tao, S. Yoon, A. Fister, R. Sproat, and C. Zhai. 2006. J. Graehl. 1997. Carmel finite-state toolkit. Unsupervised named entity transliteration using tem- http://www.isi.edu/licensed-sw/carmel. poral and phonetic correlation. In Proc. of EMNLP. L. Haizhou, Z. Min, and S. Jian. 2004. A joint source- channel model for machine transliteration. In Proc. of J. Wu and S. Chang, J. 2007. Learning to find English ACL. to Chinese transliterations on the web. In Proc. of EMNLP/CoNLL. U. Hermjakob, K. Knight, and H. Daume. 2008. Name translation in statistical machine translation—learning D. Yarowsky. 1995. Unsupervised word sense disam- when to transliterate. In Proc. of ACL/HLT. biguation rivaling supervised methods. In Proc. of ACL. F. Huang, S. Vogel, and A. Waibel. 2004. Improving named entity translation combining phonetic and se- S. Yoon, K. Kim, and R. Sproat. 2007. Multilingual mantic similarities. In Proc. of HLT/NAACL. transliteration using feature based phonetic method. In Proc. of ACL. S. Karimi, F. Scholer, and A. Turpin. 2007. Col- lapsed consonant and vowel models: New ap- D. Zelenko and C. Aone. 2006. Discriminative methods proaches for English-Persian transliteration and back- for transliteration. In Proc. of EMNLP. transliteration. In Proc. of ACL. A. Klementiev and D. Roth. 2008. Named entity translit- eration and discovery in multilingual corpora. In Learning Machine Translation. MIT press. K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612. K. Knight and K. Yamada. 1999. A computational ap- proach to deciphering unknown scripts. In Proc. of the ACL Workshop on Unsupervised Learning in Natural Language Processing. K. Knight, A. Nair, N. Rathod, and K. Yamada. 2006. Unsupervised analysis for decipherment problems. In Proc. of COLING/ACL. J. Kuo, H. Li, and Y. Yang. 2006. Learning translitera- tion lexicons from the web. In Proc. of ACL/COLING. H. Li, C. Sim, K., J. Kuo, and M. Dong. 2007. Semantic transliteration of personal names. In Proc. of ACL. M. Nagata, T. Saito, and K. Suzuki. 2001. Using the web as a bilingual dictionary. In Proc. of the ACL Work- shop on Data-driven Methods in Machine Translation. 45