SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Asialex 2011 Kyoto, Japan                                          1



       Development of the Thesaurus of Classical
             Japanese Poetic Vocabulary




                                Hilofumi Yamamoto
                            Tokyo Institute of Technology
                                   Makiro Tanaka
         National Institute of Japanese Language and Linguistics

                                  22nd Aug. 2011
Asialex 2011 Kyoto, Japan                                        2




       Outline
         1. Purpose of Study
              • Connotation of classical poetic vocabulary
              • Longitudinal study of transition of vocabulary
         2. Development of Thesaurus
         3. Applications
Asialex 2011 Kyoto, Japan                                                  3




       Waka: Japanese Poetry




                            Tatsuta-Hime..
                            tamukuru KAMI no / arebakoso
                            aki no konoha no / nusa to chirurame

                            because Princess Tatsuta
                            has a god to whom she offers brocades,
                            the leaves of trees
                            in autumn will scatter
                            as an offering.

                                                 Prince Kanemi
                                                 No. 298 in the Kokinsh¯
                                                                       u
Asialex 2011 Kyoto, Japan                                    4




       Problem: Orthography
                                in Chinese characters

                  in hiragana




                                → All Tatsuta (place name)
Asialex 2011 Kyoto, Japan                                          5




       Problem: Unit size / attribution
       The unit size and meaning of a word depends on a context.
         • unit →           or          (Nakano, 1998)
         • orthography →
           (sad)
         • attributions →         ∈ plant or       ∈ food
            (unohana = a deutzia or bean curd refuse)
Asialex 2011 Kyoto, Japan                                                            6


       An Item of Thesaurus: God

                BG-01-2030-01-030-A-                                    -
                  ↑       ↑        ↑        ↑        ↑      ↑      ↑         ↑
                 (1)     (2)      (3)      (4)      (5)    (6)    (7)       (8)

          Figure 1: Structure of an item of BG database in the case of kami (god):
                    (1) database ID (BG = short-unit general vocabulary);
                    (2) part of speech ID (01 = noun);
                    (3) group ID (2030 = Shinto deities and Buddhas);
                    (4) field ID;
                    (5) exact ID (030 = god);
                    (6) era-flag (A = contemporary, C = classic);
                    (7) Chinese character reading;
                    (8) Chinese character
Asialex 2011 Kyoto, Japan                              7




       Development: Thesaurus, KH, and t2c
         • Thesaurus for classical poetic vocabulary
         • KH (tokenizer)
         • t2c (token to code converter)
Asialex 2011 Kyoto, Japan                                                                                                                   8



        Materials: the Hachidaish¯
                                 u
           • The Hachidaish¯ : eight anthologies compiled by
                             u
             imperial orders during ca. 905–2105.
           • The database: compiled by the National Institute of
             Japanese Literature, Japan.
           • Old texts taken based on Sh¯hobonban version of the
                                        o
             Hachidaish¯u                                                                                                               )
                                              )                                                 )              )    )             )  205
                      05
                        )
                                            51                          )                   0 86           1 24 44              88 (1
                                          •9                          07                                  1      1             1 ¯
                (   •9                (                              0                    (1           ( • ( •1              (1 shu
           u¯                    u¯                                •1                sh
                                                                                       u¯            ¯
                                                                                                     u                     ¯
                                                                                                                           u    n
         sh                   nsh                         u¯
                                                               (
                                                                                 u¯ i             sh shu
                                                                                                            ¯
                                                                                                                        ish oki
      ki
        n
                           se                           sh                     sh
                                                                                                 ¯
                                                                                               yo ika                 za ink
    K
     o
                      G
                          o
                                                  J   ui
                                                      ¯                     G
                                                                              o
                                                                                           K
                                                                                             in h
                                                                                                   S
                                                                                                                    n
                                                                                                                  Se Sh
          46                     56                                   79          38        20          44       17
    ⊲




                      ⊲




                                                  ⊲




                                                                            ⊲



                                                                                        ⊲

                                                                                                 ⊲




                                                                                                             ⊲

                                                                                                                      ⊲
  900                950                   1000                      1050       1100             1150            1200           1250
Asialex 2011 Kyoto, Japan                                                                               9




       Methods: Flowchart of data processing



                                                                                  ing           P
                              e nt                        er sion          o dell          −O
                          opm                           nv              lm              CT
                    sdevel       isat
                                     ion
                                               co d
                                                   e co         ma tica          ction:       isat
                                                                                                  ion
                  pu          en            a-               he              tra            al
             Co r          Tok           Met            Mat              Sub            Visu
         A             B             C              D              E                 F
Asialex 2011 Kyoto, Japan                              10




       Development: Thesaurus, KH, and t2c
         • Thesaurus for classical poetic vocabulary
         • KH (tokenizer)
         • t2c (token to code converter)
Asialex 2011 Kyoto, Japan                                                       11

                  Table 1: An example of input for KH / Gosensh¯ No. 664
                                                               u
         input: 000664
         output:000664
                           (       - :   :   :   :              )
                   (            - : : : : )
                   (        :    : )
                       (        -    :   :   :   :              )
                           (       - :   :   :   :              )
                   (        :    : )
                           (       -   :   :   :   :                )
                  (         :    : )
                  (         :    : )
                  ( : :           )
                  (   :          : )
                ---
                        (        -       :     :       )
                  (   : :            )
                ---
                        (                - :       :        :           :   )
                  (   : :            )
                ---
                    ( : :               )
                  (   -              : : )
                    (    -             :    :   :   :   )
                    ( -              :    :   :   :   )
Asialex 2011 Kyoto, Japan                                                                      12




       Development: Thesaurus

                                                     Thesaurus
                              Tokeniser              code tagger



         Poem Texts               kh                      t2c                    Hachidaishu
                                                                                  Thesaurus

                            add unknown entries             add new thesaurus codes

                            Dictionary            General, Place Name
                                                  Personal Name, etc
                                  (A)                     (B)
Asialex 2011 Kyoto, Japan                                                                    13




       (A) Corpus: Poems (OP)

             KW00029800|A|KANEMI NO ¯=kanemi no ¯
                                    O           o
             KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→
                        tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→
                        no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→
                        aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→
                        nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→
                        rame[CJR-REAL]/

          Figure 2: Format of the database of a poem: → indicates continuing to the
                    next line without breaks; the first line, which includes |A|, indicates
                    the name of the poet; the second line which includes |B|, indicates
                    the contents of the poem and added information.
Asialex 2011 Kyoto, Japan                                                   14




       (A) Corpus: Translations (CT)
           $A|000298
           $B|                                                         →

           $C|
           $D|                                                         →

           $I|                                                         →
                                                                        →


                            Figure 3: Format of the database of a CT
Asialex 2011 Kyoto, Japan                                                       15




       (B) Tokenisation:
            original text


               ↓
            tokenising
                   /        / / /[     ]/ /    / / /         / / / /   /[   ]
               ↓
            converting into predicative form
                   /        / / /[     ]/ /    / / /         / / / /   /[   ]

                             Figure 4: Tokenisation of poem texts
Asialex 2011 Kyoto, Japan                                                           16




       (C) meta-code conversion
          CH-29-2130-01-010-A                    Tatsutahime   Princess-Tatsuta
          CH-29-0000-14-010-A   --               -- Tatsuta    Tatsuta
          BG-01-2030-01-101-A   --               -- hime       princess
          BG-02-3770-04-080-C                    tamukuru      present(verb)
          BG-01-5730-02-010-A   --               -- te         hand
          BG-02-1700-01-040-A   --               -- mukeru     for
          BG-01-2030-01-030-A                    kami          god
          BG-08-0061-07-010-A                    no            SUB (particle)
          BG-02-1200-01-010-C                    are           be
          BG-08-0064-26-010-A                    ba            because (particle)
          BG-04-1120-05-150-A   --               -- ba         because (reason)
          BG-08-0065-01-010-A                    koso          KP (emphasis)

                        Figure 5: Meta-code conversion in case of OP
Asialex 2011 Kyoto, Japan                                                            17



       (C) Structure of meta-code-1
                BG-01-2030-01-030-A-                                    -
                  ↑       ↑        ↑        ↑        ↑      ↑      ↑         ↑
                 (1)     (2)      (3)      (4)      (5)    (6)    (7)       (8)

          Figure 6: Structure of an item of BG database in the case of kami (god):
                    (1) database ID (BG = short-unit general vocabulary);
                    (2) part of speech ID (01 = noun);
                    (3) group ID (2030 = Shinto deities and Buddhas);
                    (4) field ID;
                    (5) exact ID (030 = god);
                    (6) era-flag (A = contemporary, C = classic);
                    (7) Chinese character reading;
                    (8) Chinese character
Asialex 2011 Kyoto, Japan                                                    18




       (C) Structure of the meta-code-2
             BG-01-2600-01-020-A (1)     =   BG-01-2610-01-040-A (2)
             yononaka (world)                yo (world)


                                         +   BG-08-0010-01-021-A (3)
                                             no (of)


                                         +   BG-01-1770-01-080-A (4)
                                             naka (inside)



          Figure 7: Structure of an item of the semantic table in the case
                    of a compound word, yononaka (world)
Asialex 2011 Kyoto, Japan                                                           19




       (C) meta-code conversion-3
          CH-29-2130-01-010-A                    Tatsutahime   Princess-Tatsuta
          CH-29-0000-14-010-A   --               -- Tatsuta    Tatsuta
          BG-01-2030-01-101-A   --               -- hime       princess
          BG-02-3770-04-080-C                    tamukuru      present(verb)
          BG-01-5730-02-010-A   --               -- te         hand
          BG-02-1700-01-040-A   --               -- mukeru     for
          BG-01-2030-01-030-A                    kami          god
          BG-08-0061-07-010-A                    no            SUB (particle)
          BG-02-1200-01-010-C                    are           be
          BG-08-0064-26-010-A                    ba            because (particle)
          BG-04-1120-05-150-A   --               -- ba         because (reason)
          BG-08-0065-01-010-A                    koso          KP (emphasis)

                        Figure 8: Meta-code conversion in case of OP
Asialex 2011 Kyoto, Japan                                                                 20




                             10th century                    20th century
                         Field of experience        Field of experience (expert)


                  poet         write           OP           read       expert reader

                                                         com
                                                             par           write
                                                                e


                                                                           CT


                                                                           read

                                                                       novice reader

                                                                        20th century
                                                                    Field of experience
                                                                          (novice)




                    Figure 9: Schema of relationship between OP and CT
Asialex 2011 Kyoto, Japan                                                   21

           +-------- # of pair
           | +----- value of matching level, exact=17, field=13, group=10
           | | +-- # of POS
           | | |
           | | | # of element of OP ----+        +- # of element of CT
           | | |         element of OP -+ |      | +--- element of CT
           | | |                        | |      | |
           1 17 11                       00 <-> 12        (Tatsutahime)
           2 17 47                       04 <-> 25         (hand)
           3 17 47                       05 <-> 26        (toward)
           4 17 2                        06 <-> 32         (god)
           5 10 61                       07 <-> 33         (SUB)
           6 17 47                       08 <-> 34        (be)
           7 10 64                       09 <-> 35         (because)
           8 17 65                       11 <-> 36        (EM)
           9 17 2                        12 <-> 38         (autumn)
          10 17 71                       13 <-> 39         (CON)
          11 17 2                        14 <-> 40        (leaf of tree)
          12 17 2                        19 <-> 45         (present)
          13 17 61                       20 <-> 46         (CRD)
          14 17 47                       21 <-> 49        (fall)
          15 13 74                       22 <-> 54         (CJR)

                            Figure 10: Example of the matching process
Asialex 2011 Kyoto, Japan                                                              22




        Residual

   CT   (                                )         (                )
   OP   — —— — — — — — — —                         — — — — —— —


   CT   (        )                           ( ) (       )    (           )
   OP   — —                                  [ ]       — —    — — — —



            Figure 11: Example of the matching process in the case of kks 298 in Ko-
                       machiya (1982)
Asialex 2011 Kyoto, Japan                                                        23




       Components of OP
          Table 2: Result of subtracting the elements of OP(298) from those
                   of CT(298, koma): it indicates the ratio of the ingredients
                   of OP(298).
          OP    (valid      number of element)                     =   16
          E     (ratio      of exact match)              12/16     =   0.750
          F     (ratio      of field match)               1/16     =   0.062
          G     (ratio      of group match)               2/16     =   0.125
          T     (ratio      of total match)              15/16     =   0.938
          U     (ratio      of unmatched OP)             1 - T     =   0.062
Asialex 2011 Kyoto, Japan                       24




       Calculation of Residual Rate



                                     P
                            D = 1−        (1)
                                     T
                                     16
                              = 1−        (2)
                                     41
                              = 0.61      (3)
Asialex 2011 Kyoto, Japan                                                                 25




       Components of CT
          Table 3: Component of CT in case of kks 298 by Komachiya (1982):
                   fabs(D-H) stands for the function of the absolute value of the prac-
                   tical value, D, minus the theoretical value, H.

           CT (valid number of element)                       =41
           W (ratio of original word use)                12/41=0.293(E/CT)
           A (ratio of annotation)                     1-0.293=0.707(1-W)
               ---breakdown of the annotation---
               P1(ratio of FG paraphrased)   (0.62+0.12)/0.707=0.073(F+G)/A
               P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
               D (ratio of purely added)   0.707-(0.073+0.040)=0.595A-(P1+P2)
           H (theoretical value of D)                  1-16/41=0.6101-OP/CT
           Gap                               fabs(0.595-0.610)=0.015fabs(D-H)
Asialex 2011 Kyoto, Japan                                                                              26



       Subtraction: CT - OP


                                                                        P1 3 (7.3%)


                                                                  P2 1 (4.0%)           W 12 (29.3%)
                        Exact 12 (75.0%)




                                             Unmatched 1 (6.2%)


                                                                                D 25 (59.5%)
                                           Group 2 (12.5%)


                                     Field 1 (6.2%)



                        OP(298) : 16 elements                            CT(298,koma) : 41 elements



          Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
                     koma)
Asialex 2011 Kyoto, Japan                                                         27




       (E) Mathematical modelling
                                                     √
                cw(t1 , t2 ) = (1+log ctf (t1 , t2 )) idf (t1 ) idf (t2 )   (4)


                                                N
                                 idf (t) = log                              (5)
                                               df (t)
Asialex 2011 Kyoto, Japan                                                                                                                                                                                                                                                               28
                                                                                                          far treetop high.1
                                                                                                                                               7regret

                         force                                                                                                                          separation


                                                                                                                                 7                       treetop high.3
                                                                                                           go over
                                                                                                                5
                                                                                                                                               10
                                 6                           be heard.1                                                                             7
                                                                                                                                                    4

                                                                                      this morning                     10                                                                                                                    near
                                                                                                                  9
                                                                                                           10

                                  summer mountains
                                                  hear            borrow                                                    Otowa.PN
                                                                            37
                                                                                                                                                                                                                                6
                                                                                           29
                                                                    69           19                               11                                                                                                                                       old age
                                                             11
                                                                                                treetop           20
                                                                                                                            20
             a cry
                                                                                                                                     19
                                          singing voice                                         20
                                                                                                                                                                                                      every morning
                                                                    cuckoo mountain
                         10                              21
                                                                                                                                                                                                                                                                   wear in (my) hair
               8                                                                                                                                                                        stop.vi.1     8                                                6
                                                   39                                                                 110

                                                   14                                             9                   261                                                                                                                                  4
                                 summer midsummer rain                                                                           sing.vi                                                      field
            side     8                              20                                                                                                                                                                                                                   green willow
                                                                                                                                                                                                                                                                             4
                                             12                                                                                                                                                       10
                                                                                          42
                                                                                                             174                                                                           15                          plum
                                                                                                              44                                                145                                                                                                4
                                                                                                                                                                                         17                                         10
                         9                                                                                                                                                                                                               woven hat
                                                  last year                                                                               10
                                                                                                            26               voice                         62
                                                                                                                                                                                                           56
                                                                                                                                                                                                          break off23
                                                                                                                                                                                                                       10
                                                                                                                                                                                                                                                                   6
                                                                                                                                                                                                                                                                            sew.2
                                                                        10
                                                                                          May                                                                                                              22

          mountain cuckoo                                6                                                                                                      10
                                                                                                                                                                         warbler                                                                               7
                                                                    6                                                                                                                                                                                                         6
                                                              9
                                                                                                                                                                                                            35         branch
                                                                                                                                                                                                           88
                                                                                                                                           Tatsuta.PN                         29
                                                                                                                                                                      cry.vi
                                                                                                                                                                       52                  138
                                                                        7                                                                                                                                                                                               hide.vi.2
                                                flutter.2                             8                                                                    10                       30
                                     imperceptibly                                                                                                                                                                spring
                                                                                                                                                           scatter.1
                                                                                                                                                                                   10
                                                                                                                                                                                                flower
                                                                                                                                                                                                 9

                                                                                                                                      10
                                                                                                                                           9
                                                                                                                                                                                   yet.1
                                                        iris.1              reason.1
                                                                   6


                                                                                                                                                                       touch                                    lure
                                                                                                                 stand.vi
                                                                                                                                                                                                                                         4
                                                                                                                                                                                                                                                       send
                                                                                                                             spring haze                                                                                    7

                                                                                                                                                                                                                        5
                                                                                                                                                                                                           4
                                                                                                                                                                         10
                                                                                                                                                                                                                                         fragrance.1


                                                                                                                                                                                                                       attach
                                                                                                                                                                  hand                    guidance.1

                                                                                                      warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
Asialex 2011 Kyoto, Japan                                                       29



       Conclusion
       The thesaurus annotated with meta-codes allows researchers

         1. to identify different orthographies as the same word;

         2. to attach an alternative semantic ID to a word which has the
            same form but has more than one meaning (polysemic word);

         3. to attach meta-codes not only to tokens recognised as a
            single/simple word but also to attach it to a longer size token

         4. to indicate a similarity between tokens.

         5. to detect common or different tokens among more than one text,
            which will tell us the similarities or differences between texts.

         6. to indicate the relative differences between two words in literary
            works.
Asialex 2011 Kyoto, Japan                                    30




       Questions
         • Computer Modelling of Classical Japanese Poetic
           Vocabulary
            http://etymology.jp/waka/poem.cgi
         • Inquiry:
            Hilofumi Yamamoto
            yamagen@ryu.titech.ac.jp
         • Thank you.

Contenu connexe

En vedette (11)

Ch2006slide
Ch2006slideCh2006slide
Ch2006slide
 
Database2010 01slide
Database2010 01slideDatabase2010 01slide
Database2010 01slide
 
Kokken20100303
Kokken20100303Kokken20100303
Kokken20100303
 
Keio slide
Keio slideKeio slide
Keio slide
 
Jinmon2007slide02
Jinmon2007slide02Jinmon2007slide02
Jinmon2007slide02
 
Ch2011slide01
Ch2011slide01Ch2011slide01
Ch2011slide01
 
Incremental load
Incremental loadIncremental load
Incremental load
 
Ch2010slide01
Ch2010slide01Ch2010slide01
Ch2010slide01
 
Ch2008slide01
Ch2008slide01Ch2008slide01
Ch2008slide01
 
Goiken2007slide
Goiken2007slideGoiken2007slide
Goiken2007slide
 
AyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpegAyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpeg
 

Dernier

sample sample sample sample sample sample
sample sample sample sample sample samplesample sample sample sample sample sample
sample sample sample sample sample sampleCasey Keith
 
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment Booking
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment BookingJhargram call girls 📞 8617697112 At Low Cost Cash Payment Booking
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment BookingNitya salvi
 
Genuine 8250077686 Hot and Beautiful 💕 Bhavnagar Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Bhavnagar Escorts call GirlsGenuine 8250077686 Hot and Beautiful 💕 Bhavnagar Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Bhavnagar Escorts call GirlsDeiva Sain Call Girl
 
Sample sample sample sample sample sample
Sample sample sample sample sample sampleSample sample sample sample sample sample
Sample sample sample sample sample sampleCasey Keith
 
Kolkata Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service Available
Kolkata Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service AvailableKolkata Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service Available
Kolkata Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service AvailableNitya salvi
 
WhatsApp Chat: 📞 8617697112 Independent Call Girls in Darjeeling
WhatsApp Chat: 📞 8617697112 Independent Call Girls in DarjeelingWhatsApp Chat: 📞 8617697112 Independent Call Girls in Darjeeling
WhatsApp Chat: 📞 8617697112 Independent Call Girls in DarjeelingNitya salvi
 
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑Damini Dixit
 
Hire 💕 8617697112 Chamba Call Girls Service Call Girls Agency
Hire 💕 8617697112 Chamba Call Girls Service Call Girls AgencyHire 💕 8617697112 Chamba Call Girls Service Call Girls Agency
Hire 💕 8617697112 Chamba Call Girls Service Call Girls AgencyNitya salvi
 
Sample sample sample sample sample sample
Sample sample sample sample sample sampleSample sample sample sample sample sample
Sample sample sample sample sample sampleCasey Keith
 
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.Nitya salvi
 
Andheri Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Andheri Call Girls 🥰 8617370543 Service Offer VIP Hot ModelAndheri Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Andheri Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Genuine 8250077686 Hot and Beautiful 💕 Amaravati Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Amaravati Escorts call GirlsGenuine 8250077686 Hot and Beautiful 💕 Amaravati Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Amaravati Escorts call GirlsDeiva Sain Call Girl
 
Tamluk ❤CALL GIRL 8617697112 ❤CALL GIRLS IN Tamluk ESCORT SERVICE❤CALL GIRL
Tamluk ❤CALL GIRL 8617697112 ❤CALL GIRLS IN Tamluk ESCORT SERVICE❤CALL GIRLTamluk ❤CALL GIRL 8617697112 ❤CALL GIRLS IN Tamluk ESCORT SERVICE❤CALL GIRL
Tamluk ❤CALL GIRL 8617697112 ❤CALL GIRLS IN Tamluk ESCORT SERVICE❤CALL GIRLNitya salvi
 
Darjeeling Call Girls 8250077686 Service Offer VIP Hot Model
Darjeeling Call Girls 8250077686 Service Offer VIP Hot ModelDarjeeling Call Girls 8250077686 Service Offer VIP Hot Model
Darjeeling Call Girls 8250077686 Service Offer VIP Hot ModelDeiva Sain Call Girl
 
Bhubaneswar Call Girls 8250077686 Service Offer VIP Hot Model
Bhubaneswar Call Girls 8250077686 Service Offer VIP Hot ModelBhubaneswar Call Girls 8250077686 Service Offer VIP Hot Model
Bhubaneswar Call Girls 8250077686 Service Offer VIP Hot ModelDeiva Sain Call Girl
 
sample sample sample sample sample sample
sample sample sample sample sample samplesample sample sample sample sample sample
sample sample sample sample sample sampleCasey Keith
 
Hire 8617697112 Call Girls Udhampur For an Amazing Night
Hire 8617697112 Call Girls Udhampur For an Amazing NightHire 8617697112 Call Girls Udhampur For an Amazing Night
Hire 8617697112 Call Girls Udhampur For an Amazing NightNitya salvi
 
Siliguri Call Girls 8250077686 Service Offer VIP Hot Model
Siliguri Call Girls 8250077686 Service Offer VIP Hot ModelSiliguri Call Girls 8250077686 Service Offer VIP Hot Model
Siliguri Call Girls 8250077686 Service Offer VIP Hot ModelDeiva Sain Call Girl
 
💕📲09602870969💓Girl Escort Services Udaipur Call Girls in Chittorgarh Haldighati
💕📲09602870969💓Girl Escort Services Udaipur Call Girls in Chittorgarh Haldighati💕📲09602870969💓Girl Escort Services Udaipur Call Girls in Chittorgarh Haldighati
💕📲09602870969💓Girl Escort Services Udaipur Call Girls in Chittorgarh HaldighatiApsara Of India
 
Ahmedabad Escort Service Ahmedabad Call Girl 0000000000
Ahmedabad Escort Service Ahmedabad Call Girl 0000000000Ahmedabad Escort Service Ahmedabad Call Girl 0000000000
Ahmedabad Escort Service Ahmedabad Call Girl 0000000000mountabuangels4u
 

Dernier (20)

sample sample sample sample sample sample
sample sample sample sample sample samplesample sample sample sample sample sample
sample sample sample sample sample sample
 
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment Booking
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment BookingJhargram call girls 📞 8617697112 At Low Cost Cash Payment Booking
Jhargram call girls 📞 8617697112 At Low Cost Cash Payment Booking
 
Genuine 8250077686 Hot and Beautiful 💕 Bhavnagar Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Bhavnagar Escorts call GirlsGenuine 8250077686 Hot and Beautiful 💕 Bhavnagar Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Bhavnagar Escorts call Girls
 
Sample sample sample sample sample sample
Sample sample sample sample sample sampleSample sample sample sample sample sample
Sample sample sample sample sample sample
 
Kolkata Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service Available
Kolkata Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service AvailableKolkata Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service Available
Kolkata Call Girls - 📞 8617697112 🔝 Top Class Call Girls Service Available
 
WhatsApp Chat: 📞 8617697112 Independent Call Girls in Darjeeling
WhatsApp Chat: 📞 8617697112 Independent Call Girls in DarjeelingWhatsApp Chat: 📞 8617697112 Independent Call Girls in Darjeeling
WhatsApp Chat: 📞 8617697112 Independent Call Girls in Darjeeling
 
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑
High Profile 🔝 8250077686 📞 Call Girls Service in Siri Fort🍑
 
Hire 💕 8617697112 Chamba Call Girls Service Call Girls Agency
Hire 💕 8617697112 Chamba Call Girls Service Call Girls AgencyHire 💕 8617697112 Chamba Call Girls Service Call Girls Agency
Hire 💕 8617697112 Chamba Call Girls Service Call Girls Agency
 
Sample sample sample sample sample sample
Sample sample sample sample sample sampleSample sample sample sample sample sample
Sample sample sample sample sample sample
 
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.
❤Personal Contact Number Mcleodganj Call Girls 8617697112💦✅.
 
Andheri Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Andheri Call Girls 🥰 8617370543 Service Offer VIP Hot ModelAndheri Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Andheri Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Genuine 8250077686 Hot and Beautiful 💕 Amaravati Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Amaravati Escorts call GirlsGenuine 8250077686 Hot and Beautiful 💕 Amaravati Escorts call Girls
Genuine 8250077686 Hot and Beautiful 💕 Amaravati Escorts call Girls
 
Tamluk ❤CALL GIRL 8617697112 ❤CALL GIRLS IN Tamluk ESCORT SERVICE❤CALL GIRL
Tamluk ❤CALL GIRL 8617697112 ❤CALL GIRLS IN Tamluk ESCORT SERVICE❤CALL GIRLTamluk ❤CALL GIRL 8617697112 ❤CALL GIRLS IN Tamluk ESCORT SERVICE❤CALL GIRL
Tamluk ❤CALL GIRL 8617697112 ❤CALL GIRLS IN Tamluk ESCORT SERVICE❤CALL GIRL
 
Darjeeling Call Girls 8250077686 Service Offer VIP Hot Model
Darjeeling Call Girls 8250077686 Service Offer VIP Hot ModelDarjeeling Call Girls 8250077686 Service Offer VIP Hot Model
Darjeeling Call Girls 8250077686 Service Offer VIP Hot Model
 
Bhubaneswar Call Girls 8250077686 Service Offer VIP Hot Model
Bhubaneswar Call Girls 8250077686 Service Offer VIP Hot ModelBhubaneswar Call Girls 8250077686 Service Offer VIP Hot Model
Bhubaneswar Call Girls 8250077686 Service Offer VIP Hot Model
 
sample sample sample sample sample sample
sample sample sample sample sample samplesample sample sample sample sample sample
sample sample sample sample sample sample
 
Hire 8617697112 Call Girls Udhampur For an Amazing Night
Hire 8617697112 Call Girls Udhampur For an Amazing NightHire 8617697112 Call Girls Udhampur For an Amazing Night
Hire 8617697112 Call Girls Udhampur For an Amazing Night
 
Siliguri Call Girls 8250077686 Service Offer VIP Hot Model
Siliguri Call Girls 8250077686 Service Offer VIP Hot ModelSiliguri Call Girls 8250077686 Service Offer VIP Hot Model
Siliguri Call Girls 8250077686 Service Offer VIP Hot Model
 
💕📲09602870969💓Girl Escort Services Udaipur Call Girls in Chittorgarh Haldighati
💕📲09602870969💓Girl Escort Services Udaipur Call Girls in Chittorgarh Haldighati💕📲09602870969💓Girl Escort Services Udaipur Call Girls in Chittorgarh Haldighati
💕📲09602870969💓Girl Escort Services Udaipur Call Girls in Chittorgarh Haldighati
 
Ahmedabad Escort Service Ahmedabad Call Girl 0000000000
Ahmedabad Escort Service Ahmedabad Call Girl 0000000000Ahmedabad Escort Service Ahmedabad Call Girl 0000000000
Ahmedabad Escort Service Ahmedabad Call Girl 0000000000
 

Asialex201103slide02

  • 1. Asialex 2011 Kyoto, Japan 1 Development of the Thesaurus of Classical Japanese Poetic Vocabulary Hilofumi Yamamoto Tokyo Institute of Technology Makiro Tanaka National Institute of Japanese Language and Linguistics 22nd Aug. 2011
  • 2. Asialex 2011 Kyoto, Japan 2 Outline 1. Purpose of Study • Connotation of classical poetic vocabulary • Longitudinal study of transition of vocabulary 2. Development of Thesaurus 3. Applications
  • 3. Asialex 2011 Kyoto, Japan 3 Waka: Japanese Poetry Tatsuta-Hime.. tamukuru KAMI no / arebakoso aki no konoha no / nusa to chirurame because Princess Tatsuta has a god to whom she offers brocades, the leaves of trees in autumn will scatter as an offering. Prince Kanemi No. 298 in the Kokinsh¯ u
  • 4. Asialex 2011 Kyoto, Japan 4 Problem: Orthography in Chinese characters in hiragana → All Tatsuta (place name)
  • 5. Asialex 2011 Kyoto, Japan 5 Problem: Unit size / attribution The unit size and meaning of a word depends on a context. • unit → or (Nakano, 1998) • orthography → (sad) • attributions → ∈ plant or ∈ food (unohana = a deutzia or bean curd refuse)
  • 6. Asialex 2011 Kyoto, Japan 6 An Item of Thesaurus: God BG-01-2030-01-030-A- - ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 1: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 7. Asialex 2011 Kyoto, Japan 7 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 8. Asialex 2011 Kyoto, Japan 8 Materials: the Hachidaish¯ u • The Hachidaish¯ : eight anthologies compiled by u imperial orders during ca. 905–2105. • The database: compiled by the National Institute of Japanese Literature, Japan. • Old texts taken based on Sh¯hobonban version of the o Hachidaish¯u ) ) ) ) ) ) 205 05 ) 51 ) 0 86 1 24 44 88 (1 •9 07 1 1 1 ¯ ( •9 ( 0 (1 ( • ( •1 (1 shu u¯ u¯ •1 sh u¯ ¯ u ¯ u n sh nsh u¯ ( u¯ i sh shu ¯ ish oki ki n se sh sh ¯ yo ika za ink K o G o J ui ¯ G o K in h S n Se Sh 46 56 79 38 20 44 17 ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ 900 950 1000 1050 1100 1150 1200 1250
  • 9. Asialex 2011 Kyoto, Japan 9 Methods: Flowchart of data processing ing P e nt er sion o dell −O opm nv lm CT sdevel isat ion co d e co ma tica ction: isat ion pu en a- he tra al Co r Tok Met Mat Sub Visu A B C D E F
  • 10. Asialex 2011 Kyoto, Japan 10 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 11. Asialex 2011 Kyoto, Japan 11 Table 1: An example of input for KH / Gosensh¯ No. 664 u input: 000664 output:000664 ( - : : : : ) ( - : : : : ) ( : : ) ( - : : : : ) ( - : : : : ) ( : : ) ( - : : : : ) ( : : ) ( : : ) ( : : ) ( : : ) --- ( - : : ) ( : : ) --- ( - : : : : ) ( : : ) --- ( : : ) ( - : : ) ( - : : : : ) ( - : : : : )
  • 12. Asialex 2011 Kyoto, Japan 12 Development: Thesaurus Thesaurus Tokeniser code tagger Poem Texts kh t2c Hachidaishu Thesaurus add unknown entries add new thesaurus codes Dictionary General, Place Name Personal Name, etc (A) (B)
  • 13. Asialex 2011 Kyoto, Japan 13 (A) Corpus: Poems (OP) KW00029800|A|KANEMI NO ¯=kanemi no ¯ O o KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→ tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→ no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→ aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→ nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→ rame[CJR-REAL]/ Figure 2: Format of the database of a poem: → indicates continuing to the next line without breaks; the first line, which includes |A|, indicates the name of the poet; the second line which includes |B|, indicates the contents of the poem and added information.
  • 14. Asialex 2011 Kyoto, Japan 14 (A) Corpus: Translations (CT) $A|000298 $B| → $C| $D| → $I| → → Figure 3: Format of the database of a CT
  • 15. Asialex 2011 Kyoto, Japan 15 (B) Tokenisation: original text ↓ tokenising / / / /[ ]/ / / / / / / / / /[ ] ↓ converting into predicative form / / / /[ ]/ / / / / / / / / /[ ] Figure 4: Tokenisation of poem texts
  • 16. Asialex 2011 Kyoto, Japan 16 (C) meta-code conversion CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- -- Tatsuta Tatsuta BG-01-2030-01-101-A -- -- hime princess BG-02-3770-04-080-C tamukuru present(verb) BG-01-5730-02-010-A -- -- te hand BG-02-1700-01-040-A -- -- mukeru for BG-01-2030-01-030-A kami god BG-08-0061-07-010-A no SUB (particle) BG-02-1200-01-010-C are be BG-08-0064-26-010-A ba because (particle) BG-04-1120-05-150-A -- -- ba because (reason) BG-08-0065-01-010-A koso KP (emphasis) Figure 5: Meta-code conversion in case of OP
  • 17. Asialex 2011 Kyoto, Japan 17 (C) Structure of meta-code-1 BG-01-2030-01-030-A- - ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 6: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 18. Asialex 2011 Kyoto, Japan 18 (C) Structure of the meta-code-2 BG-01-2600-01-020-A (1) = BG-01-2610-01-040-A (2) yononaka (world) yo (world) + BG-08-0010-01-021-A (3) no (of) + BG-01-1770-01-080-A (4) naka (inside) Figure 7: Structure of an item of the semantic table in the case of a compound word, yononaka (world)
  • 19. Asialex 2011 Kyoto, Japan 19 (C) meta-code conversion-3 CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- -- Tatsuta Tatsuta BG-01-2030-01-101-A -- -- hime princess BG-02-3770-04-080-C tamukuru present(verb) BG-01-5730-02-010-A -- -- te hand BG-02-1700-01-040-A -- -- mukeru for BG-01-2030-01-030-A kami god BG-08-0061-07-010-A no SUB (particle) BG-02-1200-01-010-C are be BG-08-0064-26-010-A ba because (particle) BG-04-1120-05-150-A -- -- ba because (reason) BG-08-0065-01-010-A koso KP (emphasis) Figure 8: Meta-code conversion in case of OP
  • 20. Asialex 2011 Kyoto, Japan 20 10th century 20th century Field of experience Field of experience (expert) poet write OP read expert reader com par write e CT read novice reader 20th century Field of experience (novice) Figure 9: Schema of relationship between OP and CT
  • 21. Asialex 2011 Kyoto, Japan 21 +-------- # of pair | +----- value of matching level, exact=17, field=13, group=10 | | +-- # of POS | | | | | | # of element of OP ----+ +- # of element of CT | | | element of OP -+ | | +--- element of CT | | | | | | | 1 17 11 00 <-> 12 (Tatsutahime) 2 17 47 04 <-> 25 (hand) 3 17 47 05 <-> 26 (toward) 4 17 2 06 <-> 32 (god) 5 10 61 07 <-> 33 (SUB) 6 17 47 08 <-> 34 (be) 7 10 64 09 <-> 35 (because) 8 17 65 11 <-> 36 (EM) 9 17 2 12 <-> 38 (autumn) 10 17 71 13 <-> 39 (CON) 11 17 2 14 <-> 40 (leaf of tree) 12 17 2 19 <-> 45 (present) 13 17 61 20 <-> 46 (CRD) 14 17 47 21 <-> 49 (fall) 15 13 74 22 <-> 54 (CJR) Figure 10: Example of the matching process
  • 22. Asialex 2011 Kyoto, Japan 22 Residual CT ( ) ( ) OP — —— — — — — — — — — — — — —— — CT ( ) ( ) ( ) ( ) OP — — [ ] — — — — — — Figure 11: Example of the matching process in the case of kks 298 in Ko- machiya (1982)
  • 23. Asialex 2011 Kyoto, Japan 23 Components of OP Table 2: Result of subtracting the elements of OP(298) from those of CT(298, koma): it indicates the ratio of the ingredients of OP(298). OP (valid number of element) = 16 E (ratio of exact match) 12/16 = 0.750 F (ratio of field match) 1/16 = 0.062 G (ratio of group match) 2/16 = 0.125 T (ratio of total match) 15/16 = 0.938 U (ratio of unmatched OP) 1 - T = 0.062
  • 24. Asialex 2011 Kyoto, Japan 24 Calculation of Residual Rate P D = 1− (1) T 16 = 1− (2) 41 = 0.61 (3)
  • 25. Asialex 2011 Kyoto, Japan 25 Components of CT Table 3: Component of CT in case of kks 298 by Komachiya (1982): fabs(D-H) stands for the function of the absolute value of the prac- tical value, D, minus the theoretical value, H. CT (valid number of element) =41 W (ratio of original word use) 12/41=0.293(E/CT) A (ratio of annotation) 1-0.293=0.707(1-W) ---breakdown of the annotation--- P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2) H (theoretical value of D) 1-16/41=0.6101-OP/CT Gap fabs(0.595-0.610)=0.015fabs(D-H)
  • 26. Asialex 2011 Kyoto, Japan 26 Subtraction: CT - OP P1 3 (7.3%) P2 1 (4.0%) W 12 (29.3%) Exact 12 (75.0%) Unmatched 1 (6.2%) D 25 (59.5%) Group 2 (12.5%) Field 1 (6.2%) OP(298) : 16 elements CT(298,koma) : 41 elements Figure 12: Pie-charts illustrating the components of OP(298) and CT(298, koma)
  • 27. Asialex 2011 Kyoto, Japan 27 (E) Mathematical modelling √ cw(t1 , t2 ) = (1+log ctf (t1 , t2 )) idf (t1 ) idf (t2 ) (4) N idf (t) = log (5) df (t)
  • 28. Asialex 2011 Kyoto, Japan 28 far treetop high.1 7regret force separation 7 treetop high.3 go over 5 10 6 be heard.1 7 4 this morning 10 near 9 10 summer mountains hear borrow Otowa.PN 37 6 29 69 19 11 old age 11 treetop 20 20 a cry 19 singing voice 20 every morning cuckoo mountain 10 21 wear in (my) hair 8 stop.vi.1 8 6 39 110 14 9 261 4 summer midsummer rain sing.vi field side 8 20 green willow 4 12 10 42 174 15 plum 44 145 4 17 10 9 woven hat last year 10 26 voice 62 56 break off23 10 6 sew.2 10 May 22 mountain cuckoo 6 10 warbler 7 6 6 9 35 branch 88 Tatsuta.PN 29 cry.vi 52 138 7 hide.vi.2 flutter.2 8 10 30 imperceptibly spring scatter.1 10 flower 9 10 9 yet.1 iris.1 reason.1 6 touch lure stand.vi 4 send spring haze 7 5 4 10 fragrance.1 attach hand guidance.1 warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
  • 29. Asialex 2011 Kyoto, Japan 29 Conclusion The thesaurus annotated with meta-codes allows researchers 1. to identify different orthographies as the same word; 2. to attach an alternative semantic ID to a word which has the same form but has more than one meaning (polysemic word); 3. to attach meta-codes not only to tokens recognised as a single/simple word but also to attach it to a longer size token 4. to indicate a similarity between tokens. 5. to detect common or different tokens among more than one text, which will tell us the similarities or differences between texts. 6. to indicate the relative differences between two words in literary works.
  • 30. Asialex 2011 Kyoto, Japan 30 Questions • Computer Modelling of Classical Japanese Poetic Vocabulary http://etymology.jp/waka/poem.cgi • Inquiry: Hilofumi Yamamoto yamagen@ryu.titech.ac.jp • Thank you.