SlideShare a Scribd company logo
1 of 22
Information & Database Systems Lab




                                     Entity Graph Mining and Matching
                                                                          Seung-won Hwang
                                                                         Associate Professor
                                             Department of Computer Science and Engineering
                                                                            POSTECH, Korea
Mining Human Intelligence from the Web: Click Graph
                                      Language-agnostic/data-intensive: e.g., arabic Corpus?
Information & Database Systems Lab




                                                                  Are q1 and q2 similar?




                                                                  Are u3 and u4 similar?
Mining at Finer Granularity: Named Entity (NE) Graph
                                      Person name, Place name, Organization name, Product name
                                        Newspapers, Web sites, TV programs, …
Information & Database Systems Lab




                                                                                             Apple
                                                                                                                 MS
                                                                                       tenure
                                                                                                          Co-founder
                                                                                            jobs
                                                                                                                 gates
                                                                                                   complicated

                                                                                            Mac
Case I: Matching names with twitter accounts [EDBT11]
Information & Database Systems Lab
Case II: Entity Translation [EMNLP10,CIKM11]
                                      What are the features?
                                      How are the features combined?
                                     (using translation as an application scenario)
Information & Database Systems Lab




                                                                 NE                                      NE
                                                                                                                   NE
                                                      NE
                                                                                               NE
                                                                                NE                            NE
                                                                      NE
                                                                                                                        NE
                                                NE
                                                            NE                       NE   NE        NE
                                                                           NE
                                                                                                                         NE
                                     English                                                                  NE
                                                                                                                              Chinese
                                     Corpus      NE
                                                                                                                              Corpus
                                                                                          NE
                                                                 NE                                 NE
                                                                                     NE

                                                                                                                        NE
                                                           NE                                                 NE
                                                                      NE                       NE



                                                            Ge=(Ve, Ee)                              Gc=(Vc, Ec)
NE Translation
                                      Goal
                                        Finding a NE in source language into its NE in target language
                                        Ex) “Obama” (English)  “奥巴马” (Chinese)
                                      Resources: comparable corpora
Information & Database Systems Lab




                                                                       NEE          NEE
                                                                         Features     Features
                                                                                                                Find!!
                                                                       NEE          NEE
                                                                         Features     Features

                                        Xinhua News Agency (English)
                                                                                                          NEE            NEC

                                                                                                          NEE            NEC
                                                                       NEC          NEC
                                                                                                          NEE            NEC
                                                                         Features     Features

                                                                       NEC          NEC                   NEE            NEC
                                                                         Features     Features

                                        Xinhua News Agency (Chinese)
NE Translation Similarity Features
                                      Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]
                                          Pronunciation similarity between named entities
                                          Ex) “Obama” and “奥巴马” (pronounced Aobama)
Information & Database Systems Lab




                                      Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]
                                          Contextual word similarity between named entities
                                          Ex) The president (总统) Obama (奥巴马)
                                              “As president, Obama signed economic stimulus legislation …”



                                      Relationship Similarity (R): G.-w.You [7]
                                          Co-occurrence similarity between pairs of named entities
                                          Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“成龙”, “比尔·盖茨 ”)
Motivation
                                      Taxonomy Table

                                                                        Entity     Relationship
                                        Using Entity Names            E [1,2,3]         R         You [7]
Information & Database Systems Lab




                                        Using Textual Context         EC [4,5,6]        ?
                                                                      Shao [8]




                                     Research questions:
                                        Why RC is not used?
                                        Can all four categories combined?
In this paper…
                                      We propose a new NE translation similarity feature
                                         Relationship Context similarity (RC)
                                            Contextual word similarity between named entities
                                            Ex) pair (“Barack”, “Michelle”)  Spouse
Information & Database Systems Lab




                                      We propose new holistic approaches
                                            Combining all E, EC, R, and RC




                                      We validate our proposed approach using extensive
                                       experiments
Our Framework
                                      We abstract this problem as…
                                      Graph Matching of two NE relationship graphs extracted from
                                       comparable corpora
Information & Database Systems Lab




                                                                                                              Populate a decision matrix
                                                                                                                R, |Ve|-by-|Vc| matrix



                                                                NE                                      NE
                                                                                                                    NE
                                                     NE
                                                                                              NE
                                                                               NE                            NE
                                                                     NE
                                                                                                                         NE
                                               NE
                                                           NE                       NE   NE        NE
                                                                          NE
                                                                                                                          NE
                                     English                                                                 NE
                                                                                                                                    Chinese
                                     Corpus     NE
                                                                                                                                    Corpus
                                                                                         NE
                                                                NE                                 NE
                                                                                    NE

                                                                                                                         NE
                                                          NE                                                 NE
                                                                     NE                       NE



                                                           Ge=(Ve, Ee)                              Gc=(Vc, Ec)
Our Framework
                                      Overview – 3 Steps
                                        Initialization
                                                                                                                 奥巴马        成龙
                                            Construct NE relationship graphs
                                            Build an initial pairwise similarity matrix R0        Obama         .99   .1   .2
Information & Database Systems Lab




                                            Use Entity (E) and Entity Context (EC) similarities
                                                                                                   Jackie chan              .1
                                        Iterative reinforcement
                                            Build a final pairwise similarity matrix R∞
                                            Use Relationship (R) and Relationship Context (RC) similarities


                                        Matching
                                            Find 1:1 matching from R∞
                                                                                                                 奥巴马        成龙
                                            Build a binary hard decision matrix R*
                                                                                                   Obama         .99   .1   .2



                                                                                                   Jackie chan              .99
Initialization
                                      Constructing NE relationship graphs G = (N, E)
                                         Extract NEs using entity tagger for each document in each corpus
                                         Regard NEs that appears more than δ times as Nodes
                                         Connect two Nodes when they co-occur more than δ times
Information & Database Systems Lab




                                      Initializing R0
                                         Computing entity similarity matrix SE
                                             Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’
                                             Ex) ED(“Obama”, “奥巴马”) = ED(“Obama”, “Aobama”)


                                                                    E
                                                                                ED(ei , PYC j )
                                                               S   ij   1
                                                                            Len(ei ) Len( PYC j )
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Context word
Information & Database Systems Lab




                                               ex) “As president, Obama signed economic stimulus legislation …”




                                             Context window

                                               CW ( NE , d ) {wi   l/2   , wi   l/2 1   ,..., wi ( NE ),..., wi   l/2 1   , wi   l/2   }




                                             Correlation between a NE and a context word : Log-odd ratios
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Projected Context Association Vector
Information & Database Systems Lab




                                               Obama           Score                            奥巴马   Score
                                                 …              …                                …     …
                                              President         0.9                              …     …
                                                 …              …                               总统    0.85
                                                 …              …                                …     …



                                                                                Dictionary
                                     USA
                                                                                     …
                                                                                                美
                                                                                                國
                                                                              (President, 总统)
                                                                                     …
                                                                                     …


                                                          president                                           统总
Initialization
                                      Initializing R0
                                         Computing entity context similarity matrix SEC
                                             Context Similarity between ‘ei’ and ‘cj’
                                             Compute cosine similarity between two vectors
Information & Database Systems Lab




                                                                           EC
                                                                                CAei CAc j
                                                                      S   ij
                                                                                CAei    CAc j


                                         Merging SE and SEC
                                             Min-Max normalization in range [0:1]
                                             Merge


                                                                        Rij     SijE SijEC
Reinforcement
                                      Intuition
                                         Two NEs with a strong relationship
                                            Co-occur frequently                    have edge
                                            Share similar context                  have similar relationship context
Information & Database Systems Lab




                                                                                                       NE
                                                                        NE

                                                                                                      context
                                                                  context

                                                            X
                                                                                                                  Y



                                                                 context                                                  context


                                                                        NE
                                                                                                                                NE




                                                       English NE Graph                                      Chinese NE Graph
                                           1. Align neighbors
                                               using relationship (R) and relationship context (RC) similarity
                                           2. Update the similarity score
Reinforcement
                                      Iterative Approach

                                                 Relationship Context (RC) Similarity between
                                                 relation pair (i, u) and (j, v)
Information & Database Systems Lab




                                               Relationship-based Similarity (R & RC)                              Entity-based Similarity (E & EC)

                                                                                            t      RC
                                                                                           Ruv ( Siu , jv )
                                                     Rij 1
                                                       t
                                                                                                              (1           0
                                                                                                                       ) Rij
                                                                             t
                                                                ( u ,v ) k B ( i , j , )          2k


                                      Ordered set of aligned neighbor pairs of (i, j)
                                      at iteration t

                                                                                                   Relationship (R) Similarity of
                                                                                                   i’s neighbor u and j’s neighbor v
Matching
                                      Finding 1:1 matching using greedy algorithm

                                      Steps
Information & Database Systems Lab




                                       1.    Find a translation pair with the highest final similarity score
                                       2.    Select the pair and remove the corresponding row and column from R∞
                                       3.    Repeat 1. and 2. until the similarity score < threshold




                                        R∞
Experiments
                                      Dataset
                                        English Gigaword Corpus
                                            Xinhua News Agency 2008.01~2008.12
                                            100,746 news documents
                                        Chinese Gigaword Corpus
Information & Database Systems Lab




                                            Xinhua News Agency 2008.01~2008.12
                                            88,029 news documents


                                      Approaches
                                          EC                              : consider Entity context similarity feature only
                                          E                               : consider Entity name similarity feature only
                                          Shao (E+EC)                     : combine Entity name & Entity Context similarities
                                          You (E+R)                       : combine Entity name & Relationship similarities
                                          Ours
                                            E+EC+R (when ϒ = 0)
                                            E+EC+R+RC


                                      Measure
                                        Precision, Recall, and F1-score
Experiments
                                      Effectiveness of overall framework
                                         500 person named entities
                                         Set λ = 0.15
                                         5-fold cross-validation for threshold parameter learning
Information & Database Systems Lab




                                      Other type of NE (100 Location named entities)
Directions
                                      Graph matching
                                      Graph cleansing [VLDB11]
                                      Scalable entity search
Information & Database Systems Lab




                                                                  US Presidents
                                                                  Bill Clinton
                                                                  William J Clinton
                                                                  George W. Bush
                                                                  George H.W. Bush
                                                                  Dubya
Thanks
                                      Question?
Information & Database Systems Lab




                                     Visit: www.postech.ac.kr/~swhwang for these papers

More Related Content

More from Michael Shilman (8)

Iterative Prototyping
Iterative PrototypingIterative Prototyping
Iterative Prototyping
 
Personal Desire / Design Fiction
Personal Desire / Design FictionPersonal Desire / Design Fiction
Personal Desire / Design Fiction
 
Data Design
Data DesignData Design
Data Design
 
Data Mining
Data MiningData Mining
Data Mining
 
Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!Myoyoung Kim: Visual Storytelling, Infographics!
Myoyoung Kim: Visual Storytelling, Infographics!
 
Class, where are we?
Class, where are we?Class, where are we?
Class, where are we?
 
Ignite Seoul: Machine Learning
Ignite Seoul: Machine LearningIgnite Seoul: Machine Learning
Ignite Seoul: Machine Learning
 
Collective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: IntroductionCollective Intelligence Lecture 1: Introduction
Collective Intelligence Lecture 1: Introduction
 

Recently uploaded

unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
Abortion pills in Kuwait Cytotec pills in Kuwait
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
lizamodels9
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
lizamodels9
 
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
amitlee9823
 
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Sheetaleventcompany
 
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service NoidaCall Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
dlhescort
 

Recently uploaded (20)

How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Falcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in indiaFalcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in india
 
Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptx
 
Business Model Canvas (BMC)- A new venture concept
Business Model Canvas (BMC)-  A new venture conceptBusiness Model Canvas (BMC)-  A new venture concept
Business Model Canvas (BMC)- A new venture concept
 
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
 
Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLBAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service NoidaCall Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
 

Seungwon Hwang: Entity Graph Mining and Matching

  • 1. Information & Database Systems Lab Entity Graph Mining and Matching Seung-won Hwang Associate Professor Department of Computer Science and Engineering POSTECH, Korea
  • 2. Mining Human Intelligence from the Web: Click Graph  Language-agnostic/data-intensive: e.g., arabic Corpus? Information & Database Systems Lab Are q1 and q2 similar? Are u3 and u4 similar?
  • 3. Mining at Finer Granularity: Named Entity (NE) Graph  Person name, Place name, Organization name, Product name  Newspapers, Web sites, TV programs, … Information & Database Systems Lab Apple MS tenure Co-founder jobs gates complicated Mac
  • 4. Case I: Matching names with twitter accounts [EDBT11] Information & Database Systems Lab
  • 5. Case II: Entity Translation [EMNLP10,CIKM11]  What are the features?  How are the features combined? (using translation as an application scenario) Information & Database Systems Lab NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE English NE Chinese Corpus NE Corpus NE NE NE NE NE NE NE NE NE Ge=(Ve, Ee) Gc=(Vc, Ec)
  • 6. NE Translation  Goal  Finding a NE in source language into its NE in target language  Ex) “Obama” (English)  “奥巴马” (Chinese)  Resources: comparable corpora Information & Database Systems Lab NEE NEE Features Features Find!! NEE NEE Features Features Xinhua News Agency (English) NEE NEC NEE NEC NEC NEC NEE NEC Features Features NEC NEC NEE NEC Features Features Xinhua News Agency (Chinese)
  • 7. NE Translation Similarity Features  Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]  Pronunciation similarity between named entities  Ex) “Obama” and “奥巴马” (pronounced Aobama) Information & Database Systems Lab  Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]  Contextual word similarity between named entities  Ex) The president (总统) Obama (奥巴马) “As president, Obama signed economic stimulus legislation …”  Relationship Similarity (R): G.-w.You [7]  Co-occurrence similarity between pairs of named entities  Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“成龙”, “比尔·盖茨 ”)
  • 8. Motivation  Taxonomy Table Entity Relationship Using Entity Names E [1,2,3] R You [7] Information & Database Systems Lab Using Textual Context EC [4,5,6] ? Shao [8] Research questions:  Why RC is not used?  Can all four categories combined?
  • 9. In this paper…  We propose a new NE translation similarity feature  Relationship Context similarity (RC)  Contextual word similarity between named entities  Ex) pair (“Barack”, “Michelle”)  Spouse Information & Database Systems Lab  We propose new holistic approaches  Combining all E, EC, R, and RC  We validate our proposed approach using extensive experiments
  • 10. Our Framework  We abstract this problem as…  Graph Matching of two NE relationship graphs extracted from comparable corpora Information & Database Systems Lab Populate a decision matrix R, |Ve|-by-|Vc| matrix NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE NE English NE Chinese Corpus NE Corpus NE NE NE NE NE NE NE NE NE Ge=(Ve, Ee) Gc=(Vc, Ec)
  • 11. Our Framework  Overview – 3 Steps  Initialization 奥巴马 成龙  Construct NE relationship graphs  Build an initial pairwise similarity matrix R0 Obama .99 .1 .2 Information & Database Systems Lab  Use Entity (E) and Entity Context (EC) similarities Jackie chan .1  Iterative reinforcement  Build a final pairwise similarity matrix R∞  Use Relationship (R) and Relationship Context (RC) similarities  Matching  Find 1:1 matching from R∞ 奥巴马 成龙  Build a binary hard decision matrix R* Obama .99 .1 .2 Jackie chan .99
  • 12. Initialization  Constructing NE relationship graphs G = (N, E)  Extract NEs using entity tagger for each document in each corpus  Regard NEs that appears more than δ times as Nodes  Connect two Nodes when they co-occur more than δ times Information & Database Systems Lab  Initializing R0  Computing entity similarity matrix SE  Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’  Ex) ED(“Obama”, “奥巴马”) = ED(“Obama”, “Aobama”) E ED(ei , PYC j ) S ij 1 Len(ei ) Len( PYC j )
  • 13. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Context word Information & Database Systems Lab ex) “As president, Obama signed economic stimulus legislation …”  Context window CW ( NE , d ) {wi l/2 , wi l/2 1 ,..., wi ( NE ),..., wi l/2 1 , wi l/2 }  Correlation between a NE and a context word : Log-odd ratios
  • 14. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Projected Context Association Vector Information & Database Systems Lab Obama Score 奥巴马 Score … … … … President 0.9 … … … … 总统 0.85 … … … … Dictionary USA … 美 國 (President, 总统) … … president 统总
  • 15. Initialization  Initializing R0  Computing entity context similarity matrix SEC  Context Similarity between ‘ei’ and ‘cj’  Compute cosine similarity between two vectors Information & Database Systems Lab EC CAei CAc j S ij CAei CAc j  Merging SE and SEC  Min-Max normalization in range [0:1]  Merge Rij SijE SijEC
  • 16. Reinforcement  Intuition  Two NEs with a strong relationship  Co-occur frequently  have edge  Share similar context  have similar relationship context Information & Database Systems Lab NE NE context context X Y context context NE NE English NE Graph Chinese NE Graph 1. Align neighbors using relationship (R) and relationship context (RC) similarity 2. Update the similarity score
  • 17. Reinforcement  Iterative Approach Relationship Context (RC) Similarity between relation pair (i, u) and (j, v) Information & Database Systems Lab Relationship-based Similarity (R & RC) Entity-based Similarity (E & EC) t RC Ruv ( Siu , jv ) Rij 1 t (1 0 ) Rij t ( u ,v ) k B ( i , j , ) 2k Ordered set of aligned neighbor pairs of (i, j) at iteration t Relationship (R) Similarity of i’s neighbor u and j’s neighbor v
  • 18. Matching  Finding 1:1 matching using greedy algorithm  Steps Information & Database Systems Lab 1. Find a translation pair with the highest final similarity score 2. Select the pair and remove the corresponding row and column from R∞ 3. Repeat 1. and 2. until the similarity score < threshold R∞
  • 19. Experiments  Dataset  English Gigaword Corpus  Xinhua News Agency 2008.01~2008.12  100,746 news documents  Chinese Gigaword Corpus Information & Database Systems Lab  Xinhua News Agency 2008.01~2008.12  88,029 news documents  Approaches  EC : consider Entity context similarity feature only  E : consider Entity name similarity feature only  Shao (E+EC) : combine Entity name & Entity Context similarities  You (E+R) : combine Entity name & Relationship similarities  Ours  E+EC+R (when ϒ = 0)  E+EC+R+RC  Measure  Precision, Recall, and F1-score
  • 20. Experiments  Effectiveness of overall framework  500 person named entities  Set λ = 0.15  5-fold cross-validation for threshold parameter learning Information & Database Systems Lab  Other type of NE (100 Location named entities)
  • 21. Directions  Graph matching  Graph cleansing [VLDB11]  Scalable entity search Information & Database Systems Lab US Presidents Bill Clinton William J Clinton George W. Bush George H.W. Bush Dubya
  • 22. Thanks  Question? Information & Database Systems Lab Visit: www.postech.ac.kr/~swhwang for these papers