SlideShare une entreprise Scribd logo
1  sur  24
Public profile matching

Urzhumtcev Oleg, SkolTech, ITMO1
Instructor: Raymond Chi-Wing Wong, HKUST

      27 November 2012, Hong Kong

              1
                  http://en.qdinvest.ru
Problem
Generalization
Approach studied
Proposition
Testing
Conclusion



                    2
Problem
There are many objects in the world
Some of them are named entities, among
 them — people
They may have different representations
 (profiles)




                                           3
Problem




          4
Problem




          5
Generalization
Part of summarization problem
Precisely — first step: accurate data collection
Caused by homonimy




                                                    6
Problem
Generalization
Approach studied
Proposition
Testing
Conclusion



                    7
Approaches
• User Identification Across Multiple Social Networks –
  Jan Vosecky, Dan Hong, Vincent Y. Shen
• Features:
  • Direct matching (nearest neighbor)
  • Vector-Based Comparison Algorithm
  • Fuzzy string field matching
  • Weighted parameters




                                                         8
Approaches
Vector-based comparison:
 profile {                                                 profile {
         :id = “50b2c847e3b24cf21400000”                            :id = “50b2c843e3b24cf214000005”
 :username = “darikcr”                                     :username = “oleg.urzhumtsev”
 :type “twitter”                                           :type “facebook”
 :source “http://twitter.com/darikcr”                      :source “http://facebook.com/oleg.urzhumtsev”
 :name “NetBUG”                                            :name “Oleg Urzhumtcev”
 :lang “ru-RU”                                             :alias “NetBUG”
 :birthday nil                                             :lang “ru-RU”
 :email nil                                                :birthday 1989/10/19
 :about “Linguist, programmer, also have some XPrience     :email “darikcr@gmail.cm”
 in making startups. Groaning for active shiny people to   :about “”
 do business together”                                     :status Checked in at HKUST Bus Station”
 :status “Две новые станции метро в Петербурге -           :tags nil
 "Бухарестскую" и "Международную" - откроют 27             :university [“HKUST” “SkolTech” “ITMO”
 декабря. Б... vk.cc/15wd4M ”
                http://
                                                           “SPbSU”]
 :tags nil                                                 :job [“ProMT JSC” ”Israeli Embassy”]
 }                                                         :interests [“Linguistics” “motoschool”
                                                           “programming” “startups”]
                                                           }




                                                                                                           9
Approaches
Fuzzy matching (VMN algorithm*):
    String       Pair                               VMN          SDS           SD
    1            “Jan   Vosecky”,“J Vosecky”        0.66         0.82          2.0
    2            “Jan   Vosecky”,“Vosecky Jan”      1.0          0.55          5.0
    3            “Jan   Vosecky”,“Honza vosecky”    0.5          0.36          7.0
    4            “Jan   Vosecky”,“Robert Vosecky”   0.5          0.55          5.0
    5            “Jan   Vosecky”,“Jan Smith”        0.5          0.45          6.0
    6            “Jan   Vosecky”,“Jack Vondracek”   0.0          0.27          8.0
                                                           Table 1. String Match Functions Comparison

• Partial matching
• Word swapping tolerance




•    *Vosecky, Hong, Shen 2009                                                                          10
Approaches
Drawbacks:
1.Suitable for well-intersected profiles
2.Bad for discovery
3.No cross-parameter search




                                           11
Approaches
Awareness of missing data:




                             12
Approaches
• Identifying Users Across Social Tagging Systems by
  Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof
• Tagged entities
• ‘Bag-of-words’ document model
• Only basic matching




                                                                  13
Problem
Generalization
Approach studied
Proposition
Testing
Conclusion



                    14
Proposition
1. Profile is a non-uniform document with
   different features of different types
2. Parameters split into ‘unique’ and ‘frequent’
  ‘username’ is unique
  ‘surname’ is unique although homonymy may occur
  ‘interests’ is frequent (shared by many people)




                                                    15
Proposition
3. Use combined model:
  1. Initial matching as in [1] (vector-based)
  2. If fails, continue to weight-based unique attribute
     matching
  3. If fails, continue to clustering and all attribute
     nearest-neighbor prediction




                                                           16
Proposition
       Weight-based unique attribute matching
Similarity =
  (this.unique_attrs.each{|id,attr|
  weight_unique[id]*other.unique_attrs.each ==
  attr}.sum +
  this.freqent_attrs.each{|attr|
  other.freqent_attrs.each == attr}.sum)
  / this.freqent_attrs.each{|attr|
  other.freqent_attrs.each != attr}.sum




                                                 17
Proposition
                            Clustering
Hierarchical: the distribution seems to be even
• Distance: non-numeric parameter conversion
• Merging:
• show up features shared by 30% of members or more
  for vector-like attributes
             •   Slow
             •   Reliable
• Probabilistic for singular features

Curse of dimensionality
                                                      18
Technical work
1. Data fetching:
  1. About.me
  2. Facebook
  3. Twitter
2. Tools:
  1. Ruby
  2. Document-oriented noSQL database: mongoDB
3. Implementation of vector-based weighted
   comparison
4. Implementation of VMN algorithm

                                                 19
Problem
Generalization
Approach studied
Proposition
Testing
Conclusion



                    20
Testing
Data               Direct   nearest Unique           Combined          LDA Document-
                   neighbor         parameter        (direct +         based model
                   matching         matching         unique +          (experimental)
                                                     clustering)


Completeness       51%              56%              74%               46%
(%)
Basic set          53               58               78                95

Accuracy (%)       100%             98%              95%               51%

,of them   false   0                1                3                 42
positive


False negative     51               46                29                9

Extended set       56               62                127               N/T

Accuracy(%)        98%              54%              70%                N/T

,of them   false   2                5                                   N/T
positive


False negative     51 (basic set)   37 (basic set)    28 (basic set)    N/T


                                                                                        21
Future work
1. Attempt to convert all parameters to numeric
   format and apply SVM for clustering
2. Add semantic word similarity via WordNet
   distance
3. Named Entity Recognition in text fields
4. Envelope the algorithms developed into a
   single sleek Rails web application and public
   testing



                                                   22
Conclusion
1. All approaches studied had strong
   mathematical background but were badly
   adapted for real applications
2. Intuitive fusion of approaches suitable for
   different situations may improve results
3. Further work is necessary to develop the best
   approach




                                                   23
Thank you!
Questions?


Slides available at http://n3r.ru/c4
Demo&code available at http://n3r.ru/c5


Feel free to contact me: darikcr@gmail.com
                         http://about.me/netbug
...and enlarge your soft skills!

                                                  24

Contenu connexe

Similaire à Public profile

Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Gephi icwsm-tutorial
Gephi icwsm-tutorialGephi icwsm-tutorial
Gephi icwsm-tutorial
csedays
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)
Hajime Sasaki
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
Jillian Aurisano
 
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...
paper_reader
 

Similaire à Public profile (20)

What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
Evolution of Deep Learning and new advancements
Evolution of Deep Learning and new advancementsEvolution of Deep Learning and new advancements
Evolution of Deep Learning and new advancements
 
Long-term Face Tracking in the Wild using Deep Learning
Long-term Face Tracking in the Wild using Deep LearningLong-term Face Tracking in the Wild using Deep Learning
Long-term Face Tracking in the Wild using Deep Learning
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Manos
ManosManos
Manos
 
Gephi icwsm-tutorial
Gephi icwsm-tutorialGephi icwsm-tutorial
Gephi icwsm-tutorial
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
The Tower of Knowledge A Generic System Architecture
The Tower of Knowledge A Generic System ArchitectureThe Tower of Knowledge A Generic System Architecture
The Tower of Knowledge A Generic System Architecture
 
brief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANsbrief Introduction to Different Kinds of GANs
brief Introduction to Different Kinds of GANs
 
End-to-End Network Performance Estimation Using Signal ComplexitySlides
End-to-End Network Performance Estimation Using Signal ComplexitySlidesEnd-to-End Network Performance Estimation Using Signal ComplexitySlides
End-to-End Network Performance Estimation Using Signal ComplexitySlides
 
論文サーベイ(Sasaki)
論文サーベイ(Sasaki)論文サーベイ(Sasaki)
論文サーベイ(Sasaki)
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Link prediction with the linkpred tool
Link prediction with the linkpred toolLink prediction with the linkpred tool
Link prediction with the linkpred tool
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
My life as a cyborg
My life as a cyborg My life as a cyborg
My life as a cyborg
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
 
Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...Easing embedding learning by comprehensive transcription of heterogeneous inf...
Easing embedding learning by comprehensive transcription of heterogeneous inf...
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
 

Public profile

  • 1. Public profile matching Urzhumtcev Oleg, SkolTech, ITMO1 Instructor: Raymond Chi-Wing Wong, HKUST 27 November 2012, Hong Kong 1 http://en.qdinvest.ru
  • 3. Problem There are many objects in the world Some of them are named entities, among them — people They may have different representations (profiles) 3
  • 6. Generalization Part of summarization problem Precisely — first step: accurate data collection Caused by homonimy 6
  • 8. Approaches • User Identification Across Multiple Social Networks – Jan Vosecky, Dan Hong, Vincent Y. Shen • Features: • Direct matching (nearest neighbor) • Vector-Based Comparison Algorithm • Fuzzy string field matching • Weighted parameters 8
  • 9. Approaches Vector-based comparison: profile { profile { :id = “50b2c847e3b24cf21400000” :id = “50b2c843e3b24cf214000005” :username = “darikcr” :username = “oleg.urzhumtsev” :type “twitter” :type “facebook” :source “http://twitter.com/darikcr” :source “http://facebook.com/oleg.urzhumtsev” :name “NetBUG” :name “Oleg Urzhumtcev” :lang “ru-RU” :alias “NetBUG” :birthday nil :lang “ru-RU” :email nil :birthday 1989/10/19 :about “Linguist, programmer, also have some XPrience :email “darikcr@gmail.cm” in making startups. Groaning for active shiny people to :about “” do business together” :status Checked in at HKUST Bus Station” :status “Две новые станции метро в Петербурге - :tags nil "Бухарестскую" и "Международную" - откроют 27 :university [“HKUST” “SkolTech” “ITMO” декабря. Б... vk.cc/15wd4M ” http:// “SPbSU”] :tags nil :job [“ProMT JSC” ”Israeli Embassy”] } :interests [“Linguistics” “motoschool” “programming” “startups”] } 9
  • 10. Approaches Fuzzy matching (VMN algorithm*): String Pair VMN SDS SD 1 “Jan Vosecky”,“J Vosecky” 0.66 0.82 2.0 2 “Jan Vosecky”,“Vosecky Jan” 1.0 0.55 5.0 3 “Jan Vosecky”,“Honza vosecky” 0.5 0.36 7.0 4 “Jan Vosecky”,“Robert Vosecky” 0.5 0.55 5.0 5 “Jan Vosecky”,“Jan Smith” 0.5 0.45 6.0 6 “Jan Vosecky”,“Jack Vondracek” 0.0 0.27 8.0 Table 1. String Match Functions Comparison • Partial matching • Word swapping tolerance • *Vosecky, Hong, Shen 2009 10
  • 11. Approaches Drawbacks: 1.Suitable for well-intersected profiles 2.Bad for discovery 3.No cross-parameter search 11
  • 13. Approaches • Identifying Users Across Social Tagging Systems by Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof • Tagged entities • ‘Bag-of-words’ document model • Only basic matching 13
  • 15. Proposition 1. Profile is a non-uniform document with different features of different types 2. Parameters split into ‘unique’ and ‘frequent’ ‘username’ is unique ‘surname’ is unique although homonymy may occur ‘interests’ is frequent (shared by many people) 15
  • 16. Proposition 3. Use combined model: 1. Initial matching as in [1] (vector-based) 2. If fails, continue to weight-based unique attribute matching 3. If fails, continue to clustering and all attribute nearest-neighbor prediction 16
  • 17. Proposition Weight-based unique attribute matching Similarity = (this.unique_attrs.each{|id,attr| weight_unique[id]*other.unique_attrs.each == attr}.sum + this.freqent_attrs.each{|attr| other.freqent_attrs.each == attr}.sum) / this.freqent_attrs.each{|attr| other.freqent_attrs.each != attr}.sum 17
  • 18. Proposition Clustering Hierarchical: the distribution seems to be even • Distance: non-numeric parameter conversion • Merging: • show up features shared by 30% of members or more for vector-like attributes • Slow • Reliable • Probabilistic for singular features Curse of dimensionality 18
  • 19. Technical work 1. Data fetching: 1. About.me 2. Facebook 3. Twitter 2. Tools: 1. Ruby 2. Document-oriented noSQL database: mongoDB 3. Implementation of vector-based weighted comparison 4. Implementation of VMN algorithm 19
  • 21. Testing Data Direct nearest Unique Combined LDA Document- neighbor parameter (direct + based model matching matching unique + (experimental) clustering) Completeness 51% 56% 74% 46% (%) Basic set 53 58 78 95 Accuracy (%) 100% 98% 95% 51% ,of them false 0 1 3 42 positive False negative 51 46 29 9 Extended set 56 62 127 N/T Accuracy(%) 98% 54% 70% N/T ,of them false 2 5 N/T positive False negative 51 (basic set) 37 (basic set) 28 (basic set) N/T 21
  • 22. Future work 1. Attempt to convert all parameters to numeric format and apply SVM for clustering 2. Add semantic word similarity via WordNet distance 3. Named Entity Recognition in text fields 4. Envelope the algorithms developed into a single sleek Rails web application and public testing 22
  • 23. Conclusion 1. All approaches studied had strong mathematical background but were badly adapted for real applications 2. Intuitive fusion of approaches suitable for different situations may improve results 3. Further work is necessary to develop the best approach 23
  • 24. Thank you! Questions? Slides available at http://n3r.ru/c4 Demo&code available at http://n3r.ru/c5 Feel free to contact me: darikcr@gmail.com http://about.me/netbug ...and enlarge your soft skills! 24

Notes de l'éditeur

  1. As shown in previous slide, Chinese provides broad opportunities for homonimy. However, even in small Russia there there is a guy with the same name and surname as me.
  2. Performance has not been tested due to small testing data set
  3. However, Jan demonstrated the problem of missing data.