Public profile

Public profile matching

Urzhumtcev Oleg, SkolTech, ITMO1
Instructor: Raymond Chi-Wing Wong, HKUST

27 November 2012, Hong Kong

1
http://en.qdinvest.ru

Problem
Generalization
Approach studied
Proposition
Testing
Conclusion

2

Problem
There are many objects in the world
Some of them are named entities, among
them — people
They may have different representations
(profiles)

3

Generalization
Part of summarization problem
Precisely — first step: accurate data collection
Caused by homonimy

6

Problem
Generalization
Approach studied
Proposition
Testing
Conclusion

7

Approaches
• User Identiﬁcation Across Multiple Social Networks –
Jan Vosecky, Dan Hong, Vincent Y. Shen
• Features:
• Direct matching (nearest neighbor)
• Vector-Based Comparison Algorithm
• Fuzzy string field matching
• Weighted parameters

8

Approaches
Vector-based comparison:
profile { profile {
:id = “50b2c847e3b24cf21400000” :id = “50b2c843e3b24cf214000005”
:username = “darikcr” :username = “oleg.urzhumtsev”
:type “twitter” :type “facebook”
:source “http://twitter.com/darikcr” :source “http://facebook.com/oleg.urzhumtsev”
:name “NetBUG” :name “Oleg Urzhumtcev”
:lang “ru-RU” :alias “NetBUG”
:birthday nil :lang “ru-RU”
:email nil :birthday 1989/10/19
:about “Linguist, programmer, also have some XPrience :email “darikcr@gmail.cm”
in making startups. Groaning for active shiny people to :about “”
do business together” :status Checked in at HKUST Bus Station”
:status “Две новые станции метро в Петербурге - :tags nil
"Бухарестскую" и "Международную" - откроют 27 :university [“HKUST” “SkolTech” “ITMO”
декабря. Б... vk.cc/15wd4M ”
http://
“SPbSU”]
:tags nil :job [“ProMT JSC” ”Israeli Embassy”]
} :interests [“Linguistics” “motoschool”
“programming” “startups”]
}

9

Approaches
Fuzzy matching (VMN algorithm*):
String Pair VMN SDS SD
1 “Jan Vosecky”,“J Vosecky” 0.66 0.82 2.0
2 “Jan Vosecky”,“Vosecky Jan” 1.0 0.55 5.0
3 “Jan Vosecky”,“Honza vosecky” 0.5 0.36 7.0
4 “Jan Vosecky”,“Robert Vosecky” 0.5 0.55 5.0
5 “Jan Vosecky”,“Jan Smith” 0.5 0.45 6.0
6 “Jan Vosecky”,“Jack Vondracek” 0.0 0.27 8.0
Table 1. String Match Functions Comparison

• Partial matching
• Word swapping tolerance

• *Vosecky, Hong, Shen 2009 10

Approaches
Drawbacks:
1.Suitable for well-intersected profiles
2.Bad for discovery
3.No cross-parameter search

11

Approaches
Awareness of missing data:

12

Approaches
• Identifying Users Across Social Tagging Systems by
Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof
• Tagged entities
• ‘Bag-of-words’ document model
• Only basic matching

13

Problem
Generalization
Approach studied
Proposition
Testing
Conclusion

14

Proposition
1. Profile is a non-uniform document with
different features of different types
2. Parameters split into ‘unique’ and ‘frequent’
‘username’ is unique
‘surname’ is unique although homonymy may occur
‘interests’ is frequent (shared by many people)

15

Proposition
3. Use combined model:
1. Initial matching as in [1] (vector-based)
2. If fails, continue to weight-based unique attribute
matching
3. If fails, continue to clustering and all attribute
nearest-neighbor prediction

16

Proposition
Weight-based unique attribute matching
Similarity =
(this.unique_attrs.each{|id,attr|
weight_unique[id]*other.unique_attrs.each ==
attr}.sum +
this.freqent_attrs.each{|attr|
other.freqent_attrs.each == attr}.sum)
/ this.freqent_attrs.each{|attr|
other.freqent_attrs.each != attr}.sum

17

Proposition
Clustering
Hierarchical: the distribution seems to be even
• Distance: non-numeric parameter conversion
• Merging:
• show up features shared by 30% of members or more
for vector-like attributes
• Slow
• Reliable
• Probabilistic for singular features

Curse of dimensionality
18

Technical work
1. Data fetching:
1. About.me
2. Facebook
3. Twitter
2. Tools:
1. Ruby
2. Document-oriented noSQL database: mongoDB
3. Implementation of vector-based weighted
comparison
4. Implementation of VMN algorithm

19

Problem
Generalization
Approach studied
Proposition
Testing
Conclusion

20

Testing
Data Direct nearest Unique Combined LDA Document-
neighbor parameter (direct + based model
matching matching unique + (experimental)
clustering)

Completeness 51% 56% 74% 46%
(%)
Basic set 53 58 78 95

Accuracy (%) 100% 98% 95% 51%

,of them false 0 1 3 42
positive

False negative 51 46 29 9

Extended set 56 62 127 N/T

Accuracy(%) 98% 54% 70% N/T

,of them false 2 5 N/T
positive

False negative 51 (basic set) 37 (basic set) 28 (basic set) N/T

21

Future work
1. Attempt to convert all parameters to numeric
format and apply SVM for clustering
2. Add semantic word similarity via WordNet
distance
3. Named Entity Recognition in text fields
4. Envelope the algorithms developed into a
single sleek Rails web application and public
testing

22

Conclusion
1. All approaches studied had strong
mathematical background but were badly
adapted for real applications
2. Intuitive fusion of approaches suitable for
different situations may improve results
3. Further work is necessary to develop the best
approach

23

Thank you!
Questions?

Slides available at http://n3r.ru/c4
Demo&code available at http://n3r.ru/c5

Feel free to contact me: darikcr@gmail.com
http://about.me/netbug
...and enlarge your soft skills!

24

Public profile

Recommandé

Recommandé

Contenu connexe

Similaire à Public profile

Similaire à Public profile (20)

Public profile

Notes de l'éditeur