Deep Learning: concepts and use cases (October 2018)
Public profile
1. Public profile matching
Urzhumtcev Oleg, SkolTech, ITMO1
Instructor: Raymond Chi-Wing Wong, HKUST
27 November 2012, Hong Kong
1
http://en.qdinvest.ru
3. Problem
There are many objects in the world
Some of them are named entities, among
them — people
They may have different representations
(profiles)
3
8. Approaches
• User Identification Across Multiple Social Networks –
Jan Vosecky, Dan Hong, Vincent Y. Shen
• Features:
• Direct matching (nearest neighbor)
• Vector-Based Comparison Algorithm
• Fuzzy string field matching
• Weighted parameters
8
9. Approaches
Vector-based comparison:
profile { profile {
:id = “50b2c847e3b24cf21400000” :id = “50b2c843e3b24cf214000005”
:username = “darikcr” :username = “oleg.urzhumtsev”
:type “twitter” :type “facebook”
:source “http://twitter.com/darikcr” :source “http://facebook.com/oleg.urzhumtsev”
:name “NetBUG” :name “Oleg Urzhumtcev”
:lang “ru-RU” :alias “NetBUG”
:birthday nil :lang “ru-RU”
:email nil :birthday 1989/10/19
:about “Linguist, programmer, also have some XPrience :email “darikcr@gmail.cm”
in making startups. Groaning for active shiny people to :about “”
do business together” :status Checked in at HKUST Bus Station”
:status “Две новые станции метро в Петербурге - :tags nil
"Бухарестскую" и "Международную" - откроют 27 :university [“HKUST” “SkolTech” “ITMO”
декабря. Б... vk.cc/15wd4M ”
http://
“SPbSU”]
:tags nil :job [“ProMT JSC” ”Israeli Embassy”]
} :interests [“Linguistics” “motoschool”
“programming” “startups”]
}
9
13. Approaches
• Identifying Users Across Social Tagging Systems by
Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof
• Tagged entities
• ‘Bag-of-words’ document model
• Only basic matching
13
15. Proposition
1. Profile is a non-uniform document with
different features of different types
2. Parameters split into ‘unique’ and ‘frequent’
‘username’ is unique
‘surname’ is unique although homonymy may occur
‘interests’ is frequent (shared by many people)
15
16. Proposition
3. Use combined model:
1. Initial matching as in [1] (vector-based)
2. If fails, continue to weight-based unique attribute
matching
3. If fails, continue to clustering and all attribute
nearest-neighbor prediction
16
18. Proposition
Clustering
Hierarchical: the distribution seems to be even
• Distance: non-numeric parameter conversion
• Merging:
• show up features shared by 30% of members or more
for vector-like attributes
• Slow
• Reliable
• Probabilistic for singular features
Curse of dimensionality
18
19. Technical work
1. Data fetching:
1. About.me
2. Facebook
3. Twitter
2. Tools:
1. Ruby
2. Document-oriented noSQL database: mongoDB
3. Implementation of vector-based weighted
comparison
4. Implementation of VMN algorithm
19
22. Future work
1. Attempt to convert all parameters to numeric
format and apply SVM for clustering
2. Add semantic word similarity via WordNet
distance
3. Named Entity Recognition in text fields
4. Envelope the algorithms developed into a
single sleek Rails web application and public
testing
22
23. Conclusion
1. All approaches studied had strong
mathematical background but were badly
adapted for real applications
2. Intuitive fusion of approaches suitable for
different situations may improve results
3. Further work is necessary to develop the best
approach
23
24. Thank you!
Questions?
Slides available at http://n3r.ru/c4
Demo&code available at http://n3r.ru/c5
Feel free to contact me: darikcr@gmail.com
http://about.me/netbug
...and enlarge your soft skills!
24
Notes de l'éditeur
As shown in previous slide, Chinese provides broad opportunities for homonimy. However, even in small Russia there there is a guy with the same name and surname as me.
Performance has not been tested due to small testing data set
However, Jan demonstrated the problem of missing data.