SlideShare a Scribd company logo
1 of 21
Download to read offline
Deep neural networks for matching
online social networking profiles
Vicentiu-Marian Ciorbaru & Traian Rebedea
University Politehnica of Bucharest, Romania
ICCCI 2017
Nicosia, Sep 27th
› Introduction
› Related work
– Personal web pages deduplication
– Social networking profiles matching
› Dataset
› Proposed approach
– Unsupervised vs Supervised
– Extracted features
– Deep neural network for profile matching
› Results
› Conclusions
2 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
› Online people search is a significant part of web search
(Artiles et al., 2010)
– 11-17% of queries include a person name
– ~4% contain only a person name
› Name ambiguity makes people search a complex problem to
solve efficiently
– Huge overlap in person names worldwide
– The most popular 90,000 full names (first and last name) worldwide are
shared by 100M+ individuals
› An important aspect in people search is to find most/all online
sources of information (e.g. web pages) related to the same
– Recent shift from general web pages to specific ones, like social
networking sites and other professional communities
3 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Deep neural networks for matching
online social networking profiles
› Our problem: given a set of web pages extracted from online social
networks, determine the profiles which relate to the same individual
– Profile matching (or deduplication)
– Generates a (more) complete online identity for an individual
– Only uses public online information, however adding up all this information
about a person can cause privacy concerns
Deep neural networks for matching
online social networking prfioles
4 / 21
ICCCI 2017
Nicosia, Sep 27th
Related work
› Two main directions
– Personal web pages deduplication
– Matching social networking profiles
› First problem is more generic and complex, as one also
needs to extract personal information (e.g. name,
occupation, etc.) from a wide range of different structured
web pages
› Entity deduplication, in general, is a very complex field of
study in Databases, Natural Language Processing (NLP),
and Information Retrieval
5 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Related work
› Web People Search (WePS) datasets and competitions (Artiles et al., 2009 & 2010)
› Given all web pages returned by a generic search engine for a popular name, group pages
such that each group corresponds to one specic person
› Most solutions employ clustering of the web pages using features extracted from pages
such as Wikipedia concepts, Named Entity Recognition (NER), bag of words (BoW), and
hyperlinks and different similarity measures
› A pairwise approach for solving this problem was also proposed
– Compute the probability that two pages refer to the same person
– Cluster pages by joining pairs that have a high probability to represent the same person
› WePS proposed B-cubed precision and recall for assessing performance
6 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Related work
› More recent research focused on linking social networking
profiles belonging to the same individual
› Zhang et al. (2015) proposed a binary classifier using a
probabilistic graphical model (factor graph)
› Features computed using BoW and TF-IDF for the text in
each profile, but also its social status (position of node in
network) and connections
› Our solutions only uses textual features, since the dataset
does not contain connections (e.g. friends or followers)
– These additional features, or other like avatar/profile image, would
only improve the results
7 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
› Snapshot of multiple social networking profiles collected from
15 different online social networks and community websites
– Academia, Code-Project, Facebook, Github, Google+, Instagram, Lanyrd, Linkedin, Mashable, Medium,
Moz, Quora, Slideshare, Twitter, and Vimeo
› For each profile, we extracted some/all of the following
information: username, name (full name or distinct first and last
names), gender, bio (short description), interests, publications,
jobs, etc.
› The average number of social profiles per individual is 2.04
and the maximum is 10
› Most profile pages feature a brief description (bio) of the owner
› Profiles do not contain connections, nor posts written by the
8 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
› Ground truth obtained from the website
– Complete online information for professionals
– Contains links to several social networking profiles
of the same person, added manually by each user
› Dataset contains information from over 200,000 accounts
› Total number of extracted social networking profiles:
› The corpus was created by Wholi and is one of
the largest corpora used for social profile matching
› While other datasets (Perito et al., 2011; Zhang et al.,
2016) have a larger number of distinct profiles, ground
truth is one order of magnitude larger for our dataset
– 200,000+ compared to ~10,000 items
– This allows training more complex classfiers, including
deep neural networks
9 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
› Ground truth data has been manually entered by users
– It might be incorrect in some cases (entry errors, user misbehaviour)
– Resembles crowdsourced datasets, which are very popular lately to train complex models
› Train and test sets respect the following rules:
1. Train and test sets should contains different online identities (e.g different individuals)
2. The clusters in the training set should have no entries present in the test set in order to
avoid overfitted models
3. Test set has the same distribution for cluster sizes as the train set to provide a relevant
comparison for various sized online identities
› Positive items extracted from accounts, negative ones added
randomly between profiles with similar names, location, etc.
10 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Proposed solution
› Main contribution is using a deep neural network (NN) for matching
online social networking profiles
› NN is able to make use efficiently of both textual features and
domain-specific ones
› Also performed a comparison with other solutions used in previous
studies, employing both unsupervised and supervised methods
› For the unsupervised approach, we first generated the feature vector
for each profile, then applied Hierarchical Agglomerative Clustering
(HAC) using cosine distance
› For binary classification we have a twofold objective
1. Detect whether two profiles refer to the same person and should be matched
(pairwise matching)
2. For the graph of connected profiles discovered in phase 1, compute its
connected components
11 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Extracted features
› Given a pair of profiles (a, b)
› Domain specific features: distance based measures based on names
(full, first, last) and usernames, matching gender, matching location,
matching company/employer, etc.
› Text-based features
– Computed from all the other textual attributes in a profile (e.g. bio, publications,
– Used precomputed Word2Vec word embeddings with 300 dimensions,
averaged over all words in a profile
– Also computed cosine and Euclidian distance between word embeddings of the
candidate pair (a, b)
› Features normalization
– Compute the z-scores for each feature
– Whitening using Principal Component Analysis (PCA) in a 25-dimensional
vector space to remove noise
12 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Deep neural network for profile matching
› Given the very large dataset and the recent advances of deep learning, we
propose a deep NN model for profile matching
› Deep NNs should be able to model more complex non-linear combinations of the
different features (domain specific, word embeddings)
› Proposed a model which uses 6 fully-connected (FC) layers with different activation
› The loss function uses cross-entropy, with an added weight for false positives
which contribute 10 times more to the loss
– Penalizes false connections between profiles and counteracts the imbalanced distribution
13 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Deep neural network for profile matching
› The first layer takes as input the features computed for the candidate profile pair
and goes into a larger feature space (612  1024)
› The next two layers iteratively reduce the dimensionality of the representation to a
denser feature space
› The final layers employ RELU activation for the neurons, as RELU units are known
to provide better results for binary classification (Nair & Hinton, 2010)
› Dropout is employed to avoid overfitting
14 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
› Experiments performed using an imbalanced test set with one
positive profile pair for 100 negative ones
– Reflects a real-world scenario, where for each correct match between two
profiles, one compares tens/hundreds of incorrect (but similar) candidates
› Table shows B-cubed precision and recall obtained on the test set
› Using same names or similar names as baselines for comparison
15 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
› Unsupervised methods (HAC) obtain poorer results than baseline
mainly because cosine is not a good measure for cluster/item
similarity for the proposed feature vectors
› The RF classifier performs well only when domain specific features
are added to the word embeddings
– The large training set limited the number of trees (to 12) in the forest
– RF usually performs poorly when using word embeddings for a pair of
documents (as they cannot compute a more complex similarity function)
› Mini-batch training of NNs allows using larger datasets than for RF
› The deep NN model learns a more complex combination between
word embeddings and domain specific features, grouping profiles
with similar embeddings and similar names
› Deep NN is the only model which can achieve both high recall
(R=0.85) and high precision (P=0.95)
16 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Results – examples
› Ground truth
– ['twitter/etniqminerals', 'instagram/etniqminerals', 'googleplus/106318957043871071183',
'facebook/etniqminerals', 'facebook/rockcityelitesalsa', 'facebook/1renaissancewoman',
'facebook/naturalblackgirlguide', 'linkedin/leahpatterson’]
› Computed
– [ 'facebook/1renaissancewoman’, 'linkedin/leahpatterson’, 'googleplus/106318957043871071183’]
– ['twitter/etniqminerals', 'instagram/etniqminerals', 'facebook/etniqminerals']
– [ 'facebook/naturalblackgirlguide']
› “Leah Patterson” is an individual who has two different companies “Etniq Minerals” and
“Natural Black Girl Guide”
› Ground truth
– 3 different individuals whose first name is “Tim” and all of them work in IT
› Computed
– ['googleplus/113375270405699485276', 'linkedin/timsmith78', 'googleplus/117829094399867770981',
'twitter/bbyxinnocenz', 'facebook/tim.tio.5', 'vimeo/user616297', 'linkedin/timtio', 'twitter/wbcsaint',
17 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
› Proposed a large dataset for matching online social networking
› This allowd us to train a deep neural network for profile matching
using both domain-specific features and word embeddings generated
from textual descriptions from social profiles
› Experiments showed that the NN surpassed both unsupervised and
supervised models, achieving a high precision (P = 0.95) with a good
recall rate (R = 0.85)
› As far as we know, this result outperforms existing approaches for
profile matching, but further validation is needed (to adapt it for other
datasets and/or use other methods on current dataset)
› Further advancements can be made by training more complex deep
learning models, using recurrent or convolutional networks, and by
adding features extracted from profile pictures
18 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
› A similar architecture has been proposed by Google (Convington et
al., 2016) for recommending YouTube videos
› However we have only found this work recently
19 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Thank you!
20 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles
Selected references
› Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigo, E.: Weps-3 evaluation campaign: Overview of the web
people search clustering and attribute extraction tasks. In: CLEF (Notebook Papers/LABs/Workshops) (2010)
› Artiles, J., Gonzalo, J., Sekine, S.: Weps 2 evaluation campaign: overview of the web people search clustering
task. In: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference. vol. 9. Citeseer
› Covington, P., Adams, J., & Sargin, E.: Deep neural networks for youtube recommendations. In Proceedings of
the 10th ACM Conference on Recommender Systems (pp. 191-198). ACM (2016)
› Nair, V., Hinton, G.E.: Rectied linear units improve restricted boltzmann machines. In: Proceedings of the 27th
International Conference on Machine Learning (ICML-10). pp. 807-814 (2010)
› Perito, D., Castelluccia, C., Kaafar, M.A., Manils, P.: How unique and traceable are usernames? In:
Proceedings of the 11th International Conference on Privacy Enhancing Technologies. pp. 1-17. PETS'11,
Springer-Verlag, Berlin, Heidelberg (2011)
› Zhang, Y., Tang, J., Yang, Z., Pei, J., Yu, P.S.: Cosnet: Connecting heterogeneous social networks with local
and global consistency. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. pp. 1485-1494. KDD '15, ACM, New York, NY, USA (2015)
21 / 21
ICCCI 2017
Nicosia, Sep 27th
Deep neural networks for matching
online social networking profiles

More Related Content

What's hot

Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Saeedeh Shekarpour
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Peter Mika
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
Complex Networks: Science, Programming, and Databases
Complex Networks: Science, Programming, and DatabasesComplex Networks: Science, Programming, and Databases
Complex Networks: Science, Programming, and DatabasesS.M. Mahdi Seyednezhad, Ph.D.
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignJonathan Stray
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Jonathan Stray
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineYi Zeng
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered SearchTrey Grainger
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Nik Spirin
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseArjen de Vries
AI in linkedin
AI in linkedinAI in linkedin
AI in linkedinBill Liu
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataRobert Sanderson

What's hot (20)

Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015Making the Web Searchable - Keynote ICWE 2015
Making the Web Searchable - Keynote ICWE 2015
Social network analysis
Social network analysisSocial network analysis
Social network analysis
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Complex Networks: Science, Programming, and Databases
Complex Networks: Science, Programming, and DatabasesComplex Networks: Science, Programming, and Databases
Complex Networks: Science, Programming, and Databases
Recommender systems
Recommender systemsRecommender systems
Recommender systems
Frontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter DesignFrontiers of Computational Journalism week 3 - Information Filter Design
Frontiers of Computational Journalism week 3 - Information Filter Design
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
AI in linkedin
AI in linkedinAI in linkedin
AI in linkedin
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data

Similar to Deep neural networks for matching online social networking profiles

Zeroshot multimodal named entity disambiguation for noisy social media posts
Zeroshot multimodal named entity disambiguation for noisy social media postsZeroshot multimodal named entity disambiguation for noisy social media posts
Zeroshot multimodal named entity disambiguation for noisy social media postsSyo Kyojin
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...IEEEMEMTECHSTUDENTSPROJECTS
Big social data analytics - social network analysis
Big social data analytics - social network analysis Big social data analytics - social network analysis
Big social data analytics - social network analysis Jari Jussila
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud ComputingCarmen Sanborn
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...IOSR Journals
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things PayamBarnaghi
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsAndre Freitas
Relation of Coffee Break and Productivity
Relation of Coffee Break and ProductivityRelation of Coffee Break and Productivity
Relation of Coffee Break and
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsIRJET Journal call for paper.on demand quality of web services using r... call for paper.on demand quality of web services using call for paper.on demand quality of web services using r... call for paper.on demand quality of web services using r...Alexander Decker
4.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-354.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-35Alexander Decker
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
Official resume titash_mandal_
Official resume titash_mandal_Official resume titash_mandal_
Official resume titash_mandal_Titash Mandal
Semantic Technologies in Learning Environments
Semantic Technologies in Learning EnvironmentsSemantic Technologies in Learning Environments
Semantic Technologies in Learning EnvironmentsDragan Gasevic
Infrastructures Supporting Inter-disciplinary Research - Exemplars from the UK

Infrastructures Supporting Inter-disciplinary Research - Exemplars from the UK
Infrastructures Supporting Inter-disciplinary Research - Exemplars from the UK

Infrastructures Supporting Inter-disciplinary Research - Exemplars from the UK
Visualizing Networked Collaboration
Visualizing Networked CollaborationVisualizing Networked Collaboration
Visualizing Networked CollaborationAhmet Soylu

Similar to Deep neural networks for matching online social networking profiles (20)

Zeroshot multimodal named entity disambiguation for noisy social media posts
Zeroshot multimodal named entity disambiguation for noisy social media postsZeroshot multimodal named entity disambiguation for noisy social media posts
Zeroshot multimodal named entity disambiguation for noisy social media posts
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
Big social data analytics - social network analysis
Big social data analytics - social network analysis Big social data analytics - social network analysis
Big social data analytics - social network analysis
The Revolution Of Cloud Computing
The Revolution Of Cloud ComputingThe Revolution Of Cloud Computing
The Revolution Of Cloud Computing
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic Applications
Relation of Coffee Break and Productivity
Relation of Coffee Break and ProductivityRelation of Coffee Break and Productivity
Relation of Coffee Break and Productivity
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets call for paper.on demand quality of web services using r... call for paper.on demand quality of web services using call for paper.on demand quality of web services using r... call for paper.on demand quality of web services using r...
4.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-354.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-35
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
Official resume titash_mandal_
Official resume titash_mandal_Official resume titash_mandal_
Official resume titash_mandal_
Semantic Technologies in Learning Environments
Semantic Technologies in Learning EnvironmentsSemantic Technologies in Learning Environments
Semantic Technologies in Learning Environments
Infrastructures Supporting Inter-disciplinary Research - Exemplars from the UK

Infrastructures Supporting Inter-disciplinary Research - Exemplars from the UK
Infrastructures Supporting Inter-disciplinary Research - Exemplars from the UK

Infrastructures Supporting Inter-disciplinary Research - Exemplars from the UK

Visualizing Networked Collaboration
Visualizing Networked CollaborationVisualizing Networked Collaboration
Visualizing Networked Collaboration

More from Traian Rebedea

Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...Traian Rebedea
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discoveryTraian Rebedea
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1Traian Rebedea
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitareTraian Rebedea
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeTraian Rebedea
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianTraian Rebedea
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...Traian Rebedea
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriTraian Rebedea
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Traian Rebedea
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Traian Rebedea
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyTraian Rebedea
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Traian Rebedea
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Traian Rebedea
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Traian Rebedea
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Traian Rebedea
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Traian Rebedea

More from Traian Rebedea (20)

Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discovery
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTube
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in Romanian
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuri
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11

Recently uploaded

SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Recently uploaded (20)

Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project

Deep neural networks for matching online social networking profiles

  • 1. Deep neural networks for matching online social networking profiles Vicentiu-Marian Ciorbaru & Traian Rebedea University Politehnica of Bucharest, Romania ICCCI 2017 Nicosia, Sep 27th
  • 2. Outline › Introduction › Related work – Personal web pages deduplication – Social networking profiles matching › Dataset › Proposed approach – Unsupervised vs Supervised – Extracted features – Deep neural network for profile matching › Results › Conclusions 2 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 3. Introduction › Online people search is a significant part of web search (Artiles et al., 2010) – 11-17% of queries include a person name – ~4% contain only a person name › Name ambiguity makes people search a complex problem to solve efficiently – Huge overlap in person names worldwide – The most popular 90,000 full names (first and last name) worldwide are shared by 100M+ individuals › An important aspect in people search is to find most/all online sources of information (e.g. web pages) related to the same person – Recent shift from general web pages to specific ones, like social networking sites and other professional communities 3 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 4. Deep neural networks for matching online social networking profiles Introduction › Our problem: given a set of web pages extracted from online social networks, determine the profiles which relate to the same individual – Profile matching (or deduplication) – Generates a (more) complete online identity for an individual – Only uses public online information, however adding up all this information about a person can cause privacy concerns Deep neural networks for matching online social networking prfioles 4 / 21 ICCCI 2017 Nicosia, Sep 27th
  • 5. Related work › Two main directions – Personal web pages deduplication – Matching social networking profiles › First problem is more generic and complex, as one also needs to extract personal information (e.g. name, occupation, etc.) from a wide range of different structured web pages › Entity deduplication, in general, is a very complex field of study in Databases, Natural Language Processing (NLP), and Information Retrieval 5 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 6. Related work › Web People Search (WePS) datasets and competitions (Artiles et al., 2009 & 2010) › Given all web pages returned by a generic search engine for a popular name, group pages such that each group corresponds to one specic person › Most solutions employ clustering of the web pages using features extracted from pages such as Wikipedia concepts, Named Entity Recognition (NER), bag of words (BoW), and hyperlinks and different similarity measures › A pairwise approach for solving this problem was also proposed – Compute the probability that two pages refer to the same person – Cluster pages by joining pairs that have a high probability to represent the same person › WePS proposed B-cubed precision and recall for assessing performance 6 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 7. Related work › More recent research focused on linking social networking profiles belonging to the same individual › Zhang et al. (2015) proposed a binary classifier using a probabilistic graphical model (factor graph) › Features computed using BoW and TF-IDF for the text in each profile, but also its social status (position of node in network) and connections › Our solutions only uses textual features, since the dataset does not contain connections (e.g. friends or followers) – These additional features, or other like avatar/profile image, would only improve the results 7 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 8. Dataset › Snapshot of multiple social networking profiles collected from 15 different online social networks and community websites – Academia, Code-Project, Facebook, Github, Google+, Instagram, Lanyrd, Linkedin, Mashable, Medium, Moz, Quora, Slideshare, Twitter, and Vimeo › For each profile, we extracted some/all of the following information: username, name (full name or distinct first and last names), gender, bio (short description), interests, publications, jobs, etc. › The average number of social profiles per individual is 2.04 and the maximum is 10 › Most profile pages feature a brief description (bio) of the owner › Profiles do not contain connections, nor posts written by the owner 8 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 9. Dataset › Ground truth obtained from the website – Complete online information for professionals – Contains links to several social networking profiles of the same person, added manually by each user › Dataset contains information from over 200,000 accounts › Total number of extracted social networking profiles: 500,000+ › The corpus was created by Wholi and is one of the largest corpora used for social profile matching › While other datasets (Perito et al., 2011; Zhang et al., 2016) have a larger number of distinct profiles, ground truth is one order of magnitude larger for our dataset – 200,000+ compared to ~10,000 items – This allows training more complex classfiers, including deep neural networks 9 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 10. Dataset › Ground truth data has been manually entered by users – It might be incorrect in some cases (entry errors, user misbehaviour) – Resembles crowdsourced datasets, which are very popular lately to train complex models › Train and test sets respect the following rules: 1. Train and test sets should contains different online identities (e.g different individuals) 2. The clusters in the training set should have no entries present in the test set in order to avoid overfitted models 3. Test set has the same distribution for cluster sizes as the train set to provide a relevant comparison for various sized online identities › Positive items extracted from accounts, negative ones added randomly between profiles with similar names, location, etc. 10 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 11. Proposed solution › Main contribution is using a deep neural network (NN) for matching online social networking profiles › NN is able to make use efficiently of both textual features and domain-specific ones › Also performed a comparison with other solutions used in previous studies, employing both unsupervised and supervised methods › For the unsupervised approach, we first generated the feature vector for each profile, then applied Hierarchical Agglomerative Clustering (HAC) using cosine distance › For binary classification we have a twofold objective 1. Detect whether two profiles refer to the same person and should be matched (pairwise matching) 2. For the graph of connected profiles discovered in phase 1, compute its connected components 11 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 12. Extracted features › Given a pair of profiles (a, b) › Domain specific features: distance based measures based on names (full, first, last) and usernames, matching gender, matching location, matching company/employer, etc. › Text-based features – Computed from all the other textual attributes in a profile (e.g. bio, publications, interests) – Used precomputed Word2Vec word embeddings with 300 dimensions, averaged over all words in a profile – Also computed cosine and Euclidian distance between word embeddings of the candidate pair (a, b) › Features normalization – Compute the z-scores for each feature – Whitening using Principal Component Analysis (PCA) in a 25-dimensional vector space to remove noise 12 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 13. Deep neural network for profile matching › Given the very large dataset and the recent advances of deep learning, we propose a deep NN model for profile matching › Deep NNs should be able to model more complex non-linear combinations of the different features (domain specific, word embeddings) › Proposed a model which uses 6 fully-connected (FC) layers with different activation functions › The loss function uses cross-entropy, with an added weight for false positives which contribute 10 times more to the loss – Penalizes false connections between profiles and counteracts the imbalanced distribution 13 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 14. Deep neural network for profile matching › The first layer takes as input the features computed for the candidate profile pair and goes into a larger feature space (612  1024) › The next two layers iteratively reduce the dimensionality of the representation to a denser feature space › The final layers employ RELU activation for the neurons, as RELU units are known to provide better results for binary classification (Nair & Hinton, 2010) › Dropout is employed to avoid overfitting 14 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 15. Results › Experiments performed using an imbalanced test set with one positive profile pair for 100 negative ones – Reflects a real-world scenario, where for each correct match between two profiles, one compares tens/hundreds of incorrect (but similar) candidates › Table shows B-cubed precision and recall obtained on the test set › Using same names or similar names as baselines for comparison 15 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 16. Results › Unsupervised methods (HAC) obtain poorer results than baseline mainly because cosine is not a good measure for cluster/item similarity for the proposed feature vectors › The RF classifier performs well only when domain specific features are added to the word embeddings – The large training set limited the number of trees (to 12) in the forest – RF usually performs poorly when using word embeddings for a pair of documents (as they cannot compute a more complex similarity function) › Mini-batch training of NNs allows using larger datasets than for RF › The deep NN model learns a more complex combination between word embeddings and domain specific features, grouping profiles with similar embeddings and similar names › Deep NN is the only model which can achieve both high recall (R=0.85) and high precision (P=0.95) 16 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 17. Results – examples › Ground truth – ['twitter/etniqminerals', 'instagram/etniqminerals', 'googleplus/106318957043871071183', 'facebook/etniqminerals', 'facebook/rockcityelitesalsa', 'facebook/1renaissancewoman', 'facebook/naturalblackgirlguide', 'linkedin/leahpatterson’] › Computed – [ 'facebook/1renaissancewoman’, 'linkedin/leahpatterson’, 'googleplus/106318957043871071183’] – ['twitter/etniqminerals', 'instagram/etniqminerals', 'facebook/etniqminerals'] – [ 'facebook/naturalblackgirlguide'] › “Leah Patterson” is an individual who has two different companies “Etniq Minerals” and “Natural Black Girl Guide” › Ground truth – 3 different individuals whose first name is “Tim” and all of them work in IT › Computed – ['googleplus/113375270405699485276', 'linkedin/timsmith78', 'googleplus/117829094399867770981', 'twitter/bbyxinnocenz', 'facebook/tim.tio.5', 'vimeo/user616297', 'linkedin/timtio', 'twitter/wbcsaint', 'twitter/turnitontim'] 17 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 18. Conclusions › Proposed a large dataset for matching online social networking profiles › This allowd us to train a deep neural network for profile matching using both domain-specific features and word embeddings generated from textual descriptions from social profiles › Experiments showed that the NN surpassed both unsupervised and supervised models, achieving a high precision (P = 0.95) with a good recall rate (R = 0.85) › As far as we know, this result outperforms existing approaches for profile matching, but further validation is needed (to adapt it for other datasets and/or use other methods on current dataset) › Further advancements can be made by training more complex deep learning models, using recurrent or convolutional networks, and by adding features extracted from profile pictures 18 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 19. Conclusions › A similar architecture has been proposed by Google (Convington et al., 2016) for recommending YouTube videos › However we have only found this work recently 19 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 20. Thank you! Questions Feedback 20 / 21 _____ _____ ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles
  • 21. Selected references › Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigo, E.: Weps-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks. In: CLEF (Notebook Papers/LABs/Workshops) (2010) › Artiles, J., Gonzalo, J., Sekine, S.: Weps 2 evaluation campaign: overview of the web people search clustering task. In: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference. vol. 9. Citeseer (2009) › Covington, P., Adams, J., & Sargin, E.: Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (pp. 191-198). ACM (2016) › Nair, V., Hinton, G.E.: Rectied linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). pp. 807-814 (2010) › Perito, D., Castelluccia, C., Kaafar, M.A., Manils, P.: How unique and traceable are usernames? In: Proceedings of the 11th International Conference on Privacy Enhancing Technologies. pp. 1-17. PETS'11, Springer-Verlag, Berlin, Heidelberg (2011) › Zhang, Y., Tang, J., Yang, Z., Pei, J., Yu, P.S.: Cosnet: Connecting heterogeneous social networks with local and global consistency. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1485-1494. KDD '15, ACM, New York, NY, USA (2015) 21 / 21 ICCCI 2017 Nicosia, Sep 27th Deep neural networks for matching online social networking profiles