Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Big Graph Data Science
Lise Getoor
University of California, Santa Cruz
SF MLConf
November 14, 2014

BIG Data is not flat
©2004-2013 lonnitaylor

Data is multi-modal, multi-relational,
spatio-temporal, multi-media

NEED: Data Science for Graphs
New V: Vinculate

☐
Collec2ve
Classifica2on
☐
Link
Predic2on
☐
En2ty
Resolu2on

Collec&ve
Classifica&on:
inferring
the
labels
of
nodes
in
a
graph

Collective Classification
 spouse  friend

 

spouse
spouse
colleague
colleague
friend
friend
friend
Question: or ?


 

spouse
spouse
colleague
colleague
friend
friend
friend


 

spouse
spouse
colleague
colleague
friend
friend
friend
?
?
?

☐
Collec2ve
Classifica2on
☐
Link
Predic2on
☐
En2ty
Resolu2on
✔

Link
Predic&on:
inferring
the
existence
of
edges
in
a
graph

Link Prediction
¢ Entities
l People, Emails
¢ Observed relationships
l communications, co-location
¢ Predict relationships
l Supervisor, subordinate,
colleague
 #





  


☐
Collec2ve
Classifica2on
☐
Link
Predic2on
☐
En2ty
Resolu2on
✔
✔

En&ty
Resolu&on:
determining
which
nodes
refer
to
same
underlying
en2ty

Ironically, Entity Resolution has many duplicate names
Identity uncertainty
Doubles
Record linkage Duplicate detection
Deduplication
Reference reconciliation
Object identification
Object consolidation
Coreference resolution
Entity clustering
Household matching
Householding Reference matching
Fuzzy match
Approximate match
Merge/purge
Hardening soft databases

Entity Resolution & Network Analysis
Tutorial on Entity Resolution in Big Data w/ Ashwin
Machanavajjhala @ KDD13 (slides available)
before
after

☐
Collec2ve
Classifica2on
☐
Link
Predic2on
☐
En2ty
Resolu2on
✔
✔
✔

Graph
Iden2fica2on
• Goal:
– Given
an
input
graph
infer
an
output
graph
• Three
major
components:
– En&ty
Resolu&on
(ER):
Infer
the
set
of
nodes
– Link
Predic&on
(LP):
Infer
the
set
of
edges
– Collec&ve
Classifica&on
(CC):
Infer
the
node
labels
• Challenge:
The
components
are
intra
and
inter-‐dependent
Namata,
Kok,
Getoor,
KDD
2011

☐
Feature
Construc2on
☐
Collec2ve
Reasoning
☐
Blocking

Key Idea #1: Relational
Feature Construction

Key Idea: Feature Construction
¢ Feature informativeness is key to the success of a
relational classifier
¢ Relational feature construction
l Node-specific measures
• Aggregates: summarize attributes of relational neighbors
• Structural properties: capture characteristics of the relational
structure
l Node-pair measures: summarize properties of
(potential) edges
• Attribute-based measures
• Edge-based measures
• Neighborhood similarity measures

Relational Classifiers
¢ Pros:
l Efficient: Can handle large amounts of data
• Features can often be pre-computed ahead of time
l Flexible: Can take advantage of well-understood classification/
regression algorithms
l One of the most commonly-used ways of incorporating
relational information
¢ Cons:
l Features are based on observed values/evidence, cannot be
based on attributes or relations that are being predicted
l Makes incorrect independence assumptions
l Hard to impose global constraints on joint assignments
• For example, when inferring a hierarchy of individuals, we may want
to enforce constraint that it is a tree

☐
Feature
Construc2on
☐
Collec2ve
Reasoning
☐
Blocking
✔

?


Donates(A, ) => Votes(A, ): 5.0
?
$ $

Tweet
Status
update
Mentions(A, “Affordable Health”) => Votes(A, ): 0.3

vote(A,P) ∧ friend(B,A) à vote(B,P) : 0.3
  
spouse
spouse friend
friend
 
colleague

spouse
colleague
vote(A,P) ∧ spouse(B,A) à vote(B,P) : 0.8
friend
friend

Link Prediction
§ People, emails, words,
communication, relations
§ Use model to express
dependencies
- “If email content suggests type X, it
is of type X”
- “If A sends deadline emails to B,
then A is the supervisor of B”
- “If A is the supervisor of B, and A is
the supervisor of C, then B and C are
colleagues”
 #





  


Link Prediction
HasWord(E, “due”) => Type(E, deadline) : 0.6
dependencies
is of type X”
colleagues”
 #
complete by

due




  


Link Prediction
dependencies
is of type X”
colleagues”
 #





  

Sends(A,B,E) ^ Type(E,deadline) => Supervisor(A,B) : 0.8

Link Prediction
dependencies
is of type X”
colleagues”
 #





  

Supervisor(A,B) ^ Supervisor(A,C) => Colleague(B,C) : ∞

Entity Resolution
§ Entities
- People References
§ Attributes
- Name
§ Relationships
- Friendship
§ Goal: Identify
references that denote
the same person
John Smith J. Smith
name name
A B
C
E
friend friend
D F G
H
=
=

Entity Resolution
§ References, names,
friendships
dependencies
- ‘’If two people have similar names,
they are probably the same’’
- ‘’If two people have similar friends,
- ‘’If A=B and B=C, then A and C must
also denote the same person’’
John Smith J. Smith
name name
A B
C
E
friend friend
D F G
H
=
=

Entity Resolution
friendships
dependencies
A.name ≈{str_sim} B.name => A≈B : 0.8
John Smith J. Smith
name name
A B
C
E
friend friend
D F G
H
=
=

Entity Resolution
friendships
dependencies
John Smith J. Smith
name name
A B
C
E
friend friend
D F G
H
=
=
{A.friends} ≈{} {B.friends} => A≈B : 0.6

Entity Resolution
friendships
dependencies
John Smith J. Smith
name name
A B
C
E
friend friend
D F G
H
=
=
A≈B ^ B≈C => A≈C : ∞

Challenges
¢ Collective Classification: labeling nodes in graph
l irregular structure, not a chain, not a grid
l Challenge: One large partially labeled graph
¢ Link prediction: predicting edges in graph
l Dependencies among edges
l Don’t want to reason about all possible edges
l Challenge: scaling & extremely skewed probabilities
¢ Entity resolution: determine nodes that refer to
same entities in a graph
l Dependencies between clusters
l Challenge: enforcing constraints, e.g. transitive closure

☐
Feature
Construc2on
☐
Collec2ve
Reasoning
☐
Blocking
✔
✔

Blocking: Motivation
¢ Naïve pairwise: |N|2 pairwise comparisons
l 1000 business listings each from 1,000 different cities
across the world
l 1 trillion comparisons
l 11.6 days (if each comparison is 1 μs)
¢ Mentions from different cities are unlikely to be
matches
l Blocking Criterion: City
l 1 billion comparisons
l 16 minutes (if each comparison is 1 μs)

¢ Mentions from different cities are unlikely to be
matches
l May miss potential matches
47

Matching
Pairs
of
Nodes
Set
of
all
Pairs
of
Nodes
Pairs
of
Nodes
sa&sfying
Blocking
criterion

Blocking Algorithms 1
¢ Hash based blocking
l Each block Ci is associated with a hash key hi.
l Mention x is hashed to Ci if hash(x) = hi.
l Within a block, all pairs are compared.
l Each hash function results in disjoint blocks.
¢ What hash function?
l Deterministic function of attribute values
l Boolean Functions over attribute values
[Bilenko et al ICDM’06, Michelson et al AAAI’06,
Das Sarma et al CIKM ‘12]
l minHash (min-wise independent permutations)
[Broder et al STOC’98]; locality sensitive hashing

Blocking Algorithms 2
¢ Pairwise Similarity/Neighborhood based blocking
l Nearby nodes according to a similarity metric are
clustered together
l Results in non-disjoint canopies.
¢ Techniques
l Sorted Neighborhood Approach [Hernandez et al
SIGMOD’95]
l Canopy Clustering [McCallum et al KDD’00]

☐
Feature
Construc2on
☐
Collec2ve
Reasoning
☐
Blocking
✔
✔
✔

http://psl.umiacs.umd.edu
Matthias Stephen Bach Broecheler Alex Memory Lily Mihalkova Stanley Kok
Bert Huang
Angelika Kimmig
Ben London Arti Ramesh Jay Pujara Shobeir Fakhraei Hui Miao

Probabilistic Soft Logic (PSL)
Declarative language based on logics to express
collective probabilistic inference problems
- Predicate = relationship or property
- Atom = (continuous) random variable
- Rule = capture dependency or constraint
- Set = define aggregates
PSL Program = Rules + Input DB

PSL Foundations
§ PSL makes large-scale reasoning scalable by
mapping logical rules to convex functions
§ Three principles justify this mapping:
- LP programs for MAX SAT with approximation
guarantees [Goemans and Williamson, ’94]
- Pseudomarginal LP relaxations of Boolean Markov
random fields [Wainwright, et al., ’02]
- Łukasiewicz logic, a logic for reasoning about
continuous values [Klir and Yuan, ‘95]

Hinge-loss Markov Random Fields
P(Y|X) =
1
Z
exp
2
4
Xm
j=1
wj max{`j(Y,X), 0}pj
3
5
§ Continuous variables in [0,1]
§ Potentials are hinge-loss functions
§ Subject to arbitrary linear constraints
§ Log-concave!

PSL in a Slide
§ MAP Inference in PSL translates into convex optimization
problem - inference is really fast!
§ Inference further enhanced with state-of-the-art
optimization and distributed processing paradigms such as
ADMM GraphLab - inference even faster!
§ Outperforms discrete MRFs in terms of speed, and (very)
often accuracy
§ PSL is flexible: Applied to image segmentation, activity
recognition, stance-detection, sentiment analysis,
document classification, drug target prediction, latent
social groups and trust, engagement modeling, ontology
alignment, and looking for more!

Pa#erns
Key
Ideas
Tools
NEED: Data Science for Graphs

Closing
Comments
• Need
new
sta2s2cal
and
ML
theory
for
very
large
graphs
– Be#er
understand
bias,
iden2fiability,
and
mechanism
design
in
graph
domains
• Important
topics
not
touched
on
– Genera2ve
mechanisms,
causal
reasoning,
dynamic
modeling
– Privacy,
Users,

Visualiza2on
• Many
exci2ng
opportuni2es
to
develop
new
theory
and
algorithms
and
apply
them
to
compelling
business,
intelligence,
social,
and
societal
problems!

Thank You!
psl.umiacs.umd.edu
Contact information:
getoor@soe.ucsc.edu

Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Recommandé

Recommandé

Contenu connexe

Similaire à Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF

Similaire à Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF (20)

Plus de MLconf

Plus de MLconf (20)

Dernier

Dernier (20)

Lise Getoor, Professor, Computer Science, UC Santa Cruz at MLconf SF