SlideShare a Scribd company logo
1 of 41
Download to read offline
Building Identity Graphs over
Heterogeneous Data
Sudha Viswanathan
Saigopal Thota
Primary Contributors
Agenda
▪ Identities at Scale
▪ Why Graph
▪ How we Built and Scaled it
▪ Why an In-house solution
▪ Challenges and
Considerations
▪ A peek into Real time Graph
Identities at Scale
Identity
Tokens
Online 2
...
Partner
Apps
App 3App 2
App1
Online 1
Account Ids
Cookies
Online Ids
Device Ids
Identity Resolution
Aims to provide a coherent view of a
customer and / or a household by unifying
all customer identities across channels and
subsidiaries
Provides single notion of a customer
Why Graph
Identities, Linkages & Metadata – An Example
Device
id
App id
Login id App id
Cookie
id
Device
id
Last login: 4/28/2020
App: YouTube
Country: Canada
Identity
Linkage
Country: Canada Metadata
Graph – An Example
Login id App id
Last login: 4/28/2020
Device
id
App:YouTube
Cookie
id Country: Canada
Connect all Linkages to create a single connected component per user/household
Graph Traversal
▪ Graph is an efficient data
structure relative to table joins
▪ Why Table join doesn't work?
▪ Linkages are in the order of millions of rows
spanning across hundreds of tables
▪ Table joins are based on
index and are computationally very expensive
▪ Table joins result in lesser coverage
Scalable and offers better coverage
Build once – Query multiple times
▪ Graph enables dynamic traversal logic. One Graph offers infinite
traversal possibilities
▪ Get all tokens liked to an entity
▪ Get all tokens linked to the entity's household
▪ Get all tokens linked to an entity that created after Jan 2020
▪ Get all tokens linked to the entity's household that interacted using App 1
▪ ...
Graph comes with flexibility in traversal
How we Built and Scaled
Scale and Performance Objectives
▪ More than 25+ Billion linkages and identities
▪ New linkages created 24x7
▪ Node and Edge Metadata updated for 60% of existing Linkages
▪ Freshness – Graph Updated with linkages and Metadata, Once a day
▪ Could be few hours, in future goals
▪ Ability to run on general purpose Hadoop Infrastructure
Components of Identity Graph
Data Analysis
Understand your data
and check for
anomalies
Handling
Heterogenous Data
Sources
Extract only new and
modified linkages in a
format needed by the
next stage
Stage I – Dedup
& Eliminate
outliers
Add edge metadata,
filter outliers and
populate tables needed
by the next stage
Stage II – Create
Connected
Components
Merge Linkages to form
an Identity graph for
each customer
Stage III –
Prepare for
Traversal
Demystifies linkages
within a cluster and
appends metadata
information to enable
graph traversal
Traversal
Traverse across the
cluster as per defined
rules to pick only the
qualified nodes
Core Processing
Data Analysis
▪ Understanding the data that feeds into Graph pipeline is paramount to
building a usable Graph framework.
▪ Feeding poor quality linkage results in connected components spanning across millions of nodes, taking a toll on computing
resources and business value
▪ Some questions to analyze,
▪ Does the linkage relationship makes business sense?
▪ What is acceptable threshold for poor quality linkages
▪ Do we need to apply any filter
▪ Nature of data – Snapshot vs Incremental
Handling Heterogenous Data Sources
▪ Data sources grow rapidly in volume and variety
▪ From a handful of manageable data streams to an intimidatingly magnificent Niagara falls!
▪ Dedicated framework to ingest data in parallel
from heterogenous sources
▪ Serves only new and modified linkages. This is important for Incremental processing
▪ Pulls only the desired attributes for further processing – linkages and their metadata in
a standard schema
Core Processing – Stage I
▪ Feeds good quality linkages to further
processing.
▪ It handles:
▪ Deduplication
▪ If a linkage is repeated, we consume only the latest record
▪ Outlier elimination
▪ Filters anomalous linkages based on a chosen threshold derived from data analysis
▪ Edge Metadata population
▪ Attributes of the linkage. It helps to traverse the graph to get desired linkages.
Dedup & Eliminate outliers
Core Processing – Stage II
▪ Merges all related linkages of a customer to create a Connected
Component
Create Connected Components
Core Processing – Stage III
▪ This stage enriches the connected component with linkages between nodes
and edge metadata to enable graph traversal.
Prepare for Traversal
Login id App id
Last login: 4/28/2020
Device
id
App:YouTube
Cookie
id Country: Canada
Login id
App id
Device
id
Cookie
id
A B
B C
B D
D E
A
B
DE
G1
A
B
D
E
C
m
1
m
2
m3
m4
Stage II – Create Connected Components Stage III - Prepare for Traversal
C
PN
NM
Stage I – Dedup & Outlier Elimination
P
NM
N
M P
G2 m1 m2
Union Find Shuffle (UFS): Building
Connected Components at Scale
Weighted Union Find with Path Compression
2
5
2
9
5
9
2
2
9
2
9
Top Level parent -
2 9
Size of the cluster - 2
2
5
Top Level parent -
Size of the cluster - 1
5
5
Height – 2
(not Weighted Union)
Height -1
Weighted Union
or
7 8
7 8
5 7 Top Level parent -
Size of the cluster - 3
2 Top Level parent -
Size of the cluster - 2
7
2
9 5
8
7
2
9 5
8 1
2
9 5
8
7
1
Top Level parent -
Size of the cluster - 1
1
2
9 5 87 1
Path Compression
Top Level parent -
Size of the cluster - 5
2
• Find() – helps to find the top level parent. If a is the child of b and b is the child of c, then, find()
determines that c is the top level parent;
a -> b; b -> c => c is top parent
• Path compression() – helps to reduce height of connected components by linking all children directly to
the top level parent.
a -> b; b -> c => a -> b -> c => a -> c; b -> c
• Weighted Union() – Unifies top level parents. Parent with lesser children is made the child of the parent
with more children. This also helps to reduce the height of connected components.
Distributed UFS with Path Compression
Path Compress Iteratively perform Path Compression for connected components until
all connected components are path compressed.
Shuffle Merge locally processed partitions with a global shuffle iteratively until
all connected components are resolved
Run UF Run Weighted Union Find with Path Compression on each partition
Divide Divide the data into partitions
Shuffle in UFS
9 4 5 8 3
Reached Termination conditionProceeds to next iteration
3 74 9
8 6
6 3
32
97
57
9
4 3
6 8 34 32
2 3
3 7
5
9 7 5
6 37 3 5
Union Find Shuffle using Spark
▪ Sheer scale of data at hand ( 25+ Billion vertices & 30+ Billion edges)
▪ Iterative processing with caching and intermittent checkpointing
▪ Limitations with other alternatives
How do we scale?
▪ The input to Union Find Shuffle is bucked to create 1000 part files of
similar size
▪ 10 Instances of Union Find executes on 1000 part files with ~30 billion
nodes. Each instance of UF is applied to 100 part files.
▪ At any given time, we will have 5 instances of UF running in parallel
Data Quality:
Challenges and Considerations
Noisy Data
Coo
1
Acc
1
Coo
2 Coo
3
Coo
4
Coo
100
Cookie Tokens
Acc
1
Coo.
1
Acc
2 Acc
3
Acc
4
Acc
100
•
•
•
•
Cookie Token
Graph exposes Noise, opportunities, and fragmentation in data
•
•
•
•
An Example of anomalous linkage data
▪ For some linkages, we have millions of
entities mapping to the same token (id)
▪ In the data distribution, we see a
majority of tokens mapped to 1-5 entities
▪ We also see a few tokens (potential
outliers) mapped to millions of entities!
Data Distribution
Removal of Anomalous Linkages
▪ Extensive analysis to identify anomalous linkage patterns
▪ A Gaussian Anomaly detection model (Statistical Analysis)
▪ Identify thresholds of linkage cardinality to filter linkages
▪ A lenient threshold will improve coverage at the cost of precision.
Hit the balance between Coverage and Precision
Threshold # Big Clusters % match of Facets 1 / # of distinct Entities
Threshold 10 – High Precision, Low Coverage
• Majority of connected components are not big clusters; So, % of distinct entities outside the big cluster(s) will be high
• More linkages would have been filtered out as part of dirty linkages; So, % match of facets will suffer
Threshold 1000 – Low Precision, High Coverage
• More connected components form big clusters; So, % of distinct entities outside the big clusters will be lesser
• Only a few linkages would have been filtered out as part of dirty linkages; S0, % match of facets will be high
Threshold 10; Big Cluster(s) 1 Threshold 1000; Big Cluster(s) 4
Large Connected Components (LCC)
▪ Size ranging from 10k – 100 M +
▪ Combination of Hubs and Long Chains
▪ Token collisions, Noise in data, Bot
traffic
▪ Legitimate users belong to LCC
▪ Large number of shuffles in UFS
A result of lenient threshold
Traversing LCC
▪ Business demands both Precision and Coverage, hence LCC needs
traversal
▪ Iterative Spark BFS implementation is used to traverse LCC
▪ Traversal is supported up to a certain pre-defined depth
▪ Going beyond a certain depth not only strains the system but also adds no business value
▪ Traversal is optimized to run using Spark in 20-30 minutes over all
connected components
Solution to get both Precision and Coverage
Data Volume & Runtime
tid1 tid2 Linkage
metadata
a b tid1 tid
2
c
a b tid1 tid
2
c
a b tid1 tid
2
c
Graph Pipeline – Powered by
Handling Heterogenous Linkages Stage I Stage II
Stage III - LCC
Stage III - SCC
a b tid1 tid
2
c
a b tid1 tid2 c
15 upstream tables
p q tid6 tid9 r
25B+ Raw Linkages &
30B+ Nodes
tid1 tid2 Linkage
metadata
tid1 tid2 Linkage
metadata
tid1_long tid2_long Linkage
metadata
tid6_long tgid120
tid1_long tgid1
tgid Linkages with Metadata
tgid1 {tgid: 1, tid: [aid,bid],
edges:[srcid,
destid,metadata],[] }
1-2
hrs
UnionFindShuffle
8-10hrs
1-2
hrs
Subgraphcreation
4-5hrs
tgid tid Linkages
(adj_list)
3 A1 [C1:m1, B1:m2, B2:m3]
2 C1 [A1:m1]
2 A2 B2:m2
3 B1 [A1:m2]
3 B2 [A1:m3]
tid Linkages
A1 [C1:m1, B1:m2]
A2 [B2:m2]
C1 [A1:m1]
B1 [A1:m2]
Give all aid-bid linkages which go via
cid
Traversal request
Give all A– B linkages where
criteria= m1,m2
Traversal request on LCC
Filter tids on
m1,m2
Select
count(*)
by tgid
MR on
filtered
tgid
partitions
Dump
LCC table
> 5k
CC
startnode=A, endnode=B, criteria=m1,m2
tid Linkage
A1 B1
A2 B2
For each tid do
a bfs
(unidirected/bidi
rected)
Map
Map
Map
1 map per tgid
traversal
tid1 tid1_long
tid6 tid6_long
tid6 tgid120
tid1 tgid1
Tableextraction&
transformation30mins
20-30
mins
2.5
hrs
30
mins
20-30
mins
A peek into Real time Graph
▪ Linkages within streaming datasets
▪ New linkages require updating the graph in real time.
▪ Concurrency – Concurrent updates to graphs needs to be handled to avoid deadlocks, starvation, etc.
▪ Scale
▪ High-volume - e.g., Clickstream data - As users browse the webpage/app, new events get generated
▪ Replication and Consistency – Making sure that the data is properly replicated for fault-tolerance, and is consistent for queries
▪ Real-time Querying and Traversals
▪ High throughput traversing and querying capability on tokens belonging to the same customer
Real time Graph: Challenges
Questions?

More Related Content

What's hot

How Graph Data Science can turbocharge your Knowledge Graph
How Graph Data Science can turbocharge your Knowledge GraphHow Graph Data Science can turbocharge your Knowledge Graph
How Graph Data Science can turbocharge your Knowledge GraphNeo4j
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsNeo4j
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360Cloudera, Inc.
 
Using Knowledge Graphs to Predict Customer Needs and Improve Quality
Using Knowledge Graphs to Predict Customer Needs and Improve QualityUsing Knowledge Graphs to Predict Customer Needs and Improve Quality
Using Knowledge Graphs to Predict Customer Needs and Improve QualityNeo4j
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphDatabricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Graph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4jGraph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4jNeo4j
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AISemantic Web Company
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
The art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptx
The art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptxThe art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptx
The art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptxNeo4j
 
Data Monetization
Data MonetizationData Monetization
Data MonetizationDATAVERSITY
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphsmukuljoshi
 
AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the CloudAWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the CloudAmazon Web Services
 
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptxNeo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptxNeo4j
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 

What's hot (20)

How Graph Data Science can turbocharge your Knowledge Graph
How Graph Data Science can turbocharge your Knowledge GraphHow Graph Data Science can turbocharge your Knowledge Graph
How Graph Data Science can turbocharge your Knowledge Graph
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
The Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent ApplicationsThe Data Platform for Today’s Intelligent Applications
The Data Platform for Today’s Intelligent Applications
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360
 
Using Knowledge Graphs to Predict Customer Needs and Improve Quality
Using Knowledge Graphs to Predict Customer Needs and Improve QualityUsing Knowledge Graphs to Predict Customer Needs and Improve Quality
Using Knowledge Graphs to Predict Customer Needs and Improve Quality
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Graph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4jGraph-Based Customer Journey Analytics with Neo4j
Graph-Based Customer Journey Analytics with Neo4j
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AI
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
The Data Unicorns
The Data UnicornsThe Data Unicorns
The Data Unicorns
 
The art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptx
The art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptxThe art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptx
The art of the possible with graph technology_Neo4j GraphSummit Dublin 2023.pptx
 
Data Monetization
Data MonetizationData Monetization
Data Monetization
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphs
 
AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the CloudAWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
 
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptxNeo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
Neo4j GraphSummit London March 2023 Emil Eifrem Keynote.pptx
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 

Similar to Building Identity Graphs over Heterogeneous Data

From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...HostedbyConfluent
 
5G-USA-Telemetry
5G-USA-Telemetry5G-USA-Telemetry
5G-USA-Telemetrysnrism
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...DataStax Academy
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudKey-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudUniversity of New South Wales
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITOMarcoMellia
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSDeepak Shankar
 
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdfSwisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdfThomasGraf40
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Neo4j
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreePradeeban Kathiravelu, Ph.D.
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationThomasGraf42
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
 

Similar to Building Identity Graphs over Heterogeneous Data (20)

From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
 
5G-USA-Telemetry
5G-USA-Telemetry5G-USA-Telemetry
5G-USA-Telemetry
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & R...
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
C C N A Day1
C C N A  Day1C C N A  Day1
C C N A Day1
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudKey-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
 
Ccna day1
Ccna day1Ccna day1
Ccna day1
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITO
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERSROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
 
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdfSwisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
Swisscom Network Analytics Data Mesh Architecture - ETH Viscon - 10-2022.pdf
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh Integration
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Building Identity Graphs over Heterogeneous Data

  • 1.
  • 2. Building Identity Graphs over Heterogeneous Data Sudha Viswanathan Saigopal Thota
  • 4. Agenda ▪ Identities at Scale ▪ Why Graph ▪ How we Built and Scaled it ▪ Why an In-house solution ▪ Challenges and Considerations ▪ A peek into Real time Graph
  • 6. Identity Tokens Online 2 ... Partner Apps App 3App 2 App1 Online 1 Account Ids Cookies Online Ids Device Ids
  • 7. Identity Resolution Aims to provide a coherent view of a customer and / or a household by unifying all customer identities across channels and subsidiaries Provides single notion of a customer
  • 9. Identities, Linkages & Metadata – An Example Device id App id Login id App id Cookie id Device id Last login: 4/28/2020 App: YouTube Country: Canada Identity Linkage Country: Canada Metadata
  • 10. Graph – An Example Login id App id Last login: 4/28/2020 Device id App:YouTube Cookie id Country: Canada Connect all Linkages to create a single connected component per user/household
  • 11. Graph Traversal ▪ Graph is an efficient data structure relative to table joins ▪ Why Table join doesn't work? ▪ Linkages are in the order of millions of rows spanning across hundreds of tables ▪ Table joins are based on index and are computationally very expensive ▪ Table joins result in lesser coverage Scalable and offers better coverage
  • 12. Build once – Query multiple times ▪ Graph enables dynamic traversal logic. One Graph offers infinite traversal possibilities ▪ Get all tokens liked to an entity ▪ Get all tokens linked to the entity's household ▪ Get all tokens linked to an entity that created after Jan 2020 ▪ Get all tokens linked to the entity's household that interacted using App 1 ▪ ... Graph comes with flexibility in traversal
  • 13. How we Built and Scaled
  • 14. Scale and Performance Objectives ▪ More than 25+ Billion linkages and identities ▪ New linkages created 24x7 ▪ Node and Edge Metadata updated for 60% of existing Linkages ▪ Freshness – Graph Updated with linkages and Metadata, Once a day ▪ Could be few hours, in future goals ▪ Ability to run on general purpose Hadoop Infrastructure
  • 15. Components of Identity Graph Data Analysis Understand your data and check for anomalies Handling Heterogenous Data Sources Extract only new and modified linkages in a format needed by the next stage Stage I – Dedup & Eliminate outliers Add edge metadata, filter outliers and populate tables needed by the next stage Stage II – Create Connected Components Merge Linkages to form an Identity graph for each customer Stage III – Prepare for Traversal Demystifies linkages within a cluster and appends metadata information to enable graph traversal Traversal Traverse across the cluster as per defined rules to pick only the qualified nodes Core Processing
  • 16. Data Analysis ▪ Understanding the data that feeds into Graph pipeline is paramount to building a usable Graph framework. ▪ Feeding poor quality linkage results in connected components spanning across millions of nodes, taking a toll on computing resources and business value ▪ Some questions to analyze, ▪ Does the linkage relationship makes business sense? ▪ What is acceptable threshold for poor quality linkages ▪ Do we need to apply any filter ▪ Nature of data – Snapshot vs Incremental
  • 17. Handling Heterogenous Data Sources ▪ Data sources grow rapidly in volume and variety ▪ From a handful of manageable data streams to an intimidatingly magnificent Niagara falls! ▪ Dedicated framework to ingest data in parallel from heterogenous sources ▪ Serves only new and modified linkages. This is important for Incremental processing ▪ Pulls only the desired attributes for further processing – linkages and their metadata in a standard schema
  • 18. Core Processing – Stage I ▪ Feeds good quality linkages to further processing. ▪ It handles: ▪ Deduplication ▪ If a linkage is repeated, we consume only the latest record ▪ Outlier elimination ▪ Filters anomalous linkages based on a chosen threshold derived from data analysis ▪ Edge Metadata population ▪ Attributes of the linkage. It helps to traverse the graph to get desired linkages. Dedup & Eliminate outliers
  • 19. Core Processing – Stage II ▪ Merges all related linkages of a customer to create a Connected Component Create Connected Components
  • 20. Core Processing – Stage III ▪ This stage enriches the connected component with linkages between nodes and edge metadata to enable graph traversal. Prepare for Traversal Login id App id Last login: 4/28/2020 Device id App:YouTube Cookie id Country: Canada Login id App id Device id Cookie id
  • 21. A B B C B D D E A B DE G1 A B D E C m 1 m 2 m3 m4 Stage II – Create Connected Components Stage III - Prepare for Traversal C PN NM Stage I – Dedup & Outlier Elimination P NM N M P G2 m1 m2
  • 22. Union Find Shuffle (UFS): Building Connected Components at Scale
  • 23. Weighted Union Find with Path Compression 2 5 2 9 5 9 2 2 9 2 9 Top Level parent - 2 9 Size of the cluster - 2 2 5 Top Level parent - Size of the cluster - 1 5 5 Height – 2 (not Weighted Union) Height -1 Weighted Union or 7 8 7 8 5 7 Top Level parent - Size of the cluster - 3 2 Top Level parent - Size of the cluster - 2 7 2 9 5 8 7 2 9 5 8 1 2 9 5 8 7 1 Top Level parent - Size of the cluster - 1 1 2 9 5 87 1 Path Compression Top Level parent - Size of the cluster - 5 2
  • 24. • Find() – helps to find the top level parent. If a is the child of b and b is the child of c, then, find() determines that c is the top level parent; a -> b; b -> c => c is top parent • Path compression() – helps to reduce height of connected components by linking all children directly to the top level parent. a -> b; b -> c => a -> b -> c => a -> c; b -> c • Weighted Union() – Unifies top level parents. Parent with lesser children is made the child of the parent with more children. This also helps to reduce the height of connected components.
  • 25. Distributed UFS with Path Compression Path Compress Iteratively perform Path Compression for connected components until all connected components are path compressed. Shuffle Merge locally processed partitions with a global shuffle iteratively until all connected components are resolved Run UF Run Weighted Union Find with Path Compression on each partition Divide Divide the data into partitions
  • 26. Shuffle in UFS 9 4 5 8 3 Reached Termination conditionProceeds to next iteration 3 74 9 8 6 6 3 32 97 57 9 4 3 6 8 34 32 2 3 3 7 5 9 7 5 6 37 3 5
  • 27. Union Find Shuffle using Spark ▪ Sheer scale of data at hand ( 25+ Billion vertices & 30+ Billion edges) ▪ Iterative processing with caching and intermittent checkpointing ▪ Limitations with other alternatives
  • 28. How do we scale? ▪ The input to Union Find Shuffle is bucked to create 1000 part files of similar size ▪ 10 Instances of Union Find executes on 1000 part files with ~30 billion nodes. Each instance of UF is applied to 100 part files. ▪ At any given time, we will have 5 instances of UF running in parallel
  • 30. Noisy Data Coo 1 Acc 1 Coo 2 Coo 3 Coo 4 Coo 100 Cookie Tokens Acc 1 Coo. 1 Acc 2 Acc 3 Acc 4 Acc 100 • • • • Cookie Token Graph exposes Noise, opportunities, and fragmentation in data • • • •
  • 31. An Example of anomalous linkage data ▪ For some linkages, we have millions of entities mapping to the same token (id) ▪ In the data distribution, we see a majority of tokens mapped to 1-5 entities ▪ We also see a few tokens (potential outliers) mapped to millions of entities! Data Distribution
  • 32.
  • 33. Removal of Anomalous Linkages ▪ Extensive analysis to identify anomalous linkage patterns ▪ A Gaussian Anomaly detection model (Statistical Analysis) ▪ Identify thresholds of linkage cardinality to filter linkages ▪ A lenient threshold will improve coverage at the cost of precision. Hit the balance between Coverage and Precision
  • 34. Threshold # Big Clusters % match of Facets 1 / # of distinct Entities Threshold 10 – High Precision, Low Coverage • Majority of connected components are not big clusters; So, % of distinct entities outside the big cluster(s) will be high • More linkages would have been filtered out as part of dirty linkages; So, % match of facets will suffer Threshold 1000 – Low Precision, High Coverage • More connected components form big clusters; So, % of distinct entities outside the big clusters will be lesser • Only a few linkages would have been filtered out as part of dirty linkages; S0, % match of facets will be high Threshold 10; Big Cluster(s) 1 Threshold 1000; Big Cluster(s) 4
  • 35. Large Connected Components (LCC) ▪ Size ranging from 10k – 100 M + ▪ Combination of Hubs and Long Chains ▪ Token collisions, Noise in data, Bot traffic ▪ Legitimate users belong to LCC ▪ Large number of shuffles in UFS A result of lenient threshold
  • 36. Traversing LCC ▪ Business demands both Precision and Coverage, hence LCC needs traversal ▪ Iterative Spark BFS implementation is used to traverse LCC ▪ Traversal is supported up to a certain pre-defined depth ▪ Going beyond a certain depth not only strains the system but also adds no business value ▪ Traversal is optimized to run using Spark in 20-30 minutes over all connected components Solution to get both Precision and Coverage
  • 37. Data Volume & Runtime
  • 38. tid1 tid2 Linkage metadata a b tid1 tid 2 c a b tid1 tid 2 c a b tid1 tid 2 c Graph Pipeline – Powered by Handling Heterogenous Linkages Stage I Stage II Stage III - LCC Stage III - SCC a b tid1 tid 2 c a b tid1 tid2 c 15 upstream tables p q tid6 tid9 r 25B+ Raw Linkages & 30B+ Nodes tid1 tid2 Linkage metadata tid1 tid2 Linkage metadata tid1_long tid2_long Linkage metadata tid6_long tgid120 tid1_long tgid1 tgid Linkages with Metadata tgid1 {tgid: 1, tid: [aid,bid], edges:[srcid, destid,metadata],[] } 1-2 hrs UnionFindShuffle 8-10hrs 1-2 hrs Subgraphcreation 4-5hrs tgid tid Linkages (adj_list) 3 A1 [C1:m1, B1:m2, B2:m3] 2 C1 [A1:m1] 2 A2 B2:m2 3 B1 [A1:m2] 3 B2 [A1:m3] tid Linkages A1 [C1:m1, B1:m2] A2 [B2:m2] C1 [A1:m1] B1 [A1:m2] Give all aid-bid linkages which go via cid Traversal request Give all A– B linkages where criteria= m1,m2 Traversal request on LCC Filter tids on m1,m2 Select count(*) by tgid MR on filtered tgid partitions Dump LCC table > 5k CC startnode=A, endnode=B, criteria=m1,m2 tid Linkage A1 B1 A2 B2 For each tid do a bfs (unidirected/bidi rected) Map Map Map 1 map per tgid traversal tid1 tid1_long tid6 tid6_long tid6 tgid120 tid1 tgid1 Tableextraction& transformation30mins 20-30 mins 2.5 hrs 30 mins 20-30 mins
  • 39. A peek into Real time Graph
  • 40. ▪ Linkages within streaming datasets ▪ New linkages require updating the graph in real time. ▪ Concurrency – Concurrent updates to graphs needs to be handled to avoid deadlocks, starvation, etc. ▪ Scale ▪ High-volume - e.g., Clickstream data - As users browse the webpage/app, new events get generated ▪ Replication and Consistency – Making sure that the data is properly replicated for fault-tolerance, and is consistent for queries ▪ Real-time Querying and Traversals ▪ High throughput traversing and querying capability on tokens belonging to the same customer Real time Graph: Challenges