SlideShare une entreprise Scribd logo
1  sur  58
Télécharger pour lire hors ligne
A Primer on Entity Resolution
Workshop Objectives
● Introduce entity resolution theory and tasks
● Similarity scores and similarity vectors
● Pairwise matching with the Fellegi Sunter algorithm
● Clustering and Blocking for deduplication
● Final notes on entity resolution
Entity Resolution Theory
Entity Resolution refers to techniques that
identify, group, and link digital mentions or
manifestations of some object in the real world.
In the Data Science Pipeline, ER is generally a wrangling technique.
ComputationStorageDataInteraction
Computational
Data Store
Feature Analysis
Model Builds
Model
Selection &
Monitoring
NormalizationIngestion
Feedback
Wrangling
API
Cross Validation
- Creation of high quality data sets
- Reduction in the number of instances in
machine learning models
- Reduction in the amount of covariance and
therefore collinearity of predictor variables.
- Simplification of relationships
Information Quality
Graph Analysis Simplification and Connection
ben@ddl.com
selma@gmail.comtony@ddl.com
allen@acme.com rebecca@acme.com
ben@gmail.com tony@gmail.com Ben
Rebecca
Allen
Tony
Selma
- Heterogenous data: unstructured records
- Larger and more varied datasets
- Multi-domain and multi-relational data
- Varied applications (web and mobile)
Parallel, Probabilistic Methods Required*
* Although this is often debated in various related domains.
Machine Learning and ER
Entity Resolution Tasks
Deduplication
Record Linkage
Canonicalization
Referencing
- Primary consideration in ER
- Cluster records that correspond to the same
real world entity, normalizing a schema
- Reduces number of records in the dataset
- Variant: compute cluster representative
Deduplication
Record Linkage
- Match records from one deduplicated data
store to another (bipartite)
- K-partite linkage links records in multiple
data stores and their various associations
- Generally proposed in relational data stores,
but more frequently applied to unstructured
records from various sources.
Referencing
- Known as entity disambiguation
- Match noisy records to a clean, deduplicated
reference table that is already canonicalized
- Generally used to atomize multiple records
to some primary key and donate extra
information to the record
Canonicalization
- Compute representative
- Generally the “most complete” record
- Imputation of missing attributes via merging
- Attribute selection based on the most likely
candidate for downstream matching
Notation
- R: set of records
- M: set of matches
- N: set of non-matches
- E: set of entities
- L: set of links
Compare (Mt
,Nt
,Et
,Lt
)⇔(Mp
,Np
,Ep
,Lp
)
- t = true, p = predicted
Key Assumptions
- Every entity refers to a real world object (e.g.
there are no “fake” instances
- References or sources (for record linkage)
include no duplicates (integrity constraints)
- If two records are identical, they are true
matches ( , ) ∈ Mt
- NLTK: natural language toolkit
- Dedupe*: structured deduplication
- Distance: C implemented distance metrics
- Scikit-Learn: machine learning models
- Fuzzywuzzy: fuzzy string matching
- PyBloom: probabilistic set matching
Tools for Entity Resolution
Similarity
At the heart of any entity resolution task is
the computation of similarity or distance.
For two records, x and y, compute a similarity
vector for each component attribute:
[match_score(attrx
, attry
)
for attr in zip(x,y)]
Where match_score is a per-attribute function that
computes either a boolean (match, not match) or real
valued distance score.
match_score ∈ [0,1]*
x = {
'id': 'b0000c7fpt',
'title': 'reel deal casino shuffle master edition',
'description': 'reel deal casino shuffle master edition is ...',
'manufacturer': 'phantom efx',
'price': 19.99,
'source': 'amazon',
}
y = {
'id': '17175991674191849246',
'name': 'phantom efx reel deal casino shuffle master edition',
'description': 'reel deal casino shuffle master ed. is ...',
'manufacturer': None,
'price': 17.24,
'source': 'google',
}
# similarity vector is a match score of:
# [name_score, description_score, manufacturer_score, price_score]
# Boolean Match
similarity(x,y) == [0, 1, 0, 0]
# Real Valued Match
similarity(x,y) == [0.83, 1.0, 0, 2.75]
Match Scores Reference
String Matching Distance Metrics Relational
Matching
Other Matching
Edit Distance
- Levenstein
- Smith-Waterman
- Affine
Alignment
- Jaro-Winkler
- Soft-TFIDF
- Monge-Elkan
Phonetic
- Soundex
- Translation
- Euclidean
- Manhattan
- Minkowski
Text Analytics
- Jaccard
- TFIDF
- Cosine similarity
Set Based
- Dice
- Tanimoto
(Jaccard)
- Common
Neighbors
- Adar Weighted
Aggregates
- Average values
- Max/Min values
- Medians
- Frequency
(Mode)
- Numeric
distance
- Boolean equality
- Fuzzy matching
- Domain specific
Gazettes
- Lexical matching
- Named Entities
(NER)
Fellegi Sunter
Pairwise Matching:
Given a vector of attribute match
scores for a pair of records (x,y)
compute Pmatch
(x,y).
Weighted Sum + Threshold
Pmatch
= sum(weight*score for score in vector)
- weights should sum to one
- determine weight for each attribute match score
- higher weights for more predictive features
- e.g. email more predictive than username
- attribute value also contributes to predictability
- If weighted score > threshold then match.
Rule Based Approach
- Formulate rules about the construction of a
match for attribute collections.
if scorename
> 0.75 && scoreprice
> 0.6
- Although formulating rules is hard, domain
specific rules can be applied, making this a
typical approach for many applications.
Modern record linkage theory was formalized in 1969
by Ivan Fellegi and Alan Sunter who proved that the
probabilistic decision rule they described was optimal
when the comparison attributes were conditionally
independent.
Their pioneering work “A Theory For Record Linkage”
remains the mathematical foundation for many
record linkage applications even today.
Fellegi, Ivan P., and Alan B. Sunter. "A theory for record linkage." Journal
of the American Statistical Association 64.328 (1969): 1183-1210.
Record Linkage Model
For two record sets, A and B:
and a record pair,
is the similarity vector, where is
some match score function for the record set.
M is the match set and U the non-match set
Record Linkage Model
Probabilistic linkage based on:
Linkage Rule: L(tl
, tu
) - upper & lower thresholds:
R(r)
tu
tl
MatchUncertainNon-Match
Linkage Rule Error
- Type I Error: a non-match is called a match.
- Type II Error: match is called a non-match
Optimizing a Linkage Rule
L*
(t*
l
, t*
u
) is optimized in (similarity vector
space) with error bounds and if:
- L*
bounds type I and II errors:
- L*
has the least conditional probability of not making a
decision - e.g. minimizes the uncertainty range in R(r).
L*
Discovery
Given N records in (e.g. N similarity vectors):
Sort the records decreasing by R(r) (m( ) / u( ))
Select n and n′ such that:
R(r)
0
1
, … , n
n+1
, … , n′
-1
n′
, … , N
, ,
Practical Application of FS
is high dimensional: m( ) and u( ) computations are inefficient.
Typically a naive Bayes assumption is made about the
conditional independence of features in given a match or a
non-match.
Computing P( |r ∈ M) requires knowledge of matches.
- Supervised machine learning with a training set.
- Expectation Maximization (EM) to train parameters.
Machine Learning Parameters
Supervised Methods
- Decision Trees
- Cochinwala, Munir, et al. "Efficient data reconciliation." Information Sciences 137.1 (2001): 1-15.
- Support Vector Machines
- Bilenko, Mikhail, and Raymond J. Mooney. "Adaptive duplicate detection using learnable string similarity
measures." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data
mining. ACM, 2003.
- Christen, Peter. "Automatic record linkage using seeded nearest neighbour and support vector machine
classification." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data
mining. ACM, 2008.
- Ensembles of Classifiers
- Chen, Zhaoqi, Dmitri V. Kalashnikov, and Sharad Mehrotra. "Exploiting context analysis for combining multiple
entity resolution systems." Proceedings of the 2009 ACM SIGMOD International Conference on Management of
data. ACM, 2009.
- Conditional Random Fields
- Gupta, Rahul, and Sunita Sarawagi. "Answering table augmentation queries from unstructured lists on the web."
Proceedings of the VLDB Endowment 2.1 (2009): 289-300.
Machine Learning Parameters
Unsupervised Methods
- Expectation Maximization
- Winkler, William E. "Overview of record linkage and current research directions." Bureau of the Census. 2006.
- Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer
Science & Business Media, 2007.
- Hierarchical Clustering
- Ravikumar, Pradeep, and William W. Cohen. "A hierarchical graphical model for record linkage."Proceedings of the
20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004.
Active Learning Methods
- Committee of Classifiers
- Sarawagi, Sunita, and Anuradha Bhamidipaty. "Interactive deduplication using active learning."Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.
- Tejada, Sheila, Craig A. Knoblock, and Steven Minton. "Learning object identification rules for information
integration." Information Systems 26.8 (2001): 607-633.
Luckily, all of these models are in Scikit-Learn.
Considerations:
- Building training sets is hard:
- Most records are easy non-matches
- Record pairs can be ambiguous
- Class imbalance: more negatives than positives
Machine Learning & Fellegi Sunter is the state of the art.
Implementing Papers
Clustering & Blocking
To obtain a supervised training set, start by
using clustering and then add active learning
techniques to propose items to knowledge
engineers for labeling.
Advantages to Clusters
- Resolution decisions are not made simply on
pairwise comparisons, but search a larger space.
- Can use a variety of algorithms such that:
- Number of clusters is not known in advance
- There are numerous small, singleton clusters
- Input is a pairwise similarity graph
Requirement: Blocking
- Naive Approach is |R|2
comparisons.
- Consider 100,000 products from 10 online
stores is 1,000,000,000,000 comparisons.
- At 1 s per comparison = 11.6 days
- Most are not going to be matches
- Can we block on product category?
Canopy Clustering
- Often used as a pre-clustering optimization for
approaches that must do pairwise comparisons, e.g. K-
Means or Hierarchical Clustering
- Can be run in parallel, and is often used in Big Data
systems (implementations exist in MapReduce on
Hadoop)
- Use distance metric on similarity vectors for
computation.
Canopy Clustering
The algorithm begins with two thresholds T1
and T2
the loose and tight
distances respectively, where T1
> T2
.
1. Remove a point from the set and start a new “canopy”
2. For each point in the set, assign it to the new canopy if the distance
is less than the loose distance T1
.
3. If the distance is less than T2
remove it from the original set
completely.
4. Repeat until there are no more data points to cluster.
Canopy Clustering
By setting threshold values relatively permissively -
canopies will capture more data.
In practice, most canopies will contain only a single
point, and can be ignored.
Pairwise comparisons are made between the
similarity vectors inside of each canopy.
Final Notes
Data Preparation
Good Data Preparation can go a long way in
getting good results, and is most of the work.
- Data Normalization
- Schema Normalization
- Imputation
Data Normalization
- convert to all lower case, remove whitespace
- run spell checker to remove known
typographical errors
- expand abbreviations, replace nicknames
- perform lookups in lexicons
- tokenize, stem, or lemmatize words
Schema Normalization
- match attribute names (title → name)
- compound attributes (full name → first, last)
- nested attributes, particularly boolean attributes
- deal with set and list valued attributes
- segment records from raw text
Imputation
- How do you deal with missing values?
- Set all to nan or None, remove empty string.
- How do you compare missing values? Omit
from similarity vector?
- Fill in missing values with aggregate (mean) or
with some default value.
Canonicalization
Merge information from duplicates to a representative
entity that contains maximal information - consider
downstream resolution.
Name, Email, Phone, Address
Joe Halifax, joe.halifax@gmail.com, null, New York, NY
Joseph Halifax Jr., null, (212) 123-4444, 130 5th Ave Apt 12, New York, NY
Joseph Halifax, joe.halifax@gmail.com, (212) 123-4444, 130 5th Ave Apt 12, New York, NY
Evaluation
- # of predicted matching pairs, cluster level metrics
- Precision/Recall → F1 score
Match Miss
Actual Match True Match False Match |A|
Actual Miss False Miss True Miss |B|
|P(A)| |P(B)| total
Conclusion

Contenu connexe

Tendances

Graph Databases for Master Data Management
Graph Databases for Master Data ManagementGraph Databases for Master Data Management
Graph Databases for Master Data ManagementNeo4j
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Intro to Graphs and Neo4j
Intro to Graphs and Neo4jIntro to Graphs and Neo4j
Intro to Graphs and Neo4jjexp
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphsmukuljoshi
 
Advanced Analytics Governance - Effective Model Management and Stewardship
Advanced Analytics Governance - Effective Model Management and StewardshipAdvanced Analytics Governance - Effective Model Management and Stewardship
Advanced Analytics Governance - Effective Model Management and StewardshipDATAVERSITY
 
Data Architecture Brief Overview
Data Architecture Brief OverviewData Architecture Brief Overview
Data Architecture Brief OverviewHal Kalechofsky
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesNeo4j
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Data Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and GovernanceData Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and GovernanceDATAVERSITY
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Neo4j
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 

Tendances (20)

Graph Databases for Master Data Management
Graph Databases for Master Data ManagementGraph Databases for Master Data Management
Graph Databases for Master Data Management
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Intro to Graphs and Neo4j
Intro to Graphs and Neo4jIntro to Graphs and Neo4j
Intro to Graphs and Neo4j
 
Introduction to Knowledge Graphs
Introduction to Knowledge GraphsIntroduction to Knowledge Graphs
Introduction to Knowledge Graphs
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Advanced Analytics Governance - Effective Model Management and Stewardship
Advanced Analytics Governance - Effective Model Management and StewardshipAdvanced Analytics Governance - Effective Model Management and Stewardship
Advanced Analytics Governance - Effective Model Management and Stewardship
 
Data Architecture Brief Overview
Data Architecture Brief OverviewData Architecture Brief Overview
Data Architecture Brief Overview
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Data modeling for the business
Data modeling for the businessData modeling for the business
Data modeling for the business
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Data Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and GovernanceData Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and Governance
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 

En vedette

Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Spark Summit
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
Reifier fuzzy record matching samples
Reifier fuzzy record matching samplesReifier fuzzy record matching samples
Reifier fuzzy record matching samplesSonal Goyal
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportBenjamin Bengfort
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Benjamin Bengfort
 
Is Bevan's NHS under threat?
Is Bevan's NHS under threat?Is Bevan's NHS under threat?
Is Bevan's NHS under threat?Geraint Day
 
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_
Api days 2014  from theatrophone to ap is_the 2020 telco challenge_Api days 2014  from theatrophone to ap is_the 2020 telco challenge_
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_Luis Borges Quina
 
Clara Cleymans koos meisjesnaam als benaming voor haar firma
Clara Cleymans koos meisjesnaam als benaming voor haar firmaClara Cleymans koos meisjesnaam als benaming voor haar firma
Clara Cleymans koos meisjesnaam als benaming voor haar firmaThierry Debels
 
Marina gascon la prueba judicial
Marina gascon   la prueba judicialMarina gascon   la prueba judicial
Marina gascon la prueba judicialMirta Hnriquez
 
Getting elephants to dance - Wie etablierte Unternehmen erfolgreiche Accelera...
Getting elephants to dance - Wie etablierte Unternehmen erfolgreiche Accelera...Getting elephants to dance - Wie etablierte Unternehmen erfolgreiche Accelera...
Getting elephants to dance - Wie etablierte Unternehmen erfolgreiche Accelera...Corporate Startup Summit
 
Inclusion - reaching the unreached
Inclusion - reaching the unreachedInclusion - reaching the unreached
Inclusion - reaching the unreachedicdeslides
 
Dockerfiles & Best Practices
Dockerfiles & Best PracticesDockerfiles & Best Practices
Dockerfiles & Best PracticesAvash Mulmi
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)Takao Yamanaka
 

En vedette (19)

Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Reifier fuzzy record matching samples
Reifier fuzzy record matching samplesReifier fuzzy record matching samples
Reifier fuzzy record matching samples
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
 
Is Bevan's NHS under threat?
Is Bevan's NHS under threat?Is Bevan's NHS under threat?
Is Bevan's NHS under threat?
 
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_
Api days 2014  from theatrophone to ap is_the 2020 telco challenge_Api days 2014  from theatrophone to ap is_the 2020 telco challenge_
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_
 
Clara Cleymans koos meisjesnaam als benaming voor haar firma
Clara Cleymans koos meisjesnaam als benaming voor haar firmaClara Cleymans koos meisjesnaam als benaming voor haar firma
Clara Cleymans koos meisjesnaam als benaming voor haar firma
 
Marina gascon la prueba judicial
Marina gascon   la prueba judicialMarina gascon   la prueba judicial
Marina gascon la prueba judicial
 
50 outils pour la demarche portfolio 2017
50 outils pour la demarche portfolio 201750 outils pour la demarche portfolio 2017
50 outils pour la demarche portfolio 2017
 
Getting elephants to dance - Wie etablierte Unternehmen erfolgreiche Accelera...
Getting elephants to dance - Wie etablierte Unternehmen erfolgreiche Accelera...Getting elephants to dance - Wie etablierte Unternehmen erfolgreiche Accelera...
Getting elephants to dance - Wie etablierte Unternehmen erfolgreiche Accelera...
 
Inclusion - reaching the unreached
Inclusion - reaching the unreachedInclusion - reaching the unreached
Inclusion - reaching the unreached
 
Dockerfiles & Best Practices
Dockerfiles & Best PracticesDockerfiles & Best Practices
Dockerfiles & Best Practices
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)
 

Similaire à A Primer on Entity Resolution

Data Structures unit I Introduction - data types
Data Structures unit I Introduction - data typesData Structures unit I Introduction - data types
Data Structures unit I Introduction - data typesAmirthaVarshini80
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrievalNisha Arankandath
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...inscit2006
 
Ranking nodes in growing networks: when PageRank fails
Ranking nodes in growing networks: when PageRank failsRanking nodes in growing networks: when PageRank fails
Ranking nodes in growing networks: when PageRank failsPietro De Nicolao
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine LearningNimrita Koul
 
Python for data science
Python for data sciencePython for data science
Python for data sciencebotsplash.com
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to RAnshik Bansal
 
Automated Correlation Discovery for Semi-Structured Business Processes
Automated Correlation Discovery for Semi-Structured Business ProcessesAutomated Correlation Discovery for Semi-Structured Business Processes
Automated Correlation Discovery for Semi-Structured Business ProcessesSzabolcs Rozsnyai
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets DeconstructedPaul Sterk
 
Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Pratibha Singh
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkshesnasuneer
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkshesnasuneer
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...cscpconf
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 

Similaire à A Primer on Entity Resolution (20)

Data Structures unit I Introduction - data types
Data Structures unit I Introduction - data typesData Structures unit I Introduction - data types
Data Structures unit I Introduction - data types
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrieval
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...
 
Ranking nodes in growing networks: when PageRank fails
Ranking nodes in growing networks: when PageRank failsRanking nodes in growing networks: when PageRank fails
Ranking nodes in growing networks: when PageRank fails
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Data Science as a Career and Intro to R
Data Science as a Career and Intro to RData Science as a Career and Intro to R
Data Science as a Career and Intro to R
 
Automated Correlation Discovery for Semi-Structured Business Processes
Automated Correlation Discovery for Semi-Structured Business ProcessesAutomated Correlation Discovery for Semi-Structured Business Processes
Automated Correlation Discovery for Semi-Structured Business Processes
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets Deconstructed
 
Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...Comparative study of various approaches for transaction Fraud Detection using...
Comparative study of various approaches for transaction Fraud Detection using...
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 

Plus de Benjamin Bengfort

Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Benjamin Bengfort
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Benjamin Bengfort
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 

Plus de Benjamin Bengfort (12)

Getting Started with TRISA
Getting Started with TRISAGetting Started with TRISA
Getting Started with TRISA
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Annotation with Redfox
Annotation with RedfoxAnnotation with Redfox
Annotation with Redfox
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 

Dernier

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Dernier (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

A Primer on Entity Resolution

  • 1. A Primer on Entity Resolution
  • 2. Workshop Objectives ● Introduce entity resolution theory and tasks ● Similarity scores and similarity vectors ● Pairwise matching with the Fellegi Sunter algorithm ● Clustering and Blocking for deduplication ● Final notes on entity resolution
  • 4. Entity Resolution refers to techniques that identify, group, and link digital mentions or manifestations of some object in the real world.
  • 5. In the Data Science Pipeline, ER is generally a wrangling technique. ComputationStorageDataInteraction Computational Data Store Feature Analysis Model Builds Model Selection & Monitoring NormalizationIngestion Feedback Wrangling API Cross Validation
  • 6. - Creation of high quality data sets - Reduction in the number of instances in machine learning models - Reduction in the amount of covariance and therefore collinearity of predictor variables. - Simplification of relationships Information Quality
  • 7. Graph Analysis Simplification and Connection ben@ddl.com selma@gmail.comtony@ddl.com allen@acme.com rebecca@acme.com ben@gmail.com tony@gmail.com Ben Rebecca Allen Tony Selma
  • 8. - Heterogenous data: unstructured records - Larger and more varied datasets - Multi-domain and multi-relational data - Varied applications (web and mobile) Parallel, Probabilistic Methods Required* * Although this is often debated in various related domains. Machine Learning and ER
  • 9. Entity Resolution Tasks Deduplication Record Linkage Canonicalization Referencing
  • 10. - Primary consideration in ER - Cluster records that correspond to the same real world entity, normalizing a schema - Reduces number of records in the dataset - Variant: compute cluster representative Deduplication
  • 11. Record Linkage - Match records from one deduplicated data store to another (bipartite) - K-partite linkage links records in multiple data stores and their various associations - Generally proposed in relational data stores, but more frequently applied to unstructured records from various sources.
  • 12. Referencing - Known as entity disambiguation - Match noisy records to a clean, deduplicated reference table that is already canonicalized - Generally used to atomize multiple records to some primary key and donate extra information to the record
  • 13. Canonicalization - Compute representative - Generally the “most complete” record - Imputation of missing attributes via merging - Attribute selection based on the most likely candidate for downstream matching
  • 14. Notation - R: set of records - M: set of matches - N: set of non-matches - E: set of entities - L: set of links Compare (Mt ,Nt ,Et ,Lt )⇔(Mp ,Np ,Ep ,Lp ) - t = true, p = predicted
  • 15. Key Assumptions - Every entity refers to a real world object (e.g. there are no “fake” instances - References or sources (for record linkage) include no duplicates (integrity constraints) - If two records are identical, they are true matches ( , ) ∈ Mt
  • 16. - NLTK: natural language toolkit - Dedupe*: structured deduplication - Distance: C implemented distance metrics - Scikit-Learn: machine learning models - Fuzzywuzzy: fuzzy string matching - PyBloom: probabilistic set matching Tools for Entity Resolution
  • 18. At the heart of any entity resolution task is the computation of similarity or distance.
  • 19. For two records, x and y, compute a similarity vector for each component attribute: [match_score(attrx , attry ) for attr in zip(x,y)] Where match_score is a per-attribute function that computes either a boolean (match, not match) or real valued distance score. match_score ∈ [0,1]*
  • 20. x = { 'id': 'b0000c7fpt', 'title': 'reel deal casino shuffle master edition', 'description': 'reel deal casino shuffle master edition is ...', 'manufacturer': 'phantom efx', 'price': 19.99, 'source': 'amazon', } y = { 'id': '17175991674191849246', 'name': 'phantom efx reel deal casino shuffle master edition', 'description': 'reel deal casino shuffle master ed. is ...', 'manufacturer': None, 'price': 17.24, 'source': 'google', }
  • 21. # similarity vector is a match score of: # [name_score, description_score, manufacturer_score, price_score] # Boolean Match similarity(x,y) == [0, 1, 0, 0] # Real Valued Match similarity(x,y) == [0.83, 1.0, 0, 2.75]
  • 22. Match Scores Reference String Matching Distance Metrics Relational Matching Other Matching Edit Distance - Levenstein - Smith-Waterman - Affine Alignment - Jaro-Winkler - Soft-TFIDF - Monge-Elkan Phonetic - Soundex - Translation - Euclidean - Manhattan - Minkowski Text Analytics - Jaccard - TFIDF - Cosine similarity Set Based - Dice - Tanimoto (Jaccard) - Common Neighbors - Adar Weighted Aggregates - Average values - Max/Min values - Medians - Frequency (Mode) - Numeric distance - Boolean equality - Fuzzy matching - Domain specific Gazettes - Lexical matching - Named Entities (NER)
  • 24. Pairwise Matching: Given a vector of attribute match scores for a pair of records (x,y) compute Pmatch (x,y).
  • 25. Weighted Sum + Threshold Pmatch = sum(weight*score for score in vector) - weights should sum to one - determine weight for each attribute match score - higher weights for more predictive features - e.g. email more predictive than username - attribute value also contributes to predictability - If weighted score > threshold then match.
  • 26. Rule Based Approach - Formulate rules about the construction of a match for attribute collections. if scorename > 0.75 && scoreprice > 0.6 - Although formulating rules is hard, domain specific rules can be applied, making this a typical approach for many applications.
  • 27. Modern record linkage theory was formalized in 1969 by Ivan Fellegi and Alan Sunter who proved that the probabilistic decision rule they described was optimal when the comparison attributes were conditionally independent. Their pioneering work “A Theory For Record Linkage” remains the mathematical foundation for many record linkage applications even today. Fellegi, Ivan P., and Alan B. Sunter. "A theory for record linkage." Journal of the American Statistical Association 64.328 (1969): 1183-1210.
  • 28. Record Linkage Model For two record sets, A and B: and a record pair, is the similarity vector, where is some match score function for the record set. M is the match set and U the non-match set
  • 29. Record Linkage Model Probabilistic linkage based on: Linkage Rule: L(tl , tu ) - upper & lower thresholds: R(r) tu tl MatchUncertainNon-Match
  • 30. Linkage Rule Error - Type I Error: a non-match is called a match. - Type II Error: match is called a non-match
  • 31. Optimizing a Linkage Rule L* (t* l , t* u ) is optimized in (similarity vector space) with error bounds and if: - L* bounds type I and II errors: - L* has the least conditional probability of not making a decision - e.g. minimizes the uncertainty range in R(r).
  • 32. L* Discovery Given N records in (e.g. N similarity vectors): Sort the records decreasing by R(r) (m( ) / u( )) Select n and n′ such that: R(r) 0 1 , … , n n+1 , … , n′ -1 n′ , … , N , ,
  • 33. Practical Application of FS is high dimensional: m( ) and u( ) computations are inefficient. Typically a naive Bayes assumption is made about the conditional independence of features in given a match or a non-match. Computing P( |r ∈ M) requires knowledge of matches. - Supervised machine learning with a training set. - Expectation Maximization (EM) to train parameters.
  • 34. Machine Learning Parameters Supervised Methods - Decision Trees - Cochinwala, Munir, et al. "Efficient data reconciliation." Information Sciences 137.1 (2001): 1-15. - Support Vector Machines - Bilenko, Mikhail, and Raymond J. Mooney. "Adaptive duplicate detection using learnable string similarity measures." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003. - Christen, Peter. "Automatic record linkage using seeded nearest neighbour and support vector machine classification." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008. - Ensembles of Classifiers - Chen, Zhaoqi, Dmitri V. Kalashnikov, and Sharad Mehrotra. "Exploiting context analysis for combining multiple entity resolution systems." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009. - Conditional Random Fields - Gupta, Rahul, and Sunita Sarawagi. "Answering table augmentation queries from unstructured lists on the web." Proceedings of the VLDB Endowment 2.1 (2009): 289-300.
  • 35. Machine Learning Parameters Unsupervised Methods - Expectation Maximization - Winkler, William E. "Overview of record linkage and current research directions." Bureau of the Census. 2006. - Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer Science & Business Media, 2007. - Hierarchical Clustering - Ravikumar, Pradeep, and William W. Cohen. "A hierarchical graphical model for record linkage."Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004. Active Learning Methods - Committee of Classifiers - Sarawagi, Sunita, and Anuradha Bhamidipaty. "Interactive deduplication using active learning."Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002. - Tejada, Sheila, Craig A. Knoblock, and Steven Minton. "Learning object identification rules for information integration." Information Systems 26.8 (2001): 607-633.
  • 36. Luckily, all of these models are in Scikit-Learn. Considerations: - Building training sets is hard: - Most records are easy non-matches - Record pairs can be ambiguous - Class imbalance: more negatives than positives Machine Learning & Fellegi Sunter is the state of the art. Implementing Papers
  • 38. To obtain a supervised training set, start by using clustering and then add active learning techniques to propose items to knowledge engineers for labeling.
  • 39. Advantages to Clusters - Resolution decisions are not made simply on pairwise comparisons, but search a larger space. - Can use a variety of algorithms such that: - Number of clusters is not known in advance - There are numerous small, singleton clusters - Input is a pairwise similarity graph
  • 40. Requirement: Blocking - Naive Approach is |R|2 comparisons. - Consider 100,000 products from 10 online stores is 1,000,000,000,000 comparisons. - At 1 s per comparison = 11.6 days - Most are not going to be matches - Can we block on product category?
  • 41. Canopy Clustering - Often used as a pre-clustering optimization for approaches that must do pairwise comparisons, e.g. K- Means or Hierarchical Clustering - Can be run in parallel, and is often used in Big Data systems (implementations exist in MapReduce on Hadoop) - Use distance metric on similarity vectors for computation.
  • 42. Canopy Clustering The algorithm begins with two thresholds T1 and T2 the loose and tight distances respectively, where T1 > T2 . 1. Remove a point from the set and start a new “canopy” 2. For each point in the set, assign it to the new canopy if the distance is less than the loose distance T1 . 3. If the distance is less than T2 remove it from the original set completely. 4. Repeat until there are no more data points to cluster.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50. Canopy Clustering By setting threshold values relatively permissively - canopies will capture more data. In practice, most canopies will contain only a single point, and can be ignored. Pairwise comparisons are made between the similarity vectors inside of each canopy.
  • 52. Data Preparation Good Data Preparation can go a long way in getting good results, and is most of the work. - Data Normalization - Schema Normalization - Imputation
  • 53. Data Normalization - convert to all lower case, remove whitespace - run spell checker to remove known typographical errors - expand abbreviations, replace nicknames - perform lookups in lexicons - tokenize, stem, or lemmatize words
  • 54. Schema Normalization - match attribute names (title → name) - compound attributes (full name → first, last) - nested attributes, particularly boolean attributes - deal with set and list valued attributes - segment records from raw text
  • 55. Imputation - How do you deal with missing values? - Set all to nan or None, remove empty string. - How do you compare missing values? Omit from similarity vector? - Fill in missing values with aggregate (mean) or with some default value.
  • 56. Canonicalization Merge information from duplicates to a representative entity that contains maximal information - consider downstream resolution. Name, Email, Phone, Address Joe Halifax, joe.halifax@gmail.com, null, New York, NY Joseph Halifax Jr., null, (212) 123-4444, 130 5th Ave Apt 12, New York, NY Joseph Halifax, joe.halifax@gmail.com, (212) 123-4444, 130 5th Ave Apt 12, New York, NY
  • 57. Evaluation - # of predicted matching pairs, cluster level metrics - Precision/Recall → F1 score Match Miss Actual Match True Match False Match |A| Actual Miss False Miss True Miss |B| |P(A)| |P(B)| total