SlideShare une entreprise Scribd logo
1  sur  25
January 2017
Data Linkage: An Overview
Natalie Shlomo – University of Manchester
1
2
Introduction to Data Linkage
Data (record) linkage brings together information from two different
records that are believed to belong to the same person based on
matching variables
• If two records agree on all matching variables, it is unlikely that
they would have agreed by chance, the level of assurance that
the link is correct will be high (the pair belongs to the same
person)
• If all of the matching variables disagree, the pair will not be
linked and it is unlikely that it belongs to the same person
• Intermediate situations where some matching variables agree
and some matching variables disagree, need to predict
whether the pair is a true match or a non-match
Often need clerical intervention to determine matching status
Data Linkage is difficult in the presence of errors in collecting data and
3
Introduction to Data Linkage
Challenges of Data Linkage:
• Errors, variations and missing data on the information used to link
records
• Differences in data captured and maintained in different databases,
eg. different versions of date of birth compared to age
• Data dynamics and database changes over time, eg. name changes
due to marriage and divorce, address changes
Typical problems in strings:
Misspelling, transpositions, fused or split words, missing or extra
letters, extraneous information, missing punctuation
Typical problems with numerical variables:
Transposed numbers, insertions, deletions
4
Introduction to Data Linkage
Data Linkage typically involves three stages:
- Pre-linkage: Editing and data cleaning, parsing, standardizing
matching variables
- Linkage: Bringing pairs together for comparison and determining
correct matches, i.e. belong to the same person. All pairs are
produced within blocks determined by blocking variables
- Post-linkage: Checking residuals, determining error rates, carry
out analysis accounting for linkage errors
Properties needed for matching variables:
- Unique; Available; Known; Accurate Stable over time
5
Introduction to Data Linkage
Context of data linkage to carry out statistical research and inform
policy
Focus on two main methods of data linkage and their combination:
Deterministic (exact) matching
Probabilistic matching
Deterministic (exact) matching method based on an exact one-to-one
character match of matching variables
Probabilistic matching method used if partial identifiers are available,
i.e. names and addresses
A score is computed for each potential pair based on individual
probabilities of agreement for each matching variable
6
Deterministic Linkage
Deterministic (exact matching) method
Records in two datasets must agree exactly on the matching
variables in
order to conclude that they correspond to the same individual
It can be used when a high quality identifier such as an ID number is
available
All matching variables have the same weight associated to them so
matching on gender carries the same weight as matching on last
name
Incorporating some errors:
In fuzzy matching, exact matching is carried out with a wildcard
substituted for characters, eg. *a*a*a can be banana, pajama,
etc.
Use transformed data, such as ‘Soundex’ for names or truncated
7
Probabilistic Data Linkage
Does not require that all identifying fields match exactly in order to
conclude that the records belong to the same individual
Frequency analysis of data values necessary in order to calculate for
each matching variable a weight that indicates for any pair of records
how likely it is that they refer to the same entity
Uncommon value agreement stronger evidence for linkage
Large weights assigned to fields that match and small weights are
assigned to fields that don’t match
Sum the scores over all matching variables and compare the sum to
threshold values in order to determine if the pair should be declared
a match, a non-match or undetermined for clerical review
8
Probabilistic Data Linkage
Method relies on calculating scores based on probabilities
Determines agreements between matching variables between a
pair of records as well as disagreements
Either from previous experience of record linkage on a similar
application or based on a preliminary linkage exercise, how likely
is it that the variables which agree between a pair would have
done so by chance if the pair were not correctly matched?
Compare this measure to how likely the agreement would be in
correctly matched record pairs
Can also use latent class modelling and EM algorithm to estimate the
matching probabilities without the need for test data
Probabilistic record linkage more computational demanding and more
difficult to program but it reduces the number of overlooked
matches by modelling the inconsistencies in the data and taking
them into account
9
Probabilistic Data Linkage
Criterion for good matching variables: agreement between variables
which are more typical of correctly matched pairs, rather than
those which might have occurred by chance in unrelated records
Example, variables that might agree by chance in unmatched record
pairs are those which don’t divide the population unto many
subclasses, for example gender
Key technical issues in the development of data linkage procedures
1. Good quality identifiers available to discriminate between the
person to whom the record refers and all other persons
2. Deciding whether discrepancies in identifiers are due to mistakes
in reporting for a single individual
3. Processing a large volume of data within a reasonable amount of
computing processing time
10
Data Linkage Parameters
Three key parameters for a successful probabilistic data linkage:
• Quality of the data
• The chance that values of a matching variable will randomly agree
• Ultimate number of true matches that exist in the database
Not all fields for matching give you the same amount of information
and uncommon value agreement stronger evidence for linkage
To incorporate the discriminating power of matching fields, the
weights are
computed as a ratio of 2 frequencies:
• number of agreements of a field in record pairs that represent
the
same individual
• number of agreements in a field in record pairs that do not
represent the same individual
11
Probabilistic Data Linkage
Need to define the agreement pattern:
For example, 3 matching variables with binary comparison
tests whether
- pair agrees on last name
- pair agrees on first name
- pair agrees on street name
Simple agreement pattern
and in fact, there would be 8 such patterns
Complex agreement pattern
and can be based on string
comparators

1
2
3
)1,0,1(
)80.0,0,66.0(
12
Probabilistic Data Linkage
Data quality is the first parameter of probabilistic linkage – the degree
to which the information contained for a matching variable is accurate
and stable across time
Data entry errors, missing data, or false dates diminish accuracy and
produce low quality
Higher quality data, more likely to make a correct match
Data quality is reflected in one of the probabilities needed for the
process – the m-probability
Conditional probability that a record pair has an agreement pattern
given that it is a match (the same person)
This is approximately 1-error rate and is referred as Reliability

)M|(Pm 
13
Probabilistic Data Linkage
Another parameter depends on the number of random agreements
denoted the u-probability
Conditional Probability that a record pair has an agreement pattern
given that it is not a match
The third parameter is: the prior probability of a correct
match
Then according to Bayes theorem:
Agreement (or likelihood) Ratio assuming conditional independence:
Order the comparison vectors by the agreement ratio and
choose upper and lower cut off values for to determine
correct matches and correct non-matches

)U|(Pu 
)(MP
)(P
)M(P)M|(P
)|M(P


 
)|(...)|()|(
)|(...)|()|(
)|(
)|(
)(
21
21
UPUPUP
MPMPMP
UP
MP
R
k
k








)(R
)(R
Probabilistic Data Linkage
Now take the logarithm and we obtain the sum of matching weights for
each separate matching variable:
Example:
P(agree on characteristic x|M)=
0.9 if x=first name, last name, age
0.8 if x=housenumber, streetname
P(agree on characteristic x|U)=
0.1 if x=first name, last name, age
0.2 if x=housenumber, streetname



















)|(
)|(
log...
)|(
)|(
log
)|(
)|(
log))(log(
2
2
1
1
UP
MP
UP
MP
UP
MP
R
k
k







Probabilistic Data Linkage
15
P(agree on characteristic Z|M)= 0.9 if Z=first name, last name, age
0.8 if Z=housenumber, streetname, sex
P(agree on characteristic Z|U)=0.1 if Z=first name, last name, age
0.2 if Z=housenumber, streetname,sex
Name Address Birth year Gender
Samantha Smith 435 Main St 1954 M
Sam Smith 435 Main St 1955 F
= (disagree first name, agree last name, agree hsnm, agree stnm,
disagree
birth year, disagree sex) = (0,1,1,1,0,0)
We can use dictionaries and string comparators which would give partial
agreement weight to ‘Sam’ and ‘Samantha’ or a deviation in only 1 year of

81.0))2.01/()8.01ln(()1.01/()9.01ln(()2.0/8.0ln(
)2.0/8.0ln()1.0/9.0ln())1.01/()9.01ln(())(ln(

R
16
Probabilistic Data Linkage
Data quality quantified by the m-probability with respect to accuracy
and stability of the matching variable
For any given field, the same value for m-probability applies to all
records
Distinguishing power is quantified by the u-probability
This can be obtained by the probability that 2 records will randomly
agree and is approximately 1/(number of values)
If it is high then the field has low distinguishing power, eg. gender
In contrast to the m-probability, a matching variable may have
multiple values of u-probabilities each corresponding to a specific
value in the matching variable
u-probability typically estimated as the proportion of records with a
specific value based on the frequencies seen in the primary data
source
17
Blocking
Number of possible comparisons increases with the product of the file
sizes
For large files, it is impractical to link every possible pair in the two files,
for example 2 files of size 10,000 will result in 100,000,000 comparisons
We restrict the comparisons to blocks of data where one or more
variables need to match exactly and are likely to refer to the same
person, thereby reducing the time spent searching the file
Utilizes a deterministic approach to assist the probabilistic method of
record linkage
Can block sequentially in an iterative process using different variables
Start with the most restrictive deterministic matching then proceed to
less
restrictive models
Example: block on post code and surname, perform clerical review on
the set of designated potential matches, and then match on residual files
of records not matched using another blocking criteria, such as year of
18
Evaluation and Thresholds
Thresholds for determining match status are based on minimizing
errors:
• error of linking unmatched records - Type I error
• error of not linking matched records – Type II error
As in classical decision theory, the thresholds are determined by how
much you are willing to be wrong based on the two error types which
are pre-determined
High values of overall scores suggest a correct match
Low values of overall scores suggest an incorrect match
19
Evaluation and Thresholds
What constitutes high and low?
Calculate a frequency distribution to determine critical values of
high or low values based on levels of significance (how much
you are willing to be wrong)
If we had training data where we knew the correct matching
status, we could derive two distributions, for true matches
and true non-matches:
Lower distribution for true non-matches typically contain lower
values of weights (typically this group is very large as there
are many more possible incorrect comparisons than there are
correct comparisons)
Upper distribution for true matches contain higher values of
weights
20
Evaluation and Thresholds
Must chose threshold: value of Score W- below which
automatically classify as incorrect matches
and value of Score W+ above which automatically classify as
correct matches
In between W- and W+ we would need to carry out a clerical
21
Evaluation and Thresholds
The decision rule is based on 3 types of pairs:
• Believed to be correct matches
• Unknown and might be correct matches
• Unlinked pairs
Empirical distributions can be
used to determine thresholds
based on training data and
Pre-set Type 1 and Type2
errors
Type
1
error
Type 2
error
22
• After record linkage, need to carry out logical checks in the data
for evaluation, i.e. might obtain situations such as hospital
discharges after a death
• Errors result from poor quality data
Recommended:
On those pairs declared a match - carry out a small random
sample and check for accuracy in the matching status,
particularly for those near the threshold cut-off values
On those pairs declared a non-match – carry out a small
random sample and check for accuracy in the matching status,
particularly for those near the threshold cut-off values
Use the errors to compensate for linkage errors when
analysing linked data
Evaluation and thresholds
23
Evaluation and Thresholds
True Status
Non-
Matches
(null
hypothesis)
Matches (alternative
hypothesis)
Decisio
n
Not
Linked
pairs
(fail to
reject
null)
Not Linked
non-
matches
(True
Negative)
Not linked matches
Type II error
(False Negative)
Linked
pairs
(reject
null)
Linked non
matches
Type I error
(False
Positive)
Linked Matches
(True Positive)
False Negative Rate=
fn/(tp+fn)
Sensitivity or recall
(ability to recognize
true matches)=
tp/(tp+fn)
False Positive Rate =
fp/(tn+fp)
Specificity (ability to
recognize incorrect
matches) = tn/(tn+fp)
Precision=tp/(tp+fp)
24
Overall Process
Select blocking and
matching variables
Editing, parsing,
phonetic code and
standardizing
Block and sort both
files
Deterministic method Probabilistic method
Probabilities and
Agreement Ratio
Determine thresholds
Match the two files
Clerical review
Link the data file records and check for errors
Not matched
Matche
d
Good matches Good matches
25
Thank you for your attention!!

Contenu connexe

Tendances

History of clinical trials
History of clinical trialsHistory of clinical trials
History of clinical trials
Urmila Aswar
 
Power Analysis and Sample Size Determination
Power Analysis and Sample Size DeterminationPower Analysis and Sample Size Determination
Power Analysis and Sample Size Determination
Ajay Dhamija
 
Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage system
Vineetha Menon
 
Source Documents in Clinical Trials_part1
Source Documents in Clinical Trials_part1Source Documents in Clinical Trials_part1
Source Documents in Clinical Trials_part1
Valentyna Korniyenko
 

Tendances (20)

CRA/ Monitor Roles and Responsibilities
CRA/ Monitor Roles and ResponsibilitiesCRA/ Monitor Roles and Responsibilities
CRA/ Monitor Roles and Responsibilities
 
History of clinical trials
History of clinical trialsHistory of clinical trials
History of clinical trials
 
Components of a clinical study protocol
Components of a clinical study protocolComponents of a clinical study protocol
Components of a clinical study protocol
 
ICH GCP
ICH GCPICH GCP
ICH GCP
 
Pharmacometrics
PharmacometricsPharmacometrics
Pharmacometrics
 
Careers in Medical Writing
Careers in Medical Writing Careers in Medical Writing
Careers in Medical Writing
 
Clinical trials flow process
Clinical trials flow processClinical trials flow process
Clinical trials flow process
 
Power Analysis and Sample Size Determination
Power Analysis and Sample Size DeterminationPower Analysis and Sample Size Determination
Power Analysis and Sample Size Determination
 
Decentralized Clinical Trials
Decentralized Clinical TrialsDecentralized Clinical Trials
Decentralized Clinical Trials
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
What is Regulatory Medical Writing?
What is Regulatory Medical Writing?What is Regulatory Medical Writing?
What is Regulatory Medical Writing?
 
Non-inferiority and Equivalence Study design considerations and sample size
Non-inferiority and Equivalence Study design considerations and sample sizeNon-inferiority and Equivalence Study design considerations and sample size
Non-inferiority and Equivalence Study design considerations and sample size
 
PKPD seminar
PKPD seminarPKPD seminar
PKPD seminar
 
Umbrella, Basket and Platform trials
Umbrella, Basket and Platform trialsUmbrella, Basket and Platform trials
Umbrella, Basket and Platform trials
 
Auditor roles & responsibilities in CT as per ICHGCP
Auditor roles & responsibilities in CT as per ICHGCPAuditor roles & responsibilities in CT as per ICHGCP
Auditor roles & responsibilities in CT as per ICHGCP
 
Safety_Data_Reconciliation_Katalyst HLS
Safety_Data_Reconciliation_Katalyst HLSSafety_Data_Reconciliation_Katalyst HLS
Safety_Data_Reconciliation_Katalyst HLS
 
Correlation
CorrelationCorrelation
Correlation
 
Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage system
 
Source Documents in Clinical Trials_part1
Source Documents in Clinical Trials_part1Source Documents in Clinical Trials_part1
Source Documents in Clinical Trials_part1
 
TDM of drugs used in seizure disorders
TDM of drugs used in seizure disordersTDM of drugs used in seizure disorders
TDM of drugs used in seizure disorders
 

En vedette

A seminar report on data aggregation in wireless sensor networks
A seminar report on data aggregation in wireless sensor networksA seminar report on data aggregation in wireless sensor networks
A seminar report on data aggregation in wireless sensor networks
praveen369
 

En vedette (13)

Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
Mdm
MdmMdm
Mdm
 
Sustainability Information in Mining: Technologies and Processes for Data Agg...
Sustainability Information in Mining: Technologies and Processes for Data Agg...Sustainability Information in Mining: Technologies and Processes for Data Agg...
Sustainability Information in Mining: Technologies and Processes for Data Agg...
 
Biosocial research missing data
Biosocial research missing dataBiosocial research missing data
Biosocial research missing data
 
Tamr | MDM and the Data Unification Imperative
Tamr | MDM and the Data Unification ImperativeTamr | MDM and the Data Unification Imperative
Tamr | MDM and the Data Unification Imperative
 
Biosocial research framework
Biosocial research frameworkBiosocial research framework
Biosocial research framework
 
A seminar report on data aggregation in wireless sensor networks
A seminar report on data aggregation in wireless sensor networksA seminar report on data aggregation in wireless sensor networks
A seminar report on data aggregation in wireless sensor networks
 
Data Aggregation and Dissemination in Vehicular Ad-Hoc Networks
Data Aggregation and Dissemination in Vehicular Ad-Hoc NetworksData Aggregation and Dissemination in Vehicular Ad-Hoc Networks
Data Aggregation and Dissemination in Vehicular Ad-Hoc Networks
 
Tamr | cdo-summit
Tamr | cdo-summitTamr | cdo-summit
Tamr | cdo-summit
 
Biosocial research: Biological data quality issues
Biosocial research: Biological data quality issuesBiosocial research: Biological data quality issues
Biosocial research: Biological data quality issues
 
An introduction to Jupyter Notebooks for Social Science research
An introduction to Jupyter Notebooks for Social Science researchAn introduction to Jupyter Notebooks for Social Science research
An introduction to Jupyter Notebooks for Social Science research
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 

Similaire à Introduction to Data Linkage

Matching Criteria
Matching CriteriaMatching Criteria
Matching Criteria
Hugh Knight
 
Correlational research
Correlational researchCorrelational research
Correlational research
Jijo G John
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
Bonnie Green
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
Taweh Beysolow II
 
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
CSCJournals
 
Higher Order Learning
Higher Order LearningHigher Order Learning
Higher Order Learning
butest
 

Similaire à Introduction to Data Linkage (20)

String Identifier in Multiple Medical Databases
String Identifier in Multiple Medical DatabasesString Identifier in Multiple Medical Databases
String Identifier in Multiple Medical Databases
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
 
Approximating Source Accuracy Using Dublicate Records in Da-ta Integration
Approximating Source Accuracy Using Dublicate Records in Da-ta IntegrationApproximating Source Accuracy Using Dublicate Records in Da-ta Integration
Approximating Source Accuracy Using Dublicate Records in Da-ta Integration
 
Matching Criteria
Matching CriteriaMatching Criteria
Matching Criteria
 
Correlational research
Correlational researchCorrelational research
Correlational research
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
 
analysing_data_using_spss.pdf
analysing_data_using_spss.pdfanalysing_data_using_spss.pdf
analysing_data_using_spss.pdf
 
analysing_data_using_spss.pdf
analysing_data_using_spss.pdfanalysing_data_using_spss.pdf
analysing_data_using_spss.pdf
 
Analysis Of Data Using SPSS
Analysis Of Data Using SPSSAnalysis Of Data Using SPSS
Analysis Of Data Using SPSS
 
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
A Statistical Analysis Of Summarization Evaluation Metrics Using Resampling M...
 
Morse et al 2012
Morse et al 2012Morse et al 2012
Morse et al 2012
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
 
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Cat...
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-complete
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Assigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical ResponsesAssigning Scores For Ordered Categorical Responses
Assigning Scores For Ordered Categorical Responses
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Higher Order Learning
Higher Order LearningHigher Order Learning
Higher Order Learning
 

Plus de University of Southampton

Plus de University of Southampton (20)

Introduction to Survey Data Quality
Introduction to Survey Data Quality  Introduction to Survey Data Quality
Introduction to Survey Data Quality
 
Generating SPSS training materials in StatJR
Generating SPSS training materials in StatJRGenerating SPSS training materials in StatJR
Generating SPSS training materials in StatJR
 
Introduction to the Stat-JR software package
Introduction to the Stat-JR software packageIntroduction to the Stat-JR software package
Introduction to the Stat-JR software package
 
Multi level modelling- random coefficient models | Ian Brunton-Smith
Multi level modelling- random coefficient models | Ian Brunton-SmithMulti level modelling- random coefficient models | Ian Brunton-Smith
Multi level modelling- random coefficient models | Ian Brunton-Smith
 
Multi level modelling - random intercept models | Ian Brunton Smith
Multi level modelling - random intercept models | Ian Brunton SmithMulti level modelling - random intercept models | Ian Brunton Smith
Multi level modelling - random intercept models | Ian Brunton Smith
 
Introduction to multilevel modelling | Ian Brunton-Smith
Introduction to multilevel modelling | Ian Brunton-SmithIntroduction to multilevel modelling | Ian Brunton-Smith
Introduction to multilevel modelling | Ian Brunton-Smith
 
Biosocial research:How to use biological data in social science research?
Biosocial research:How to use biological data in social science research?Biosocial research:How to use biological data in social science research?
Biosocial research:How to use biological data in social science research?
 
Integrating biological and social research data - Michaela Benzeval
Integrating biological and social research data - Michaela BenzevalIntegrating biological and social research data - Michaela Benzeval
Integrating biological and social research data - Michaela Benzeval
 
Teaching research methods: pedagogy hooks
Teaching research methods: pedagogy hooksTeaching research methods: pedagogy hooks
Teaching research methods: pedagogy hooks
 
Teaching research methods: pedagogy of methods learning
Teaching research methods: pedagogy of methods learningTeaching research methods: pedagogy of methods learning
Teaching research methods: pedagogy of methods learning
 
Data quality: total survey error
Data quality: total survey errorData quality: total survey error
Data quality: total survey error
 
better off living with parents
better off living with parentsbetter off living with parents
better off living with parents
 
Multilevel models:random coefficient models
Multilevel models:random coefficient modelsMultilevel models:random coefficient models
Multilevel models:random coefficient models
 
Multilevel models:random intercept models
Multilevel models:random intercept modelsMultilevel models:random intercept models
Multilevel models:random intercept models
 
Introduction to multilevel modelling
Introduction to multilevel modellingIntroduction to multilevel modelling
Introduction to multilevel modelling
 
How to write about research methods
How to write about research methodsHow to write about research methods
How to write about research methods
 
How to write about research methods
How to write about research methodsHow to write about research methods
How to write about research methods
 
Introduction to spatial interaction modelling
Introduction to spatial interaction modellingIntroduction to spatial interaction modelling
Introduction to spatial interaction modelling
 
Cognitive interviewing
Cognitive interviewingCognitive interviewing
Cognitive interviewing
 
Survey questions and measurement error
Survey questions and measurement errorSurvey questions and measurement error
Survey questions and measurement error
 

Dernier

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Introduction to Data Linkage

  • 1. January 2017 Data Linkage: An Overview Natalie Shlomo – University of Manchester 1
  • 2. 2 Introduction to Data Linkage Data (record) linkage brings together information from two different records that are believed to belong to the same person based on matching variables • If two records agree on all matching variables, it is unlikely that they would have agreed by chance, the level of assurance that the link is correct will be high (the pair belongs to the same person) • If all of the matching variables disagree, the pair will not be linked and it is unlikely that it belongs to the same person • Intermediate situations where some matching variables agree and some matching variables disagree, need to predict whether the pair is a true match or a non-match Often need clerical intervention to determine matching status Data Linkage is difficult in the presence of errors in collecting data and
  • 3. 3 Introduction to Data Linkage Challenges of Data Linkage: • Errors, variations and missing data on the information used to link records • Differences in data captured and maintained in different databases, eg. different versions of date of birth compared to age • Data dynamics and database changes over time, eg. name changes due to marriage and divorce, address changes Typical problems in strings: Misspelling, transpositions, fused or split words, missing or extra letters, extraneous information, missing punctuation Typical problems with numerical variables: Transposed numbers, insertions, deletions
  • 4. 4 Introduction to Data Linkage Data Linkage typically involves three stages: - Pre-linkage: Editing and data cleaning, parsing, standardizing matching variables - Linkage: Bringing pairs together for comparison and determining correct matches, i.e. belong to the same person. All pairs are produced within blocks determined by blocking variables - Post-linkage: Checking residuals, determining error rates, carry out analysis accounting for linkage errors Properties needed for matching variables: - Unique; Available; Known; Accurate Stable over time
  • 5. 5 Introduction to Data Linkage Context of data linkage to carry out statistical research and inform policy Focus on two main methods of data linkage and their combination: Deterministic (exact) matching Probabilistic matching Deterministic (exact) matching method based on an exact one-to-one character match of matching variables Probabilistic matching method used if partial identifiers are available, i.e. names and addresses A score is computed for each potential pair based on individual probabilities of agreement for each matching variable
  • 6. 6 Deterministic Linkage Deterministic (exact matching) method Records in two datasets must agree exactly on the matching variables in order to conclude that they correspond to the same individual It can be used when a high quality identifier such as an ID number is available All matching variables have the same weight associated to them so matching on gender carries the same weight as matching on last name Incorporating some errors: In fuzzy matching, exact matching is carried out with a wildcard substituted for characters, eg. *a*a*a can be banana, pajama, etc. Use transformed data, such as ‘Soundex’ for names or truncated
  • 7. 7 Probabilistic Data Linkage Does not require that all identifying fields match exactly in order to conclude that the records belong to the same individual Frequency analysis of data values necessary in order to calculate for each matching variable a weight that indicates for any pair of records how likely it is that they refer to the same entity Uncommon value agreement stronger evidence for linkage Large weights assigned to fields that match and small weights are assigned to fields that don’t match Sum the scores over all matching variables and compare the sum to threshold values in order to determine if the pair should be declared a match, a non-match or undetermined for clerical review
  • 8. 8 Probabilistic Data Linkage Method relies on calculating scores based on probabilities Determines agreements between matching variables between a pair of records as well as disagreements Either from previous experience of record linkage on a similar application or based on a preliminary linkage exercise, how likely is it that the variables which agree between a pair would have done so by chance if the pair were not correctly matched? Compare this measure to how likely the agreement would be in correctly matched record pairs Can also use latent class modelling and EM algorithm to estimate the matching probabilities without the need for test data Probabilistic record linkage more computational demanding and more difficult to program but it reduces the number of overlooked matches by modelling the inconsistencies in the data and taking them into account
  • 9. 9 Probabilistic Data Linkage Criterion for good matching variables: agreement between variables which are more typical of correctly matched pairs, rather than those which might have occurred by chance in unrelated records Example, variables that might agree by chance in unmatched record pairs are those which don’t divide the population unto many subclasses, for example gender Key technical issues in the development of data linkage procedures 1. Good quality identifiers available to discriminate between the person to whom the record refers and all other persons 2. Deciding whether discrepancies in identifiers are due to mistakes in reporting for a single individual 3. Processing a large volume of data within a reasonable amount of computing processing time
  • 10. 10 Data Linkage Parameters Three key parameters for a successful probabilistic data linkage: • Quality of the data • The chance that values of a matching variable will randomly agree • Ultimate number of true matches that exist in the database Not all fields for matching give you the same amount of information and uncommon value agreement stronger evidence for linkage To incorporate the discriminating power of matching fields, the weights are computed as a ratio of 2 frequencies: • number of agreements of a field in record pairs that represent the same individual • number of agreements in a field in record pairs that do not represent the same individual
  • 11. 11 Probabilistic Data Linkage Need to define the agreement pattern: For example, 3 matching variables with binary comparison tests whether - pair agrees on last name - pair agrees on first name - pair agrees on street name Simple agreement pattern and in fact, there would be 8 such patterns Complex agreement pattern and can be based on string comparators  1 2 3 )1,0,1( )80.0,0,66.0(
  • 12. 12 Probabilistic Data Linkage Data quality is the first parameter of probabilistic linkage – the degree to which the information contained for a matching variable is accurate and stable across time Data entry errors, missing data, or false dates diminish accuracy and produce low quality Higher quality data, more likely to make a correct match Data quality is reflected in one of the probabilities needed for the process – the m-probability Conditional probability that a record pair has an agreement pattern given that it is a match (the same person) This is approximately 1-error rate and is referred as Reliability  )M|(Pm 
  • 13. 13 Probabilistic Data Linkage Another parameter depends on the number of random agreements denoted the u-probability Conditional Probability that a record pair has an agreement pattern given that it is not a match The third parameter is: the prior probability of a correct match Then according to Bayes theorem: Agreement (or likelihood) Ratio assuming conditional independence: Order the comparison vectors by the agreement ratio and choose upper and lower cut off values for to determine correct matches and correct non-matches  )U|(Pu  )(MP )(P )M(P)M|(P )|M(P     )|(...)|()|( )|(...)|()|( )|( )|( )( 21 21 UPUPUP MPMPMP UP MP R k k         )(R )(R
  • 14. Probabilistic Data Linkage Now take the logarithm and we obtain the sum of matching weights for each separate matching variable: Example: P(agree on characteristic x|M)= 0.9 if x=first name, last name, age 0.8 if x=housenumber, streetname P(agree on characteristic x|U)= 0.1 if x=first name, last name, age 0.2 if x=housenumber, streetname                    )|( )|( log... )|( )|( log )|( )|( log))(log( 2 2 1 1 UP MP UP MP UP MP R k k       
  • 15. Probabilistic Data Linkage 15 P(agree on characteristic Z|M)= 0.9 if Z=first name, last name, age 0.8 if Z=housenumber, streetname, sex P(agree on characteristic Z|U)=0.1 if Z=first name, last name, age 0.2 if Z=housenumber, streetname,sex Name Address Birth year Gender Samantha Smith 435 Main St 1954 M Sam Smith 435 Main St 1955 F = (disagree first name, agree last name, agree hsnm, agree stnm, disagree birth year, disagree sex) = (0,1,1,1,0,0) We can use dictionaries and string comparators which would give partial agreement weight to ‘Sam’ and ‘Samantha’ or a deviation in only 1 year of  81.0))2.01/()8.01ln(()1.01/()9.01ln(()2.0/8.0ln( )2.0/8.0ln()1.0/9.0ln())1.01/()9.01ln(())(ln(  R
  • 16. 16 Probabilistic Data Linkage Data quality quantified by the m-probability with respect to accuracy and stability of the matching variable For any given field, the same value for m-probability applies to all records Distinguishing power is quantified by the u-probability This can be obtained by the probability that 2 records will randomly agree and is approximately 1/(number of values) If it is high then the field has low distinguishing power, eg. gender In contrast to the m-probability, a matching variable may have multiple values of u-probabilities each corresponding to a specific value in the matching variable u-probability typically estimated as the proportion of records with a specific value based on the frequencies seen in the primary data source
  • 17. 17 Blocking Number of possible comparisons increases with the product of the file sizes For large files, it is impractical to link every possible pair in the two files, for example 2 files of size 10,000 will result in 100,000,000 comparisons We restrict the comparisons to blocks of data where one or more variables need to match exactly and are likely to refer to the same person, thereby reducing the time spent searching the file Utilizes a deterministic approach to assist the probabilistic method of record linkage Can block sequentially in an iterative process using different variables Start with the most restrictive deterministic matching then proceed to less restrictive models Example: block on post code and surname, perform clerical review on the set of designated potential matches, and then match on residual files of records not matched using another blocking criteria, such as year of
  • 18. 18 Evaluation and Thresholds Thresholds for determining match status are based on minimizing errors: • error of linking unmatched records - Type I error • error of not linking matched records – Type II error As in classical decision theory, the thresholds are determined by how much you are willing to be wrong based on the two error types which are pre-determined High values of overall scores suggest a correct match Low values of overall scores suggest an incorrect match
  • 19. 19 Evaluation and Thresholds What constitutes high and low? Calculate a frequency distribution to determine critical values of high or low values based on levels of significance (how much you are willing to be wrong) If we had training data where we knew the correct matching status, we could derive two distributions, for true matches and true non-matches: Lower distribution for true non-matches typically contain lower values of weights (typically this group is very large as there are many more possible incorrect comparisons than there are correct comparisons) Upper distribution for true matches contain higher values of weights
  • 20. 20 Evaluation and Thresholds Must chose threshold: value of Score W- below which automatically classify as incorrect matches and value of Score W+ above which automatically classify as correct matches In between W- and W+ we would need to carry out a clerical
  • 21. 21 Evaluation and Thresholds The decision rule is based on 3 types of pairs: • Believed to be correct matches • Unknown and might be correct matches • Unlinked pairs Empirical distributions can be used to determine thresholds based on training data and Pre-set Type 1 and Type2 errors Type 1 error Type 2 error
  • 22. 22 • After record linkage, need to carry out logical checks in the data for evaluation, i.e. might obtain situations such as hospital discharges after a death • Errors result from poor quality data Recommended: On those pairs declared a match - carry out a small random sample and check for accuracy in the matching status, particularly for those near the threshold cut-off values On those pairs declared a non-match – carry out a small random sample and check for accuracy in the matching status, particularly for those near the threshold cut-off values Use the errors to compensate for linkage errors when analysing linked data Evaluation and thresholds
  • 23. 23 Evaluation and Thresholds True Status Non- Matches (null hypothesis) Matches (alternative hypothesis) Decisio n Not Linked pairs (fail to reject null) Not Linked non- matches (True Negative) Not linked matches Type II error (False Negative) Linked pairs (reject null) Linked non matches Type I error (False Positive) Linked Matches (True Positive) False Negative Rate= fn/(tp+fn) Sensitivity or recall (ability to recognize true matches)= tp/(tp+fn) False Positive Rate = fp/(tn+fp) Specificity (ability to recognize incorrect matches) = tn/(tn+fp) Precision=tp/(tp+fp)
  • 24. 24 Overall Process Select blocking and matching variables Editing, parsing, phonetic code and standardizing Block and sort both files Deterministic method Probabilistic method Probabilities and Agreement Ratio Determine thresholds Match the two files Clerical review Link the data file records and check for errors Not matched Matche d Good matches Good matches
  • 25. 25 Thank you for your attention!!