Data re-use in the CALIBER programme

Data re-use in the CALIBER
programme
Anoop Shah (a.shah@ucl.ac.uk)
Clinical Epidemiology Group, University College London

14th November 2013

1 The CALIBER programme

2 Why make research data re-usable?

3 The CALIBER approach

4 Summary

The CALIBER programme
UCL & LSHTM collaboration
General practice

MINAP registry

CALIBER
linked research database

Death
registrations

Hospital
Episode Statistics

Funded by NIHR and Wellcome Trust

Deﬁning continuous variables
clinical e.g. blood pressure, laboratory e.g. white cell
count
ˆ

Recorded in CPRD (primary care)
ˆ Identiﬁed by ‘entity code’ and medcode (more
granular)
ˆ Lab data now electronically transferred
ˆ Problems:
ˆ
ˆ
ˆ
ˆ

Missing units
Erroneous values
Inconsistent recording
Missing data

Medcodes associated with a test result
Example: neutrophil counts (a type of white blood
cell) – may be absolute or percentage
Medcode Percent Term
18

89.6

Neutrophil count

17622

9.9

Percentage neutrophils

23114

0.3

Granulocyte count

23115

0.1

13777

0.1

Percentage
granulocytes
Neutrophil count NOS

Distribution of values for diﬀerent units

Analysis issues
ˆ
ˆ

Extraction algorithm
Remove biologically implausible extreme values
ˆ

ˆ
ˆ

In a huge dataset with no restriction on possible
values, there will be some errors

Standardise units
Decide how to analyse
ˆ
ˆ
ˆ
ˆ

Timing e.g. relative to index date
Repeat measures
Transformation, splines, categories etc.
Missing data (e.g. multiple imputation)

Observation time in GP practice
ˆ

Observation time – when registered at GP
practice
ˆ Practice ‘up to standard date’ – date after
which we expect that data are recorded
ˆ If nothing recorded while registered at GP:
ˆ
ˆ

ˆ

Patient may be abroad
Patient may be genuinely healthy

Excluding observation time with no records
risks bias

Deﬁning a diagnosis, e.g. atrial ﬁbrillation

Defining a diagnosis

ˆ
ˆ

Cross-map against different datasets
Individual data sources may miss cases, so
consider using linked datasets
ˆ
ˆ

Important for accurate measures of incidence
May be less important for associations between
disease and risk factor, as long as the risk factor
does not influence recording

Non-fatal myocardial infarction – all
sources miss cases
MINAP
disease
registry

8%
6%
Primary
care
(CPRD)

18%

7%
20%

10%

Hospital
Episode
Statistics

Motivations for re-using data

ˆ

Time taken to prepare data and define
variables
ˆ

ˆ

Cost

Different definitions used by different groups
ˆ

Lack of transparency and reproducibility

Possible approaches

ˆ

Ad hoc sharing of codelists and algorithms
within a group
ˆ Publish codelists and algorithms with papers
ˆ The CALIBER approach
ˆ
ˆ

Repository of codelists and algorithms
Web portal for researcher access

CALIBER ‘LEGO’ data access model
1001, 2000-01-01, 23,1,NULL,I48
1001, 1994-08-11,1234,1,3,7L1H300
1001, 1993-01-01, 253,1,1,793Mz00
1231, 2012-03-03, 23,1,123,K65
1121, 2013-05-04, 7,1,3,5,14AN.00
1121, 2011-05-21, 81,1,9, G573100
1511, 1993-01-11, 91,1,6,9hF1.00
1511, 199-03-11, 91,1,6, G573100
9913, 2012-05-21, 81,1,9, G573100
67222, 1994-11-01,1234,1,3,7L1H300
67222, 1995-12-21,1234,1,3,7L1H300
67222, 1991-03-03,1234,1,3,7L1H310
682444, 1993-01-01, 253,1,1,793Mz00

1001, 2000-01-01, af_gprd=1
1231, 2012-03-03, af_hes=3
1121, 2013-05-04, af_procs_gprd=1
1511, 1993-01-11, heart_valve_gprd=2
9913, 2012-05-21, af_hes=1
67222, 1994-08-11, af_hes=1
682444, 1993-01-01, heart_valve_hes=2

af=1, af_diag_date=2001-12-01

CALIBER phenotypes (research variables)
ˆ

Consistent deﬁnitions for multiple studies (over
300 variables curated)
ˆ Read, ICD-9, ICD-10, OPCS codelists
ˆ Web portal to view variable deﬁnitions, and
registered users can view codelists (https:
//www.caliberresearch.org/portal)
ˆ Future: able to download scripts (e.g. Stata, R,
SQL)

CALIBER data portal

ˆ

Encourage researchers to deﬁne variables in a
way that will be of use to others
ˆ Final validated versions of codelists and
variables
ˆ Review by clinician and researcher

CALIBER analysis software

ˆ

R packages for managing codelists and data
preparation (http://caliberanalysis.
r-forge.r-project.org/)
ˆ Lookup tables and data dictionaries
ˆ Functions to simplify / automate common
steps in data preparation

CALIBER expects researchers to
contribute to the resource
Investigators

Noninvestigators
Nonexperienced

Experienced

Research
coordinator

Industry

Website form

Approvals

Data

Analysis

Publication

Impacts

Website
content

Project feasibility and prioritization

Unified data access form

LEGO data access model
Contribute phenotyping algorithms, linkages

Contribute to knowledge base

Open access

Advancement of knowledge
Translation
Legislation, policy, guidelines
Economic benefit, industry

Difficulties encountered

ˆ

Setting up the data portal takes time, needs
dedicated staff
ˆ Researchers need to think outside their own
project
ˆ Variables are updated / corrected; need to
store different versions

Summary

ˆ

When analysing routine data think about how
the data were collected, and cross-check
diﬀerent sources of information
ˆ Data sharing and re-use can bring beneﬁts but
needs time and resources to manage

Data re-use in the CALIBER programme

Recommended

Recommended

More Related Content

Similar to Data re-use in the CALIBER programme

Similar to Data re-use in the CALIBER programme (20)

More from London School of Hygiene and Tropical Medicine

More from London School of Hygiene and Tropical Medicine (20)

Recently uploaded

Recently uploaded (20)

Data re-use in the CALIBER programme