An overview of work being performed to make research data easier to manage, analyse and use in the CALIBER programme. Presentation given by Anoop Shah of UCL at the Data Management in Practice workshop which took place on Nov 14th at the London School of Hygiene and Tropical Medicine
1. Data re-use in the CALIBER
programme
Anoop Shah (a.shah@ucl.ac.uk)
Clinical Epidemiology Group, University College London
14th November 2013
2. 1 The CALIBER programme
2 Why make research data re-usable?
3 The CALIBER approach
4 Summary
3. The CALIBER programme
UCL & LSHTM collaboration
General practice
MINAP registry
CALIBER
linked research database
Death
registrations
Hospital
Episode Statistics
Funded by NIHR and Wellcome Trust
5. Defining continuous variables
clinical e.g. blood pressure, laboratory e.g. white cell
count
ˆ
Recorded in CPRD (primary care)
ˆ Identified by ‘entity code’ and medcode (more
granular)
ˆ Lab data now electronically transferred
ˆ Problems:
ˆ
ˆ
ˆ
ˆ
Missing units
Erroneous values
Inconsistent recording
Missing data
6. Medcodes associated with a test result
Example: neutrophil counts (a type of white blood
cell) – may be absolute or percentage
Medcode Percent Term
18
89.6
Neutrophil count
17622
9.9
Percentage neutrophils
23114
0.3
Granulocyte count
23115
0.1
13777
0.1
Percentage
granulocytes
Neutrophil count NOS
9. Analysis issues
ˆ
ˆ
Extraction algorithm
Remove biologically implausible extreme values
ˆ
ˆ
ˆ
In a huge dataset with no restriction on possible
values, there will be some errors
Standardise units
Decide how to analyse
ˆ
ˆ
ˆ
ˆ
Timing e.g. relative to index date
Repeat measures
Transformation, splines, categories etc.
Missing data (e.g. multiple imputation)
10. Observation time in GP practice
ˆ
Observation time – when registered at GP
practice
ˆ Practice ‘up to standard date’ – date after
which we expect that data are recorded
ˆ If nothing recorded while registered at GP:
ˆ
ˆ
ˆ
Patient may be abroad
Patient may be genuinely healthy
Excluding observation time with no records
risks bias
12. Defining a diagnosis
ˆ
ˆ
Cross-map against different datasets
Individual data sources may miss cases, so
consider using linked datasets
ˆ
ˆ
Important for accurate measures of incidence
May be less important for associations between
disease and risk factor, as long as the risk factor
does not influence recording
13. Non-fatal myocardial infarction – all
sources miss cases
MINAP
disease
registry
8%
6%
Primary
care
(CPRD)
18%
7%
20%
10%
Hospital
Episode
Statistics
14. Motivations for re-using data
ˆ
Time taken to prepare data and define
variables
ˆ
ˆ
Cost
Different definitions used by different groups
ˆ
Lack of transparency and reproducibility
15. Possible approaches
ˆ
Ad hoc sharing of codelists and algorithms
within a group
ˆ Publish codelists and algorithms with papers
ˆ The CALIBER approach
ˆ
ˆ
Repository of codelists and algorithms
Web portal for researcher access
20. CALIBER data portal
ˆ
Encourage researchers to define variables in a
way that will be of use to others
ˆ Final validated versions of codelists and
variables
ˆ Review by clinician and researcher
21. CALIBER analysis software
ˆ
R packages for managing codelists and data
preparation (http://caliberanalysis.
r-forge.r-project.org/)
ˆ Lookup tables and data dictionaries
ˆ Functions to simplify / automate common
steps in data preparation
22. CALIBER expects researchers to
contribute to the resource
Investigators
Noninvestigators
Nonexperienced
Experienced
Research
coordinator
Industry
Website form
Approvals
Data
Analysis
Publication
Impacts
Website
content
Project feasibility and prioritization
Unified data access form
LEGO data access model
Contribute phenotyping algorithms, linkages
Contribute to knowledge base
Open access
Advancement of knowledge
Translation
Legislation, policy, guidelines
Economic benefit, industry
23. Difficulties encountered
ˆ
Setting up the data portal takes time, needs
dedicated staff
ˆ Researchers need to think outside their own
project
ˆ Variables are updated / corrected; need to
store different versions
24. Summary
ˆ
When analysing routine data think about how
the data were collected, and cross-check
different sources of information
ˆ Data sharing and re-use can bring benefits but
needs time and resources to manage