Researcher KnowHow session on Anonymisation 101, based on slides and training materials by Dr Sarah Nevitt, Research Associate at the University of Liverpool with a section on Research Data Management and Anonymisation by Judith Carr, Research Data Manager and co-ordinated by Gary Jeffers, Research Data Officer at University of Liverpool Library.
MS4 level being good citizen -imperative- (1) (1).pdf
Anonymisation 101
1. Research Data Management and Anonymisation
Aim:- To introduce concepts of research
data management to aid research practice
and the anonymisation process
Judith Carr, Research Data Manager
Gary Jeffers, Research Data Officer
2. Session overview
• Let's talk about managing your research data
• Can't avoid data protection and GDPR
• Anonymisation – a practical bit
• First of all, a quick poll
3. Research data are
Research data management is
Any recorded information necessary
to support or validate a research
project’s observations, findings or
outputs, regardless of format
“an explicit process covering the
creation and stewardship of
research materials to enable their
use for as long as they retain value
5. Plan, plan and plan some more!
Plan because
v Avoid drowning in irrelevant information
v Avoid duplication
v Underpins integrity and efficiency of research
v Enables you to access data easily
v Aids collaboration
v Underpins data security and preservation – back up!!
v Plan to share
www.Liverpool.ac.uk/rdm
6. Photo by Derick McKinney on Unsplash
Not the most exciting part of research but a
building block that helps best research practise
• Simple filing and labelling
• Learning metadata vocabulary
• Conversations about what, how and where
data is collected
• Complicated collaborations data from different
jurisdictions – DMP online a good collaboration tool – used
worldwide
• Plan and prepare how to manage and
share – avoid delays and misunderstandings down the
line
7. Example:- filing of email, letters, attachments
• Name of correspondent (sender or receiver as appropriate)
• Subject description (where it is not given in the folder title)
• Date of letter/email/memo
• If incoming correspondence, include ‘rcvd’
• If an attachment, the same info as above, with additional: 'attch' - to indicate
the document is an attachment and [2 digit number] of [2 digit number] - to
indicate the number of attachments received with the same covering email
Correct file name
/…/Complaints/
BloggsJ20031205attch01of02.pdf
BloggsJ20031205attch02of02.pdf
BloggsJ20031205rcvd.txt
BloggsJ20040105.rtf
BloggsJ20040220.rtf
BloggsJ20040220.rtf
ThomasH20030610rcvd.txt
ThomasH20030710.rtf
(Ordered alphanumerically as the files would be in the directory
list)
Incorrect file name
/…/Complaints/
AttachmentFromHThomas10Jun03.rtf
Attachment1FromJBloggs.pdf
Attachment2FromJBloggs.pdf
EmailFromHelenThomas10Jun03.txt
EmailToJoeBloggs5Dec03.txt
LetterFromJoeBloggs5Jan04.rtf
LetterToHelenThomas10Jul03.rtf
LetterToJoeBloggs20Feb04.rtf
(Ordered alphanumerically as the files would be in the directory list)
8. Store and back up
• Where? Important when dealing with personal/special category data,
also when using secondary data.
• Security? Passwords, encryption, limit access, anonymise early
• Never! One copy on tablet, phone, USB, unprotected laptop
• University of Liverpool C drive, M drive (Dept), Active Data store
9. The ingredients for sharing
Photo by Kelly Sikkema on Unsplas
University of Liverpool Research Data Catalogue
• Discipline or funder research data repository
• Liverpool Research Data Catalogue
• Persistent identifier – DOI – you can be cited!
• Creative commons licence (CC-BY if possible)
• Anonymise to share
• Cannot go open – controlled sharing and
data sharing agreement
• Metadata – to find and understand your data
• Formats – generic and accessible
• LINK
10. Research Data and Data Protection
Photo by Elena Mozhvilo on Unsplash
11. Data Protection
• Data Protection (GDPR) only applies if the data you are collecting, using, analysing
is identifiable personal or special category data, but this includes pseudonymised data
• When collecting and using identifiable personal or special category data, the
information sheet and consent form are really important. They should detail what you
will be doing with the data ( including how going to share)
• In the UK, researchers are collecting, archiving and processing such data
because processing is necessary for archiving purposes in the public interest, for
scientific or historical research purposes or statistical purposes in accordance with
Article 89(1).
GDPR does not apply to
anonymised data – but it is
always best practise to get
consent to share anonymous
research data
12. GDPR Principles
•Lawfulness, fairness and transparency –
process lawfully, fairly and in a transparent
manner in relation to the data subject
•Purpose limitation – collected for specified,
explicit and legitimate purposes
•Data minimisation – adequate, relevant and
limited to what is necessary in relation to the
purposes for which they are processed
•Accuracy – accurate and kept up to date where
necessary
•Storage limitation – kept in a form that permits
identification for no longer than is necessary
•Integrity and confidentiality – processed
securely, protecting against unauthorised or
unlawful processing, accidental loss, destruction
or damage. Using appropriate technical and
organisational measures.
Withdrawal of information
• Individuals have right to withdraw information
• Collection of data is under Article 89 – this
right can be time limited
• Only if you have clearly advised participants
via information sheet and consent form
• Only if you have given a reasonable amount of
time before you withdraw the right
• Only if it would be difficult to extract the data
If you cannot make your
research data open, you
may still be able to share
in a controlled manner.
You will need consent
13. Anonymisation 101
Based on slides and Training
materials by Dr Sarah Nevitt
sjn16@liverpool.ac.uk @sjn_16
14. Anonymisation: why?
• Data Sharing: HOT TOPIC
• Huge societal and scientific / research benefit
• Utility of data already collected for research
• Design new studies efficiently
• Consider new hypotheses with existing data
• Save resources
• Avoid research waste
15. Data sharing: a history
• Old news: researchers have shared data for years
o Friends and colleagues share data within their field
o Informal, few questions asked
o Little thought given to whether data was anonymised or if research
participants consented to their data being shared
• More recently, the process has formalised
o More research groups globally, more questions asked
o Requests for research proposals
o Introduction of data sharing or data use agreements
o Protection of participant privacy via anonymisation
o Choice to share at the discretion of the investigators
16. Data sharing: the present (and the future?)
Data transparency and data sharing policies
o UK Research and Innovation Common Principles on Data Policy
o Wellcome Trust data and software sharing policy
o Many journals (all fields) now have data sharing policies
• Great interest in repositories to facilitate storing and sharing data across
many fields
• Data sharing platforms are now commonly used by the pharmaceutical
industry
17. Data sharing: the present (and the future?)
CSDR: www.clinicalstudydatarequest.com
The YODA project: www.yoda.yale.edu
Vivli: www.vivli.org
Data sharing platforms
• Research proposal
• Data Sharing Agreement
• Anonymised data shared within
a remote analysis portal
19. Patient
ID
DoB Age Gender Race Country Partner
Age
1 12APR1963 56 Male White Canada 48
2 28MAY1974 44 Male Asian France 43
3 06MAY1961 58 Male White United States 41
4 28MAY1954 65 Female Black Spain 65
5 14JUL1969 49 Male Black Brazil 41
6 13AUG1964 55 Female White Argentina 45
7 18MAR1961 58 Male White United States 48
8 22JAN1961 58 Male White United States 37
9 27SEP1924 95 Male White Canada 73
10 07FEB1966 53 Male White Canada 62
?
Slide acknowledgement to Jean-Marc Ferran
20. Patient
ID
Age
Category
Age Gender Race Country Partner
Age
1 <89 56 Male White Canada
2 <89 44 Male Asian France
3 <89 58 Male White United States
4 <89 65 Female Black Spain
5 <89 49 Male Black Brazil
6 <89 55 Female White Argentina
7 <89 58 Male White United States
8 <89 58 Male White United States
9 ≥89 . Male White Canada
10 <89 53 Male White Canada
?
?
?
Slide acknowledgement to Jean-Marc Ferran
21. Patient
ID
Age
Category 2
Age Gender Race Continent Partner
Age
1 50-59 Male White North America
2 40-49 Male Asian Europe
3 50-59 Male White North America
4 60-69 Female Black Europe
5 40-49 Male Black South America
6 50-59 Female White South America
7 50-59 Male White North America
8 50-59 Male White North America
9 ≥89 Male White North America
10 50-59 Male White North America
?
?
?
?
?
Slide acknowledgement to Jean-Marc Ferran
22. Patient
ID
DoB Age Gender Race Country Partner
Age
1
2
3
4
5
6
7
8
9
10
?
?
?
?
?
?
?
?
?
?
Slide acknowledgement to Jean-Marc Ferran
24. Anonymisation: Direct and Quasi identifiers
• Direct Identifier: Information that can uniquely identify an individual.
– NHS number, Passport number, NI number, Postcode
– This information is usually removed completely (redacted, little data
utility)
• Quasi identifier: Information that can identify an individual when used in
connection with other information
– Level 1: Information which doesn’t change over time
• Sex, Country, Ethnicity etc. (recoded)
• Date of birth. (removed or replace with age)
– Level 2: Information which changes over time
• Event dates (offset or replace with study day)
• Measurements etc. (recoded or left unaltered – risk vs utility)
25. Example: Anonymisation of clinical datasets
• Study ID variables could be a direct identifier
• Link uniquely back to a single person
• Recoded with attempt to preserve the sequence of patients joining the study
• Complete removal of any other personally identifiable information or
sensitive data
• Record contact details (phone numbers, addresses etc.) outside of the main
database
26. Example: Anonymisation of clinical datasets
• Dates recorded during the study
• Use study day: the day the participant enters the study is day 0 and they attend
appointments every week (day 7, day 14, day 21 etc.)
• Easier and should be adequate for most scenarios?
• Offset: add or subtract a random number to all dates recorded for a person
• Removes the identifying nature of the actual dates, preserves the sequence of
dates – data utility?
• Generate a random number for each participant (within a small range), apply
to all dates
27. Example: Anonymisation of clinical datasets
•Free text
• Free text is a nightmare in clinical datasets
• Impossible to validate and tricky to analyse
• Some circumstances, free text is necessary (e.g. to describe ‘Other’ where a
pre-defined category cannot be chosen)
• Often contains identifiable or personal information
• Complete removal would be easier – but consider the utility of this
data within the whole dataset
• Review and remove only the identifiable part.
• E.g. Replace Dr Jones says progress is good with Dr ‘X.’
• Also consideration for identifiable information which is not obvious
• Linking to dates – e.g. seasons, birthday etc.
• E.g. ‘The patient’s next appointment was missed due to Christmas’
28. Summary
• Data sharing has huge benefits and the push towards sharing data seems to be
getting stronger
• Protection of personal information via anonymization is of utmost importance
(and required by GDPR for sharing data)
• Anonymised dataset – identifiable information is removed from a dataset so
that re-identification of an individual is of an acceptably low risk.
• What is deemed to be ‘low risk’ varies by context!
• Thought should also be given to the utility of the anonymised data
29. Practical
You are provided with a dataset of fifty people
who have taken part in a ‘Couch to 5K’ study.
Following completion of the 9 week programme,
all participants were provided with a running
watch and invited to record their time to run 5K.
30. Practical
Consider whether this dataset is sufficiently anonymised or
whether any changes should to be made to the data. Consider:
• Personally identifiable information
• Quasi Identifiers
• Dates
• Low frequencies or extreme values
• Sensitive information
• Free text
Remember to consider the utility of the anonymised data!
31. Thank you and any questions?
Webpages and contacts
www.Liverpool.ac.uk/rdm – Liverpool Research Data
https://www.liverpool.ac.uk/legal/data_protection/ - Data Protection
https://www.liverpool.ac.uk/csd/security/information-security/ -
Information security – CSD
https://www.ukdataservice.ac.uk/manage-data/legal-
ethical/anonymisation/step-by-step.aspx - UK Data Service