SlideShare a Scribd company logo
1 of 19
Stories from the Field: Data are Messy and that’s
(kind of) ok
Jude Towers, Lecturer in Sociology and Quantitative Methods
David Ellis, Lecturer in Computational Social Science
Introductions: who we are and why
we care about (even messy) data
Jude Towers
• Doctor of Applied Social Statistics, Lecturer in Sociology and
Quantitative Methods, Associate Director of the Violence &
Society UNESCO Centre and lead for the N8 Policing Research
Partnership, Training and Learning strand
• Current research is focused on the measurement of violence
• Work with data which is highly confidential and very, very
‘messy’ (e.g. individualised police records, NGO datasets
• Teach Making Research Count: Engaging with Quantitative
Data – Faculty of Arts & Social Sciences ‘prequel to technical
methods courses’ - thinking critically about data
• JISC-sponsored Data Champion
Introductions: why we care about
(even messy) data
David Ellis
• Doctor of Psychology, Lecturer in Computational Social Science
at Lancaster, Core Researcher as part of CREST Research Centre,
Honorary Research Fellow at Lincoln
• Current research considers the measurement of digital traces
• Data collected is often messy and cloud-based
• JISC-sponsored Data Champion
Data: what counts?
• Inclusive understanding of ‘data’ - the
collection, use and management of a
myriad of forms of data
– ‘field’ data
• Policing
• Health
• Replication crisis within
Why bother with (messy) data?
Data, and the analysis of data can entrench or contest our
understanding of the world – we cannot either accept them at
face-value, nor dismiss them as positivistic and of no use for
progressive social change…
• Need to better support academics, students, policy-makers,
practitioners and the general public to better understand the
implications of the construction and analysis of data, the
presentation of data, especially statistical findings, and the use
and interpretation of ‘evidence’
-> key tool is robust management of data
Contribution to a progressive society, the common good,
a public academia
Messy Data:
• All data are ‘messy’ to some degree: data from ‘the field’ can be
especially messy
• Concepts and definitions can be wildly different
• Getting data is hard
– Sources; collection methods; confidentiality and anonymity;
access; sampling frames -> consequences of explicit and implicit
inclusions and exclusions
• ‘Cleaning’ data is time consuming and can be highly political
– E.g. Outliers: important anomalies or data ‘mistakes’?
• Units of measurement
Data are messy
– but that’s (kind of) OK
GOAL: Distinguish between the signal and
the noise
• SIGNAL: real variation we want to explain
• NOISE: random variation probably caused by the
process of collecting and using data e.g.
measurement, sampling and human error ( caveat:
tomorrow with new knowledge or new techniques /
technology we might return to this seemingly random ‘noise’
and impose a new meaning)
Nate Silver (2012) The Signal and The Noise: The Art and Science of Prediction. London, Penguin.
Learning
GOAL: to expand the current knowledge base to improve
understanding of a particular issue/topic: learning is more than
collecting or producing (new) data -> data needs to be integrated
into and to change the existing knowledge base
Example 1. NHS Administrative
Data
Ellis, McQueenie, McConnachie et al., (2017). The Lancet Public
Health
Example 1. NHS Administrative
Data
Example 1. NHS Administrative
Data
code appointments
attended = 830,039
DNA = 56,441
appointments.csv
N=892,216
patients.csv
N=73,012
clinical.csv
N=704,828
remove non-appointments
based on time rules
compute number of
appointments attended/missed
for each patient
appointmenthistory dataframe
patient ID
DNA
attended
total
percentage missed
annual DNA rate
Categorise each patient. zero,
low medium, high
appointment History merged with
Patients file
(using patient ID as link)
patientappointments dataset
(N=70,165)
ID
sex
age
distance
Rur8
PracticeRur8
SIMD
PracticeSIMD
Ethnic
attended
DNA
total
percentage missed
category
annual rate (attended)
Ready for analysis and visualization
(N=67,705)
reclassify based on
codes of interest
N=825,784 remaining after (7.4%)
removed
Zero N = 44,685 (63.7%)
Low N = 19,281(27.5%)
Medium N = 5,097 (7.3%)
High N = 1,102 (1.6%)
N = 491 patients (<1%) with no
appointment data removed
remove patients with missing
data
N=2,460
(3.5%)
patients classified as frequent/non
frequent attenders
(10th centile (annual attendance
rate>=8.66))
Yes = 7,283
No = 62,882
subset to remove
remove ethnicity data
add age categories
remove administrative/
secretary appointments
N=891,921 remaining after (<.01%)
removed
remove duplicate
patients
N=2,356
Example 1. NHS Administrative
Data
Example 1. NHS Administrative
Data
Example 2. Problems within Social
Science
Example 2. Problems within Social
Science
Example 2. Problems within Social
Science 5
4
3
2
1
Example 2. Problems within Social
Science
5
Shaw, Ellis, Kendrick et al., (2016). Cyberpsychology, Behavior and
Social Networking
Example 2. Problems within Social
Science
Thank you!
j.towers1@Lancaster.ac.uk (@towersjude)
d.a.ellis@Lancaster.ac.uk (@davidaellis)
rdm@lancaster.ac.uk

More Related Content

What's hot

Lecture workshop 2 am open access and altmetrics
Lecture workshop 2 am open access and altmetricsLecture workshop 2 am open access and altmetrics
Lecture workshop 2 am open access and altmetrics
Thed van Leeuwen
 
TAMU_Poster_2015
TAMU_Poster_2015TAMU_Poster_2015
TAMU_Poster_2015
Katie Leming
 
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docxG2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
Maicol Suntasig
 

What's hot (20)

Responsible journals: Making reading, evaluation and publishing open
Responsible journals: Making reading, evaluation and publishing openResponsible journals: Making reading, evaluation and publishing open
Responsible journals: Making reading, evaluation and publishing open
 
Scientific information retrieval: Challenges and opportunities
Scientific information retrieval: Challenges and opportunitiesScientific information retrieval: Challenges and opportunities
Scientific information retrieval: Challenges and opportunities
 
Chew schoeborn niso vc apr 29
Chew schoeborn niso vc apr 29Chew schoeborn niso vc apr 29
Chew schoeborn niso vc apr 29
 
An in-depth bibliometric perspective on China’s scientific performance
An in-depth bibliometric perspective on China’s scientific performanceAn in-depth bibliometric perspective on China’s scientific performance
An in-depth bibliometric perspective on China’s scientific performance
 
Social sciences research addressing societal challenges
Social sciences research addressing societal challengesSocial sciences research addressing societal challenges
Social sciences research addressing societal challenges
 
Open science: Implications for bibliometrics and scientometrics
Open science: Implications for bibliometrics and scientometricsOpen science: Implications for bibliometrics and scientometrics
Open science: Implications for bibliometrics and scientometrics
 
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career ResearchersLEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
LEARN Final Conference: Tutorial Group | How To Engage Early Career Researchers
 
Adams, "Profiles, not Metrics; Why it is important to drill into the data tha...
Adams, "Profiles, not Metrics; Why it is important to drill into the data tha...Adams, "Profiles, not Metrics; Why it is important to drill into the data tha...
Adams, "Profiles, not Metrics; Why it is important to drill into the data tha...
 
The landscape of research on research
The landscape of research on researchThe landscape of research on research
The landscape of research on research
 
Casey, "Measuring Science Impact Among Citations (case studies)"
Casey, "Measuring Science Impact Among Citations (case studies)"Casey, "Measuring Science Impact Among Citations (case studies)"
Casey, "Measuring Science Impact Among Citations (case studies)"
 
Lecture workshop 2 am open access and altmetrics
Lecture workshop 2 am open access and altmetricsLecture workshop 2 am open access and altmetrics
Lecture workshop 2 am open access and altmetrics
 
IT3010 Lecture on Data Analysis
IT3010 Lecture on Data AnalysisIT3010 Lecture on Data Analysis
IT3010 Lecture on Data Analysis
 
Uses and misuses of quantitative indicators of impact
Uses and misuses of quantitative indicators of impactUses and misuses of quantitative indicators of impact
Uses and misuses of quantitative indicators of impact
 
TAMU_Poster_2015
TAMU_Poster_2015TAMU_Poster_2015
TAMU_Poster_2015
 
IT3010 Lecture on Reviewing the literature
IT3010 Lecture on Reviewing the literatureIT3010 Lecture on Reviewing the literature
IT3010 Lecture on Reviewing the literature
 
Contextualized scientometrics: What's behind the numbers?
Contextualized scientometrics: What's behind the numbers?Contextualized scientometrics: What's behind the numbers?
Contextualized scientometrics: What's behind the numbers?
 
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docxG2.suntasig.guallichico.maicol.alexander.english.project.design.docx
G2.suntasig.guallichico.maicol.alexander.english.project.design.docx
 
TDT39 oppstartsmøte
TDT39 oppstartsmøteTDT39 oppstartsmøte
TDT39 oppstartsmøte
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluation
 
IT3010 Lecture- Purpose and product of research
IT3010 Lecture- Purpose and product of researchIT3010 Lecture- Purpose and product of research
IT3010 Lecture- Purpose and product of research
 

Similar to Stories from the Field: Data are Messy and that's (kind of) ok

321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)
Iin Angriyani
 

Similar to Stories from the Field: Data are Messy and that's (kind of) ok (20)

Sdal air education workforce analytics workshop jan. 7 , 2014.pptx
Sdal air education workforce analytics workshop jan. 7 , 2014.pptxSdal air education workforce analytics workshop jan. 7 , 2014.pptx
Sdal air education workforce analytics workshop jan. 7 , 2014.pptx
 
Introduction to data support services and resources for public policy
Introduction to data support services and resources for public policyIntroduction to data support services and resources for public policy
Introduction to data support services and resources for public policy
 
Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...
 
Data Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information ScienceData Science and What It Means to Library and Information Science
Data Science and What It Means to Library and Information Science
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) final
 
BIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.pptBIG-DATAPPTFINAL.ppt
BIG-DATAPPTFINAL.ppt
 
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...Stepping stones to ‘big data’: supporting quantitative methods teaching with ...
Stepping stones to ‘big data’: supporting quantitative methods teaching with ...
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and management
 
A Stocktake of New Zealand’s Healthcare Datasets
A Stocktake of New Zealand’s Healthcare DatasetsA Stocktake of New Zealand’s Healthcare Datasets
A Stocktake of New Zealand’s Healthcare Datasets
 
ISSOTL Presentation
ISSOTL PresentationISSOTL Presentation
ISSOTL Presentation
 
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina LeonelliLa ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
 
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...
What is Extreme Citizen Science? Volunteerism & Publicly Initiated Scientific...
 
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
Elsevier CWTS Open Data Report Presentation at RDA meeting in Barcelona
 
BIG DATA.ppt
BIG DATA.pptBIG DATA.ppt
BIG DATA.ppt
 
Dhis elective topic 3 - info cycle, collection and collation
Dhis elective   topic 3 - info cycle, collection and collationDhis elective   topic 3 - info cycle, collection and collation
Dhis elective topic 3 - info cycle, collection and collation
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)
 
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...Kicking off the INCENTIVE project with an intro to the CS Principles and Char...
Kicking off the INCENTIVE project with an intro to the CS Principles and Char...
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
Sdal overview sallie keller
Sdal overview  sallie kellerSdal overview  sallie keller
Sdal overview sallie keller
 

More from Jisc RDM

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_Burland
Jisc RDM
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
Jisc RDM
 

More from Jisc RDM (20)

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_Burland
 
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 Paper
 
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7
 
Jisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case study
 
Building a national Data Repository Data Modelling
Building a national Data Repository Data ModellingBuilding a national Data Repository Data Modelling
Building a national Data Repository Data Modelling
 
Building a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewBuilding a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture Overview
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
 
Research Data Toolkit
Research Data ToolkitResearch Data Toolkit
Research Data Toolkit
 
Pre jisc datachampday_260318
Pre jisc datachampday_260318Pre jisc datachampday_260318
Pre jisc datachampday_260318
 
Fair data - dinkum research - by Andy Turner
Fair data -  dinkum research - by Andy TurnerFair data -  dinkum research - by Andy Turner
Fair data - dinkum research - by Andy Turner
 
2018 03 codata - making the case
2018 03 codata - making the case2018 03 codata - making the case
2018 03 codata - making the case
 
Research Data Shared Service update at DPC
Research Data Shared Service update at DPCResearch Data Shared Service update at DPC
Research Data Shared Service update at DPC
 
Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1
 
Managing data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMManaging data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCM
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpieces
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - Intro
 
Lightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanLightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellan
 
Lightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardLightning Talk - Nick Sheppard
Lightning Talk - Nick Sheppard
 
Lightning Talk - Angela Dappart
Lightning Talk - Angela DappartLightning Talk - Angela Dappart
Lightning Talk - Angela Dappart
 
Lightning talk - Adam Harwood
Lightning talk - Adam HarwoodLightning talk - Adam Harwood
Lightning talk - Adam Harwood
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Recently uploaded (20)

Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 

Stories from the Field: Data are Messy and that's (kind of) ok

  • 1. Stories from the Field: Data are Messy and that’s (kind of) ok Jude Towers, Lecturer in Sociology and Quantitative Methods David Ellis, Lecturer in Computational Social Science
  • 2. Introductions: who we are and why we care about (even messy) data Jude Towers • Doctor of Applied Social Statistics, Lecturer in Sociology and Quantitative Methods, Associate Director of the Violence & Society UNESCO Centre and lead for the N8 Policing Research Partnership, Training and Learning strand • Current research is focused on the measurement of violence • Work with data which is highly confidential and very, very ‘messy’ (e.g. individualised police records, NGO datasets • Teach Making Research Count: Engaging with Quantitative Data – Faculty of Arts & Social Sciences ‘prequel to technical methods courses’ - thinking critically about data • JISC-sponsored Data Champion
  • 3. Introductions: why we care about (even messy) data David Ellis • Doctor of Psychology, Lecturer in Computational Social Science at Lancaster, Core Researcher as part of CREST Research Centre, Honorary Research Fellow at Lincoln • Current research considers the measurement of digital traces • Data collected is often messy and cloud-based • JISC-sponsored Data Champion
  • 4. Data: what counts? • Inclusive understanding of ‘data’ - the collection, use and management of a myriad of forms of data – ‘field’ data • Policing • Health • Replication crisis within
  • 5. Why bother with (messy) data? Data, and the analysis of data can entrench or contest our understanding of the world – we cannot either accept them at face-value, nor dismiss them as positivistic and of no use for progressive social change… • Need to better support academics, students, policy-makers, practitioners and the general public to better understand the implications of the construction and analysis of data, the presentation of data, especially statistical findings, and the use and interpretation of ‘evidence’ -> key tool is robust management of data Contribution to a progressive society, the common good, a public academia
  • 6. Messy Data: • All data are ‘messy’ to some degree: data from ‘the field’ can be especially messy • Concepts and definitions can be wildly different • Getting data is hard – Sources; collection methods; confidentiality and anonymity; access; sampling frames -> consequences of explicit and implicit inclusions and exclusions • ‘Cleaning’ data is time consuming and can be highly political – E.g. Outliers: important anomalies or data ‘mistakes’? • Units of measurement
  • 7. Data are messy – but that’s (kind of) OK GOAL: Distinguish between the signal and the noise • SIGNAL: real variation we want to explain • NOISE: random variation probably caused by the process of collecting and using data e.g. measurement, sampling and human error ( caveat: tomorrow with new knowledge or new techniques / technology we might return to this seemingly random ‘noise’ and impose a new meaning) Nate Silver (2012) The Signal and The Noise: The Art and Science of Prediction. London, Penguin.
  • 8. Learning GOAL: to expand the current knowledge base to improve understanding of a particular issue/topic: learning is more than collecting or producing (new) data -> data needs to be integrated into and to change the existing knowledge base
  • 9. Example 1. NHS Administrative Data Ellis, McQueenie, McConnachie et al., (2017). The Lancet Public Health
  • 10. Example 1. NHS Administrative Data
  • 11. Example 1. NHS Administrative Data code appointments attended = 830,039 DNA = 56,441 appointments.csv N=892,216 patients.csv N=73,012 clinical.csv N=704,828 remove non-appointments based on time rules compute number of appointments attended/missed for each patient appointmenthistory dataframe patient ID DNA attended total percentage missed annual DNA rate Categorise each patient. zero, low medium, high appointment History merged with Patients file (using patient ID as link) patientappointments dataset (N=70,165) ID sex age distance Rur8 PracticeRur8 SIMD PracticeSIMD Ethnic attended DNA total percentage missed category annual rate (attended) Ready for analysis and visualization (N=67,705) reclassify based on codes of interest N=825,784 remaining after (7.4%) removed Zero N = 44,685 (63.7%) Low N = 19,281(27.5%) Medium N = 5,097 (7.3%) High N = 1,102 (1.6%) N = 491 patients (<1%) with no appointment data removed remove patients with missing data N=2,460 (3.5%) patients classified as frequent/non frequent attenders (10th centile (annual attendance rate>=8.66)) Yes = 7,283 No = 62,882 subset to remove remove ethnicity data add age categories remove administrative/ secretary appointments N=891,921 remaining after (<.01%) removed remove duplicate patients N=2,356
  • 12. Example 1. NHS Administrative Data
  • 13. Example 1. NHS Administrative Data
  • 14. Example 2. Problems within Social Science
  • 15. Example 2. Problems within Social Science
  • 16. Example 2. Problems within Social Science 5 4 3 2 1
  • 17. Example 2. Problems within Social Science 5 Shaw, Ellis, Kendrick et al., (2016). Cyberpsychology, Behavior and Social Networking
  • 18. Example 2. Problems within Social Science

Editor's Notes

  1. My idea for this slide would be for David to give an example / exemplar from Health and I will give one from Crime to illustrate – as per below… we could talk to one slide or could separate out each bullet point into a separate slide and add examples – which ever you think is best… [I’ve just added some slides pointing to 2 examples, but the points you raise here apply to everything] Messy data – e.g. 80% of respondents reporting domestic violence to the Crime Survey for England and Wales have not reported to the police Concepts and definitions – what is violence is the most controversial question in the field – is it narrow and specific e.g. physical act which causes injury, fear or distress or is it wide e.g. Zizek ‘violence of capitalism’ Galtung ‘ any unnecessary civilian death’ – implications for data Often clashes with new ‘open data’ agenda e.g. CSEW Intimate Violence data – need to be certified, access via secure server from PC with static IP address, in a locked room with no public access; all outputs have to be checked and signed off before removal from server; only those certified can see data in ‘raw’ form or during analysis process; sampling frame CSEW excludes groups most likely to be victims of crime – homeless, anyone in an institutional setting e.g. prison, hospital, refuge, and anyone staying temporarily with friends or family (insecurely housed) Outliers – remove serial killers from homicide trends Unit of measurement: violent crime in UK going up if use crimes, going down if use victims
  2. Success stories…
  3. The truth is often far messier than what is presented within a journal https://psychology.shinyapps.io/example3/ https://psychology.shinyapps.io/smartphonepersonality/ https://t.co/DurJDuJHQM