SlideShare une entreprise Scribd logo
1  sur  23
Visual Cleaning of Genotype Data 
Jessie Kennedy, Martin Graham 
Edinburgh Napier University 
Trevor Paterson, Andy Law 
The Roslin Institute, University of Edinburgh
Background 
• VIPER is a visualisation for spotting areas of error 
(impossible inheritance) in pedigree genotype 
datasets 
Many More Markers, with 
similar data per marker 
Pedigree 
structure 
G | G 
T | A G | G 
G | G 
G | T G | A 
T | C
Background 
• The visualisation aggregated errors across markers 
and displayed them as offspring groups 
– Along with ancillary tables and bar charts 
• For it to be a useful biological tool , it needed 
extended to become a data cleaning application
Background 
• Data Wrangling 
– Fixing unreliable or useless data 
– General Purpose vs Specific Task 
• General Purpose Tools 
– Wrangler / Google Refine 
– Tabular data 
• Ours is a Specific Task 
– Remove the errors as they break further analyses 
– Fixing errors often creates new ones as our data is an 
inheritance graph of related data rather than a table
Background 
• Error Visualisation Topics (in order of vol of work) 
– Uncertainty visualisation – show bounds of reliability 
– Missing data visualisation – is data present 
• Usually the bane of visualisation rather than the aim 
– Correctness visualisation – is data right
Data Cleaning 
• We cover missing data and correctness. For us... 
– Incorrect data – bad. 
– Missing (incomplete) data – manageable. 
• Cleaning ≠ Correcting 
– Correction is preferable, but often impossible 
• We clean by deleting erroneous data points and 
inferring data from ancestor individuals 
– We swap wrong data for missing data
Data Cleaning - Operations 
• Four basic masking operations 
1. Mask markers 
2. Mask individuals 
3. Mask single data points 
4. Break relationships
Data Cleaning - Markers 
• Markers are independent of each other. 
– Masking one marker doesn’t change the errors in any 
other markers 
• Thus markers with lots of errors can be quickly 
removed with no side-effect 
– Early version in VIPER hid errors (but didn’t do anything to 
the underlying data)
Data Cleaning - Individuals 
• Wanted to adopt the same approach... 
– But something odd happened. 
– Removing individuals changes the error counts of other 
individuals 
• Because individuals inherit from each other 
• So e.g. Removing every individual with > 5 errors 
produced individuals with >5 errors.
Data Cleaning - Individuals 
• Some errors turned out to simply drop from one 
generation to the next 
– Literal “chase to the bottom”, lots of lost data 
• In these situations it is often necessary to break a 
child/parent relationship across all markers in the 
pedigree 
– Which is where the fourth masking operation originates
Masking - 1 
A/G G/T 
C/C G/C 
A/G C/G G/T G/C G/A C/A C/C 
www.napier.ac.uk/iidi 
C/C G/G G/T G/G G/C 
C/A A/C 
G/C G/G 
Mask all errors 
Recheck for errors 
Repeat 
Lose 50% of data
Masking - 2 
A/G G/T 
C/C G/C 
A/G C/G G/T G/C G/A C/A C/C 
www.napier.ac.uk/iidi 
C/C G/G G/T G/G G/C 
C/A A/C 
G/C G/G 
Mask errors top down 
Recheck Lose 25% for of errors 
data 
Repeat
Masking - 3 
A/G G/T 
C/C G/C 
A/G C/G G/T G/C G/A C/A C/C 
www.napier.ac.uk/iidi 
C/C G/G G/T G/G G/C 
C/A A/C 
G/C G/G 
Mask errors top down 
+ Lose cut links 
<20% of data 
Recheck for errors 
Repeat
Showing Missing 
• Masked and missing data are shown in a different 
colour to error data
Representations 
• Being careful not to use any other colours in the 
interface, we can see how cleaning is going (red vs 
blue) 
• New masking interactions available through 
standard context menus (and through tables)
Visual History 
• With such a hypothetical / experimental method of 
cleaning errors, undo is a must 
– Part of Shneiderman’s mantra 
– Beyond single-step, branching history
Final Interface
Experiment 
• Genotype Checker vs VIPER+ interfaces 
• Both run using the same underlying data checking 
algorithm 
• Same dataset 
• 11 Biologists/Geneticists/Bioinformaticians at The 
Roslin Institute 
• Asked them to attempt a pair of representative 
tasks with both interfaces (split into 12 Q’s)
Experiment - Objective 
• Over the whole question set there was no objective 
difference, but one did emerge when we considered 
questions that involved pedigree exploration 
12 
10 
8 
6 
4 
2 
0 
1 2 3 4 5 6 7 8 9 10 11 
GenotypeChecker 
Viper
Experiment - Objective 
• Over the whole question set there was no objective 
difference, but one did emerge when we considered 
questions that involved pedigree exploration 
8 
7 
6 
5 
4 
3 
2 
1 
0 
1 2 3 4 5 6 7 8 9 10 11 
Genotype Checker 
VIPER
Experiment - Subjective 
Question VP No Pref GC 
Finding structural information on a pedigree 7 1 2 1 0 
Finding descendents of an individual 8 2 0 1 0 
Finding ancestors of an individual 7 3 1 0 0 
Finding error information on a single individual 4 1 1 4 1 
Finding error information on a single marker 3 3 2 3 0 
Distinguishing between different types of error 7 2 2 0 0 
Tracing errors to a shared parent 8 0 2 1 0 
Finding error information on a single family 7 1 2 1 0 
Comparing errors between related families (one shared 
parent) 8 1 1 1 0 
Masking errors 1 2 4 3 1 
Overall understanding of errors 5 1 4 1 0 
Overall ease of use 5 2 3 0 1 
Key: 1 = Strongly prefer Viper, 5 = Strongly prefer GC, Bold = Median
Experiment - Observations 
• A lot of incorrect/skipped answers in both scenarios 
– GC 61/132 = 46% 
– VP 45/132 = 34% 
• These users were occasional users of cleaning 
software but it does show that Pedigree Cleaning is 
hard 
• Excelitis – Biologists love Excel. The first move of 
many was to investigate the tables of error info 
rather than the main pedigree visualisation
End 
• Thanks for listening 
• Sponsored by BBSRC 
• http://www.bioinformatics.roslin.ed.ac.uk/viper/

Contenu connexe

Similaire à Final VIPER presentation at BioVis 2013

Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian Aurisano
 
Jillian ms defense-4-14-14-ja-novideo
Jillian ms defense-4-14-14-ja-novideoJillian ms defense-4-14-14-ja-novideo
Jillian ms defense-4-14-14-ja-novideoJillian Aurisano
 
Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian Aurisano
 
Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian Aurisano
 
Session ii g3 overview behavior science mmc
Session ii g3 overview behavior science mmcSession ii g3 overview behavior science mmc
Session ii g3 overview behavior science mmcUSD Bioinformatics
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian Aurisano
 
Stat 3203 -sampling errors and non-sampling errors
Stat 3203 -sampling errors  and non-sampling errorsStat 3203 -sampling errors  and non-sampling errors
Stat 3203 -sampling errors and non-sampling errorsKhulna University
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbugc.titus.brown
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniquesahmad abdelhafeez
 
Dealing with incomplete data for mapping and spatial analysis
Dealing with incomplete data for mapping and spatial analysisDealing with incomplete data for mapping and spatial analysis
Dealing with incomplete data for mapping and spatial analysisAileen Buckley
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 

Similaire à Final VIPER presentation at BioVis 2013 (20)

Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3
 
Jillian ms defense-4-14-14-ja-novideo
Jillian ms defense-4-14-14-ja-novideoJillian ms defense-4-14-14-ja-novideo
Jillian ms defense-4-14-14-ja-novideo
 
Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2
 
Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2Jillian ms defense-4-14-14-ja-novid2
Jillian ms defense-4-14-14-ja-novid2
 
Session ii g3 overview behavior science mmc
Session ii g3 overview behavior science mmcSession ii g3 overview behavior science mmc
Session ii g3 overview behavior science mmc
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
 
Fundamental of Quality Data - Anthony Ndungu
Fundamental of Quality Data - Anthony NdunguFundamental of Quality Data - Anthony Ndungu
Fundamental of Quality Data - Anthony Ndungu
 
Stat 3203 -sampling errors and non-sampling errors
Stat 3203 -sampling errors  and non-sampling errorsStat 3203 -sampling errors  and non-sampling errors
Stat 3203 -sampling errors and non-sampling errors
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?
 
Turning Information chaos into reliable data
Turning Information chaos into reliable dataTurning Information chaos into reliable data
Turning Information chaos into reliable data
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining TechniquesA Novel Approach for Breast Cancer Detection using Data Mining Techniques
A Novel Approach for Breast Cancer Detection using Data Mining Techniques
 
Dealing with incomplete data for mapping and spatial analysis
Dealing with incomplete data for mapping and spatial analysisDealing with incomplete data for mapping and spatial analysis
Dealing with incomplete data for mapping and spatial analysis
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 

Dernier

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Dernier (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Final VIPER presentation at BioVis 2013

  • 1. Visual Cleaning of Genotype Data Jessie Kennedy, Martin Graham Edinburgh Napier University Trevor Paterson, Andy Law The Roslin Institute, University of Edinburgh
  • 2. Background • VIPER is a visualisation for spotting areas of error (impossible inheritance) in pedigree genotype datasets Many More Markers, with similar data per marker Pedigree structure G | G T | A G | G G | G G | T G | A T | C
  • 3. Background • The visualisation aggregated errors across markers and displayed them as offspring groups – Along with ancillary tables and bar charts • For it to be a useful biological tool , it needed extended to become a data cleaning application
  • 4. Background • Data Wrangling – Fixing unreliable or useless data – General Purpose vs Specific Task • General Purpose Tools – Wrangler / Google Refine – Tabular data • Ours is a Specific Task – Remove the errors as they break further analyses – Fixing errors often creates new ones as our data is an inheritance graph of related data rather than a table
  • 5. Background • Error Visualisation Topics (in order of vol of work) – Uncertainty visualisation – show bounds of reliability – Missing data visualisation – is data present • Usually the bane of visualisation rather than the aim – Correctness visualisation – is data right
  • 6. Data Cleaning • We cover missing data and correctness. For us... – Incorrect data – bad. – Missing (incomplete) data – manageable. • Cleaning ≠ Correcting – Correction is preferable, but often impossible • We clean by deleting erroneous data points and inferring data from ancestor individuals – We swap wrong data for missing data
  • 7. Data Cleaning - Operations • Four basic masking operations 1. Mask markers 2. Mask individuals 3. Mask single data points 4. Break relationships
  • 8. Data Cleaning - Markers • Markers are independent of each other. – Masking one marker doesn’t change the errors in any other markers • Thus markers with lots of errors can be quickly removed with no side-effect – Early version in VIPER hid errors (but didn’t do anything to the underlying data)
  • 9. Data Cleaning - Individuals • Wanted to adopt the same approach... – But something odd happened. – Removing individuals changes the error counts of other individuals • Because individuals inherit from each other • So e.g. Removing every individual with > 5 errors produced individuals with >5 errors.
  • 10. Data Cleaning - Individuals • Some errors turned out to simply drop from one generation to the next – Literal “chase to the bottom”, lots of lost data • In these situations it is often necessary to break a child/parent relationship across all markers in the pedigree – Which is where the fourth masking operation originates
  • 11. Masking - 1 A/G G/T C/C G/C A/G C/G G/T G/C G/A C/A C/C www.napier.ac.uk/iidi C/C G/G G/T G/G G/C C/A A/C G/C G/G Mask all errors Recheck for errors Repeat Lose 50% of data
  • 12. Masking - 2 A/G G/T C/C G/C A/G C/G G/T G/C G/A C/A C/C www.napier.ac.uk/iidi C/C G/G G/T G/G G/C C/A A/C G/C G/G Mask errors top down Recheck Lose 25% for of errors data Repeat
  • 13. Masking - 3 A/G G/T C/C G/C A/G C/G G/T G/C G/A C/A C/C www.napier.ac.uk/iidi C/C G/G G/T G/G G/C C/A A/C G/C G/G Mask errors top down + Lose cut links <20% of data Recheck for errors Repeat
  • 14. Showing Missing • Masked and missing data are shown in a different colour to error data
  • 15. Representations • Being careful not to use any other colours in the interface, we can see how cleaning is going (red vs blue) • New masking interactions available through standard context menus (and through tables)
  • 16. Visual History • With such a hypothetical / experimental method of cleaning errors, undo is a must – Part of Shneiderman’s mantra – Beyond single-step, branching history
  • 18. Experiment • Genotype Checker vs VIPER+ interfaces • Both run using the same underlying data checking algorithm • Same dataset • 11 Biologists/Geneticists/Bioinformaticians at The Roslin Institute • Asked them to attempt a pair of representative tasks with both interfaces (split into 12 Q’s)
  • 19. Experiment - Objective • Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 GenotypeChecker Viper
  • 20. Experiment - Objective • Over the whole question set there was no objective difference, but one did emerge when we considered questions that involved pedigree exploration 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 Genotype Checker VIPER
  • 21. Experiment - Subjective Question VP No Pref GC Finding structural information on a pedigree 7 1 2 1 0 Finding descendents of an individual 8 2 0 1 0 Finding ancestors of an individual 7 3 1 0 0 Finding error information on a single individual 4 1 1 4 1 Finding error information on a single marker 3 3 2 3 0 Distinguishing between different types of error 7 2 2 0 0 Tracing errors to a shared parent 8 0 2 1 0 Finding error information on a single family 7 1 2 1 0 Comparing errors between related families (one shared parent) 8 1 1 1 0 Masking errors 1 2 4 3 1 Overall understanding of errors 5 1 4 1 0 Overall ease of use 5 2 3 0 1 Key: 1 = Strongly prefer Viper, 5 = Strongly prefer GC, Bold = Median
  • 22. Experiment - Observations • A lot of incorrect/skipped answers in both scenarios – GC 61/132 = 46% – VP 45/132 = 34% • These users were occasional users of cleaning software but it does show that Pedigree Cleaning is hard • Excelitis – Biologists love Excel. The first move of many was to investigate the tables of error info rather than the main pedigree visualisation
  • 23. End • Thanks for listening • Sponsored by BBSRC • http://www.bioinformatics.roslin.ed.ac.uk/viper/

Notes de l'éditeur

  1. Animation is blank, then show 1 markers and errors, then show lots of markers (2 clicks)
  2. Needed to extend it from “here is a problem” to “here’s how to solve the problem”
  3. Diff between errors that render visualisation useless, and errors that visualisation exposes.
  4. When our data is wrong or missing we’re completely certain it’s wrong or missing
  5. Even in something as constrained as SNP markers with only 16 different combinations, we can most of the time only restrict to a set of possible values.
  6. Need diagram
  7. Need diagram
  8. Note, we do examples in this style as it’s better for small pedigrees but no good for large pedigrees
  9. Gradually see change from red to blue
  10. Need diagram
  11. Most hadn’t used visualisation software for pedigree cleaning, two had used GC before
  12. Most hadn’t used visualisation software for pedigree cleaning, two had used GC before
  13. Most hadn’t used visualisation software for pedigree cleaning, two had used GC before