BioVis 2013 Presentation of VIPER paper
J. Kennedy, M. Graham, T. Paterson, and A. Law, "Visual Cleaning of Genotype Data," Proc. 3rd IEEE Symposium on Biological Data Visualization, pp. 105-112, 2013, doi:10.1109/BioVis.2013.6664353.
The videos are missing, and the animations on the error inheritance slides are all messed up after slideshare conversion... but everything else is ok.
1. Visual Cleaning of Genotype Data
Jessie Kennedy, Martin Graham
Edinburgh Napier University
Trevor Paterson, Andy Law
The Roslin Institute, University of Edinburgh
2. Background
• VIPER is a visualisation for spotting areas of error
(impossible inheritance) in pedigree genotype
datasets
Many More Markers, with
similar data per marker
Pedigree
structure
G | G
T | A G | G
G | G
G | T G | A
T | C
3. Background
• The visualisation aggregated errors across markers
and displayed them as offspring groups
– Along with ancillary tables and bar charts
• For it to be a useful biological tool , it needed
extended to become a data cleaning application
4. Background
• Data Wrangling
– Fixing unreliable or useless data
– General Purpose vs Specific Task
• General Purpose Tools
– Wrangler / Google Refine
– Tabular data
• Ours is a Specific Task
– Remove the errors as they break further analyses
– Fixing errors often creates new ones as our data is an
inheritance graph of related data rather than a table
5. Background
• Error Visualisation Topics (in order of vol of work)
– Uncertainty visualisation – show bounds of reliability
– Missing data visualisation – is data present
• Usually the bane of visualisation rather than the aim
– Correctness visualisation – is data right
6. Data Cleaning
• We cover missing data and correctness. For us...
– Incorrect data – bad.
– Missing (incomplete) data – manageable.
• Cleaning ≠ Correcting
– Correction is preferable, but often impossible
• We clean by deleting erroneous data points and
inferring data from ancestor individuals
– We swap wrong data for missing data
7. Data Cleaning - Operations
• Four basic masking operations
1. Mask markers
2. Mask individuals
3. Mask single data points
4. Break relationships
8. Data Cleaning - Markers
• Markers are independent of each other.
– Masking one marker doesn’t change the errors in any
other markers
• Thus markers with lots of errors can be quickly
removed with no side-effect
– Early version in VIPER hid errors (but didn’t do anything to
the underlying data)
9. Data Cleaning - Individuals
• Wanted to adopt the same approach...
– But something odd happened.
– Removing individuals changes the error counts of other
individuals
• Because individuals inherit from each other
• So e.g. Removing every individual with > 5 errors
produced individuals with >5 errors.
10. Data Cleaning - Individuals
• Some errors turned out to simply drop from one
generation to the next
– Literal “chase to the bottom”, lots of lost data
• In these situations it is often necessary to break a
child/parent relationship across all markers in the
pedigree
– Which is where the fourth masking operation originates
11. Masking - 1
A/G G/T
C/C G/C
A/G C/G G/T G/C G/A C/A C/C
www.napier.ac.uk/iidi
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
Mask all errors
Recheck for errors
Repeat
Lose 50% of data
12. Masking - 2
A/G G/T
C/C G/C
A/G C/G G/T G/C G/A C/A C/C
www.napier.ac.uk/iidi
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
Mask errors top down
Recheck Lose 25% for of errors
data
Repeat
13. Masking - 3
A/G G/T
C/C G/C
A/G C/G G/T G/C G/A C/A C/C
www.napier.ac.uk/iidi
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
Mask errors top down
+ Lose cut links
<20% of data
Recheck for errors
Repeat
14. Showing Missing
• Masked and missing data are shown in a different
colour to error data
15. Representations
• Being careful not to use any other colours in the
interface, we can see how cleaning is going (red vs
blue)
• New masking interactions available through
standard context menus (and through tables)
16. Visual History
• With such a hypothetical / experimental method of
cleaning errors, undo is a must
– Part of Shneiderman’s mantra
– Beyond single-step, branching history
18. Experiment
• Genotype Checker vs VIPER+ interfaces
• Both run using the same underlying data checking
algorithm
• Same dataset
• 11 Biologists/Geneticists/Bioinformaticians at The
Roslin Institute
• Asked them to attempt a pair of representative
tasks with both interfaces (split into 12 Q’s)
19. Experiment - Objective
• Over the whole question set there was no objective
difference, but one did emerge when we considered
questions that involved pedigree exploration
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11
GenotypeChecker
Viper
20. Experiment - Objective
• Over the whole question set there was no objective
difference, but one did emerge when we considered
questions that involved pedigree exploration
8
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11
Genotype Checker
VIPER
21. Experiment - Subjective
Question VP No Pref GC
Finding structural information on a pedigree 7 1 2 1 0
Finding descendents of an individual 8 2 0 1 0
Finding ancestors of an individual 7 3 1 0 0
Finding error information on a single individual 4 1 1 4 1
Finding error information on a single marker 3 3 2 3 0
Distinguishing between different types of error 7 2 2 0 0
Tracing errors to a shared parent 8 0 2 1 0
Finding error information on a single family 7 1 2 1 0
Comparing errors between related families (one shared
parent) 8 1 1 1 0
Masking errors 1 2 4 3 1
Overall understanding of errors 5 1 4 1 0
Overall ease of use 5 2 3 0 1
Key: 1 = Strongly prefer Viper, 5 = Strongly prefer GC, Bold = Median
22. Experiment - Observations
• A lot of incorrect/skipped answers in both scenarios
– GC 61/132 = 46%
– VP 45/132 = 34%
• These users were occasional users of cleaning
software but it does show that Pedigree Cleaning is
hard
• Excelitis – Biologists love Excel. The first move of
many was to investigate the tables of error info
rather than the main pedigree visualisation
23. End
• Thanks for listening
• Sponsored by BBSRC
• http://www.bioinformatics.roslin.ed.ac.uk/viper/
Notes de l'éditeur
Animation is blank, then show 1 markers and errors, then show lots of markers (2 clicks)
Needed to extend it from “here is a problem” to “here’s how to solve the problem”
Diff between errors that render visualisation useless, and errors that visualisation exposes.
When our data is wrong or missing we’re completely certain it’s wrong or missing
Even in something as constrained as SNP markers with only 16 different combinations, we can most of the time only restrict to a set of possible values.
Need diagram
Need diagram
Note, we do examples in this style as it’s better for small pedigrees but no good for large pedigrees
Gradually see change from red to blue
Need diagram
Most hadn’t used visualisation software for pedigree cleaning, two had used GC before
Most hadn’t used visualisation software for pedigree cleaning, two had used GC before
Most hadn’t used visualisation software for pedigree cleaning, two had used GC before