Final VIPER presentation at BioVis 2013

Visual Cleaning of Genotype Data
Jessie Kennedy, Martin Graham
Edinburgh Napier University
Trevor Paterson, Andy Law
The Roslin Institute, University of Edinburgh

Background
• VIPER is a visualisation for spotting areas of error
(impossible inheritance) in pedigree genotype
datasets
Many More Markers, with
similar data per marker
Pedigree
structure
G | G
T | A G | G
G | G
G | T G | A
T | C

Background
• The visualisation aggregated errors across markers
and displayed them as offspring groups
– Along with ancillary tables and bar charts
• For it to be a useful biological tool , it needed
extended to become a data cleaning application

Background
• Data Wrangling
– Fixing unreliable or useless data
– General Purpose vs Specific Task
• General Purpose Tools
– Wrangler / Google Refine
– Tabular data
• Ours is a Specific Task
– Remove the errors as they break further analyses
– Fixing errors often creates new ones as our data is an
inheritance graph of related data rather than a table

Background
• Error Visualisation Topics (in order of vol of work)
– Uncertainty visualisation – show bounds of reliability
– Missing data visualisation – is data present
• Usually the bane of visualisation rather than the aim
– Correctness visualisation – is data right

Data Cleaning
• We cover missing data and correctness. For us...
– Incorrect data – bad.
– Missing (incomplete) data – manageable.
• Cleaning ≠ Correcting
– Correction is preferable, but often impossible
• We clean by deleting erroneous data points and
inferring data from ancestor individuals
– We swap wrong data for missing data

Data Cleaning - Operations
• Four basic masking operations
1. Mask markers
2. Mask individuals
3. Mask single data points
4. Break relationships

Data Cleaning - Markers
• Markers are independent of each other.
– Masking one marker doesn’t change the errors in any
other markers
• Thus markers with lots of errors can be quickly
removed with no side-effect
– Early version in VIPER hid errors (but didn’t do anything to
the underlying data)

Data Cleaning - Individuals
• Wanted to adopt the same approach...
– But something odd happened.
– Removing individuals changes the error counts of other
individuals
• Because individuals inherit from each other
• So e.g. Removing every individual with > 5 errors
produced individuals with >5 errors.

Data Cleaning - Individuals
• Some errors turned out to simply drop from one
generation to the next
– Literal “chase to the bottom”, lots of lost data
• In these situations it is often necessary to break a
child/parent relationship across all markers in the
pedigree
– Which is where the fourth masking operation originates

Masking - 1
A/G G/T
C/C G/C
A/G C/G G/T G/C G/A C/A C/C
www.napier.ac.uk/iidi
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
Mask all errors
Recheck for errors
Repeat
Lose 50% of data

Masking - 2
A/G G/T
C/C G/C
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
Mask errors top down
Recheck Lose 25% for of errors
data
Repeat

Masking - 3
A/G G/T
C/C G/C
C/C G/G G/T G/G G/C
C/A A/C
G/C G/G
Mask errors top down
+ Lose cut links
<20% of data
Recheck for errors
Repeat

Showing Missing
• Masked and missing data are shown in a different
colour to error data

Representations
• Being careful not to use any other colours in the
interface, we can see how cleaning is going (red vs
blue)
• New masking interactions available through
standard context menus (and through tables)

Visual History
• With such a hypothetical / experimental method of
cleaning errors, undo is a must
– Part of Shneiderman’s mantra
– Beyond single-step, branching history

Experiment
• Genotype Checker vs VIPER+ interfaces
• Both run using the same underlying data checking
algorithm
• Same dataset
• 11 Biologists/Geneticists/Bioinformaticians at The
Roslin Institute
• Asked them to attempt a pair of representative
tasks with both interfaces (split into 12 Q’s)

Experiment - Objective
• Over the whole question set there was no objective
difference, but one did emerge when we considered
questions that involved pedigree exploration
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11
GenotypeChecker
Viper

Experiment - Objective
• Over the whole question set there was no objective
difference, but one did emerge when we considered
questions that involved pedigree exploration
8
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11
Genotype Checker
VIPER

Experiment - Subjective
Question VP No Pref GC
Finding structural information on a pedigree 7 1 2 1 0
Finding descendents of an individual 8 2 0 1 0
Finding ancestors of an individual 7 3 1 0 0
Finding error information on a single individual 4 1 1 4 1
Finding error information on a single marker 3 3 2 3 0
Distinguishing between different types of error 7 2 2 0 0
Tracing errors to a shared parent 8 0 2 1 0
Finding error information on a single family 7 1 2 1 0
Comparing errors between related families (one shared
parent) 8 1 1 1 0
Masking errors 1 2 4 3 1
Overall understanding of errors 5 1 4 1 0
Overall ease of use 5 2 3 0 1
Key: 1 = Strongly prefer Viper, 5 = Strongly prefer GC, Bold = Median

Experiment - Observations
• A lot of incorrect/skipped answers in both scenarios
– GC 61/132 = 46%
– VP 45/132 = 34%
• These users were occasional users of cleaning
software but it does show that Pedigree Cleaning is
hard
• Excelitis – Biologists love Excel. The first move of
many was to investigate the tables of error info
rather than the main pedigree visualisation

End
• Thanks for listening
• Sponsored by BBSRC
• http://www.bioinformatics.roslin.ed.ac.uk/viper/

Final VIPER presentation at BioVis 2013

Recommandé

Recommandé

Contenu connexe

Similaire à Final VIPER presentation at BioVis 2013

Similaire à Final VIPER presentation at BioVis 2013 (20)

Dernier

Dernier (20)

Final VIPER presentation at BioVis 2013

Notes de l'éditeur