16.
A language all scientists should know
How R helped me look at billions of genotypes and how it can
help you too
Mitchell Bekritsky
WSBS Graduate Student
17.
What is R?
• Language for statistical
analysis, data manipulation
and graphics
• Open source
• Flexible language
• Powerful built-in functions
• Strong user community
• Publication quality graphs
• Free!
Graphic
from
h=p://blenditbayes.blogspot.com/2013/06/visualising-‐crime-‐hotspots-‐in-‐england_25.html
18.
Who uses R?
Source:
h=p://www.revoluKonanalyKcs.com/what-‐is-‐open-‐source-‐r/companies-‐using-‐r.php
19.
What is R used for?
• Movie recommendations
• Clinical drug development
• Credit risk analysis
• News graphics
• Tailoring online advertising
• Modeling oil spills
• Predicting economic activity
• Predicting election outcomes
Graphic
from
h=p://www.nyKmes.com/interacKve/2009/06/25/arts/0625-‐jackson-‐graphic.html
21.
How R helped me see my data
• First time looking at microsatellite genotypes
• How many microsatellites differ from reference genome?
• By how much?
Problems:
– Lots of data (4.7 million genotypes)
– Complex information
– Too big for Excel
– No good graphics in Excel either
22.
One of my first graphs in R
Lessons learned about my data
• Lots of microsatellites differ
from reference by a little bit
• Thousands differ by ± 20 bp
• 8.27% of all microsatellites
differ from reference (~400k)
Lessons learned about my graph
• This is a terrible graph
23.
A bad R graph is better than no R graph
Bad graphs helped me
• Understand my data better
• Improve my analyses
• Improve how I communicate
my data
• R has incredible flexibility for
graphing—if you can dream it,
you can probably build it
24.
A bad R graph is better than no R graph
Bad graphs helped me
• Understand my data better
• Improve my analyses
• Improve how I communicate
my data
• R has incredible flexibility for
graphing—if you can dream it,
you can probably build it
My best R graphs make one point clearly without clutter
26.
How R saved my thesis
• Processing lots of sequencing
data in hundreds of people
• Too many people and
processes to monitor all steps
of pipeline by eye while data
was being processed
Sanity check
• After data processing did data
look bi-allelic?
27.
How R saved my thesis
• Processing lots of sequencing
data in hundreds of people
• Too many people and
processes to monitor all steps
of pipeline by eye while data
was being processed
Sanity check
• After data processing did data
look bi-allelic?
No!!
28.
Troubleshooting using R
• People don’t actually have massive deletions and amplifications
• My pipeline was deleting files because of a bug, which would
remove large chunks of chromosomes
• Thanks to R, I found people where this had happened, tracked
down the bug, and didn’t report massive CNVs in autistic children
Side note
• If it looks too good to be true, it probably is
29.
R helped me build a better genotyper
• Some non-reference alleles
aren’t covered well
• Leads to incorrect genotype
calls
Problem
• How do I develop a smarter
genotyper and know that it
works?
30.
R helped me build a better genotyper
• Some non-reference alleles
chr19:54772760 A repeat, reference length 8
aren’t covered well
Genotypes
100
• Leads to incorrect genotype
works?
60
40
20
0
genotyper and know that it
10 bp allele coverage
• How do I develop a smarter
80
calls
Problem
10|-1
10|10
8|-1
8|10
8|8
0
20
40
60
8 bp allele coverage
80
100
31.
Modeling genotypes in R
• Built a model for biased
genotypes in R
• Model helped me build a more
accurate genotyper
• When applied to real data,
clear improvements
32.
R finds de novo mutations for me
• >300 million genotypes
• How do I find de novo mutations in all that data?
R to the rescue!
33.
What R has done for me
Data mining
•
Finding de novo mutations
•
Quality control for my data
Data manipulation
•
Converting raw read counts to genotypes
Data simulation and modeling
•
Finding ways to improve my genotyper
Data visualization
34.
R has extensive support for biologists
Bioconductor is an incredible resource for biological analyses in R
• Microarrays
• Differential expression (DESeq, edgeR, cummeRbund)
• Gene models
• Flow cytometry (flowCore, flowStats, flowViz)
• Interacting with Ensembl, Cosmic, Gramene, etc. (biomaRt)
35.
Installing R
• R can be downloaded from rproject.org
• R runs on PCs, Macs and
Linux computers
• The R project website has an
R manual to get you started
36.
Working in R
Native R interface can be hard to
work with
• Lots of windows
• Difficult to keep things
organized
37.
RStudio interface
• All your variables, help pages,
script windows and consoles
in one place
• Highlights R code for easier
programming
• Tabbed windows for multiple
scripts
• History saves all previous
commands, plot history saves
all previous plots
• Find it at rstudio.com
38.
Learning R
Many online tutorials
• R has its own introduction
• Statistics Using R with Biological Examples
Take interesting data, use it to explore R
• Plot, graph, use statistical tests
Ask someone who knows R
• Getting started is pretty easy
• Learn what you need when you need it
40.
The Bioscience Entreprise Club is dedicated to helping CSHL’s science research
professionals and alumni cultivate and leverage their cross-disciplinary skill sets and
expertise to transition into diverse careers.
41.
Current Exchange is CSHL’s very own student-run magazine. We feature articles about
science aimed at a general audience. Check out our inaugural issue at issuu.com/
currentexchange
Send your articles to raboukha@cshl.edu by November 5, 2013
Il semblerait que vous ayez déjà ajouté cette diapositive à .
Créer un clipboard
Vous avez clippé votre première diapositive !
En clippant ainsi les diapos qui vous intéressent, vous pourrez les revoir plus tard. Personnalisez le nom d’un clipboard pour mettre de côté vos diapositives.
Créer un clipboard
Partager ce SlideShare
Vous avez les pubs en horreur?
Obtenez SlideShare sans publicité
Bénéficiez d'un accès à des millions de présentations, documents, e-books, de livres audio, de magazines et bien plus encore, sans la moindre publicité.
Offre spéciale pour les lecteurs de SlideShare
Juste pour vous: Essai GRATUIT de 60 jours dans la plus grande bibliothèque numérique du monde.
La famille SlideShare vient de s'agrandir. Profitez de l'accès à des millions de livres numériques, livres audio, magazines et bien plus encore sur Scribd.
Apparemment, vous utilisez un bloqueur de publicités qui est en cours d'exécution. En ajoutant SlideShare à la liste blanche de votre bloqueur de publicités, vous soutenez notre communauté de créateurs de contenu.
Vous détestez les publicités?
Nous avons mis à jour notre politique de confidentialité.
Nous avons mis à jour notre politique de confidentialité pour nous conformer à l'évolution des réglementations mondiales en matière de confidentialité et pour vous informer de la manière dont nous utilisons vos données de façon limitée.
Vous pouvez consulter les détails ci-dessous. En cliquant sur Accepter, vous acceptez la politique de confidentialité mise à jour.