SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
11 September 2017
Reproducible Research
To infinity and beyond
!1
Outline - getting
to reproducibility
✤ What does reproducibility
mean?
✤ Should I design my data
analysis as code?
✤ What is data & what is
metadata?
✤ Should I keep a digital
notebook?
2https://github.com/MorrellLAB/sequence_handling
Inspirations for talk
✤ Yosef Cohen - Statistics and Data with R
✤ Vince Buffalo - Bioinformatics Data Skills
✤ Karl Broman - Initial steps toward reproducible research
✤ Greg Wilson, Titus Brown, et al. - Best practices for
scientific computing
✤ Every dissertation, manuscript, or paper I read!
!3
NSF Statement
(today)
4
What does
reproducible mean?
✤ Person “skilled in the art” can
reproduce results
✤ Or collect new data and get to
similar results
5
Why should research
be reproducible?
✤ Public investment requires
reliable results
✤ Research training is NOT
intended to teach a boutique
set of skills only an artisan can
replicate
✤ Work is replicated and
updated; current human
genome is version 38
6
Why this does
[not] apply to me?
✤ “I work on a highly specialized
system”
✤ “Croak length in
subterranean Iberian frogs”
✤ Whenever possible approaches
and methods should be
“generalizable”
7
Should I design my
analysis as code?
✤ Easier said than done
✤ Goal should be self-contained
code that executes analyses
✤ Hand-editing any file or
intermediate step is a no-no
✤ Primarily because you will
have to do it repeatedly, and
could make mistakes
8
“You all grew up in Latin America, you just don’t know it
yet.”
–Joseph Whitecotton - anthropologist - to U. of Oklahoma
undergrads
!9
“You are all computational biologists, you just don’t know it
yet.”
–Peter Morrell - biologist - to seminar audience at U. of Minnesota
!10
Transition to
executable analysis
✤ Often includes code in Perl,
Python, R, BASH, C++ or a
combination of languages
✤ Elegance of design and
simplicity of interpretation
increase with experience
11
Reproducibility starts with good
tools that are well vetted
✤ “I wrote my own sequence read mapper” - bad idea
✤ Unless you are a computer science grad student
✤ Even then, bad idea because there are dozens
already
✤ Testing of these tools is crowd sourced
!12
“Use your [power] tools whenever possible.”
–Leonard (Top) Hartin - my grandfather
!13
Power tools
✤ Primary tools for most analyses
are developed by professional
programmers and statisticians
✤ Use them whenever possible!
✤ BWA - sequence read mapping
✤ R/QTL - biparental QTL
mapping
✤ Bedtools - for genome slicing
and dicing
14
Good code is
well commented
✤ Comments should explain the
“why” of a step taken in code
✤ Should be sufficient for anyone
“skilled in the art”
✤ Avoid “!@##$!!@#*%” in
comments; code eventually
becomes public!
15
If you only learn one
language - BASH
✤ Unix shell (BASH) scripting
provides access to pro tools
✤ Necessary to use a
supercomputer
✤ Bioinformatics Data Skills -
Chapters 3 & 7
16
Example shell
script
✤ Avoid hard coding file paths in
the middle of a script
✤ Hard coding nearly
eliminates reproducibility
✤ Set paths as variable names at
the top of the script
✤ WORKSHOP=/panfs/roc/
groups/9/morrellp/
pmorrell/Workshop
17
What is a concurrent versioning
system?
✤ Track files and changes over versions of the files
✤ Why is it relevant for my system?
✤ Remember, “I work on croak length of subterranean
frogs”
✤ Rerunning analysis during manuscript revision, on
a similar project years later, or by another group
analyzing rooster crows
!18
Git & Github are preferred options
✤ Git repositories can be local or shared on the internet
✤ Flaky WiFi and poor internet connectivity are not
friendly for remote work
✤ Github stores repositories, can be accessed through
a web browser or cloned to another computer (or
server account)
✤ Alternatives include Bitbucket and UMN Github
!19
What types of files
belong in a repository?
✤ Plain text (with Markdown) is
preferred format
✤ Code, readme files, workflow
diagrams all belong in a
repository
✤ Data generally belong
elsewhere
20
“Code base”
repository
✤ A Git/Github repository used
across multiple projects
✤ Should be generalizable and
extensible
✤ Changes relatively slowly
✤ Are often public during
development
21
“Code base”
repository
✤ Multiple users contribute
✤ Active contributors change
over time
22
Project
repository
✤ Git/Github repository with
files for a specific project
✤ Can be specific challenges in a
data set
✤ Changes often and quickly as a
project develops
✤ Are often private until project
is published
23
What is data,
what is metadata?
✤ Actual information; e.g., the
contents of a photo
✤ Metadata includes description
of the file containing the photo
24
File Metadata
✤ All files have a physical
location, size, dates
✤ Photos have extended
attributes - EXIF data
✤ Geolocation - spatial
metadata
✤ Doesn’t include -
accelerometer or inertial
navigation info
25
!26
!27
Metadata for
RNASeq
✤ Tissue, growth stage, genotype,
time of day, light intensity
28
Hypothetical
Study
✤ Barley 8 parent MAGIC
population
✤ Reporting on population
creation, sequence and genetic
mapping
✤ Recombination rate variation
as a phenotype
✤ Genetics or PLoS Genetics
manuscript
29Macdonald & Long 2007 - Genetics
What data & metadata should we
collect?
!30
What data & metadata should we
collect?
✤ Nanopore / Illumina sequencing of each parent
✤ Skim sequencing of progeny
✤ Recombination events per chromosomal segment for
progeny
!31
Primary
manuscript
✤ Figures & tables
✤ Names of parental lines
✤ Sequence depth
✤ Crossover frequency
✤ Location, number, & effect of
QTL discovered
32Macdonald & Long 2007 - Genetics
Supplementary
material
✤ Pedigree of parental lines
✤ Number of variants from
barley Morex reference for each
parent
✤ Proportion of variants that are
coding, noncoding, deleterious,
etc.
33Kono et al. 2016 - Mol Bio Evol
Public archive
✤ Sequence Read Archive (SRA)
✤ Stores nucleotide sequence &
quality values (FASTQ files)
✤ FASTQ for skim sequencing of
progeny
34
DRUM
✤ Data needed to reproduce
research
✤ Variant Call Format (VCF) -
millions of SNPs from 8
barley parents
✤ Detailed variant annotation
for each parent
✤ Get DOI with permanent URL
to data
35
Digital notebook
✤ Store goals for analysis, output,
code snippets
✤ Entries should contain as much
metadata as possible
36
!37
!38
Broader Impacts
!39
!40
Acknowledgements
✤ Funding
✤ NSF Plant Genome
✤ USDA Biotech Risk Assessment
✤ MN Ag Experiment Station
✤ USDA National Needs
✤ MnDrive
✤ UMN Doctororal Dissertation Fellowships
!41
Acknowledgements
!42

Contenu connexe

Similaire à Reproducible research - to infinity

Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchAri Berman
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesAri Berman
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
From Zero to Nextflow 2017
From Zero to Nextflow 2017From Zero to Nextflow 2017
From Zero to Nextflow 2017Luca Cozzuto
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformaticianChristian Frech
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...David Peyruc
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 

Similaire à Reproducible research - to infinity (20)

Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life Sciences
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
From Zero to Nextflow 2017
From Zero to Nextflow 2017From Zero to Nextflow 2017
From Zero to Nextflow 2017
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03Open data genomics_palermo_2017_ver03
Open data genomics_palermo_2017_ver03
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 

Dernier

Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 

Dernier (20)

Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 

Reproducible research - to infinity

  • 1. 11 September 2017 Reproducible Research To infinity and beyond !1
  • 2. Outline - getting to reproducibility ✤ What does reproducibility mean? ✤ Should I design my data analysis as code? ✤ What is data & what is metadata? ✤ Should I keep a digital notebook? 2https://github.com/MorrellLAB/sequence_handling
  • 3. Inspirations for talk ✤ Yosef Cohen - Statistics and Data with R ✤ Vince Buffalo - Bioinformatics Data Skills ✤ Karl Broman - Initial steps toward reproducible research ✤ Greg Wilson, Titus Brown, et al. - Best practices for scientific computing ✤ Every dissertation, manuscript, or paper I read! !3
  • 5. What does reproducible mean? ✤ Person “skilled in the art” can reproduce results ✤ Or collect new data and get to similar results 5
  • 6. Why should research be reproducible? ✤ Public investment requires reliable results ✤ Research training is NOT intended to teach a boutique set of skills only an artisan can replicate ✤ Work is replicated and updated; current human genome is version 38 6
  • 7. Why this does [not] apply to me? ✤ “I work on a highly specialized system” ✤ “Croak length in subterranean Iberian frogs” ✤ Whenever possible approaches and methods should be “generalizable” 7
  • 8. Should I design my analysis as code? ✤ Easier said than done ✤ Goal should be self-contained code that executes analyses ✤ Hand-editing any file or intermediate step is a no-no ✤ Primarily because you will have to do it repeatedly, and could make mistakes 8
  • 9. “You all grew up in Latin America, you just don’t know it yet.” –Joseph Whitecotton - anthropologist - to U. of Oklahoma undergrads !9
  • 10. “You are all computational biologists, you just don’t know it yet.” –Peter Morrell - biologist - to seminar audience at U. of Minnesota !10
  • 11. Transition to executable analysis ✤ Often includes code in Perl, Python, R, BASH, C++ or a combination of languages ✤ Elegance of design and simplicity of interpretation increase with experience 11
  • 12. Reproducibility starts with good tools that are well vetted ✤ “I wrote my own sequence read mapper” - bad idea ✤ Unless you are a computer science grad student ✤ Even then, bad idea because there are dozens already ✤ Testing of these tools is crowd sourced !12
  • 13. “Use your [power] tools whenever possible.” –Leonard (Top) Hartin - my grandfather !13
  • 14. Power tools ✤ Primary tools for most analyses are developed by professional programmers and statisticians ✤ Use them whenever possible! ✤ BWA - sequence read mapping ✤ R/QTL - biparental QTL mapping ✤ Bedtools - for genome slicing and dicing 14
  • 15. Good code is well commented ✤ Comments should explain the “why” of a step taken in code ✤ Should be sufficient for anyone “skilled in the art” ✤ Avoid “!@##$!!@#*%” in comments; code eventually becomes public! 15
  • 16. If you only learn one language - BASH ✤ Unix shell (BASH) scripting provides access to pro tools ✤ Necessary to use a supercomputer ✤ Bioinformatics Data Skills - Chapters 3 & 7 16
  • 17. Example shell script ✤ Avoid hard coding file paths in the middle of a script ✤ Hard coding nearly eliminates reproducibility ✤ Set paths as variable names at the top of the script ✤ WORKSHOP=/panfs/roc/ groups/9/morrellp/ pmorrell/Workshop 17
  • 18. What is a concurrent versioning system? ✤ Track files and changes over versions of the files ✤ Why is it relevant for my system? ✤ Remember, “I work on croak length of subterranean frogs” ✤ Rerunning analysis during manuscript revision, on a similar project years later, or by another group analyzing rooster crows !18
  • 19. Git & Github are preferred options ✤ Git repositories can be local or shared on the internet ✤ Flaky WiFi and poor internet connectivity are not friendly for remote work ✤ Github stores repositories, can be accessed through a web browser or cloned to another computer (or server account) ✤ Alternatives include Bitbucket and UMN Github !19
  • 20. What types of files belong in a repository? ✤ Plain text (with Markdown) is preferred format ✤ Code, readme files, workflow diagrams all belong in a repository ✤ Data generally belong elsewhere 20
  • 21. “Code base” repository ✤ A Git/Github repository used across multiple projects ✤ Should be generalizable and extensible ✤ Changes relatively slowly ✤ Are often public during development 21
  • 22. “Code base” repository ✤ Multiple users contribute ✤ Active contributors change over time 22
  • 23. Project repository ✤ Git/Github repository with files for a specific project ✤ Can be specific challenges in a data set ✤ Changes often and quickly as a project develops ✤ Are often private until project is published 23
  • 24. What is data, what is metadata? ✤ Actual information; e.g., the contents of a photo ✤ Metadata includes description of the file containing the photo 24
  • 25. File Metadata ✤ All files have a physical location, size, dates ✤ Photos have extended attributes - EXIF data ✤ Geolocation - spatial metadata ✤ Doesn’t include - accelerometer or inertial navigation info 25
  • 26. !26
  • 27. !27
  • 28. Metadata for RNASeq ✤ Tissue, growth stage, genotype, time of day, light intensity 28
  • 29. Hypothetical Study ✤ Barley 8 parent MAGIC population ✤ Reporting on population creation, sequence and genetic mapping ✤ Recombination rate variation as a phenotype ✤ Genetics or PLoS Genetics manuscript 29Macdonald & Long 2007 - Genetics
  • 30. What data & metadata should we collect? !30
  • 31. What data & metadata should we collect? ✤ Nanopore / Illumina sequencing of each parent ✤ Skim sequencing of progeny ✤ Recombination events per chromosomal segment for progeny !31
  • 32. Primary manuscript ✤ Figures & tables ✤ Names of parental lines ✤ Sequence depth ✤ Crossover frequency ✤ Location, number, & effect of QTL discovered 32Macdonald & Long 2007 - Genetics
  • 33. Supplementary material ✤ Pedigree of parental lines ✤ Number of variants from barley Morex reference for each parent ✤ Proportion of variants that are coding, noncoding, deleterious, etc. 33Kono et al. 2016 - Mol Bio Evol
  • 34. Public archive ✤ Sequence Read Archive (SRA) ✤ Stores nucleotide sequence & quality values (FASTQ files) ✤ FASTQ for skim sequencing of progeny 34
  • 35. DRUM ✤ Data needed to reproduce research ✤ Variant Call Format (VCF) - millions of SNPs from 8 barley parents ✤ Detailed variant annotation for each parent ✤ Get DOI with permanent URL to data 35
  • 36. Digital notebook ✤ Store goals for analysis, output, code snippets ✤ Entries should contain as much metadata as possible 36
  • 37. !37
  • 38. !38
  • 40. !40
  • 41. Acknowledgements ✤ Funding ✤ NSF Plant Genome ✤ USDA Biotech Risk Assessment ✤ MN Ag Experiment Station ✤ USDA National Needs ✤ MnDrive ✤ UMN Doctororal Dissertation Fellowships !41