SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Accessing	
  Public	
  Data	
  	
  
(from	
  NCBI)	
  Using	
  Bioconductor	
  
Na#onal	
  Ins#tutes	
  of	
  Health	
  
Objec#ves	
  
•  Navigate	
  NCBI	
  GEO	
  website	
  
•  Understand	
  NCBI	
  GEO	
  data	
  en<<es	
  and	
  rela#onships	
  
   between	
  them	
  
•  Use	
  GEOquery	
  package	
  to	
  import	
  data	
  from	
  NCBI	
  GEO	
  into	
  
   R	
  
•  Convert	
  GEOquery	
  data	
  structures	
  into	
  R	
  data	
  structures	
  
•  Use	
  GEOmetadb	
  to	
  find	
  data	
  in	
  NCBI	
  GEO	
  
•  Know	
  rela#onship	
  between	
  NCBI	
  GEO	
  and	
  NCBI	
  SRA	
  
•  Understand	
  how	
  to	
  use	
  the	
  SRAdb	
  package	
  to	
  query	
  SRA	
  
   metadata	
  
•  Use	
  SRAdb	
  package	
  to	
  control	
  the	
  Integrated	
  Genome	
  
   Viewer	
  (IGV)	
  from	
  R	
  
MIAME-­‐compliant	
  Data	
  
•  The	
  raw	
  data	
  for	
  each	
  hybridiza#on	
  (e.g.,	
  CEL	
  or	
  GPR	
  files)	
  
•  The	
  final	
  processed	
  (normalized)	
  data	
  for	
  the	
  set	
  of	
  hybridiza#ons	
  in	
  the	
  
   experiment	
  (study)	
  (e.g.,	
  the	
  gene	
  expression	
  data	
  matrix	
  used	
  to	
  draw	
  the	
  
   conclusions	
  from	
  the	
  study)	
  
•  The	
  essen#al	
  sample	
  annota#on	
  including	
  experimental	
  factors	
  and	
  their	
  
   values	
  (e.g.,	
  compound	
  and	
  dose	
  in	
  a	
  dose	
  response	
  experiment)	
  
•  The	
  experimental	
  design	
  including	
  sample	
  data	
  rela#onships	
  (e.g.,	
  which	
  
   raw	
  data	
  file	
  relates	
  to	
  which	
  sample,	
  which	
  hybridiza#ons	
  are	
  technical,	
  
   which	
  are	
  biological	
  replicates)	
  
•  Sufficient	
  annota#on	
  of	
  the	
  array	
  (e.g.,	
  gene	
  iden#fiers,	
  genomic	
  
   coordinates,	
  probe	
  oligonucleo#de	
  sequences	
  or	
  reference	
  commercial	
  
   array	
  catalog	
  number)	
  
•  The	
  essen#al	
  laboratory	
  and	
  data	
  processing	
  protocols	
  (e.g.,	
  what	
  
   normaliza#on	
  method	
  has	
  been	
  used	
  to	
  obtain	
  the	
  final	
  processed	
  data)	
  
What’s	
  in	
  GEO	
  
•  Gene	
  expression	
  profiling	
  by	
  microarray	
  or	
  next-­‐genera#on	
  
   sequencing	
  
•  Non-­‐coding	
  RNA	
  profiling	
  by	
  microarray	
  or	
  next-­‐genera#on	
  
   sequencing	
  
•  Chroma#n	
  immunoprecipita#on	
  (ChIP)	
  profiling	
  by	
  
   microarray	
  or	
  next-­‐genera#on	
  sequencing	
  
•  Genome	
  methyla#on	
  profiling	
  by	
  microarray	
  or	
  next-­‐
   genera#on	
  sequencing	
  
•  Genome	
  varia#on	
  profiling	
  by	
  array	
  (arrayCGH)	
  
•  SNP	
  arrays	
  
•  Serial	
  Analysis	
  of	
  Gene	
  Expression	
  (SAGE)	
  
•  Protein	
  arrays	
  
GEO	
  data	
  en##es	
  
•  GEO	
  Samples	
  (GSM)	
  
•  GEO	
  PlaYorm	
  (GPL)	
  
•  GEO	
  Series	
  (GSE)	
  
    –  Collec#ons	
  of	
  related	
  GSM	
  and	
  GPL	
  along	
  with	
  free-­‐text	
  
       study	
  metadata	
  
    –  Two	
  data	
  “flavors”	
  
         •  GSE	
  
         •  GSEMatrix	
  
•  GEO	
  Dataset	
  (GDS)	
  	
  
    –  Only	
  data	
  en#ty	
  that	
  is	
  curated	
  by	
  NCBI	
  GEO	
  staff	
  
    –  Typically,	
  samples	
  are	
  divided	
  into	
  sta#s#cally	
  and	
  
       biologically	
  relevant	
  groups	
  upon	
  which	
  we	
  can	
  compute	
  
NCBI	
  GEO	
  website	
  
•  h^p://www.ncbi.nlm.nih.gov/geo/	
  
GEO	
  SOFT	
  Format	
  
GEO	
  SOFT	
  Format	
  
GEOquery	
  
•  Singular	
  goal:	
  Get	
  data	
  from	
  NCBI	
  GEO	
  and	
  
   parse	
  into	
  R	
  objects	
  
•  Secondary	
  goals:	
  
    –  Get	
  supplemental	
  files	
  from	
  NCBI	
  GEO	
  
    –  Allow	
  large-­‐scale	
  data	
  mining	
  by	
  doing	
  all-­‐of-­‐the-­‐
       above	
  in	
  a	
  lossless	
  manner	
  
GEOquery	
  Data	
  Structures	
  	
  
   (GSM,	
  GPL,	
  GDS)	
  
GEOquery	
  Data	
  Structures	
  
          (GSE)	
  
GEOquery	
  Walkthrough	
  
•  h^p://watson.nci.nih.gov/~sdavis/	
  
    –  Click	
  on	
  Tutorials	
  and	
  then	
  on	
  “Accessing	
  public	
  
       data….”	
  
•  Download	
  the	
  R	
  script	
  for	
  cut-­‐and-­‐paste	
  to	
  
   follow	
  along	
  
GEOmetadb	
  
•  Finding	
  data	
  in	
  NCBI	
  GEO	
  can	
  be	
  challenging	
  
•  Some	
  inves#gators	
  need	
  large-­‐scale,	
  
   computable	
  access	
  to	
  GEO	
  metadata	
  
•  Given	
  one	
  GEO	
  en#ty	
  type,	
  it	
  may	
  be	
  useful	
  to	
  
   find	
  all	
  rela#onships	
  with	
  other	
  en#ty	
  types	
  
   (eg.,	
  find	
  all	
  hgu133a	
  arrays	
  in	
  GEO)	
  
GEOmetadb	
  
•  What	
  is	
  GEOmetadb?	
  
   –  	
  A	
  Bioconductor	
  package	
  that	
  offers	
  an	
  alterna#ve	
  
      to	
  eU#ls	
  data	
  mining	
  of	
  GEO	
  metadata	
  
   –  A	
  SQLite	
  database	
  which	
  stores	
  all	
  the	
  GEO	
  
      metadata	
  in	
  a	
  rela#onal	
  format	
  for	
  easy	
  querying,	
  
      par#cularly	
  in	
  bulk	
  
•  What	
  is	
  GEOmetadb	
  NOT?	
  
   –  We	
  have	
  not	
  a^empted	
  to	
  alter	
  or	
  standardize	
  in	
  
      any	
  way	
  the	
  data	
  from	
  GEO	
  
GEOmetadb	
  Schema	
  
GEOmetadb	
  Walkthrough	
  
SRAdb	
  
•  Similar	
  goals	
  to	
  GEOmetadb,	
  but	
  with	
  SRA	
  
   data	
  
•  Data	
  for	
  SRA	
  (ENA,	
  DRA)	
  are	
  mirrored,	
  but	
  
   exact	
  policies	
  are	
  a	
  bit	
  unclear	
  (to	
  me)	
  
•  Accessing	
  to	
  the	
  data	
  also	
  provided	
  by	
  SRAdb	
  
   package,	
  but	
  further	
  processing	
  (SRA	
  SDK)	
  
   now	
  needed	
  to	
  get	
  FASTQ	
  files	
  
SRAdb	
  and	
  NCBI	
  SRA	
  
Recently,	
  NCBI	
  announced	
  that	
  due	
  to	
  budget	
  constraints,	
  it	
  would	
  be	
  discon#nuing	
  its	
  
Sequence	
  Read	
  Archive	
  (SRA)	
  and	
  Trace	
  Archive	
  repositories	
  for	
  high-­‐throughput	
  
sequence	
  data.	
  However,	
  NIH	
  has	
  since	
  commi^ed	
  interim	
  funding	
  for	
  SRA	
  in	
  its	
  current	
  
form	
  un#l	
  October	
  1,	
  2011.	
  In	
  addi#on,	
  NCBI	
  has	
  been	
  working	
  with	
  staff	
  from	
  other	
  NIH	
  
Ins#tutes	
  and	
  NIH	
  grantees	
  to	
  develop	
  an	
  approach	
  to	
  con#nue	
  archiving	
  a	
  widely	
  used	
  
subset	
  of	
  next	
  genera#on	
  sequencing	
  data	
  ajer	
  October	
  1,	
  2011.	
  

We	
  now	
  plan	
  to	
  con#nue	
  handling	
  sequencing	
  data	
  associated	
  with:	
  

• RNA-­‐Seq,	
  ChIP-­‐Seq,	
  and	
  epigenomic	
  data	
  that	
  are	
  submi^ed	
  to	
  GEO	
  
• Genomic	
  and	
  Transcriptomic	
  assemblies	
  that	
  are	
  submi^ed	
  to	
  GenBank	
  
• 16S	
  ribosomal	
  RNA	
  data	
  associated	
  with	
  metagenomics	
  that	
  are	
  submi^ed	
  to	
  GenBank	
  

In	
  addi#on,	
  NCBI	
  will	
  con#nue	
  to	
  provide	
  access	
  to	
  exis#ng	
  SRA	
  and	
  Trace	
  Archive	
  data	
  
for	
  the	
  foreseeable	
  future.	
  NCBI	
  is	
  also	
  con#nuing	
  to	
  discuss	
  with	
  NIH	
  Ins#tutes	
  
approaches	
  for	
  handling	
  other	
  next-­‐genera#on	
  sequencing	
  data	
  associated	
  with	
  specific	
  
large-­‐scale	
  studies.	
  
SRAdb	
  and	
  IGV	
  Walkthrough	
  
•  Download	
  and	
  start	
  IGV:	
  
    –  h^p://www.broadins#tute.org/sojware/igv/
       download	
  
Public datatutorialoverview

Contenu connexe

Similaire à Public datatutorialoverview

08 wp7 progresses&results-20130221
08 wp7 progresses&results-2013022108 wp7 progresses&results-20130221
08 wp7 progresses&results-20130221fruitbreedomics
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataMaori Ito
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisAdamCribbs1
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marcGenomeInABottle
 
Geonetwork for Spatial Data
Geonetwork for Spatial DataGeonetwork for Spatial Data
Geonetwork for Spatial DataNizam GIS
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Data retreival system
Data retreival systemData retreival system
Data retreival systemShikha Thakur
 
Data cycle microbes
Data cycle microbesData cycle microbes
Data cycle microbesjyotikhadake
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesChristoph Steinbeck
 
The Progress on Sagace and Data Integration
The Progress on Sagace and Data IntegrationThe Progress on Sagace and Data Integration
The Progress on Sagace and Data IntegrationMaori Ito
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...mestato
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 

Similaire à Public datatutorialoverview (20)

ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
08 wp7 progresses&results-20130221
08 wp7 progresses&results-2013022108 wp7 progresses&results-20130221
08 wp7 progresses&results-20130221
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
Geonetwork for Spatial Data
Geonetwork for Spatial DataGeonetwork for Spatial Data
Geonetwork for Spatial Data
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Data cycle microbes
Data cycle microbesData cycle microbes
Data cycle microbes
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Publication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic moleculesPublication of raw and curated NMR spectroscopic data for organic molecules
Publication of raw and curated NMR spectroscopic data for organic molecules
 
Genomic Databases-.pptx
Genomic Databases-.pptxGenomic Databases-.pptx
Genomic Databases-.pptx
 
The Progress on Sagace and Data Integration
The Progress on Sagace and Data IntegrationThe Progress on Sagace and Data Integration
The Progress on Sagace and Data Integration
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 

Plus de Sean Davis

Lightweight data engineering, tools, and software to facilitate data reuse an...
Lightweight data engineering, tools, and software to facilitate data reuse an...Lightweight data engineering, tools, and software to facilitate data reuse an...
Lightweight data engineering, tools, and software to facilitate data reuse an...Sean Davis
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavisSean Davis
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeSean Davis
 
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor packageShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor packageSean Davis
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewSean Davis
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RSean Davis
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencingSean Davis
 
Sssc retreat.bioinfo resources.20110411
Sssc retreat.bioinfo resources.20110411Sssc retreat.bioinfo resources.20110411
Sssc retreat.bioinfo resources.20110411Sean Davis
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009Sean Davis
 
Genetics Branch Journal club
Genetics Branch Journal clubGenetics Branch Journal club
Genetics Branch Journal clubSean Davis
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics TechnologiesSean Davis
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Sean Davis
 

Plus de Sean Davis (13)

Lightweight data engineering, tools, and software to facilitate data reuse an...
Lightweight data engineering, tools, and software to facilitate data reuse an...Lightweight data engineering, tools, and software to facilitate data reuse an...
Lightweight data engineering, tools, and software to facilitate data reuse an...
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor packageShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Rna seq
Rna seqRna seq
Rna seq
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencing
 
Sssc retreat.bioinfo resources.20110411
Sssc retreat.bioinfo resources.20110411Sssc retreat.bioinfo resources.20110411
Sssc retreat.bioinfo resources.20110411
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
 
Genetics Branch Journal club
Genetics Branch Journal clubGenetics Branch Journal club
Genetics Branch Journal club
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09
 

Dernier

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Dernier (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Public datatutorialoverview

  • 1. Accessing  Public  Data     (from  NCBI)  Using  Bioconductor  
  • 3. Objec#ves   •  Navigate  NCBI  GEO  website   •  Understand  NCBI  GEO  data  en<<es  and  rela#onships   between  them   •  Use  GEOquery  package  to  import  data  from  NCBI  GEO  into   R   •  Convert  GEOquery  data  structures  into  R  data  structures   •  Use  GEOmetadb  to  find  data  in  NCBI  GEO   •  Know  rela#onship  between  NCBI  GEO  and  NCBI  SRA   •  Understand  how  to  use  the  SRAdb  package  to  query  SRA   metadata   •  Use  SRAdb  package  to  control  the  Integrated  Genome   Viewer  (IGV)  from  R  
  • 4.
  • 5. MIAME-­‐compliant  Data   •  The  raw  data  for  each  hybridiza#on  (e.g.,  CEL  or  GPR  files)   •  The  final  processed  (normalized)  data  for  the  set  of  hybridiza#ons  in  the   experiment  (study)  (e.g.,  the  gene  expression  data  matrix  used  to  draw  the   conclusions  from  the  study)   •  The  essen#al  sample  annota#on  including  experimental  factors  and  their   values  (e.g.,  compound  and  dose  in  a  dose  response  experiment)   •  The  experimental  design  including  sample  data  rela#onships  (e.g.,  which   raw  data  file  relates  to  which  sample,  which  hybridiza#ons  are  technical,   which  are  biological  replicates)   •  Sufficient  annota#on  of  the  array  (e.g.,  gene  iden#fiers,  genomic   coordinates,  probe  oligonucleo#de  sequences  or  reference  commercial   array  catalog  number)   •  The  essen#al  laboratory  and  data  processing  protocols  (e.g.,  what   normaliza#on  method  has  been  used  to  obtain  the  final  processed  data)  
  • 6. What’s  in  GEO   •  Gene  expression  profiling  by  microarray  or  next-­‐genera#on   sequencing   •  Non-­‐coding  RNA  profiling  by  microarray  or  next-­‐genera#on   sequencing   •  Chroma#n  immunoprecipita#on  (ChIP)  profiling  by   microarray  or  next-­‐genera#on  sequencing   •  Genome  methyla#on  profiling  by  microarray  or  next-­‐ genera#on  sequencing   •  Genome  varia#on  profiling  by  array  (arrayCGH)   •  SNP  arrays   •  Serial  Analysis  of  Gene  Expression  (SAGE)   •  Protein  arrays  
  • 7. GEO  data  en##es   •  GEO  Samples  (GSM)   •  GEO  PlaYorm  (GPL)   •  GEO  Series  (GSE)   –  Collec#ons  of  related  GSM  and  GPL  along  with  free-­‐text   study  metadata   –  Two  data  “flavors”   •  GSE   •  GSEMatrix   •  GEO  Dataset  (GDS)     –  Only  data  en#ty  that  is  curated  by  NCBI  GEO  staff   –  Typically,  samples  are  divided  into  sta#s#cally  and   biologically  relevant  groups  upon  which  we  can  compute  
  • 8. NCBI  GEO  website   •  h^p://www.ncbi.nlm.nih.gov/geo/  
  • 11. GEOquery   •  Singular  goal:  Get  data  from  NCBI  GEO  and   parse  into  R  objects   •  Secondary  goals:   –  Get  supplemental  files  from  NCBI  GEO   –  Allow  large-­‐scale  data  mining  by  doing  all-­‐of-­‐the-­‐ above  in  a  lossless  manner  
  • 12. GEOquery  Data  Structures     (GSM,  GPL,  GDS)  
  • 14. GEOquery  Walkthrough   •  h^p://watson.nci.nih.gov/~sdavis/   –  Click  on  Tutorials  and  then  on  “Accessing  public   data….”   •  Download  the  R  script  for  cut-­‐and-­‐paste  to   follow  along  
  • 15. GEOmetadb   •  Finding  data  in  NCBI  GEO  can  be  challenging   •  Some  inves#gators  need  large-­‐scale,   computable  access  to  GEO  metadata   •  Given  one  GEO  en#ty  type,  it  may  be  useful  to   find  all  rela#onships  with  other  en#ty  types   (eg.,  find  all  hgu133a  arrays  in  GEO)  
  • 16. GEOmetadb   •  What  is  GEOmetadb?   –   A  Bioconductor  package  that  offers  an  alterna#ve   to  eU#ls  data  mining  of  GEO  metadata   –  A  SQLite  database  which  stores  all  the  GEO   metadata  in  a  rela#onal  format  for  easy  querying,   par#cularly  in  bulk   •  What  is  GEOmetadb  NOT?   –  We  have  not  a^empted  to  alter  or  standardize  in   any  way  the  data  from  GEO  
  • 18.
  • 20. SRAdb   •  Similar  goals  to  GEOmetadb,  but  with  SRA   data   •  Data  for  SRA  (ENA,  DRA)  are  mirrored,  but   exact  policies  are  a  bit  unclear  (to  me)   •  Accessing  to  the  data  also  provided  by  SRAdb   package,  but  further  processing  (SRA  SDK)   now  needed  to  get  FASTQ  files  
  • 21. SRAdb  and  NCBI  SRA   Recently,  NCBI  announced  that  due  to  budget  constraints,  it  would  be  discon#nuing  its   Sequence  Read  Archive  (SRA)  and  Trace  Archive  repositories  for  high-­‐throughput   sequence  data.  However,  NIH  has  since  commi^ed  interim  funding  for  SRA  in  its  current   form  un#l  October  1,  2011.  In  addi#on,  NCBI  has  been  working  with  staff  from  other  NIH   Ins#tutes  and  NIH  grantees  to  develop  an  approach  to  con#nue  archiving  a  widely  used   subset  of  next  genera#on  sequencing  data  ajer  October  1,  2011.   We  now  plan  to  con#nue  handling  sequencing  data  associated  with:   • RNA-­‐Seq,  ChIP-­‐Seq,  and  epigenomic  data  that  are  submi^ed  to  GEO   • Genomic  and  Transcriptomic  assemblies  that  are  submi^ed  to  GenBank   • 16S  ribosomal  RNA  data  associated  with  metagenomics  that  are  submi^ed  to  GenBank   In  addi#on,  NCBI  will  con#nue  to  provide  access  to  exis#ng  SRA  and  Trace  Archive  data   for  the  foreseeable  future.  NCBI  is  also  con#nuing  to  discuss  with  NIH  Ins#tutes   approaches  for  handling  other  next-­‐genera#on  sequencing  data  associated  with  specific   large-­‐scale  studies.  
  • 22. SRAdb  and  IGV  Walkthrough   •  Download  and  start  IGV:   –  h^p://www.broadins#tute.org/sojware/igv/ download