SlideShare a Scribd company logo
1 of 16
Opportunities and Constraints




Palaniappan SP
connectsp2012@gmail.com
Request Note


  I prepared this presentation entirely with input from internet research with the intend to
  share this as give back to society. Please share your comments and suggestions through the
  mail ID. It would help to improve the value and benefit of this preparation




18-Nov-12                                                                                      2
Core areas where DNA sequencing is employed


 • Academic research
            •   understanding gene expression/regulation
            •   phylogeny, demography and evolution research
 • Oncology
      • understanding DNA’s role in cancer cells

            •   finding ways to tune gene expression for cancer abatement or prevention
 • Gene therapy
            •   Using recombinant DNA to suppress / modify / induce gene expression to
                address genetic disorder based diseases / malfunction




18-Nov-12                                                                                 3
More areas where DNA sequencing is employed

 • Developing Genetically Modified Organisms through recombinant
    research
            •   salt tolerant/drought tolerant/disease resistant cultivable crops
            •   microbes producing more of therapeutic compounds, proteins
            •   microbes for environment cleaning
            •   pro-biotic lactobacilli

 • Clinical diagnosis - diagnosing gene-sequence-correlated diseases /
    infections e.g. HIV
 • Forensic analysis - DNA fingerprint profiling to identify crime suspects

 • Pedigree analysis - to establish parental lineage in legal disputes



18-Nov-12                                                                           4
Agencies engaged in DNA sequencing


 • Non-Profit Research Laboratories in Universities, Institutes

 • Clinical labs of government and private hospitals

 • Commercial organizations engaged in DNA sequencing for payment




18-Nov-12                                                           5
Databases maintaining DNA sequence data
  During early period, the data were generated and analyzed only by a few research
  institutes like members of Humane Genome Research Project. Later when such
  databases grew by size and region, so many databases were created and are made
  available for research communities.
  Following are some examples:
      • NCBI - National Center for Biotechnology Information (GenBank)
         http://www.ncbi.nlm.nih.gov/guide/data-software/#databases_
      • EBI - European Bioinformatics Institute (EMBL)
         http://www.ebi.ac.uk/Databases/
      • EMNEW - Index of New EMBL Nucleotides ( EBI)
         http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+databanks
      • DDBJ - DNA Data Bank of Japan
         http://www.ddbj.nig.ac.jp/
      • As per Nucleic Acid Research database issue dated Jan 2012, there are as many
         as 1380 databases!!!
         http://www.oxfordjournals.org/nar/database/a/
18-Nov-12                                                                               6
DNA Sequencing capability has grown exponentially

            DNA sequences in GenBank
            Doubling time = 18 months

             Sequencing Cost   Data Analyzing Cost




  Source: Bioinformatics Challenges of High-Throughput DNA Sequencing by Stuart M. Brown, Ph.D,
  New York University www.med.nyu.edu/rcr/rcr/course/NexGen-2010.ppt
18-Nov-12                                                                                    7
Big Data Of DNA Sequence Is Different From Other Big Data

   Most of the databases have built in search engines with predefined filters to narrow down
     the search. Those web based tools also restrict the input / output file format. Integrating
     customized search tools with databases need to be worked out.
   Too much of tool customization limits the interoperability in various software platforms
   Comparing conventional RDBMS databases, there is no Dev region, Test region or sand box
     environment where one can test the tools, scripts and queries freely
   Data mining is not confined to one or few data sources. There are more than 1300
     databases – some are primary (Tables) and some are derived databases(Table Views). It’s
     likely the data is replicated in many databases.
   Analytics is not looking for matching exact search string or a cluster of strings defined by a
     complex query with multiple joins and unions. Analysis is mostly on percentage of matching
     of a given sequence. A variety of computational algorithms like dynamic programming and
     heuristic algorithms or probabilistic methods are used for sequence alignment.
   Too many tools, search engines, software and script languages - it is difficult to find or
     validate a software component framework / a technology tool box.

18-Nov-12                                                                                            8
Using Big Data of Gene Sequence – Examples


  Identifying gene sequences relevant to specific biochemical/metabolic
     pathway using transcriptional "fingerprints“
  Understanding gene regulations exerted by promoters, suppressors
  Whole genome sequence analysis for disease control, human healthcare
  Gene profiling for
     ―   Structural genes (coding for mRNA, rRNA, tRNA)
     ―   Functional genes (coding for promoter, operator, terminator)
     ―   Regulatory genes (coding for repressor protein that binds to operator)
     ―   putative genes which are not evidently associated with any protein produced or
         function performed
     ―   sequence of interest with reference to SNPs, SVs, indels, ChIP


18-Nov-12                                                                                 9
Big Data of Gene Sequence – More Examples

  extrinsic gene finding system for gene annotation
  Understanding genetic basis for multi-drug resistance of super bugs so as
     to evolve alternative control measures
  Targeted drug delivery against pathogens
  Localized gene therapy for infectious diseases or inherent disorder
  String mining / sequence mining, itemset mining, association rule mining
     – Data mining can help us in two ways. 1) Understand genetic mechanism
     of regulation and expression of phenotypes and 2) Retrieve genes or
     genetic information that could be converted into a process technology or
     a diagnostic tool or a therapeutic technique.



18-Nov-12                                                                      10
Big Data Management – A Generic Approach




18-Nov-12                                    11
Analyzing DNA databases – Some Practical Constraints


  Reading frame alignment. Every sequence can represent three different
     reading frames that could be converted into a derived amino acid
     sequence
  Presence of Exon-Intron - RNA splicing, possibility of alternative splicing
     make the analysis as more complex
  Silent mutations – redundancy of codon –SNPs. Difficult to distinguish
     silent mutation from sequencing error. Sequencing errors are possible
     because of complexity in sample preparation, sequencing, assembling and
     analyzing sequence data. Those situations could be resolved only by
     repeat runs.
  Since DNA preparation is from a host of cells, the sequence we get is,
     eventually, an average of DNA sequence from all sample cells.


18-Nov-12                                                                        12
Analyzing DNA databases – Practical Constraints - contd.
  Significance of non-coding DNA is yet to be understood - In many
     eukaryotes, up to 99% of an organism's total genome size is non-coding
     DNA. More than 98% of the human genome does not encode protein
     sequences. A fraction of non-coding sequence is reported to regulate
     gene expression.
  Sequence matching is based on statistical analysis and not on exact data
     matching
  Reference Human Genome Data may not represent global population.
     However when more and more sequence information from different
     geographic region are added, the reference would become more global.
     The Genome Reference Consortium is an international body that takes
     care of genome curation.
  In a short span of time, the cost of sequencing has drastically come down.
     Still the sequencing fee (which is around $1000 per individual) is
     expensive for countries like India
18-Nov-12                                                                       13
Software Tools Used For Sequence Analysis

 • There are quite a large number of tools are available in internet
 • The Tools are used for
      ―     sequence comparison/alignment
      ―     searching databases and retrieve catalogued reference sequences
      ―     assembling short sequence strings to get complete sequence of a gene
      ―     Retrieving sequence info for constructing primer / oligo probe
      ―     converting nucleic acid sequence to protein structure
      ―     converting protein structure to nucleic acid sequence
      ―     Multiple sequence alignment
 • Scripting languages – Perl, Python, Ruby. BioPERL, BioPython and BioRuby are the
     framework applications that could be readily used for data mining.
 • SWIG – Tool to generate scripting language interface – It improves interoperability
     of scripts
 • SourceForge, GitHub – commonly used version control systems to maintain
     software and tool versions

18-Nov-12                                                                                14
Some Commonly Used Tools
 •   Ensembl Genome Browser
 •   UCSC Genome Browser
 •   Entrez - Integrated, text-based search and retrieval tool used at NCBI
 •   RSAT - Regulatory Sequence Analysis Tools - tools dedicated to the detection of regulatory signals in non-
     coding sequences
 •   BLAST, FASTA - Tools used for sequence alignment - to compare query sequence with that available in a
     database
 •   ClustalW - Multiple sequence alignment program
 •   GeneMarkerR - A commercial tool for forensic profiling
 •    Seq Anal - A collection of tools to search, align and analyze DNA sequences
 •   Galaxy Tools can also be used to search, align and analyze DNA data
 •   Codon Suite - Codon-based sequence analysis
 •   Transeq, Backtranseq - Tools to translate or back-translate between nucleotide and peptide sequences.
 •   ReadSeq: Molecular sequence format converter
 •   FASTLINK - Used to map genes and find the approximate location of disease genes.
 •   DnaSP - A software package for the analysis of nucleotide polymorphism from aligned DNA sequence data
 •   MATCHTM - A tool for searching transcription factor binding sites in DNA sequences
 •   PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of
     basic, large-scale analyses
 •   GeneMark™ - Free gene prediction software
 •   Genscan is the best available ab initio gene predictor
 •   More list of gene prediction software in
     http://en.wikipedia.org/wiki/List_of_gene_prediction_software

18-Nov-12                                                                                                     15
18-Nov-12   16

More Related Content

What's hot

Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final ReportShruthi Choudary
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRONPrabin Shakya
 
From Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsFrom Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsAli Kishk
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Use of open_linked_data_in_bioinformatics
Use of open_linked_data_in_bioinformaticsUse of open_linked_data_in_bioinformatics
Use of open_linked_data_in_bioinformaticsRemzi Çelebi
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.Elena Sügis
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseRai University
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Gunnar Rätsch
 
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...ExternalEvents
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological databaseKAUSHAL SAHU
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
Biological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usabilityBiological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usabilityLars Juhl Jensen
 
Graziano Pesole - il progetto EPIGEN
Graziano Pesole - il progetto EPIGENGraziano Pesole - il progetto EPIGEN
Graziano Pesole - il progetto EPIGENeventi-ITBbari
 
How to submit a sequence in NCBI
How to submit a sequence in NCBIHow to submit a sequence in NCBI
How to submit a sequence in NCBIMinhaz Ahmed
 
Finding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/HadoopFinding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/HadoopMahmoud Parsian
 

What's hot (20)

Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
 
From Expression to Pathways Using Online Tools
From Expression to Pathways Using Online ToolsFrom Expression to Pathways Using Online Tools
From Expression to Pathways Using Online Tools
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Use of open_linked_data_in_bioinformatics
Use of open_linked_data_in_bioinformaticsUse of open_linked_data_in_bioinformatics
Use of open_linked_data_in_bioinformatics
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
Bioinformatics in a Nutshell
Bioinformatics in a NutshellBioinformatics in a Nutshell
Bioinformatics in a Nutshell
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
Biological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usabilityBiological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usability
 
Graziano Pesole - il progetto EPIGEN
Graziano Pesole - il progetto EPIGENGraziano Pesole - il progetto EPIGEN
Graziano Pesole - il progetto EPIGEN
 
How to submit a sequence in NCBI
How to submit a sequence in NCBIHow to submit a sequence in NCBI
How to submit a sequence in NCBI
 
Qi liu 08.08.2014
Qi liu 08.08.2014Qi liu 08.08.2014
Qi liu 08.08.2014
 
Finding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/HadoopFinding Allelic Frequencies Using MapReduce/Hadoop
Finding Allelic Frequencies Using MapReduce/Hadoop
 
TOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBITOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBI
 

Viewers also liked

Bi Story Jan 2010
Bi Story Jan 2010Bi Story Jan 2010
Bi Story Jan 2010kfranznick
 
Big Bang to DNA , Relatively Speaking
Big Bang to DNA , Relatively SpeakingBig Bang to DNA , Relatively Speaking
Big Bang to DNA , Relatively SpeakingJibrael Jos
 
Dna the next big thing in data storage
Dna the next big thing in data storageDna the next big thing in data storage
Dna the next big thing in data storageOther Mother
 
Karin Strauss - DNA Storage, July 2016
Karin Strauss - DNA Storage, July 2016Karin Strauss - DNA Storage, July 2016
Karin Strauss - DNA Storage, July 2016Seattle DAML meetup
 
20100729.atlassian
20100729.atlassian20100729.atlassian
20100729.atlassianKen SASAKI
 
20100921.ticket
20100921.ticket20100921.ticket
20100921.ticketKen SASAKI
 
บุคลากรครูที่เกษียณ12
บุคลากรครูที่เกษียณ12บุคลากรครูที่เกษียณ12
บุคลากรครูที่เกษียณ12Calvinlok
 
Prinsip konseling islam
Prinsip konseling islamPrinsip konseling islam
Prinsip konseling islamHaq Sasax
 
Daniels11 pp.doc
Daniels11 pp.docDaniels11 pp.doc
Daniels11 pp.docVijay Rathi
 
FM&P 2011 - EC Harris
FM&P 2011 - EC HarrisFM&P 2011 - EC Harris
FM&P 2011 - EC Harrisjasonawatar
 
AI in Economics
AI in EconomicsAI in Economics
AI in EconomicsAna Soric
 
Institute biz&financ plan Q410
Institute biz&financ plan Q410Institute biz&financ plan Q410
Institute biz&financ plan Q410Ana Soric
 
Mjs vol 27_3_2_field_and_petrographic_studies
Mjs vol 27_3_2_field_and_petrographic_studiesMjs vol 27_3_2_field_and_petrographic_studies
Mjs vol 27_3_2_field_and_petrographic_studiesGodang Shaban
 

Viewers also liked (20)

Bi Story Jan 2010
Bi Story Jan 2010Bi Story Jan 2010
Bi Story Jan 2010
 
Big Bang to DNA , Relatively Speaking
Big Bang to DNA , Relatively SpeakingBig Bang to DNA , Relatively Speaking
Big Bang to DNA , Relatively Speaking
 
Dna computing
Dna computingDna computing
Dna computing
 
Genetic data storage
Genetic data storageGenetic data storage
Genetic data storage
 
Dna the next big thing in data storage
Dna the next big thing in data storageDna the next big thing in data storage
Dna the next big thing in data storage
 
Karin Strauss - DNA Storage, July 2016
Karin Strauss - DNA Storage, July 2016Karin Strauss - DNA Storage, July 2016
Karin Strauss - DNA Storage, July 2016
 
DNA as Storage Medium
DNA as Storage MediumDNA as Storage Medium
DNA as Storage Medium
 
Dna ppt
Dna pptDna ppt
Dna ppt
 
20100729.atlassian
20100729.atlassian20100729.atlassian
20100729.atlassian
 
20100921.ticket
20100921.ticket20100921.ticket
20100921.ticket
 
FEBIN PRESENTATION
FEBIN PRESENTATION FEBIN PRESENTATION
FEBIN PRESENTATION
 
บุคลากรครูที่เกษียณ12
บุคลากรครูที่เกษียณ12บุคลากรครูที่เกษียณ12
บุคลากรครูที่เกษียณ12
 
My sister´s keeper
My sister´s keeperMy sister´s keeper
My sister´s keeper
 
Prinsip konseling islam
Prinsip konseling islamPrinsip konseling islam
Prinsip konseling islam
 
Daniels11 pp.doc
Daniels11 pp.docDaniels11 pp.doc
Daniels11 pp.doc
 
FM&P 2011 - EC Harris
FM&P 2011 - EC HarrisFM&P 2011 - EC Harris
FM&P 2011 - EC Harris
 
AI in Economics
AI in EconomicsAI in Economics
AI in Economics
 
Institute biz&financ plan Q410
Institute biz&financ plan Q410Institute biz&financ plan Q410
Institute biz&financ plan Q410
 
English Project
English Project English Project
English Project
 
Mjs vol 27_3_2_field_and_petrographic_studies
Mjs vol 27_3_2_field_and_petrographic_studiesMjs vol 27_3_2_field_and_petrographic_studies
Mjs vol 27_3_2_field_and_petrographic_studies
 

Similar to DNA Sequence Data in Big Data Perspective

EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EITESANGO
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxSwetaTripathi13
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsmaulikchaudhary8
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizAlexander Pico
 
Genome resource databases in horticutural crops
Genome resource databases in horticutural cropsGenome resource databases in horticutural crops
Genome resource databases in horticutural cropsPulipati Gangadhara Rao
 
Biological databases
Biological databasesBiological databases
Biological databasesAshfaq Ahmad
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim D. Pruitt
 
Open Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of CancerOpen Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of CancerOpen Networking Summit
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsAmit Sheth
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformaticscontactsoorya
 

Similar to DNA Sequence Data in Big Data Perspective (20)

Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptx
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
NetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-vizNetBioSIG2012 anyatsalenko-en-viz
NetBioSIG2012 anyatsalenko-en-viz
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Genome resource databases in horticutural crops
Genome resource databases in horticutural cropsGenome resource databases in horticutural crops
Genome resource databases in horticutural crops
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Dna chip
Dna chipDna chip
Dna chip
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
Open Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of CancerOpen Source Networking Solving Molecular Analysis of Cancer
Open Source Networking Solving Molecular Analysis of Cancer
 
KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017
 
Semantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical InformaticsSemantic Web for Health Care and Biomedical Informatics
Semantic Web for Health Care and Biomedical Informatics
 
NCBI
NCBINCBI
NCBI
 
Ncbi
NcbiNcbi
Ncbi
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
 

Recently uploaded

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 

Recently uploaded (20)

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 

DNA Sequence Data in Big Data Perspective

  • 1. Opportunities and Constraints Palaniappan SP connectsp2012@gmail.com
  • 2. Request Note I prepared this presentation entirely with input from internet research with the intend to share this as give back to society. Please share your comments and suggestions through the mail ID. It would help to improve the value and benefit of this preparation 18-Nov-12 2
  • 3. Core areas where DNA sequencing is employed • Academic research • understanding gene expression/regulation • phylogeny, demography and evolution research • Oncology • understanding DNA’s role in cancer cells • finding ways to tune gene expression for cancer abatement or prevention • Gene therapy • Using recombinant DNA to suppress / modify / induce gene expression to address genetic disorder based diseases / malfunction 18-Nov-12 3
  • 4. More areas where DNA sequencing is employed • Developing Genetically Modified Organisms through recombinant research • salt tolerant/drought tolerant/disease resistant cultivable crops • microbes producing more of therapeutic compounds, proteins • microbes for environment cleaning • pro-biotic lactobacilli • Clinical diagnosis - diagnosing gene-sequence-correlated diseases / infections e.g. HIV • Forensic analysis - DNA fingerprint profiling to identify crime suspects • Pedigree analysis - to establish parental lineage in legal disputes 18-Nov-12 4
  • 5. Agencies engaged in DNA sequencing • Non-Profit Research Laboratories in Universities, Institutes • Clinical labs of government and private hospitals • Commercial organizations engaged in DNA sequencing for payment 18-Nov-12 5
  • 6. Databases maintaining DNA sequence data During early period, the data were generated and analyzed only by a few research institutes like members of Humane Genome Research Project. Later when such databases grew by size and region, so many databases were created and are made available for research communities. Following are some examples: • NCBI - National Center for Biotechnology Information (GenBank) http://www.ncbi.nlm.nih.gov/guide/data-software/#databases_ • EBI - European Bioinformatics Institute (EMBL) http://www.ebi.ac.uk/Databases/ • EMNEW - Index of New EMBL Nucleotides ( EBI) http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+databanks • DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/ • As per Nucleic Acid Research database issue dated Jan 2012, there are as many as 1380 databases!!! http://www.oxfordjournals.org/nar/database/a/ 18-Nov-12 6
  • 7. DNA Sequencing capability has grown exponentially DNA sequences in GenBank Doubling time = 18 months Sequencing Cost Data Analyzing Cost Source: Bioinformatics Challenges of High-Throughput DNA Sequencing by Stuart M. Brown, Ph.D, New York University www.med.nyu.edu/rcr/rcr/course/NexGen-2010.ppt 18-Nov-12 7
  • 8. Big Data Of DNA Sequence Is Different From Other Big Data  Most of the databases have built in search engines with predefined filters to narrow down the search. Those web based tools also restrict the input / output file format. Integrating customized search tools with databases need to be worked out.  Too much of tool customization limits the interoperability in various software platforms  Comparing conventional RDBMS databases, there is no Dev region, Test region or sand box environment where one can test the tools, scripts and queries freely  Data mining is not confined to one or few data sources. There are more than 1300 databases – some are primary (Tables) and some are derived databases(Table Views). It’s likely the data is replicated in many databases.  Analytics is not looking for matching exact search string or a cluster of strings defined by a complex query with multiple joins and unions. Analysis is mostly on percentage of matching of a given sequence. A variety of computational algorithms like dynamic programming and heuristic algorithms or probabilistic methods are used for sequence alignment.  Too many tools, search engines, software and script languages - it is difficult to find or validate a software component framework / a technology tool box. 18-Nov-12 8
  • 9. Using Big Data of Gene Sequence – Examples  Identifying gene sequences relevant to specific biochemical/metabolic pathway using transcriptional "fingerprints“  Understanding gene regulations exerted by promoters, suppressors  Whole genome sequence analysis for disease control, human healthcare  Gene profiling for ― Structural genes (coding for mRNA, rRNA, tRNA) ― Functional genes (coding for promoter, operator, terminator) ― Regulatory genes (coding for repressor protein that binds to operator) ― putative genes which are not evidently associated with any protein produced or function performed ― sequence of interest with reference to SNPs, SVs, indels, ChIP 18-Nov-12 9
  • 10. Big Data of Gene Sequence – More Examples  extrinsic gene finding system for gene annotation  Understanding genetic basis for multi-drug resistance of super bugs so as to evolve alternative control measures  Targeted drug delivery against pathogens  Localized gene therapy for infectious diseases or inherent disorder  String mining / sequence mining, itemset mining, association rule mining – Data mining can help us in two ways. 1) Understand genetic mechanism of regulation and expression of phenotypes and 2) Retrieve genes or genetic information that could be converted into a process technology or a diagnostic tool or a therapeutic technique. 18-Nov-12 10
  • 11. Big Data Management – A Generic Approach 18-Nov-12 11
  • 12. Analyzing DNA databases – Some Practical Constraints  Reading frame alignment. Every sequence can represent three different reading frames that could be converted into a derived amino acid sequence  Presence of Exon-Intron - RNA splicing, possibility of alternative splicing make the analysis as more complex  Silent mutations – redundancy of codon –SNPs. Difficult to distinguish silent mutation from sequencing error. Sequencing errors are possible because of complexity in sample preparation, sequencing, assembling and analyzing sequence data. Those situations could be resolved only by repeat runs.  Since DNA preparation is from a host of cells, the sequence we get is, eventually, an average of DNA sequence from all sample cells. 18-Nov-12 12
  • 13. Analyzing DNA databases – Practical Constraints - contd.  Significance of non-coding DNA is yet to be understood - In many eukaryotes, up to 99% of an organism's total genome size is non-coding DNA. More than 98% of the human genome does not encode protein sequences. A fraction of non-coding sequence is reported to regulate gene expression.  Sequence matching is based on statistical analysis and not on exact data matching  Reference Human Genome Data may not represent global population. However when more and more sequence information from different geographic region are added, the reference would become more global. The Genome Reference Consortium is an international body that takes care of genome curation.  In a short span of time, the cost of sequencing has drastically come down. Still the sequencing fee (which is around $1000 per individual) is expensive for countries like India 18-Nov-12 13
  • 14. Software Tools Used For Sequence Analysis • There are quite a large number of tools are available in internet • The Tools are used for ― sequence comparison/alignment ― searching databases and retrieve catalogued reference sequences ― assembling short sequence strings to get complete sequence of a gene ― Retrieving sequence info for constructing primer / oligo probe ― converting nucleic acid sequence to protein structure ― converting protein structure to nucleic acid sequence ― Multiple sequence alignment • Scripting languages – Perl, Python, Ruby. BioPERL, BioPython and BioRuby are the framework applications that could be readily used for data mining. • SWIG – Tool to generate scripting language interface – It improves interoperability of scripts • SourceForge, GitHub – commonly used version control systems to maintain software and tool versions 18-Nov-12 14
  • 15. Some Commonly Used Tools • Ensembl Genome Browser • UCSC Genome Browser • Entrez - Integrated, text-based search and retrieval tool used at NCBI • RSAT - Regulatory Sequence Analysis Tools - tools dedicated to the detection of regulatory signals in non- coding sequences • BLAST, FASTA - Tools used for sequence alignment - to compare query sequence with that available in a database • ClustalW - Multiple sequence alignment program • GeneMarkerR - A commercial tool for forensic profiling • Seq Anal - A collection of tools to search, align and analyze DNA sequences • Galaxy Tools can also be used to search, align and analyze DNA data • Codon Suite - Codon-based sequence analysis • Transeq, Backtranseq - Tools to translate or back-translate between nucleotide and peptide sequences. • ReadSeq: Molecular sequence format converter • FASTLINK - Used to map genes and find the approximate location of disease genes. • DnaSP - A software package for the analysis of nucleotide polymorphism from aligned DNA sequence data • MATCHTM - A tool for searching transcription factor binding sites in DNA sequences • PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses • GeneMark™ - Free gene prediction software • Genscan is the best available ab initio gene predictor • More list of gene prediction software in http://en.wikipedia.org/wiki/List_of_gene_prediction_software 18-Nov-12 15
  • 16. 18-Nov-12 16