SlideShare une entreprise Scribd logo
1  sur  17
Sai Teja Vissamsetti (700645566)
Sarika Batte (700647682)
Chandana Sripathi (700641627)
Krishna Chaitanya Koti (700648083)
Krishna Chaitanya Gollavilli (700638821)
Sree Navya Kovvuri (700645739)
Sai Priyanka Reddy Addaboina (700648561)
ANALYSING GENOMICS AND
THE BDG PROJECT
BIG DATA
- Dr. Bo Li
Next generation DNA sequencing is rapidly transforming the life
sciences into a data driven fields.
• Traditional computational methods – difficult to use
• More digitalised versions are developed
INTRODUCTION
• We show the experienced Bio Informatician how to perform typical genomics tasks in
the context of Spark.
• Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command-
line tools for large-scale genomics analysis.
• We introduce the general Spark user to a new set of Hadoop-friendly serialization and
file formats
OVERVIEW of the Project
• Free java based programming frame work
• Runs thousands of nodes involving thousands of terabytes
• Rapid data transfer
• Continue operating interpreted in case of node failure this frame work is
used by
Google
Yahoo
IBM
• Scalable, cost effective, flexible, fast, resilient to failure
HADOOP
 A software frame work for writing and processing vast amount of
data on large clusters reliably
 Basic concept :
 Divide - Divides input datasets into chunks and processed by map task
in parallel.
 Sorts
 Conquer - Merges and given as the input to the reduced tasks.
 Handles
 Scheduling
 Data distribution
 Synchronization
 Errors and faults
Map Reduce
• Also called as sequence-specific DNA binding factor
• Controls the rate of genetic information
• Larger genomes – more number of transcription factors
TRANSCRIPTION FACTOR
GM12878 - Genetic variation studies
K562 - Erythropoiesis
HepG2 - Metabolism disorders
HEK293 - Embryonic kidney
H54 - Glioblastoma
BJ - Skin fibroblast
Data Types
 Bio informaticians have their own specific file formats
Example:
 .fasta
 .sam
 .gtf
 .narrowpeak
 .vcf etc.
 Accessing file formats of similar data is difficult
 They are ASCII encoded
 ASCII – inefficient !!
DECOUPLING STORAGE
 An open source, high performance, distributed platform for genomic
analysis
 ADAM defines a:
 Data schema and layout on disk
 A Scala API
 A command line interface
What is ADAM?
 VM-Ware version:5.5 – Cloudera
 Java version 1.8
 Tool : ADAM
 Apache Avro
 Spark
SOFTWARES USED
• An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node movement
needed
• Presents a functional programing API, along with support for
iterative programming.
• Used at scale on clusters with >2k nodes, 4TB datasets
 Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros
 Cloudera
 HortonWorks
 MapR
 The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages failures
WHY SPARK?
SPARK
• Open source
• In memory, on disk
• Can be written in SCALA
• API : SCALA, Java, python
• Easy to program
• Doesn’t need abstractions
• Less compared to map reduce
MAP REDUCE
• Open source
• On-disk
• Can be written in java
• API : java, python, SCALA
• Difficult to program
• Needs abstractions
• More security features
MAP REDUCE vs SPARK
Ingesting the full 1000 Genomes genotype data set –
• Download the raw data directly into HDFS
• Unzipping in-flight
• Run an ADAM job to convert the data to Parquet
Querying Genotypes from the 1000
Genomes Project
Building ADAM
Building Spark
Big data   analysing genomics and the bdg project

Contenu connexe

Tendances

Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
 
Intro to Python for C# Developers
Intro to Python for C# DevelopersIntro to Python for C# Developers
Intro to Python for C# DevelopersSarah Dutkiewicz
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Latest Developments in H2O
Latest Developments in H2OLatest Developments in H2O
Latest Developments in H2OSri Ambati
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Durga Gadiraju
 
Is there a SQL for NoSQL?
Is there a SQL for NoSQL?Is there a SQL for NoSQL?
Is there a SQL for NoSQL?Arthur Keen
 
Scala ecosystem - Dublin Scala Meetup, Oct 2018
Scala ecosystem - Dublin Scala Meetup, Oct 2018Scala ecosystem - Dublin Scala Meetup, Oct 2018
Scala ecosystem - Dublin Scala Meetup, Oct 2018Mikhail Girkin
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Scaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksScaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksDatabricks
 
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsBig Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsDurga Gadiraju
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2OSri Ambati
 

Tendances (20)

Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Intro to Python for C# Developers
Intro to Python for C# DevelopersIntro to Python for C# Developers
Intro to Python for C# Developers
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Spark Core
Spark CoreSpark Core
Spark Core
 
Latest Developments in H2O
Latest Developments in H2OLatest Developments in H2O
Latest Developments in H2O
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
 
Is there a SQL for NoSQL?
Is there a SQL for NoSQL?Is there a SQL for NoSQL?
Is there a SQL for NoSQL?
 
Scala ecosystem - Dublin Scala Meetup, Oct 2018
Scala ecosystem - Dublin Scala Meetup, Oct 2018Scala ecosystem - Dublin Scala Meetup, Oct 2018
Scala ecosystem - Dublin Scala Meetup, Oct 2018
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Scaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and DatabricksScaling Security Threat Detection with Apache Spark and Databricks
Scaling Security Threat Detection with Apache Spark and Databricks
 
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsBig Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
 

En vedette

drill management system
drill management system  drill management system
drill management system sree navya
 
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions Architect
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions ArchitectHUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions Architect
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions ArchitectSpagoWorld
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies
 
An Amzing Sermon
An Amzing SermonAn Amzing Sermon
An Amzing SermonManoj Jacob
 
bw23-nyfinalpresentation-verizon-130426104853-phpapp02
bw23-nyfinalpresentation-verizon-130426104853-phpapp02bw23-nyfinalpresentation-verizon-130426104853-phpapp02
bw23-nyfinalpresentation-verizon-130426104853-phpapp02Laurie Shook, MBA
 
History of internet
History of internetHistory of internet
History of internetUsman Sajid
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Android Seminar || history || versions||application developement
Android Seminar || history || versions||application developement Android Seminar || history || versions||application developement
Android Seminar || history || versions||application developement Shubham Pahune
 
7 Steps to Rocking Your Brand on Social Media
7 Steps to Rocking Your Brand on Social Media7 Steps to Rocking Your Brand on Social Media
7 Steps to Rocking Your Brand on Social MediaKatia Millar
 
Mubasher, M Phil synoses seminar
Mubasher, M Phil synoses seminarMubasher, M Phil synoses seminar
Mubasher, M Phil synoses seminarMubasher Solangi
 
Jenis turbin dan nozzle beserta komponennya
Jenis turbin dan nozzle beserta komponennyaJenis turbin dan nozzle beserta komponennya
Jenis turbin dan nozzle beserta komponennyaNur Ilham
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomicsmikaelhuss
 
Classifications of Triangles by Ricardo C. Lacsa
Classifications of Triangles by Ricardo C. LacsaClassifications of Triangles by Ricardo C. Lacsa
Classifications of Triangles by Ricardo C. LacsaRic Lacsa
 
2 6 rational function graphs
2 6 rational function graphs2 6 rational function graphs
2 6 rational function graphsLomasPreCalc
 
Diretrizes para elaboração de projetos ambientais
Diretrizes para elaboração de projetos ambientaisDiretrizes para elaboração de projetos ambientais
Diretrizes para elaboração de projetos ambientaisCBH Rio das Velhas
 

En vedette (20)

drill management system
drill management system  drill management system
drill management system
 
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions Architect
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions ArchitectHUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions Architect
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions Architect
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
 
An Amzing Sermon
An Amzing SermonAn Amzing Sermon
An Amzing Sermon
 
Carreras de Caballos
Carreras de CaballosCarreras de Caballos
Carreras de Caballos
 
Question 7
Question 7Question 7
Question 7
 
Lectura 1 Los números Irracionales
Lectura 1 Los números Irracionales Lectura 1 Los números Irracionales
Lectura 1 Los números Irracionales
 
bw23-nyfinalpresentation-verizon-130426104853-phpapp02
bw23-nyfinalpresentation-verizon-130426104853-phpapp02bw23-nyfinalpresentation-verizon-130426104853-phpapp02
bw23-nyfinalpresentation-verizon-130426104853-phpapp02
 
History of internet
History of internetHistory of internet
History of internet
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Android Seminar || history || versions||application developement
Android Seminar || history || versions||application developement Android Seminar || history || versions||application developement
Android Seminar || history || versions||application developement
 
7 Steps to Rocking Your Brand on Social Media
7 Steps to Rocking Your Brand on Social Media7 Steps to Rocking Your Brand on Social Media
7 Steps to Rocking Your Brand on Social Media
 
Mubasher, M Phil synoses seminar
Mubasher, M Phil synoses seminarMubasher, M Phil synoses seminar
Mubasher, M Phil synoses seminar
 
La emoción y el conocimiento van juntos
La emoción y el conocimiento van juntosLa emoción y el conocimiento van juntos
La emoción y el conocimiento van juntos
 
Jenis turbin dan nozzle beserta komponennya
Jenis turbin dan nozzle beserta komponennyaJenis turbin dan nozzle beserta komponennya
Jenis turbin dan nozzle beserta komponennya
 
Execuçao CBH Rio das Velhas
Execuçao CBH Rio das VelhasExecuçao CBH Rio das Velhas
Execuçao CBH Rio das Velhas
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomics
 
Classifications of Triangles by Ricardo C. Lacsa
Classifications of Triangles by Ricardo C. LacsaClassifications of Triangles by Ricardo C. Lacsa
Classifications of Triangles by Ricardo C. Lacsa
 
2 6 rational function graphs
2 6 rational function graphs2 6 rational function graphs
2 6 rational function graphs
 
Diretrizes para elaboração de projetos ambientais
Diretrizes para elaboração de projetos ambientaisDiretrizes para elaboração de projetos ambientais
Diretrizes para elaboração de projetos ambientais
 

Similaire à Big data analysing genomics and the bdg project

Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsOleg Magazov
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionSri Ambati
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Hadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesHadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesBigdata Meetup Kochi
 

Similaire à Big data analysing genomics and the bdg project (20)

Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Apache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and BasicsApache Cassandra training. Overview and Basics
Apache Cassandra training. Overview and Basics
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Spark
SparkSpark
Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Hadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologiesHadoop enhancements using next gen IA technologies
Hadoop enhancements using next gen IA technologies
 

Dernier

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Dernier (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Big data analysing genomics and the bdg project

  • 1. Sai Teja Vissamsetti (700645566) Sarika Batte (700647682) Chandana Sripathi (700641627) Krishna Chaitanya Koti (700648083) Krishna Chaitanya Gollavilli (700638821) Sree Navya Kovvuri (700645739) Sai Priyanka Reddy Addaboina (700648561) ANALYSING GENOMICS AND THE BDG PROJECT BIG DATA - Dr. Bo Li
  • 2. Next generation DNA sequencing is rapidly transforming the life sciences into a data driven fields. • Traditional computational methods – difficult to use • More digitalised versions are developed INTRODUCTION
  • 3. • We show the experienced Bio Informatician how to perform typical genomics tasks in the context of Spark. • Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command- line tools for large-scale genomics analysis. • We introduce the general Spark user to a new set of Hadoop-friendly serialization and file formats OVERVIEW of the Project
  • 4. • Free java based programming frame work • Runs thousands of nodes involving thousands of terabytes • Rapid data transfer • Continue operating interpreted in case of node failure this frame work is used by Google Yahoo IBM • Scalable, cost effective, flexible, fast, resilient to failure HADOOP
  • 5.  A software frame work for writing and processing vast amount of data on large clusters reliably  Basic concept :  Divide - Divides input datasets into chunks and processed by map task in parallel.  Sorts  Conquer - Merges and given as the input to the reduced tasks.  Handles  Scheduling  Data distribution  Synchronization  Errors and faults Map Reduce
  • 6. • Also called as sequence-specific DNA binding factor • Controls the rate of genetic information • Larger genomes – more number of transcription factors TRANSCRIPTION FACTOR
  • 7. GM12878 - Genetic variation studies K562 - Erythropoiesis HepG2 - Metabolism disorders HEK293 - Embryonic kidney H54 - Glioblastoma BJ - Skin fibroblast Data Types
  • 8.  Bio informaticians have their own specific file formats Example:  .fasta  .sam  .gtf  .narrowpeak  .vcf etc.  Accessing file formats of similar data is difficult  They are ASCII encoded  ASCII – inefficient !! DECOUPLING STORAGE
  • 9.  An open source, high performance, distributed platform for genomic analysis  ADAM defines a:  Data schema and layout on disk  A Scala API  A command line interface What is ADAM?
  • 10.  VM-Ware version:5.5 – Cloudera  Java version 1.8  Tool : ADAM  Apache Avro  Spark SOFTWARES USED
  • 11. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed • Presents a functional programing API, along with support for iterative programming. • Used at scale on clusters with >2k nodes, 4TB datasets
  • 12.  Current leading map-reduce framework: • First in-memory map-reduce platform • Used at scale in industry, supported in major distros  Cloudera  HortonWorks  MapR  The API: • Fully functional API • Main API in Scala, also support Java, Python, R • Manages failures WHY SPARK?
  • 13. SPARK • Open source • In memory, on disk • Can be written in SCALA • API : SCALA, Java, python • Easy to program • Doesn’t need abstractions • Less compared to map reduce MAP REDUCE • Open source • On-disk • Can be written in java • API : java, python, SCALA • Difficult to program • Needs abstractions • More security features MAP REDUCE vs SPARK
  • 14. Ingesting the full 1000 Genomes genotype data set – • Download the raw data directly into HDFS • Unzipping in-flight • Run an ADAM job to convert the data to Parquet Querying Genotypes from the 1000 Genomes Project