SlideShare a Scribd company logo
1 of 64
Download to read offline
Puneet Suri, Thermo Fisher Scientific 
Shakila Pothini, Thermo Fisher Scientific 
Sami Zuhuruddin, Amazon Web Services 
November 12, 2014 | Las Vegas, NV 
HLS402 
Getting into Your Genes: The definitive guide to using Amazon 
EMR, Amazon ElastiCache, and Amazon S3 to Deliver High- 
Performance Scientific Applications
About me 
Puneet Suri 
Senior Director, Software Engineering 
Life Sciences Group, Thermo Fisher Scientific 
follow at: @psuriconnect at: puneet.suri@thermofisher.com 
Envisionedanddeveloped the life sciences cloud platform for Thermo Fisher Scientific
This is why we are here…
Having an impact… 
A person was set free after 35 years in prison because of a DNA test 
Freeing the innocent 
Surviving Cancer 
A person survived pancreatic cancer thanks to a genetic approach that allowed an oncologist to focus on a specific cancer cell 
Ebola
H1N1: Pandemic declared in April 2009
Need to enable this at larger scale & impact more lives
Customer needs… 
store & manage large scientific data sets
A few years back
Our offerings 
desktop applications 
challenges with upgrade cycle, versions etc. 
limited storage and compute capacity 
to analyze complex & large data sets 
no sharing & collaboration 
no backup, archive & security
Abetter way… is to provide 
STORAGE 
COMPUTE 
SCALABILITY 
MEMORY
Our vision
Adeep dive into our story
Aday with the scientist 
Get Insights 
 
 
 
 
aproject 
* 
* 
* 
*
Insights… 
•what is causing cancer 
•what drugs will work 
•is therapy working
Customer pain points 
•existing solutions cannot address the complexities 
•excel is used painfully to manually analyze data 
•multiple tools used to get the final insight 
•it takes days to analyze the data 
•some of the analysis workflow are not possible
Dimensions of complexity… 
millions 
of 
records 
thousands of users, 
projects 
real time analysis of large datasets 
2-3 seconds response time 
project 
storage 
compute 
performance 
scalability
Our journey enabling complex customer workflows
Our iterative journey & challenges 
0 
start with reference architecture 
1 
identify scalable storage solution 
2 
identify scalable storage solution for large data items 
3 
identify solutionsfor real time response & queries 
4 
Identify solutions for real time analysis ofdata
Reference web architecture 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SERVERS 
AUTO SCALING 
APP SERVERS 
Amazon 
RDS 
MASTER 
Amazon 
RDS 
STANDBY 
Synchronous Replication 
Load 
Balancers 
Load Balancers 
WEB SERVERS 
CDN: 
CloudFront 
APP SERVERS
Why relational DB was not considered 
•based on projected data and user growth over the years (hundreds of TBs), required real-time query performance very hard to achieve 
•needed managed scalability without sharding/re-shardingoverhead and disruptions 
•needed a loose schema to seamlessly enable new and cross domain workflows
Our iterative journey & challenges 
0 
start with reference architecture 
1 
identify scalable storage solution 
2 
identify scalable storage solution for large data items 
3 
identify solutionsfor real time response & queries 
4 
Identify solutions for real time analysis ofdata
NoSQL was the way to go 
•managed scalability 
•near zero administration overhead 
•query performance not impacted by table sizecan add billions of rows 
•simple and flexible schema –new domains can be supported 
•extremely fast read/write performance
Architecture with DynamoDB 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SERVERS 
AUTO SCALING 
APP SERVERS 
Load 
Balancers 
Load Balancers 
WEB SERVERS 
CDN: 
CloudFront 
APP SERVERS 
Auto Scaling 
AmazonDynamoDB
What worked well with DynamoDB 
Managed Service with flexible schema 
Managed Scalability 
Extremely fast access in order of milliseconds 
READ/WRITE
Iteration 1 
GBs 
GBs 
MBs 
MBs 
Instrument Run 
(1000s) 
Patient Samples 
(1000s) 
Genes 
(1000s) 
Raw Signals 
(millions) 
Analysis Results 
(millions) 
Storage 
Query 
Performance 
✔ 
✔ 
Cost 
✔ 
✔ 
Get Insights 
 
 
 
 
project
What were the gaps 
our item attribute (e.g.Instrument Run) size range > 400KB 
(item attribute size limitation of 64KB400KB) 
hot hash key& batch size limitations 
•Adding thousands of related records (e.g. Raw Signals) with common hash key (e.g. Instrument Run) can be slow (10s seconds) 
•a large project can have ~ 1 million records (e.g. Raw Signals) that needs to read & written 
for a large project, high read/write capacity (1000s) was needed 
(increased cost due to high READ/WRITE capacity needs)
What we needed 
Asolution that 
•can store huge number of related objects 
•is cost effective to read/write large data sets 
•has no limitations on batch size or item size 
•ability to query into the large number of records
Our iterative journey & challenges 
0 
start with reference architecture 
1 
identify scalable storage solution : DynamoDB 
2 
identify scalable storage solution for large data items 
3 
identify solutionsfor real time response & queries 
4 
Identify solutions for real time analysis ofdata
Architecture with DynamoDB & S3 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SERVERS 
AUTO SCALING 
APP SERVERS 
Load 
Balancers 
Load Balancers 
WEB SERVERS 
CDN: 
CloudFront 
APP SERVERS 
Auto Scaling 
DynamoDB 
Amazon S3
MBs 
MBs 
GBs 
Iteration 2 
GBs 
Instrument Run 
(1000s) 
Patient Samples 
(1000s) 
Genes 
(1000s) 
Raw Signals 
(millions) 
Analysis Results 
(millions) 
Storage 
Query 
Performance 
✔ 
✔ 
Cost 
✔ 
✔ 
Get Insights 
 
 
 

Architecture with DynamoDB & S3 
•DynamoDB was used to store small unrelated objects (KB) 
•will grow to a large number (e.g. Data Files) 
•Amazon S3 was used to store related larger objects (e.g. Raw Signals & Analysis Results (GB)) 
•stored as single Amazon S3 object serialized using google protobuf 
•Amazon S3 was cost effective for storing huge objects
Real time queries for complex visualizations
What we needed 
•complex visualizations requiresGigabytes of data to be queried in 2-3 secs and presented to the user 
•visualizations are very interactive that requires constant update of data. Need quick read & writes 
•support concurrent access without any degradation in query performance
Our iterative journey & challenges 
0 
start with reference architecture 
1 
identify scalable storage solution : DynamoDB 
2 
identify scalable storage solution for large data items :DynamoDB + AmazonS3 
3 
identify solutionsfor fast real time response & queries 
3 
Identify solutions for real time analysis ofdata
Distributed in-memory storage was the way to go 
read/writes have to be quick to enable fast response times, reading & writing from Amazon S3 was not ideal. 
•ElastiCachewas used as IN-MEMORY storage on top of DynamoDB& Amazon S3. 
•all related serialized objects in Amazon S3 accessed by customers is maintained in ElastiCacheas individual records 
•Indexes created in DynamoDB based on the query pattern so that data can be easily retrieved from ElastiCache
Architecture with DynamoDB, Amazon S3 & ElastiCache 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SERVERS 
AUTO SCALING 
APP SERVERS 
Load 
Balancers 
Load Balancers 
WEB SERVERS 
CDN: 
CloudFront 
APP SERVERS 
Auto Scaling 
DynamoDB 
Amazon S3 
ElastiCache
Iteration 3 
MBs 
MBs 
GBs 
GBs 
Instrument Run 
(1000s) 
Patient Samples 
(1000s) 
Genes 
(1000s) 
Raw Signals 
(millions) 
Analysis Results 
(millions) 
Storage Query 
Performance ✔ ✔ 
Cost ✔ ✔ 
indexes 
Get Insights 
   
Need for real time data analysis 
•analyze huge projects containing thousands of patient samples in minutes instead of days 
•a scalable solution is required to support analysis requests from thousands of users 
•existing desktop algorithms used for this analysis not optimized for extracting parallelism in data
8 
20 
40 
80 
120 
200 
320 
0 
50 
100 
150 
200 
250 
300 
350 
90000 
180000 
270000 
360000 
450000 
675000 
900000 
desktop 
desktop 
Analysis solutions in desktop 
desktop 
crashes 
minutes 
# of records 
Get Insights 
 
 
 

Excel nightmare
Our iterative journey & challenges 
0 
startwith reference architecture 
1 
identify scalable storage solution : DynamoDB 
2 
identify scalable storage solution for large data items : DynamoDB + Amazon S3 
3 
identify solutionsfor fast real time response & queries : DynamoDB + Amazon S3 + ElastiCache 
4 
Identify solutions for real time analysis ofdata
Amazon EMR was the way to go 
1.EMR was used to performreal time analysisof huge data sets – results in minutes instead of days 
2.all small jobs analyzed in-memory while big ones are sent toAmazon EMR. 
3.existing algorithms overhauled to derive massive parallelism using Hadoopmap-reduce framework 
4.as large datasets already in Amazon S3, used Amazon S3 for input and output instead of HDFS –only intermediate map-reduce data in HDFS 
5.Amazon EMR cluster is created On-Demand and shutdown when done
Architecture with EMR for real time 
analysis 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SERVERS 
AUTO SCALING 
APP SERVERS 
Load 
Balancers 
Load Balancers 
WEB SERVERS 
CDN: 
CloudFront 
APP SERVERS 
Auto Scaling 
DynamoDB 
Amazon S3 
ElastiCache 
EMR
Iteration 4 
MBs 
MBs 
GBs 
GBs 
Instrument Run 
(1000s) 
Patient Samples 
(1000s) 
Genes 
(1000s) 
Raw Signals 
(millions) 
Analysis Results 
(millions) 
Storage Query Analysis 
Performance ✔ ✔ ✔ 
Cost ✔ ✔ ✔ 
Get Insights 
   
Performance for a project 
2 
4 
7 
11 
13 
20 
30 
0 
50 
100 
150 
200 
250 
300 
350 
90000 
180000 
270000 
360000 
450000 
675000 
900000 
cloud 
desktop 
>10x 
crashes 
minutes 
# of records
Journey 
0 
start with reference architecture 
1 
identify scalable storage solution : DynamoDB 
2 
identify scalable storage solution for large data items : DynamoDB + Amazon S3 
3 
identify solutionsfor fast real time response & queries : DynamoDB+ Amazon S3 + ElastiCache 
4 
Identify solutions for real time analysis ofdata : Amazon EMR 
✓ 
✓ 
✓ 
✓ 
✓
Learnings 
• 
• 
• 
• 
• 
•
About me : Shakila Pothini 
Senior Manger, Cloud Apps 
Life Sciences Group, 
Thermo Fisher Scientific 
Hiking is my ONLY stress buster 
Entertain to Educate. 
Cofounder of performing arts group (swaram.org) 
Mostly left brained with 
occasional sense of creativity 
* 
* 
*
How to get into your gene? 
sequence the human entire transcriptome (30,000 genes) 
identify significant genes 
(100+ genes) 
validate & reconfirm the 
(20+ genes) 
do it on more samples & different population 
find the way the genes interplay in the pathway 
understand cancer diversity. 
types of therapy. 
drug-able genes.
Demo
Demo summary 
non cancerous sample 
cancerous sample 
difference in expression of genes
Customer feedback 
“My initial SymphoniSuite evaluation experience was good, GUI/ controls are intuitive and data upload/ analysis was fast and user friendly” 
UPENN 
“I enjoy processing hundreds of open array plates with ease.”, “I appreciate the rapid access of the large number of amplification curves ” 
Sanofi 
“I wanted to let you know that Symphonihas been working well for me. I have done analysis using as high as 500 files. ” 
ASU 
“This I see value in... utilizing these features. I appreciate the speed of data processing and visuals.” 
LUMC
Yearly checkup today 
165 / 105 
120 
50 / 90 
104
Is this really going to detect early stages of cancer?
A few years from now : every person 
ATGCATGCTATCAATTGCCC 
Sequence 
melanoma 
healthrisks 
drug response 
powered by AWS 
lifecloud
Yearly check-ups a few years from now 
ATGCATGC ATTGCCC 
ATGCATGC ATTGCCC 
TATCA 
GCATG 
lifecloud 
ATGCATGCTATCAATTGCCC 
Sequence
Yearly check-ups a few years from now (cont’d) 
cancer 
any clinical 
trial? 
healthrisks 
drug response 
ATGCATGCTATCAATTGCCC 
Sequence 
lifecloud 
prescribe the 
right drug
Puneet Suri 
Senior Director, Software engineering 
puneet.suri@thermofisher.comT: 650.266.5857 @psuri 
ShakilaPothini 
Senior Manager, Cloud Applications 
shakila.pothini@thermofisher.com 
SalilKumar 
Cloud Architect 
T: 650.740.1646 @salilkum
Collect / 
Ingest 
Kinesis 
Process / Analyze 
EMR 
EC2 
Redshift 
Data Pipeline 
Visualize / 
Report 
Glacier 
S3 
DynamoDB 
Store 
RDS 
Data Answers
Experiment 1 
Data Access 
Compute Time 
Experiment 2 
Experiment 3 
Data Access 
Compute Time 
Data Access 
Compute Time 
✔ 
✔ 
✔
EMR 
Cluster 
EC2 
Instance 
Data Temperature
Please give us your feedback on this session. 
Complete session evaluations and earn re:Invent swag. 
http://bit.ly/awsevals 
Puneet Suri 
Senior Director, Software engineering 
puneet.suri@thermofisher.com 
T: 650.266.5857 @psuri 
Shakila Pothini 
Senior Manager, Cloud Applications 
shakila.pothini@thermofisher.com 
T: 650.554.2190 
Salil Kumar 
Cloud Architect 
T: 650.740.1646 @salilkum

More Related Content

What's hot

What's hot (20)

Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Building and scaling your containerized microservices on Amazon ECS
Building and scaling your containerized microservices on Amazon ECSBuilding and scaling your containerized microservices on Amazon ECS
Building and scaling your containerized microservices on Amazon ECS
 
Rackspace Best Practices for DevOps on AWS
Rackspace Best Practices for DevOps on AWSRackspace Best Practices for DevOps on AWS
Rackspace Best Practices for DevOps on AWS
 
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech TalksSelecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
 
Real-time Data Processing using AWS Lambda
Real-time Data Processing using AWS LambdaReal-time Data Processing using AWS Lambda
Real-time Data Processing using AWS Lambda
 
Getting started with Amazon Dynamo BD
Getting started with Amazon Dynamo BDGetting started with Amazon Dynamo BD
Getting started with Amazon Dynamo BD
 
(BDT307) Zero Infrastructure, Real-Time Data Collection, and Analytics
(BDT307) Zero Infrastructure, Real-Time Data Collection, and Analytics(BDT307) Zero Infrastructure, Real-Time Data Collection, and Analytics
(BDT307) Zero Infrastructure, Real-Time Data Collection, and Analytics
 
Real-Time Processing Using AWS Lambda
Real-Time Processing Using AWS LambdaReal-Time Processing Using AWS Lambda
Real-Time Processing Using AWS Lambda
 
AWS re:Invent 2016: Getting Started with the Hybrid Cloud: Enterprise Backup ...
AWS re:Invent 2016: Getting Started with the Hybrid Cloud: Enterprise Backup ...AWS re:Invent 2016: Getting Started with the Hybrid Cloud: Enterprise Backup ...
AWS re:Invent 2016: Getting Started with the Hybrid Cloud: Enterprise Backup ...
 
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Plat...
 
AWS re:Invent 2016: Getting Started with Amazon Aurora (DAT203)
AWS re:Invent 2016: Getting Started with Amazon Aurora (DAT203)AWS re:Invent 2016: Getting Started with Amazon Aurora (DAT203)
AWS re:Invent 2016: Getting Started with Amazon Aurora (DAT203)
 
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivBig Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
 
SRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDBSRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDB
 
serverless_architecture_patterns_london_loft.pdf
serverless_architecture_patterns_london_loft.pdfserverless_architecture_patterns_london_loft.pdf
serverless_architecture_patterns_london_loft.pdf
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million Users
 
Data Storage for the Long Haul: Compliance and Archive
Data Storage for the Long Haul: Compliance and ArchiveData Storage for the Long Haul: Compliance and Archive
Data Storage for the Long Haul: Compliance and Archive
 
ENT306 Migrating Large Scale Data Sets to the Cloud
ENT306 Migrating Large Scale Data Sets to the CloudENT306 Migrating Large Scale Data Sets to the Cloud
ENT306 Migrating Large Scale Data Sets to the Cloud
 
SEC303 Automating Security in Cloud Workloads with DevSecOps
SEC303 Automating Security in Cloud Workloads with DevSecOpsSEC303 Automating Security in Cloud Workloads with DevSecOps
SEC303 Automating Security in Cloud Workloads with DevSecOps
 
SMC302 Building Serverless Web Applications
SMC302 Building Serverless Web ApplicationsSMC302 Building Serverless Web Applications
SMC302 Building Serverless Web Applications
 
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
 Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T... Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
 

Similar to (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Similar to (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014 (20)

Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
re:Invent re:Cap - Big Data & IoT at Any Scale
re:Invent re:Cap - Big Data & IoT at Any Scalere:Invent re:Cap - Big Data & IoT at Any Scale
re:Invent re:Cap - Big Data & IoT at Any Scale
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSGetting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWS
 
Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud. Choosing the right data storage in the Cloud.
Choosing the right data storage in the Cloud.
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
 
Strategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud StorageStrategic Uses for Cost Efficient Long-Term Cloud Storage
Strategic Uses for Cost Efficient Long-Term Cloud Storage
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

  • 1. Puneet Suri, Thermo Fisher Scientific Shakila Pothini, Thermo Fisher Scientific Sami Zuhuruddin, Amazon Web Services November 12, 2014 | Las Vegas, NV HLS402 Getting into Your Genes: The definitive guide to using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High- Performance Scientific Applications
  • 2. About me Puneet Suri Senior Director, Software Engineering Life Sciences Group, Thermo Fisher Scientific follow at: @psuriconnect at: puneet.suri@thermofisher.com Envisionedanddeveloped the life sciences cloud platform for Thermo Fisher Scientific
  • 3. This is why we are here…
  • 4. Having an impact… A person was set free after 35 years in prison because of a DNA test Freeing the innocent Surviving Cancer A person survived pancreatic cancer thanks to a genetic approach that allowed an oncologist to focus on a specific cancer cell Ebola
  • 5. H1N1: Pandemic declared in April 2009
  • 6. Need to enable this at larger scale & impact more lives
  • 7. Customer needs… store & manage large scientific data sets
  • 8. A few years back
  • 9. Our offerings desktop applications challenges with upgrade cycle, versions etc. limited storage and compute capacity to analyze complex & large data sets no sharing & collaboration no backup, archive & security
  • 10. Abetter way… is to provide STORAGE COMPUTE SCALABILITY MEMORY
  • 12. Adeep dive into our story
  • 13. Aday with the scientist Get Insights     aproject * * * *
  • 14. Insights… •what is causing cancer •what drugs will work •is therapy working
  • 15. Customer pain points •existing solutions cannot address the complexities •excel is used painfully to manually analyze data •multiple tools used to get the final insight •it takes days to analyze the data •some of the analysis workflow are not possible
  • 16. Dimensions of complexity… millions of records thousands of users, projects real time analysis of large datasets 2-3 seconds response time project storage compute performance scalability
  • 17. Our journey enabling complex customer workflows
  • 18. Our iterative journey & challenges 0 start with reference architecture 1 identify scalable storage solution 2 identify scalable storage solution for large data items 3 identify solutionsfor real time response & queries 4 Identify solutions for real time analysis ofdata
  • 19. Reference web architecture A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Amazon RDS MASTER Amazon RDS STANDBY Synchronous Replication Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS
  • 20. Why relational DB was not considered •based on projected data and user growth over the years (hundreds of TBs), required real-time query performance very hard to achieve •needed managed scalability without sharding/re-shardingoverhead and disruptions •needed a loose schema to seamlessly enable new and cross domain workflows
  • 21. Our iterative journey & challenges 0 start with reference architecture 1 identify scalable storage solution 2 identify scalable storage solution for large data items 3 identify solutionsfor real time response & queries 4 Identify solutions for real time analysis ofdata
  • 22. NoSQL was the way to go •managed scalability •near zero administration overhead •query performance not impacted by table sizecan add billions of rows •simple and flexible schema –new domains can be supported •extremely fast read/write performance
  • 23. Architecture with DynamoDB A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS Auto Scaling AmazonDynamoDB
  • 24. What worked well with DynamoDB Managed Service with flexible schema Managed Scalability Extremely fast access in order of milliseconds READ/WRITE
  • 25. Iteration 1 GBs GBs MBs MBs Instrument Run (1000s) Patient Samples (1000s) Genes (1000s) Raw Signals (millions) Analysis Results (millions) Storage Query Performance ✔ ✔ Cost ✔ ✔ Get Insights     project
  • 26. What were the gaps our item attribute (e.g.Instrument Run) size range > 400KB (item attribute size limitation of 64KB400KB) hot hash key& batch size limitations •Adding thousands of related records (e.g. Raw Signals) with common hash key (e.g. Instrument Run) can be slow (10s seconds) •a large project can have ~ 1 million records (e.g. Raw Signals) that needs to read & written for a large project, high read/write capacity (1000s) was needed (increased cost due to high READ/WRITE capacity needs)
  • 27. What we needed Asolution that •can store huge number of related objects •is cost effective to read/write large data sets •has no limitations on batch size or item size •ability to query into the large number of records
  • 28. Our iterative journey & challenges 0 start with reference architecture 1 identify scalable storage solution : DynamoDB 2 identify scalable storage solution for large data items 3 identify solutionsfor real time response & queries 4 Identify solutions for real time analysis ofdata
  • 29. Architecture with DynamoDB & S3 A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS Auto Scaling DynamoDB Amazon S3
  • 30. MBs MBs GBs Iteration 2 GBs Instrument Run (1000s) Patient Samples (1000s) Genes (1000s) Raw Signals (millions) Analysis Results (millions) Storage Query Performance ✔ ✔ Cost ✔ ✔ Get Insights    
  • 31. Architecture with DynamoDB & S3 •DynamoDB was used to store small unrelated objects (KB) •will grow to a large number (e.g. Data Files) •Amazon S3 was used to store related larger objects (e.g. Raw Signals & Analysis Results (GB)) •stored as single Amazon S3 object serialized using google protobuf •Amazon S3 was cost effective for storing huge objects
  • 32. Real time queries for complex visualizations
  • 33. What we needed •complex visualizations requiresGigabytes of data to be queried in 2-3 secs and presented to the user •visualizations are very interactive that requires constant update of data. Need quick read & writes •support concurrent access without any degradation in query performance
  • 34. Our iterative journey & challenges 0 start with reference architecture 1 identify scalable storage solution : DynamoDB 2 identify scalable storage solution for large data items :DynamoDB + AmazonS3 3 identify solutionsfor fast real time response & queries 3 Identify solutions for real time analysis ofdata
  • 35. Distributed in-memory storage was the way to go read/writes have to be quick to enable fast response times, reading & writing from Amazon S3 was not ideal. •ElastiCachewas used as IN-MEMORY storage on top of DynamoDB& Amazon S3. •all related serialized objects in Amazon S3 accessed by customers is maintained in ElastiCacheas individual records •Indexes created in DynamoDB based on the query pattern so that data can be easily retrieved from ElastiCache
  • 36. Architecture with DynamoDB, Amazon S3 & ElastiCache A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS Auto Scaling DynamoDB Amazon S3 ElastiCache
  • 37. Iteration 3 MBs MBs GBs GBs Instrument Run (1000s) Patient Samples (1000s) Genes (1000s) Raw Signals (millions) Analysis Results (millions) Storage Query Performance ✔ ✔ Cost ✔ ✔ indexes Get Insights    
  • 38. Need for real time data analysis •analyze huge projects containing thousands of patient samples in minutes instead of days •a scalable solution is required to support analysis requests from thousands of users •existing desktop algorithms used for this analysis not optimized for extracting parallelism in data
  • 39. 8 20 40 80 120 200 320 0 50 100 150 200 250 300 350 90000 180000 270000 360000 450000 675000 900000 desktop desktop Analysis solutions in desktop desktop crashes minutes # of records Get Insights    
  • 41. Our iterative journey & challenges 0 startwith reference architecture 1 identify scalable storage solution : DynamoDB 2 identify scalable storage solution for large data items : DynamoDB + Amazon S3 3 identify solutionsfor fast real time response & queries : DynamoDB + Amazon S3 + ElastiCache 4 Identify solutions for real time analysis ofdata
  • 42. Amazon EMR was the way to go 1.EMR was used to performreal time analysisof huge data sets – results in minutes instead of days 2.all small jobs analyzed in-memory while big ones are sent toAmazon EMR. 3.existing algorithms overhauled to derive massive parallelism using Hadoopmap-reduce framework 4.as large datasets already in Amazon S3, used Amazon S3 for input and output instead of HDFS –only intermediate map-reduce data in HDFS 5.Amazon EMR cluster is created On-Demand and shutdown when done
  • 43. Architecture with EMR for real time analysis A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS Auto Scaling DynamoDB Amazon S3 ElastiCache EMR
  • 44. Iteration 4 MBs MBs GBs GBs Instrument Run (1000s) Patient Samples (1000s) Genes (1000s) Raw Signals (millions) Analysis Results (millions) Storage Query Analysis Performance ✔ ✔ ✔ Cost ✔ ✔ ✔ Get Insights    
  • 45. Performance for a project 2 4 7 11 13 20 30 0 50 100 150 200 250 300 350 90000 180000 270000 360000 450000 675000 900000 cloud desktop >10x crashes minutes # of records
  • 46. Journey 0 start with reference architecture 1 identify scalable storage solution : DynamoDB 2 identify scalable storage solution for large data items : DynamoDB + Amazon S3 3 identify solutionsfor fast real time response & queries : DynamoDB+ Amazon S3 + ElastiCache 4 Identify solutions for real time analysis ofdata : Amazon EMR ✓ ✓ ✓ ✓ ✓
  • 47. Learnings • • • • • •
  • 48. About me : Shakila Pothini Senior Manger, Cloud Apps Life Sciences Group, Thermo Fisher Scientific Hiking is my ONLY stress buster Entertain to Educate. Cofounder of performing arts group (swaram.org) Mostly left brained with occasional sense of creativity * * *
  • 49. How to get into your gene? sequence the human entire transcriptome (30,000 genes) identify significant genes (100+ genes) validate & reconfirm the (20+ genes) do it on more samples & different population find the way the genes interplay in the pathway understand cancer diversity. types of therapy. drug-able genes.
  • 50. Demo
  • 51. Demo summary non cancerous sample cancerous sample difference in expression of genes
  • 52. Customer feedback “My initial SymphoniSuite evaluation experience was good, GUI/ controls are intuitive and data upload/ analysis was fast and user friendly” UPENN “I enjoy processing hundreds of open array plates with ease.”, “I appreciate the rapid access of the large number of amplification curves ” Sanofi “I wanted to let you know that Symphonihas been working well for me. I have done analysis using as high as 500 files. ” ASU “This I see value in... utilizing these features. I appreciate the speed of data processing and visuals.” LUMC
  • 53. Yearly checkup today 165 / 105 120 50 / 90 104
  • 54. Is this really going to detect early stages of cancer?
  • 55. A few years from now : every person ATGCATGCTATCAATTGCCC Sequence melanoma healthrisks drug response powered by AWS lifecloud
  • 56. Yearly check-ups a few years from now ATGCATGC ATTGCCC ATGCATGC ATTGCCC TATCA GCATG lifecloud ATGCATGCTATCAATTGCCC Sequence
  • 57. Yearly check-ups a few years from now (cont’d) cancer any clinical trial? healthrisks drug response ATGCATGCTATCAATTGCCC Sequence lifecloud prescribe the right drug
  • 58. Puneet Suri Senior Director, Software engineering puneet.suri@thermofisher.comT: 650.266.5857 @psuri ShakilaPothini Senior Manager, Cloud Applications shakila.pothini@thermofisher.com SalilKumar Cloud Architect T: 650.740.1646 @salilkum
  • 59.
  • 60.
  • 61. Collect / Ingest Kinesis Process / Analyze EMR EC2 Redshift Data Pipeline Visualize / Report Glacier S3 DynamoDB Store RDS Data Answers
  • 62. Experiment 1 Data Access Compute Time Experiment 2 Experiment 3 Data Access Compute Time Data Access Compute Time ✔ ✔ ✔
  • 63. EMR Cluster EC2 Instance Data Temperature
  • 64. Please give us your feedback on this session. Complete session evaluations and earn re:Invent swag. http://bit.ly/awsevals Puneet Suri Senior Director, Software engineering puneet.suri@thermofisher.com T: 650.266.5857 @psuri Shakila Pothini Senior Manager, Cloud Applications shakila.pothini@thermofisher.com T: 650.554.2190 Salil Kumar Cloud Architect T: 650.740.1646 @salilkum