SlideShare une entreprise Scribd logo
1  sur  18
© 2014 MapR Technologies 1
Talk Overview
• Agile Real-time Stats
• R + Storm
github.com/allenday/R-Storm
• DEMO
• How to do it?
• Q & A @allenday
Agile
Methods
Advanced
Statistics
Continuous
Real-time
Delivery
github.com/allenday/hadoop-summit-r-storm-demo-public
© 2014 MapR Technologies 2© 2014 MapR Technologies
Architecting R into the Storm
Application Development Process
© 2014 MapR Technologies 3
Allen (me) and Sungwook @ MapR
• Allen Day, Principal Data Scientist [ @allenday ]
7yr Hadoop dev, 12yr R dev/author
PhD, Human Genetics, UCLA Medicine
• Sungwook Yoon, Data Scientist
Spark & Security Expert
PhD, Computer Engineering, Purdue
• MapR [ @mapr ]
Distributes open source components for Hadoop
Adds major technology for performance, HA, industry standard APIs
© 2014 MapR Technologies 4
What’s Storm? What’s R?
• What’s Storm?
– Processes a data stream. Akin to UNIX pipe + tee & merge commands
– Runs on a cluster. Fault-tolerant and designed to scale out
– Used for: real-time analytics & machine learning
• What’s R?
– Programming language with advanced statistics libraries
– Does not scale out. Can scale up
– Used for: prototyping, data modeling, visualization
How to combine these?
© 2014 MapR Technologies 5
R outside, Storm inside: not practical. Why?
• Model-building and QA is done
on data snapshots
• However, R => Hadoop is
realistic. Key difference:
referenced data can be static
– Use MapR snapshots for dev and
QA
– See also: RHIPE (Purdue) and
RHadoop (RevolutionAnalytics)
R
Storm
User
© 2014 MapR Technologies 6
Storm outside, R inside: a good fit
• Enables separation of concerns
– Independently manage modeling,
ops timelines, and version control
– Integrate as needed
• Enables role specialization
– R built-ins allow faster iteration
and more concise stats-type code
– Do DevOps with specific SW
engineering tech, e.g. Java
Storm
R
User
© 2014 MapR Technologies 7© 2014 MapR Technologies
Q: Who really likes statistics?
A: Baseball fans
A: Team Managers = Portfolio Managers
© 2014 MapR Technologies 8
© 2014 MapR Technologies 9
Fresh Local Data Tonight!
© 2014 MapR Technologies 10
Famous Vintage Data
Oakland Athletics
2002 Season
20 consecutive
wins – the current
record
Obligatory movie
ref… I’m from LA
LET’S GO DODGERS!
© 2014 MapR Technologies 11© 2014 MapR Technologies
Goal: Detect “Moneyball” 2002 Winning Streak
© 2014 MapR Technologies 12
Methods:
Change Point Detection
Find natural breakpoints in a
time-series set of data points
R packages implement this:
changepoint: more
sensitve, but not streaming
bcp: streaming, but less
sensitive
© 2014 MapR Technologies 13
GIFs to
MapR
Filesystem
Methods: R+Storm Demo Architecture
Storm Bolt
R online
change point
detector
Storm Bolt
(write to Jetty)
Oakland A’s
Data
(accelerated)
Jetty
Webserver
Browser
(D3.js) Us 
github.com/allenday/hadoop-summit-r-storm-demo-public
© 2014 MapR Technologies 14© 2014 MapR Technologies
50-game sliding
window/buffer to
detect change points
Cumulative history
with detected break
points
Raw data (score
difference between
A’s and opponent)
Demo
© 2014 MapR Technologies 15
Methods Details: How it’s done
• Uses R-Storm binding github.com/allenday/R-Storm
– Storm package on CRAN cran.r-project.org/web/packages/Storm
Storm (dev team)
R
(stats team)
Storm
(dev team, pure
Java)
Producer Consumer
© 2014 MapR Technologies 16
Methods Details: Easy integration
R: lambda function
storm = Storm$new();
storm$lambda = function(s) {
t = s$tuple;
t$output = vector(length=1); t$output[1] = “tada!”
s$emit(t)
}
Storm: extend ShellBolt
public static class MyRBolt extends ShellBolt implements IRichBolt
{
public RBolt() {
super("Rscript", ”my.R");
}
}
© 2014 MapR Technologies 17
Results
• Change points are identified, but none for winning streak
– Not using score difference, anyway
• Time to integrate with the modeling team!
– Send @kunpognr or @allenday a pull request on GitHub
• Applicable to many other use cases, e.g.
– Security (fraud detection, intrusion detection)
– Marketing (intent to purchase / social media streams)
– Customer Support (help desk voice calls)
Discussion
© 2014 MapR Technologies 18
Q&A
@allenday maprtech
allenday@mapr.com
Engage with us!
MapR
maprtech
linkedin.com/in/allenday

Contenu connexe

Tendances

Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Approaching real-time-hadoop
Approaching real-time-hadoopApproaching real-time-hadoop
Approaching real-time-hadoopChris Huang
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Chris Huang
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis PatternsMikio L. Braun
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsSrinath Perera
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 

Tendances (20)

Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Approaching real-time-hadoop
Approaching real-time-hadoopApproaching real-time-hadoop
Approaching real-time-hadoop
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Scaling big-data-mining-infra2
Scaling big-data-mining-infra2
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 

Similaire à R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataSenturus
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San DiegoMapR Technologies
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?Flink Forward
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in businessMapR Technologies
 
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleMapR Technologies
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks
 
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340Big Data Joe™ Rossi
 
Ted Dunning-Faster and Furiouser- Flink Drift
Ted Dunning-Faster and Furiouser- Flink DriftTed Dunning-Faster and Furiouser- Flink Drift
Ted Dunning-Faster and Furiouser- Flink DriftFlink Forward
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 

Similaire à R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose (20)

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big Data
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San Diego
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in business
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark & Hadoop at Production at Scale
Spark & Hadoop at Production at ScaleSpark & Hadoop at Production at Scale
Spark & Hadoop at Production at Scale
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340
 
Ted Dunning-Faster and Furiouser- Flink Drift
Ted Dunning-Faster and Furiouser- Flink DriftTed Dunning-Faster and Furiouser- Flink Drift
Ted Dunning-Faster and Furiouser- Flink Drift
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 

Plus de Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 

Plus de Allen Day, PhD (20)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and Genomics
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 

Dernier

JORNADA 2 LIGA MUROBASQUETBOL1 2024.docx
JORNADA 2 LIGA MUROBASQUETBOL1 2024.docxJORNADA 2 LIGA MUROBASQUETBOL1 2024.docx
JORNADA 2 LIGA MUROBASQUETBOL1 2024.docxArturo Pacheco Alvarez
 
Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...
Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...
Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...World Wide Tickets And Hospitality
 
Benifits of Individual And Team Sports-Group 7.pptx
Benifits of Individual And Team Sports-Group 7.pptxBenifits of Individual And Team Sports-Group 7.pptx
Benifits of Individual And Team Sports-Group 7.pptxsherrymieg19
 
Clash of Titans_ PSG vs Barcelona (1).pdf
Clash of Titans_ PSG vs Barcelona (1).pdfClash of Titans_ PSG vs Barcelona (1).pdf
Clash of Titans_ PSG vs Barcelona (1).pdfMuhammad Hashim
 
Introduction to Basketball-PowerPoint Presentation
Introduction to Basketball-PowerPoint PresentationIntroduction to Basketball-PowerPoint Presentation
Introduction to Basketball-PowerPoint PresentationJuliusMacaballug
 
PPT on INDIA VS PAKISTAN - A Sports Rivalry
PPT on INDIA VS PAKISTAN - A Sports RivalryPPT on INDIA VS PAKISTAN - A Sports Rivalry
PPT on INDIA VS PAKISTAN - A Sports Rivalryanirbannath184
 
Italy Vs Albania Euro Cup 2024 Italy's Strategy for Success.docx
Italy Vs Albania Euro Cup 2024 Italy's Strategy for Success.docxItaly Vs Albania Euro Cup 2024 Italy's Strategy for Success.docx
Italy Vs Albania Euro Cup 2024 Italy's Strategy for Success.docxWorld Wide Tickets And Hospitality
 
Project & Portfolio, Market Analysis: WWE
Project & Portfolio, Market Analysis: WWEProject & Portfolio, Market Analysis: WWE
Project & Portfolio, Market Analysis: WWEDeShawn Ellis
 
DONAL88 >LINK SLOT PG SOFT TERGACOR 2024
DONAL88 >LINK SLOT PG SOFT TERGACOR 2024DONAL88 >LINK SLOT PG SOFT TERGACOR 2024
DONAL88 >LINK SLOT PG SOFT TERGACOR 2024DONAL88 GACOR
 
JORNADA 3 LIGA MURO 2024GHGHGHGHGHGH.pdf
JORNADA 3 LIGA MURO 2024GHGHGHGHGHGH.pdfJORNADA 3 LIGA MURO 2024GHGHGHGHGHGH.pdf
JORNADA 3 LIGA MURO 2024GHGHGHGHGHGH.pdfArturo Pacheco Alvarez
 

Dernier (11)

JORNADA 2 LIGA MUROBASQUETBOL1 2024.docx
JORNADA 2 LIGA MUROBASQUETBOL1 2024.docxJORNADA 2 LIGA MUROBASQUETBOL1 2024.docx
JORNADA 2 LIGA MUROBASQUETBOL1 2024.docx
 
Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...
Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...
Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...
 
Benifits of Individual And Team Sports-Group 7.pptx
Benifits of Individual And Team Sports-Group 7.pptxBenifits of Individual And Team Sports-Group 7.pptx
Benifits of Individual And Team Sports-Group 7.pptx
 
Clash of Titans_ PSG vs Barcelona (1).pdf
Clash of Titans_ PSG vs Barcelona (1).pdfClash of Titans_ PSG vs Barcelona (1).pdf
Clash of Titans_ PSG vs Barcelona (1).pdf
 
Introduction to Basketball-PowerPoint Presentation
Introduction to Basketball-PowerPoint PresentationIntroduction to Basketball-PowerPoint Presentation
Introduction to Basketball-PowerPoint Presentation
 
PPT on INDIA VS PAKISTAN - A Sports Rivalry
PPT on INDIA VS PAKISTAN - A Sports RivalryPPT on INDIA VS PAKISTAN - A Sports Rivalry
PPT on INDIA VS PAKISTAN - A Sports Rivalry
 
Italy Vs Albania Euro Cup 2024 Italy's Strategy for Success.docx
Italy Vs Albania Euro Cup 2024 Italy's Strategy for Success.docxItaly Vs Albania Euro Cup 2024 Italy's Strategy for Success.docx
Italy Vs Albania Euro Cup 2024 Italy's Strategy for Success.docx
 
Project & Portfolio, Market Analysis: WWE
Project & Portfolio, Market Analysis: WWEProject & Portfolio, Market Analysis: WWE
Project & Portfolio, Market Analysis: WWE
 
DONAL88 >LINK SLOT PG SOFT TERGACOR 2024
DONAL88 >LINK SLOT PG SOFT TERGACOR 2024DONAL88 >LINK SLOT PG SOFT TERGACOR 2024
DONAL88 >LINK SLOT PG SOFT TERGACOR 2024
 
JORNADA 3 LIGA MURO 2024GHGHGHGHGHGH.pdf
JORNADA 3 LIGA MURO 2024GHGHGHGHGHGH.pdfJORNADA 3 LIGA MURO 2024GHGHGHGHGHGH.pdf
JORNADA 3 LIGA MURO 2024GHGHGHGHGHGH.pdf
 
NATIONAL SPORTS DAY WRITTEN QUIZ by QUI9
NATIONAL SPORTS DAY WRITTEN QUIZ by QUI9NATIONAL SPORTS DAY WRITTEN QUIZ by QUI9
NATIONAL SPORTS DAY WRITTEN QUIZ by QUI9
 

R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose

  • 1. © 2014 MapR Technologies 1 Talk Overview • Agile Real-time Stats • R + Storm github.com/allenday/R-Storm • DEMO • How to do it? • Q & A @allenday Agile Methods Advanced Statistics Continuous Real-time Delivery github.com/allenday/hadoop-summit-r-storm-demo-public
  • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies Architecting R into the Storm Application Development Process
  • 3. © 2014 MapR Technologies 3 Allen (me) and Sungwook @ MapR • Allen Day, Principal Data Scientist [ @allenday ] 7yr Hadoop dev, 12yr R dev/author PhD, Human Genetics, UCLA Medicine • Sungwook Yoon, Data Scientist Spark & Security Expert PhD, Computer Engineering, Purdue • MapR [ @mapr ] Distributes open source components for Hadoop Adds major technology for performance, HA, industry standard APIs
  • 4. © 2014 MapR Technologies 4 What’s Storm? What’s R? • What’s Storm? – Processes a data stream. Akin to UNIX pipe + tee & merge commands – Runs on a cluster. Fault-tolerant and designed to scale out – Used for: real-time analytics & machine learning • What’s R? – Programming language with advanced statistics libraries – Does not scale out. Can scale up – Used for: prototyping, data modeling, visualization How to combine these?
  • 5. © 2014 MapR Technologies 5 R outside, Storm inside: not practical. Why? • Model-building and QA is done on data snapshots • However, R => Hadoop is realistic. Key difference: referenced data can be static – Use MapR snapshots for dev and QA – See also: RHIPE (Purdue) and RHadoop (RevolutionAnalytics) R Storm User
  • 6. © 2014 MapR Technologies 6 Storm outside, R inside: a good fit • Enables separation of concerns – Independently manage modeling, ops timelines, and version control – Integrate as needed • Enables role specialization – R built-ins allow faster iteration and more concise stats-type code – Do DevOps with specific SW engineering tech, e.g. Java Storm R User
  • 7. © 2014 MapR Technologies 7© 2014 MapR Technologies Q: Who really likes statistics? A: Baseball fans A: Team Managers = Portfolio Managers
  • 8. © 2014 MapR Technologies 8
  • 9. © 2014 MapR Technologies 9 Fresh Local Data Tonight!
  • 10. © 2014 MapR Technologies 10 Famous Vintage Data Oakland Athletics 2002 Season 20 consecutive wins – the current record Obligatory movie ref… I’m from LA LET’S GO DODGERS!
  • 11. © 2014 MapR Technologies 11© 2014 MapR Technologies Goal: Detect “Moneyball” 2002 Winning Streak
  • 12. © 2014 MapR Technologies 12 Methods: Change Point Detection Find natural breakpoints in a time-series set of data points R packages implement this: changepoint: more sensitve, but not streaming bcp: streaming, but less sensitive
  • 13. © 2014 MapR Technologies 13 GIFs to MapR Filesystem Methods: R+Storm Demo Architecture Storm Bolt R online change point detector Storm Bolt (write to Jetty) Oakland A’s Data (accelerated) Jetty Webserver Browser (D3.js) Us  github.com/allenday/hadoop-summit-r-storm-demo-public
  • 14. © 2014 MapR Technologies 14© 2014 MapR Technologies 50-game sliding window/buffer to detect change points Cumulative history with detected break points Raw data (score difference between A’s and opponent) Demo
  • 15. © 2014 MapR Technologies 15 Methods Details: How it’s done • Uses R-Storm binding github.com/allenday/R-Storm – Storm package on CRAN cran.r-project.org/web/packages/Storm Storm (dev team) R (stats team) Storm (dev team, pure Java) Producer Consumer
  • 16. © 2014 MapR Technologies 16 Methods Details: Easy integration R: lambda function storm = Storm$new(); storm$lambda = function(s) { t = s$tuple; t$output = vector(length=1); t$output[1] = “tada!” s$emit(t) } Storm: extend ShellBolt public static class MyRBolt extends ShellBolt implements IRichBolt { public RBolt() { super("Rscript", ”my.R"); } }
  • 17. © 2014 MapR Technologies 17 Results • Change points are identified, but none for winning streak – Not using score difference, anyway • Time to integrate with the modeling team! – Send @kunpognr or @allenday a pull request on GitHub • Applicable to many other use cases, e.g. – Security (fraud detection, intrusion detection) – Marketing (intent to purchase / social media streams) – Customer Support (help desk voice calls) Discussion
  • 18. © 2014 MapR Technologies 18 Q&A @allenday maprtech allenday@mapr.com Engage with us! MapR maprtech linkedin.com/in/allenday

Notes de l'éditeur

  1. FILL IN RED WITH CORRECT DETAILS
  2. FILL IN RED WITH CORRECT DETAILS