SlideShare une entreprise Scribd logo
1  sur  45
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2© 2014 MapR Technologies
© 2014 MapR Technologies 3
Contact info
• Slides online later, relax, enjoy, ask questions, participate
• allenday@mapr.com
• @allenday
• slideshare.net/allenday
• github.com/allenday
• …etc
© 2014 MapR Technologies 4
Allen’s Scorecard – Data Science Roles
• Domain Expertise – Genetics, geospatial, advertising
• Data Science – Biostatistics, recommendation systems,
persuasion
• App Development – R (13yr), Hadoop (7yr),
Before: web apps
• Operations – Horizontal scaling (e.g. web apps),
automation
© 2014 MapR Technologies 5
Message
How do emerging technologies…
…change our roles and
…change the way we design systems?
© 2014 MapR Technologies 6© 2014 MapR Technologies
Example: Sensor Data from Drilling Rigs
Real-time + long-time data use case
© 2014 MapR Technologies 7
Powerful Combination: RT Sensor Data + Histories
• Internet of Things is resulting in huge quantities of sensor data
• New opportunities for fine-grained view: save years instead of
months of data
• Also analyze in real-time for short term reporting, dashboards,
anomaly detection and predictive modeling
© 2014 MapR Technologies 8
Which part?
When was maintenance performed?
Why repaired?
If malfunction – what details?
Maintenance Data Base
© 2014 MapR Technologies 9
What is current status of part?
What are the current conditions?
Where is it located?
How much stress is it under?
Real-Time Sensor Data
© 2014 MapR Technologies 10
Real-Time Sensor DataMaintenance Data Base +
Machine Learning => Data Models
Analyze maintenance records
Predict maintenance needs
Schedule repairs to reduce costs
Reduce damage from unexpected failures
© 2014 MapR Technologies 11© 2014 MapR Technologies
How can an application be built to do this?
© 2014 MapR Technologies 12
Application: Data Access
Real Time Processing
Long Term Persistence
New Data
Query
Hadoop
Spark Streaming
Storm
© 2014 MapR Technologies 13
t
now
The Challenge: Hadoop is Not Very Real-time
UnprocessedData
Fully processed Latest full
period
Hadoop job
takes this long
for this data
© 2014 MapR Technologies 14
t
now
Hadoop works great back
here
Spark
Streaming or
Storm work
here
Real-time and Long-time together
Blended viewBlended viewBlended View
© 2014 MapR Technologies 15
t
now
Hadoop works great back
here
Spark
Streaming or
Storm work
here
Real-time and Long-time together
Blended viewBlended viewBlended View
© 2014 MapR Technologies 16
Lambda Architecture
New Data
SPEED LAYER
BATCH LAYER
Query
SERVING LAYER
© 2014 MapR Technologies 17
Query Process
Real Time Processing
Long Term Persistence &
Batch Processing
New Data
Merge Query Results
SPEED LAYER
SERVING LAYER
BATCH LAYER
Query
Results
Hadoop
Spark Streaming
Storm
Drill
Impala
Hive
Partial Query Results
Partial Query Results
© 2014 MapR Technologies 18© 2014 MapR Technologies
New designs benefit from overlapping roles:
Dev + Ops
© 2014 MapR Technologies 19
Production involves real time & long time processing
© 2014 MapR Technologies 20
Ongoing Development
© 2014 MapR Technologies 21
DevOps View
© 2014 MapR Technologies 22
t
now
Data snapshot for devops
and QA
Live data for
production
systems
Real-time and Long-time together
Step
forward
© 2014 MapR Technologies 23© 2014 MapR Technologies
Recommendation Systems
© 2014 MapR Technologies 24
Recommendations
– Data used to train model: interactions between people taking action
(users) and items
– Goal is to suggest additional interactions
– Example applications: movie, music or map-based restaurant choices;
suggesting sale items for e-stores or via cash-register receipts
© 2014 MapR Technologies 25
Recommendation
Behavior of a crowd
helps us understand
what individuals will do
© 2014 MapR Technologies 26
User
History
Log Files
Mahout
Analysis
Search
Technology
Item
Meta-Data
Ingest easily via NFS
MapR Cluster
via NFS Python
Use Python
directly via NFS
Pig
Web
TierRecommendations
New User History
Example:
Real-time recommender using MapR data platform
Offline analysis
Real-time
recommendations
Real-time Layer
Batch Layer
Serving Layer
© 2014 MapR Technologies 27
Result: System delivers real-time custom recommendations
based on music listening activity
© 2014 MapR Technologies 28
Practical Machine Learning: Free e-books
• Practical Machine Learning series authored by Ted Dunning and Ellen
Friedman, published by O’Reilly (2014)
• Provide innovations and advice that make machine learning more
accessible and more successful in real world settings
• Two titles available now as free e-book download from MapR website:
Innovations in Recommendation and A New
Look at Anomaly Detection
http://bit.ly/1nI2dyS
© 2014 MapR Technologies 29© 2014 MapR Technologies
Building data science teams
© 2014 MapR Technologies 30
Q:
Can I simply hire one rock star data
scientist to cover all this kind of work?
© 2014 MapR Technologies 31
A: No, interdisciplinary work requires
teams
A: Hire leads who can speak the lingo of
each required discipline
A: Hire individual contributors who
cover 2+ roles, when possible
© 2014 MapR Technologies 32© 2014 MapR Technologies
Good news: you don’t have to do it all
at once
Build in steps and repurpose existing
expertise
© 2014 MapR Technologies 33
Team Process = Needs
apps
discovery
modeling
systems
help people ask the right questions
allow automation to place informed
bets
deliver products at scale to
customers
build smarts into product features
keep infrastructure running, cost-
effective
integration
© 2014 MapR Technologies 34
Team Process = Needs
apps
discovery
modeling
systems
integration
These are the primary phases of leveraging BigData
Analysts drive from discovery.
Engineers drive from systems.
Both meet at integration.
Effective management of Data Science lives at
integration and doesn’t delegate it
© 2014 MapR Technologies 35
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
availability
Team Composition = Roles
Each role brings different
disciplines, opportunities, and
risks. It’s a powerful technique to
pair people with complementary
skills.
Blurring roles is very effective with
great people, e.g. DevOps.
There is danger in blurring
boundaries: Don’t try to create
rockstars (pushing down /
overloading stresses teams)
© 2014 MapR Technologies 36
Team Matrix = Needs x Roles
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
access
© 2014 MapR Technologies 37
Team Matrix
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
access
Conceptual tool for building and managing
Data Science teams
Overlay your project requirements (needs)
with your team’s strengths (roles)
That will show very quickly where to focus
Bring in individuals who cover 2-3 needs,
particularly for Team Leads
© 2014 MapR Technologies 38
Team Matrix = Needs x Roles
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
access
© 2014 MapR Technologies 39
Allen’s Overlay
business process,
stakeholder
data prep,
discovery,
modeling, etc.
software
engineering,
automation
systems
engineering,
access
© 2014 MapR Technologies 40
Aggressively Proactive Learning
• Disrupts old learning and
management models
– one size fits all
– Specialists
Hire people who
learn and re-learn
efficiently
Throw Your Life a
Curve
Whitney Johnson
blogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.html
© 2014 MapR Technologies 41
Recap
• Scalable storage allows for huge amounts of data
• Huge data calls for new system designs
Lambda Architecture: conceptual framework to design
systems for combining real-time and long-time data
• New system designs call for new definitions of roles and teams
Building Data Science Teams: conceptual framework for
building teams teams that can effectively work with huge
amounts of data
© 2014 MapR Technologies 42© 2014 MapR Technologies
Bonus round:
What’s MapR?
Why care?
© 2014 MapR Technologies 43
MapR Data Platform
Supports Complete Data Science Lifecycle
Filesystem
POSIX NFS
HBase
HDFS
MapReduce
SAN Storage
© 2014 MapR Technologies 44
FILESYSTEM
POSIX NFS
HBASE
NOSQL TABLES API
HADOOP
HDFS API
APACHE™HADOOP® HDFS
APACHE HBASE
IMPLEMENTS IMPLEMENTS
IMPLEMENTS IMPLEMENTS
IMPLEMENTS
DEPENDS
DEPENDS
MapR Data Platform
Architecture in a Nutshell
© 2014 MapR Technologies 45
HADOOP
HDFS API
HBASE
NOSQL TABLES API
FILESYSTEM
APACHE™HADOOP® HDFS
APACHE HBASE
IMPLEMENTS IMPLEMENTS
IMPLEMENTS IMPLEMENTS
IMPLEMENTS
DEPENDS
DEPENDS
Vertical Integration = High Performance
POSIX NFS
MapR Data Platform
Architecture in a Nutshell

Contenu connexe

Plus de Allen Day, PhD

Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 

Plus de Allen Day, PhD (18)

Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and Genomics
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 

Dernier

GENUINE Babe,Call Girls IN Baderpur Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Baderpur  Delhi | +91-8377087607GENUINE Babe,Call Girls IN Baderpur  Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Baderpur Delhi | +91-8377087607dollysharma2066
 
Safety T fire missions army field Artillery
Safety T fire missions army field ArtillerySafety T fire missions army field Artillery
Safety T fire missions army field ArtilleryKennethSwanberg
 
Agile Coaching Change Management Framework.pptx
Agile Coaching Change Management Framework.pptxAgile Coaching Change Management Framework.pptx
Agile Coaching Change Management Framework.pptxalinstan901
 
Beyond the Codes_Repositioning towards sustainable development
Beyond the Codes_Repositioning towards sustainable developmentBeyond the Codes_Repositioning towards sustainable development
Beyond the Codes_Repositioning towards sustainable developmentNimot Muili
 
Dealing with Poor Performance - get the full picture from 3C Performance Mana...
Dealing with Poor Performance - get the full picture from 3C Performance Mana...Dealing with Poor Performance - get the full picture from 3C Performance Mana...
Dealing with Poor Performance - get the full picture from 3C Performance Mana...Hedda Bird
 
International Ocean Transportation p.pdf
International Ocean Transportation p.pdfInternational Ocean Transportation p.pdf
International Ocean Transportation p.pdfAlejandromexEspino
 
Call Now Pooja Mehta : 7738631006 Door Step Call Girls Rate 100% Satisfactio...
Call Now Pooja Mehta :  7738631006 Door Step Call Girls Rate 100% Satisfactio...Call Now Pooja Mehta :  7738631006 Door Step Call Girls Rate 100% Satisfactio...
Call Now Pooja Mehta : 7738631006 Door Step Call Girls Rate 100% Satisfactio...Pooja Nehwal
 
Strategic Management, Vision Mission, Internal Analsysis
Strategic Management, Vision Mission, Internal AnalsysisStrategic Management, Vision Mission, Internal Analsysis
Strategic Management, Vision Mission, Internal Analsysistanmayarora45
 
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...Pooja Nehwal
 
Day 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC BootcampDay 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC BootcampPLCLeadershipDevelop
 
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
internal analysis on strategic management
internal analysis on strategic managementinternal analysis on strategic management
internal analysis on strategic managementharfimakarim
 
Reviewing and summarization of university ranking system to.pptx
Reviewing and summarization of university ranking system  to.pptxReviewing and summarization of university ranking system  to.pptx
Reviewing and summarization of university ranking system to.pptxAss.Prof. Dr. Mogeeb Mosleh
 

Dernier (15)

GENUINE Babe,Call Girls IN Baderpur Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Baderpur  Delhi | +91-8377087607GENUINE Babe,Call Girls IN Baderpur  Delhi | +91-8377087607
GENUINE Babe,Call Girls IN Baderpur Delhi | +91-8377087607
 
Safety T fire missions army field Artillery
Safety T fire missions army field ArtillerySafety T fire missions army field Artillery
Safety T fire missions army field Artillery
 
Agile Coaching Change Management Framework.pptx
Agile Coaching Change Management Framework.pptxAgile Coaching Change Management Framework.pptx
Agile Coaching Change Management Framework.pptx
 
Beyond the Codes_Repositioning towards sustainable development
Beyond the Codes_Repositioning towards sustainable developmentBeyond the Codes_Repositioning towards sustainable development
Beyond the Codes_Repositioning towards sustainable development
 
Dealing with Poor Performance - get the full picture from 3C Performance Mana...
Dealing with Poor Performance - get the full picture from 3C Performance Mana...Dealing with Poor Performance - get the full picture from 3C Performance Mana...
Dealing with Poor Performance - get the full picture from 3C Performance Mana...
 
International Ocean Transportation p.pdf
International Ocean Transportation p.pdfInternational Ocean Transportation p.pdf
International Ocean Transportation p.pdf
 
Call Now Pooja Mehta : 7738631006 Door Step Call Girls Rate 100% Satisfactio...
Call Now Pooja Mehta :  7738631006 Door Step Call Girls Rate 100% Satisfactio...Call Now Pooja Mehta :  7738631006 Door Step Call Girls Rate 100% Satisfactio...
Call Now Pooja Mehta : 7738631006 Door Step Call Girls Rate 100% Satisfactio...
 
Strategic Management, Vision Mission, Internal Analsysis
Strategic Management, Vision Mission, Internal AnalsysisStrategic Management, Vision Mission, Internal Analsysis
Strategic Management, Vision Mission, Internal Analsysis
 
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
Call now : 9892124323 Nalasopara Beautiful Call Girls Vasai virar Best Call G...
 
Day 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC BootcampDay 0- Bootcamp Roadmap for PLC Bootcamp
Day 0- Bootcamp Roadmap for PLC Bootcamp
 
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 99 Noida Escorts >༒8448380779 Escort Service
 
internal analysis on strategic management
internal analysis on strategic managementinternal analysis on strategic management
internal analysis on strategic management
 
Intro_University_Ranking_Introduction.pptx
Intro_University_Ranking_Introduction.pptxIntro_University_Ranking_Introduction.pptx
Intro_University_Ranking_Introduction.pptx
 
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTECAbortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
Abortion pills in Jeddah |• +966572737505 ] GET CYTOTEC
 
Reviewing and summarization of university ranking system to.pptx
Reviewing and summarization of university ranking system  to.pptxReviewing and summarization of university ranking system  to.pptx
Reviewing and summarization of university ranking system to.pptx
 

2014.07.01 - New Technologies, New Roles, New Architectures - Singapore Management University - BigData SG

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies
  • 3. © 2014 MapR Technologies 3 Contact info • Slides online later, relax, enjoy, ask questions, participate • allenday@mapr.com • @allenday • slideshare.net/allenday • github.com/allenday • …etc
  • 4. © 2014 MapR Technologies 4 Allen’s Scorecard – Data Science Roles • Domain Expertise – Genetics, geospatial, advertising • Data Science – Biostatistics, recommendation systems, persuasion • App Development – R (13yr), Hadoop (7yr), Before: web apps • Operations – Horizontal scaling (e.g. web apps), automation
  • 5. © 2014 MapR Technologies 5 Message How do emerging technologies… …change our roles and …change the way we design systems?
  • 6. © 2014 MapR Technologies 6© 2014 MapR Technologies Example: Sensor Data from Drilling Rigs Real-time + long-time data use case
  • 7. © 2014 MapR Technologies 7 Powerful Combination: RT Sensor Data + Histories • Internet of Things is resulting in huge quantities of sensor data • New opportunities for fine-grained view: save years instead of months of data • Also analyze in real-time for short term reporting, dashboards, anomaly detection and predictive modeling
  • 8. © 2014 MapR Technologies 8 Which part? When was maintenance performed? Why repaired? If malfunction – what details? Maintenance Data Base
  • 9. © 2014 MapR Technologies 9 What is current status of part? What are the current conditions? Where is it located? How much stress is it under? Real-Time Sensor Data
  • 10. © 2014 MapR Technologies 10 Real-Time Sensor DataMaintenance Data Base + Machine Learning => Data Models Analyze maintenance records Predict maintenance needs Schedule repairs to reduce costs Reduce damage from unexpected failures
  • 11. © 2014 MapR Technologies 11© 2014 MapR Technologies How can an application be built to do this?
  • 12. © 2014 MapR Technologies 12 Application: Data Access Real Time Processing Long Term Persistence New Data Query Hadoop Spark Streaming Storm
  • 13. © 2014 MapR Technologies 13 t now The Challenge: Hadoop is Not Very Real-time UnprocessedData Fully processed Latest full period Hadoop job takes this long for this data
  • 14. © 2014 MapR Technologies 14 t now Hadoop works great back here Spark Streaming or Storm work here Real-time and Long-time together Blended viewBlended viewBlended View
  • 15. © 2014 MapR Technologies 15 t now Hadoop works great back here Spark Streaming or Storm work here Real-time and Long-time together Blended viewBlended viewBlended View
  • 16. © 2014 MapR Technologies 16 Lambda Architecture New Data SPEED LAYER BATCH LAYER Query SERVING LAYER
  • 17. © 2014 MapR Technologies 17 Query Process Real Time Processing Long Term Persistence & Batch Processing New Data Merge Query Results SPEED LAYER SERVING LAYER BATCH LAYER Query Results Hadoop Spark Streaming Storm Drill Impala Hive Partial Query Results Partial Query Results
  • 18. © 2014 MapR Technologies 18© 2014 MapR Technologies New designs benefit from overlapping roles: Dev + Ops
  • 19. © 2014 MapR Technologies 19 Production involves real time & long time processing
  • 20. © 2014 MapR Technologies 20 Ongoing Development
  • 21. © 2014 MapR Technologies 21 DevOps View
  • 22. © 2014 MapR Technologies 22 t now Data snapshot for devops and QA Live data for production systems Real-time and Long-time together Step forward
  • 23. © 2014 MapR Technologies 23© 2014 MapR Technologies Recommendation Systems
  • 24. © 2014 MapR Technologies 24 Recommendations – Data used to train model: interactions between people taking action (users) and items – Goal is to suggest additional interactions – Example applications: movie, music or map-based restaurant choices; suggesting sale items for e-stores or via cash-register receipts
  • 25. © 2014 MapR Technologies 25 Recommendation Behavior of a crowd helps us understand what individuals will do
  • 26. © 2014 MapR Technologies 26 User History Log Files Mahout Analysis Search Technology Item Meta-Data Ingest easily via NFS MapR Cluster via NFS Python Use Python directly via NFS Pig Web TierRecommendations New User History Example: Real-time recommender using MapR data platform Offline analysis Real-time recommendations Real-time Layer Batch Layer Serving Layer
  • 27. © 2014 MapR Technologies 27 Result: System delivers real-time custom recommendations based on music listening activity
  • 28. © 2014 MapR Technologies 28 Practical Machine Learning: Free e-books • Practical Machine Learning series authored by Ted Dunning and Ellen Friedman, published by O’Reilly (2014) • Provide innovations and advice that make machine learning more accessible and more successful in real world settings • Two titles available now as free e-book download from MapR website: Innovations in Recommendation and A New Look at Anomaly Detection http://bit.ly/1nI2dyS
  • 29. © 2014 MapR Technologies 29© 2014 MapR Technologies Building data science teams
  • 30. © 2014 MapR Technologies 30 Q: Can I simply hire one rock star data scientist to cover all this kind of work?
  • 31. © 2014 MapR Technologies 31 A: No, interdisciplinary work requires teams A: Hire leads who can speak the lingo of each required discipline A: Hire individual contributors who cover 2+ roles, when possible
  • 32. © 2014 MapR Technologies 32© 2014 MapR Technologies Good news: you don’t have to do it all at once Build in steps and repurpose existing expertise
  • 33. © 2014 MapR Technologies 33 Team Process = Needs apps discovery modeling systems help people ask the right questions allow automation to place informed bets deliver products at scale to customers build smarts into product features keep infrastructure running, cost- effective integration
  • 34. © 2014 MapR Technologies 34 Team Process = Needs apps discovery modeling systems integration These are the primary phases of leveraging BigData Analysts drive from discovery. Engineers drive from systems. Both meet at integration. Effective management of Data Science lives at integration and doesn’t delegate it
  • 35. © 2014 MapR Technologies 35 business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability Team Composition = Roles Each role brings different disciplines, opportunities, and risks. It’s a powerful technique to pair people with complementary skills. Blurring roles is very effective with great people, e.g. DevOps. There is danger in blurring boundaries: Don’t try to create rockstars (pushing down / overloading stresses teams)
  • 36. © 2014 MapR Technologies 36 Team Matrix = Needs x Roles business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access
  • 37. © 2014 MapR Technologies 37 Team Matrix business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access Conceptual tool for building and managing Data Science teams Overlay your project requirements (needs) with your team’s strengths (roles) That will show very quickly where to focus Bring in individuals who cover 2-3 needs, particularly for Team Leads
  • 38. © 2014 MapR Technologies 38 Team Matrix = Needs x Roles business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access
  • 39. © 2014 MapR Technologies 39 Allen’s Overlay business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access
  • 40. © 2014 MapR Technologies 40 Aggressively Proactive Learning • Disrupts old learning and management models – one size fits all – Specialists Hire people who learn and re-learn efficiently Throw Your Life a Curve Whitney Johnson blogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.html
  • 41. © 2014 MapR Technologies 41 Recap • Scalable storage allows for huge amounts of data • Huge data calls for new system designs Lambda Architecture: conceptual framework to design systems for combining real-time and long-time data • New system designs call for new definitions of roles and teams Building Data Science Teams: conceptual framework for building teams teams that can effectively work with huge amounts of data
  • 42. © 2014 MapR Technologies 42© 2014 MapR Technologies Bonus round: What’s MapR? Why care?
  • 43. © 2014 MapR Technologies 43 MapR Data Platform Supports Complete Data Science Lifecycle Filesystem POSIX NFS HBase HDFS MapReduce SAN Storage
  • 44. © 2014 MapR Technologies 44 FILESYSTEM POSIX NFS HBASE NOSQL TABLES API HADOOP HDFS API APACHE™HADOOP® HDFS APACHE HBASE IMPLEMENTS IMPLEMENTS IMPLEMENTS IMPLEMENTS IMPLEMENTS DEPENDS DEPENDS MapR Data Platform Architecture in a Nutshell
  • 45. © 2014 MapR Technologies 45 HADOOP HDFS API HBASE NOSQL TABLES API FILESYSTEM APACHE™HADOOP® HDFS APACHE HBASE IMPLEMENTS IMPLEMENTS IMPLEMENTS IMPLEMENTS IMPLEMENTS DEPENDS DEPENDS Vertical Integration = High Performance POSIX NFS MapR Data Platform Architecture in a Nutshell

Notes de l'éditeur

  1. Talk track: Traditionally what has been lacking is enough historical data. Now with new approaches such as Hadoop it’s possible to save long term maintenance histories in a cost –effective way…<CLICK> Consider what this may mean for scheduling repairs for a particular piece of equipment. Rather than just knowing overall repair rates and costs, many details can be stored for a particular part...
  2. Talk track: And from the field, sensors provide real-time measurements about what is happening for that particular part…<CLICK> <PAUSE>
  3. Talk track: When you combine real time sensor data with maintenance histories, you can leverage the value of your data by using machine learning models to inform your actions: <click> analyze the records in order to <click> predict maintenance needs for <click> better scheduling of repairs. This saves you money by <click> avoiding down time and reducing risk of costly failures. <click> Time series data is useful, particularly when saved together with part or equipment specifications. You could, for example, go back <click> and see what happened in the days or months leading up to a part failure and thus better understand how to schedule repairs before problems occur.
  4. Talk track: Here is a familiar view of what is being done with data. New data input can be ingested to persistence layer or used in real time processing. What the user such as an analyst would like to be able to do is to make a single query against the data. How does this work? Let’s think about it in terms of lambda architecture to get a conceptual view of how to combine real time with batch processing…
  5. Talk track: Lambda architecture divides all components in a system into 3 basic layers: Batch Layer handles persistence and batch oriented computation Speed Layer handles real time computation and updates to short term persistence (such as HBase or M7 tables) Serving Layer combines the partial batch query results and the partial query results from real time processing. Now we can think about our system components in terms of the lambda architecture…
  6. Talk track: Long term data persistence and batch processing are done by components such as Apache Hadoop –based technologies. For the speed layer, there are several choices to do real-time processing including Apache Spark Streaming or Apache Storm. The query can (soon) be carried out using Apache Drill, Apache Hive, Apache Spark’s Shark component or Impala. The serving layer combines long-time and real-time partial query results to provide the final results that the user wants.
  7. Talk track: Recommendations have wide spread use and building a powerful recommendation engine can be easier than you think with certain innovations..
  8. Talk track: the first trick is to choose the right data. Instead of looking at ratings or characteristics of the items to recommend, instead watch people’s behaviors as they interact with items. You discover patterns and that tells you what to recommend.
  9. Talk track: We demonstrated this powerful two-stage approach by building a music recommender on the MapR platform. Notice that the intensive part of the computation, the
  10. Catching up 10:23 (equals normal 53 min; should be at 38 min (15 min late) … catching up