SlideShare une entreprise Scribd logo
1  sur  26
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Future of Data Boston
Data & Cognitive Developers
Enterprise Data Science at Scale
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
• Networking, Food and drink
• Announcements
• Main Presentation
– Introducing Data Science at Scale
– Building and Deploying Models Collaboratively
– Training Models with all the Data
– Putting Models to Work in a Streaming Application
• Question and Answer
• Networking and Wrap up
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
 #1 Pure Open Source Hadoop Distribution
 1000+ customers and 2100+ ecosystem
partners
 Employs the original architects, developers
and operators of Hadoop from Yahoo!
 Best-in-class 24x7 customer support
 Leading professional services and training
 #1 Data Science Platform (Source: Gartner)
 OpenPOWER performance leadership
 Flexible, software defined storage
 #1 SQL Engine for complex, analytical workloads
 Leader in On-premise and Hybrid Cloud solutions
+
Thanks to our Meetup Partners
IBM + Hortonworks
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
About Carolyn Duby
• Big Data Solutions Architect
• High performance data intensive systems
• Data science and Cyber Security SME
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby Github: carolynduby
• Hortonworks
– Innovation through data
– Enterprise ready, 100% open source, modern data platforms
– Engineering, Technical Support, Professional Services, Training
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
About Rich Tarro
• Analytics Solutions Architect
• IBM Corporation
• Client insights through data
• MS Electrical Engineering
– Rensselaer Polytechnic Institute
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Data Science Lifecycle
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Next Generation Data Science Problems
Multiple data sources & clusters
Data Scientists
Where is the data I need to answer the
business questions?
Data Engineers
How do I move that data into a central
repository?
How do I transform and cleanse that data?
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Next Generation Data Science Problems
Too many tools and technologies
Data Scientists
How do I learn the latest library/ technique?
I don’t (want to) know Hadoop/ Hive etc.
How do I bring my familiar R/ Python library
to the new data science platform?
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Next Generation Data Science Problems
Socializing insights is challenging
Data Scientists
How do I collaborate and share my work
with others in the organization?
Business Analyst
How do I move that data into a central
repository?
What is the best visualization to tell my
story?
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Next Generation Data Science Problems
Going from prototype to production is cumbersome
Data Scientists
I created this awesome Machine Learning
Model, how do I put it into production?
Data Scientists/ Data Engineers
How are my Machine Learning Models
performing & how to improve them?
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
More data = better model but desktop is limited
• Analyzing and training with portion of available data
• Analysis or training too slow
• Out of memory
• Data accumulates over time
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Data Science Solution
• Data Movement and Acquisition
– Acquire and move data required for problem
• Distributed Compute Platform
– Store, clean, and organize historical data
– Build and train models on historical data
• Notebooks
– Record data processes
• Clean and prepare data
• Build and train models
– Collaborate with others
• Model Deployment
– Package model for use
– Monitor performance
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Data Movement and Acquisition
 Constrained
 High-latency
 Localized context
 Hybrid – cloud/on-premises
 Low-latency
 Global context
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
Data Lake
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache SPARK
• Distributed processing efficiently crunches large data
sets
– Optimized
– Horizontally scalable with multi tenancy
– Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS,
Hive, Phoenix, S3, etc
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
SPARK Libraries
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
– Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
– Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
– Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
– Complex event processing and data ingest
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Spark Architecture
Spark Driver Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
LivyNotebook
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Open Source Notebooks - Jupyter
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Open Source Notebooks - Zeppelin
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Sharing with DSX Projects
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Deploying a Model
Virtual Model Deployment
Physical Model
Rest Service
Physical Model
Rest Service
Physical Model
Rest Service
Model Algorithm and Parameters
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Big Picture
Build Model
Model 1 Model 2 Model 3
Model Deployment
History
MoveData
Time
Train
Predict
Streaming Application
Evaluate Performance
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
DEMO
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Demo Scenario
Customer Churn
• Churn occurs when a customer stops subscribing or using a service, which affects all industries.
• A company’s ability to predict customer churns allows them the opportunity to be proactive in
efforts to retain them.
• Historical customer churn data will be used to train the Machine Learning model, and it will be
used to predict whether customer will stop using their services.
• Random Forest Classifier will be used in this demo, which will well suited to handle variance in
the training data set.
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Demo Flow
Insights from Data Science to Production
Data Scientists
Where is the data I
need to answer the
business questions?
Business Users
Where is the insight
& predictions from
the data?
HDP Cluster
Knox
Admins
How do I meet SLA,
Performance, .., Feature
needs?
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Demo Scenario
Problems Solved
• Data Scientist collaborate, learn new tools & frameworks
• Choice of tools, notebooks and languages
• Run favorite notebook on all data in the HDP Cluster
• Deploy the model to production
• Leverage the production model to deliver insights to business
• Monitor models and retrain models as new data comes in
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You

Contenu connexe

Tendances

10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the GlobeDataWorks Summit
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash CourseDataWorks Summit
 
HDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New FeaturesHDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New FeaturesTimothy Spann
 
Risk listening: monitoring for profitable growth
Risk listening: monitoring for profitable growthRisk listening: monitoring for profitable growth
Risk listening: monitoring for profitable growthDataWorks Summit
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the dataDataWorks Summit
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...DataWorks Summit/Hadoop Summit
 
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic EcosystemsHortonworks
 
EDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old FavoriteEDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old FavoriteHortonworks
 
Machine Learning Everywhere
Machine Learning EverywhereMachine Learning Everywhere
Machine Learning EverywhereDataWorks Summit
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks
 
Big Data Challenges in the Energy Sector
Big Data Challenges in the Energy SectorBig Data Challenges in the Energy Sector
Big Data Challenges in the Energy SectorDataWorks Summit
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark Summit
 
Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...DataWorks Summit
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
 

Tendances (20)

10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
HDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New FeaturesHDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New Features
 
Risk listening: monitoring for profitable growth
Risk listening: monitoring for profitable growthRisk listening: monitoring for profitable growth
Risk listening: monitoring for profitable growth
 
Deep learning 101
Deep learning 101Deep learning 101
Deep learning 101
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
 
EDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old FavoriteEDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old Favorite
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Machine Learning Everywhere
Machine Learning EverywhereMachine Learning Everywhere
Machine Learning Everywhere
 
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption2015 02 12 talend hortonworks webinar challenges to hadoop adoption
2015 02 12 talend hortonworks webinar challenges to hadoop adoption
 
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJApache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
 
Big Data Challenges in the Energy Sector
Big Data Challenges in the Energy SectorBig Data Challenges in the Energy Sector
Big Data Challenges in the Energy Sector
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...Global Data Management – a practical framework to rethinking enterprise, oper...
Global Data Management – a practical framework to rethinking enterprise, oper...
 
Smart Cities: An APAC Necessity
Smart Cities: An APAC Necessity Smart Cities: An APAC Necessity
Smart Cities: An APAC Necessity
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 

Similaire à Data Science at Scale with Hortonworks

Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at ScaleArtem Ervits
 
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Timothy Spann
 
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017 Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017 Hortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Hortonworks
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution Hortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsHortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Hortonworks
 
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
[Hortonworks] Future Of Data: Madrid - HDF & Data in motionRaúl Marín
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseGregory Keys
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsDataWorks Summit
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 

Similaire à Data Science at Scale with Hortonworks (20)

Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at Scale
 
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
 
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017 Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
 
Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09Zementis hortonworks-webinar-2014-09
Zementis hortonworks-webinar-2014-09
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
Getting to What Matters: Accelerating Your Path Through the Big Data Lifecycl...
 
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
[Hortonworks] Future Of Data: Madrid - HDF & Data in motion
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 

Dernier

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Dernier (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Data Science at Scale with Hortonworks

  • 1. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Future of Data Boston Data & Cognitive Developers Enterprise Data Science at Scale
  • 2. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda • Networking, Food and drink • Announcements • Main Presentation – Introducing Data Science at Scale – Building and Deploying Models Collaboratively – Training Models with all the Data – Putting Models to Work in a Streaming Application • Question and Answer • Networking and Wrap up
  • 3. © Hortonworks Inc. 2011 – 2017. All Rights Reserved  #1 Pure Open Source Hadoop Distribution  1000+ customers and 2100+ ecosystem partners  Employs the original architects, developers and operators of Hadoop from Yahoo!  Best-in-class 24x7 customer support  Leading professional services and training  #1 Data Science Platform (Source: Gartner)  OpenPOWER performance leadership  Flexible, software defined storage  #1 SQL Engine for complex, analytical workloads  Leader in On-premise and Hybrid Cloud solutions + Thanks to our Meetup Partners IBM + Hortonworks
  • 4. © Hortonworks Inc. 2011 – 2017. All Rights Reserved About Carolyn Duby • Big Data Solutions Architect • High performance data intensive systems • Data science and Cyber Security SME • ScB ScM Computer Science, Brown University • LinkedIn: https://www.linkedin.com/in/carolynduby/ • Twitter: @carolynduby Github: carolynduby • Hortonworks – Innovation through data – Enterprise ready, 100% open source, modern data platforms – Engineering, Technical Support, Professional Services, Training
  • 5. © Hortonworks Inc. 2011 – 2017. All Rights Reserved About Rich Tarro • Analytics Solutions Architect • IBM Corporation • Client insights through data • MS Electrical Engineering – Rensselaer Polytechnic Institute
  • 6. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Science Lifecycle
  • 7. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Next Generation Data Science Problems Multiple data sources & clusters Data Scientists Where is the data I need to answer the business questions? Data Engineers How do I move that data into a central repository? How do I transform and cleanse that data?
  • 8. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Next Generation Data Science Problems Too many tools and technologies Data Scientists How do I learn the latest library/ technique? I don’t (want to) know Hadoop/ Hive etc. How do I bring my familiar R/ Python library to the new data science platform?
  • 9. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Next Generation Data Science Problems Socializing insights is challenging Data Scientists How do I collaborate and share my work with others in the organization? Business Analyst How do I move that data into a central repository? What is the best visualization to tell my story?
  • 10. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Next Generation Data Science Problems Going from prototype to production is cumbersome Data Scientists I created this awesome Machine Learning Model, how do I put it into production? Data Scientists/ Data Engineers How are my Machine Learning Models performing & how to improve them?
  • 11. © Hortonworks Inc. 2011 – 2017. All Rights Reserved More data = better model but desktop is limited • Analyzing and training with portion of available data • Analysis or training too slow • Out of memory • Data accumulates over time
  • 12. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Science Solution • Data Movement and Acquisition – Acquire and move data required for problem • Distributed Compute Platform – Store, clean, and organize historical data – Build and train models on historical data • Notebooks – Record data processes • Clean and prepare data • Build and train models – Collaborate with others • Model Deployment – Package model for use – Monitor performance
  • 13. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Movement and Acquisition  Constrained  High-latency  Localized context  Hybrid – cloud/on-premises  Low-latency  Global context SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE Data Lake
  • 14. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache SPARK • Distributed processing efficiently crunches large data sets – Optimized – Horizontally scalable with multi tenancy – Fault tolerant • One platform for streaming, cleaning, analyzing • Elegant APIs – Scala, Python, Java, R • Many data source connectors – file system, HDFS, Hive, Phoenix, S3, etc
  • 15. © Hortonworks Inc. 2011 – 2017. All Rights Reserved SPARK Libraries • Same API for all data sources • SQL - http://spark.apache.org/sql/ – Access structured data and combine with other sources • MLLIB - http://spark.apache.org/mllib/ – Machine learning for training models and predicting • GraphX - http://spark.apache.org/graphx/ – Connectivity algorithms • Streaming - http://spark.apache.org/streaming/ – Complex event processing and data ingest
  • 16. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark Architecture Spark Driver Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task LivyNotebook
  • 17. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Open Source Notebooks - Jupyter
  • 18. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Open Source Notebooks - Zeppelin
  • 19. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Sharing with DSX Projects
  • 20. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Deploying a Model Virtual Model Deployment Physical Model Rest Service Physical Model Rest Service Physical Model Rest Service Model Algorithm and Parameters
  • 21. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Big Picture Build Model Model 1 Model 2 Model 3 Model Deployment History MoveData Time Train Predict Streaming Application Evaluate Performance
  • 22. © Hortonworks Inc. 2011 – 2017. All Rights Reserved DEMO
  • 23. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Demo Scenario Customer Churn • Churn occurs when a customer stops subscribing or using a service, which affects all industries. • A company’s ability to predict customer churns allows them the opportunity to be proactive in efforts to retain them. • Historical customer churn data will be used to train the Machine Learning model, and it will be used to predict whether customer will stop using their services. • Random Forest Classifier will be used in this demo, which will well suited to handle variance in the training data set.
  • 24. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Demo Flow Insights from Data Science to Production Data Scientists Where is the data I need to answer the business questions? Business Users Where is the insight & predictions from the data? HDP Cluster Knox Admins How do I meet SLA, Performance, .., Feature needs?
  • 25. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Demo Scenario Problems Solved • Data Scientist collaborate, learn new tools & frameworks • Choice of tools, notebooks and languages • Run favorite notebook on all data in the HDP Cluster • Deploy the model to production • Leverage the production model to deliver insights to business • Monitor models and retrain models as new data comes in
  • 26. © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank You

Notes de l'éditeur

  1. Hortonworks & IBM – Integration of HDP and DSX Who Are We IBM - #1 Data Science solution. Hortonworks – Largest Open Source Hadoop distribution. We believe this partnership optimizes the strengths our companies and uniquely positions our solution in the Data Science market. What Are We Talking About Today Integrating HDP and DSX creates a platform for organizations to unlock the potential of their data. Ultimately, it creates a pathway to innovative and valuable Data Science work flows. Presentation Overview 1. Walk through the Data Science Life Cycle. 2. Discuss challenges in the process. 3. Discuss how DSX & HDP solves these problems. 4. Demonstration of the technology.
  2. Problem Definition A successful Data Science practices begin with a well defined business problem. Ideally, the business has specific questions to ask their data. ETL – Feature Extraction Once the the problem has been defined, the process of data wrangling, transformation, and cleaning must be completed using various ETL processes. Once the data corpus has been curated, statistical analysis techniques are utilized to determine which features should be extracted. Learning After the features are selected, supervised or unsupervised Machine Learning models are be created for future prediction or classification. Model Deployment & Management In order for this process to be valuable, the organization must deploy these models into their production environment. Additionally, they most also monitor the performance and health of these models while they are operating.
  3. Data Science team consist of Data Scientist, Data Engineers, Business Analyst, and Application Developers. Challenge #1 – Multiple sources of data. Traditional – Structured (DB, EDW, CRM, Ect), Big Data – Unstructured (Social media, IOT), Hadoop based data stores Legacy – Spreadsheets Problem 1 – 20% of the time in the Data Science Lifecycle (DSLC) is spent on Data Scientists trying to find where the required data is located. Problem 2 – 60% of the the time in the DSLC is spent on Data Engineers centrally located the data, and preparing it to ensure data quality. *Key Take Away - Combined 80% of the time in the DSLC is spent on locating, moving, and preparing the data before machine learning models can be created or deployed.
  4. Challenge #2 – Data Science workflows lack standardization. Problem 1 - the open source community has created too many tools for expect a single person to know them all. Problem 2 – Data Science teams are often limited to using the tools that their data scientist and application developer are most familiar with including languages, and libraries. Problem 3 – There are no systems in place where an organization’s Data Science team can build reliable, standardized, and repeatable pipelines for managing models at a Big Data scale. *Key Take Away – Due to the number of open source tools, there are no standardizations in Data Science practices.
  5. Challenge #3 – Collaboration is difficult Problem 1 – Without a common framework, Data Scientists have difficultly collaborating with team members. They are unable to share code, results, or models with each other. Problem 2 - Due to lack of collaboration, Business Analyst struggle to find visualization tools that can integrate with the data repository. Key Take Away – Collaboration for a Data Science team is difficult leveraging existing tools.
  6. Challenge #4 – Deploying Models into Production Problem 1 – Data Science teams struggle migrating their prototype models to deployment in their production environment. Problem 2 – Using current tools, it is also difficult to monitor the health and performance on their models while they are operating in production. Key Take Away – Creating models in an isolated environment is relatively straight forward. The challenge begins when these models need to be deployed and monitored in production.
  7. Components of the end-to-end real-time insights dataflow platform MiNiFi : Edge Data Collection w/Provenance and centralized C&C NiFi: End to end dataflow management w/Provenance and Interactive C&C Kafka: High throughput durable replayable messaging Storm: High-scale Data Processing Right-sized solutions All optimized for delivery into HDP (HDFS, Hive, Spark, Hbase, etc…)