Data Science at Scale with Hortonworks

© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Future of Data Boston
Data & Cognitive Developers
Enterprise Data Science at Scale

Agenda
• Networking, Food and drink
• Announcements
• Main Presentation
– Introducing Data Science at Scale
– Building and Deploying Models Collaboratively
– Training Models with all the Data
– Putting Models to Work in a Streaming Application
• Question and Answer
• Networking and Wrap up

 #1 Pure Open Source Hadoop Distribution
 1000+ customers and 2100+ ecosystem
partners
 Employs the original architects, developers
and operators of Hadoop from Yahoo!
 Best-in-class 24x7 customer support
 Leading professional services and training
 #1 Data Science Platform (Source: Gartner)
 OpenPOWER performance leadership
 Flexible, software defined storage
 #1 SQL Engine for complex, analytical workloads
 Leader in On-premise and Hybrid Cloud solutions
+
Thanks to our Meetup Partners
IBM + Hortonworks

About Carolyn Duby
• Big Data Solutions Architect
• High performance data intensive systems
• Data science and Cyber Security SME
• ScB ScM Computer Science, Brown University
• LinkedIn: https://www.linkedin.com/in/carolynduby/
• Twitter: @carolynduby Github: carolynduby
• Hortonworks
– Innovation through data
– Enterprise ready, 100% open source, modern data platforms
– Engineering, Technical Support, Professional Services, Training

About Rich Tarro
• Analytics Solutions Architect
• IBM Corporation
• Client insights through data
• MS Electrical Engineering
– Rensselaer Polytechnic Institute

Data Science Lifecycle

Next Generation Data Science Problems
Multiple data sources & clusters
Data Scientists
Where is the data I need to answer the
business questions?
Data Engineers
How do I move that data into a central
repository?
How do I transform and cleanse that data?

Too many tools and technologies
Data Scientists
How do I learn the latest library/ technique?
I don’t (want to) know Hadoop/ Hive etc.
How do I bring my familiar R/ Python library
to the new data science platform?

Socializing insights is challenging
Data Scientists
How do I collaborate and share my work
with others in the organization?
Business Analyst
How do I move that data into a central
repository?
What is the best visualization to tell my
story?

Going from prototype to production is cumbersome
Data Scientists
I created this awesome Machine Learning
Model, how do I put it into production?
Data Scientists/ Data Engineers
How are my Machine Learning Models
performing & how to improve them?

More data = better model but desktop is limited
• Analyzing and training with portion of available data
• Analysis or training too slow
• Out of memory
• Data accumulates over time

Data Science Solution
• Data Movement and Acquisition
– Acquire and move data required for problem
• Distributed Compute Platform
– Store, clean, and organize historical data
– Build and train models on historical data
• Notebooks
– Record data processes
• Clean and prepare data
• Build and train models
– Collaborate with others
• Model Deployment
– Package model for use
– Monitor performance

Data Movement and Acquisition
 Constrained
 High-latency
 Localized context
 Hybrid – cloud/on-premises
 Low-latency
 Global context
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
Data Lake

Apache SPARK
• Distributed processing efficiently crunches large data
sets
– Optimized
– Horizontally scalable with multi tenancy
– Fault tolerant
• One platform for streaming, cleaning, analyzing
• Elegant APIs – Scala, Python, Java, R
• Many data source connectors – file system, HDFS,
Hive, Phoenix, S3, etc

SPARK Libraries
• Same API for all data sources
• SQL - http://spark.apache.org/sql/
– Access structured data and combine with other sources
• MLLIB - http://spark.apache.org/mllib/
– Machine learning for training models and predicting
• GraphX - http://spark.apache.org/graphx/
– Connectivity algorithms
• Streaming - http://spark.apache.org/streaming/
– Complex event processing and data ingest

Spark Architecture
Spark Driver Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
LivyNotebook

Open Source Notebooks - Jupyter

Open Source Notebooks - Zeppelin

Sharing with DSX Projects

Deploying a Model
Virtual Model Deployment
Physical Model
Rest Service
Physical Model
Rest Service
Physical Model
Rest Service
Model Algorithm and Parameters

Big Picture
Build Model
Model 1 Model 2 Model 3
Model Deployment
History
MoveData
Time
Train
Predict
Streaming Application
Evaluate Performance

DEMO

Demo Scenario
Customer Churn
• Churn occurs when a customer stops subscribing or using a service, which affects all industries.
• A company’s ability to predict customer churns allows them the opportunity to be proactive in
efforts to retain them.
• Historical customer churn data will be used to train the Machine Learning model, and it will be
used to predict whether customer will stop using their services.
• Random Forest Classifier will be used in this demo, which will well suited to handle variance in
the training data set.

Demo Flow
Insights from Data Science to Production
Data Scientists
Where is the data I
need to answer the
business questions?
Business Users
Where is the insight
& predictions from
the data?
HDP Cluster
Knox
Admins
How do I meet SLA,
Performance, .., Feature
needs?

Demo Scenario
Problems Solved
• Data Scientist collaborate, learn new tools & frameworks
• Choice of tools, notebooks and languages
• Run favorite notebook on all data in the HDP Cluster
• Deploy the model to production
• Leverage the production model to deliver insights to business
• Monitor models and retrain models as new data comes in

Thank You

Data Science at Scale with Hortonworks

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Science at Scale with Hortonworks

Similaire à Data Science at Scale with Hortonworks (20)

Dernier

Dernier (20)

Data Science at Scale with Hortonworks

Notes de l'éditeur