SlideShare a Scribd company logo
1 of 38
Machine Learning
With Apache Spark
CodeMash, Sandusky, Ohio, Jan 5-8, 2016
David Taieb
STSM-IBM Cloud Data Services
©2015 IBM Corporation
Introduction
David Taieb
david_taieb@us.ibm.com
Developer Advocate
IBM Cloud Data Services
Our mission:
We are here to help developers realize their most ambitious projects.
https://developer.ibm.com/clouddataservices/connect/
©2015 IBM Corporation
Big data, cloud and the rise of business Analytics
‣ Data being collected by enterprises
grows exponentially : ERP,
embedded systems (IOT)
‣ Cloud, with high availability and huge
capacity, make more data available
for analytics
‣ Big data and cloud create new
opportunities:
- Organizations: more effective decision-
making process, richer client interactions
- Business users: discover new insights,
better decision-making process
- Developers: access to diverse data sources
and new tools that increase productivity
©2015 IBM Corporation
Why Business Analytics with big data
“In God we trust.
All others bring data”
W. Edwards Deming
‣ Every day, companies make bet-the-business
decisions about their customers, competitors and
new products
‣ Time available for decision-making is shrinking
(sometimes real-time)
‣ As more and more companies go digital, data
becomes the world’s newest resource for
competitive advantage
‣ Decision making has moved from the elite few to
the empowered many
‣ Few organizations can keep pace with the
appetite for data
Business Analytics Types
Descriptive Analytics Predictive Analytics Prescriptive Analytics
Look at the reason for
past success or failure
What is probably going
to happen in the future?
What’s my best actions?
• Use interactive querying and
visualization to explore and
communicate data
• Discover insight and trends
• correlation between 2
seemingly unrelated
variables
• Data mining
• Generate hypothesis and
models
• Predict occurrence of future
events using probability
(confidence)
• Product recommendations
• Classification
• Help make the right decision
based on the data
• Find optimal solution to a
given problem
Taking Analytics a step further with Cognitive Systems
‣ Use natural language processing and machine learning algorithms to unlock knowledge
from massive amount of structured and unstructured data
Decide
• Ingest and analyze domain sources, info models
• Generate evidence based decisions with confidence
• Learn with new outcomes and actions
• e.g. - Next generation Apps  Probabilistic Apps
Ask
• Leverage vast amounts of data
• Ask questions for greater insights
• Natural language inquiries
• e.g. - Next generation Chat
Discover
• Find the rationale for given answers
• Prompt for inputs to yield improved responses
• Inspire considerations of new ideas
• e.g. - Next generation Search  Discovery
IBM Watson
IBM Cloud Data Services
Resources for developers to get, build, and analyze on the IBM Cloud
©2015 IBM Corporation
What is spark
Spark is an open source
in-memory
computing framework for
distributed data processing
and
iterative analysis
on massive data volumes
©2015 IBM Corporation
Spark Core Libraries
Spark Core
general compute engine, handles
distributed task dispatching, scheduling
and basic I/O functions
Spark
SQL
Spark
Streaming
Mllib
(machine
learning)
GraphX
(graph)
executes
SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
©2015 IBM Corporation
Key reasons for interest in Spark
Open Source
Fast
distributed data
processing
Productive
Web Scale
•In-memory storage greatly reduces disk I/O
•Up to 100x faster in memory, 10x faster on disk
•Largest project and one of the most active on Apache
•Vibrant growing community of developers continuously improve code
base and extend capabilities
•Fast adoption in the enterprise (IBM, Databricks, etc…)
•Fault tolerant, seamlessly recompute lost data from hardware failure
•Scalable: easily increase number of worker nodes
•Flexible job execution: Batch, Streaming, Interactive
•Easily handle Petabytes of data without special code handling
•Compatible with existing Hadoop ecosystem
•Unified programming model across a range of use cases
•Rich and expressive apis hide complexities of parallel computing and worker node
management
•Support for Java, Scala, Python and R: less code written
•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
©2015 IBM Corporation
IBM is all-in on its commitment to Spark
11
Foster
Community
Educate 1M+ data scientists
and engineers via online
courses
Sponsor AMPLab, creators and
evangelists of Spark
Infuse the
Portfolio
Integrate Spark throughout
portfolio
3,500 employees working
on Spark-related topics
Spark however customers
want it – standalone,
platform or products
Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss
Launch Spark Technology Cluster
(STC), 300 engineers
Open source SystemML
Partner with databricks
Contribute to the
Core
©2015 IBM Corporation
Spark MLLib
‣ Extension to the Spark Core API that provide a library of easy to use Machine
learning algorithms.
‣ Highly scalable: Leverages Spark ability to work with massive amount of data
‣ Fast: Designed for parallel computing
‣ Cover common Machine Learning algorithms:
- Regression
- Classification
- Clustering
- Recommender Systems
- Text Analytics
©2015 IBM Corporation
What is Machine Learning and where is it used
‣Subfield of computer science that focuses on getting computers to
learn from data:
- Recognize patterns
- Make predictions
‣Example use:
- Spam filters
- Netflix recommendations
- Self-driving cars
- Watson
- …
©2015 IBM Corporation
Typical Machine Learning Flow diagram
Data
Acquisition
Data
Preparation
Data Annotation
(Ground Truth)
Model
Training
• Cleansing
• Shaping
• Enrichment
Model
Testing
Training
Set
Test
Set
Blind
Set
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
©2015 IBM Corporation
MLLib Algorithm Overview
• Predictive analytics
• Recommendations
• Collaborative Filtering
• Matrix Factorization
• Feature extraction and Transformation
• TF-IDF
• HashingTF
• Word2Vec
• StandardScaler
• Normalizer
• Model Evaluation/Metrics
• Binary Classification Metrics
• Multi Class Metrics
• Regression Metrics
©2015 IBM Corporation
Predictive analytics
Continuous Output Discrete Output
Supervised
Learning
(require Ground-Truth)
• Regression
- Linear
- Ridge
- Lasso
- Isotonic
• Decision Tree
• RandomForest
• GradientBoostedTree
• Classification
- Logistic Regression
- SVM
- NaiveBayes
• Decision Tree
• RandomForest
• GradientBoostedTree
• K-NN (available as add-on spark
package)
Unsupervised
Learning
(no Ground-Truth data required)
• Clustering
- KMeans
- Gaussian Mixture
• Dimensionality Reduction
- PCA
- SVD
• FP-Growth
©2015 IBM Corporation
Featured demo: Flight Delay Predictor
‣ Use training data collected from flight stats and enriched with weather observations from
“Insight for Weather” service on Bluemix
‣ Train multi-class classifier that, given and flight departure weather observations, can predict the
flight delay class:
- 0 = Canceled
- 1 = On Time
- 2 = Delay less than 2 hours
- 3 = Delay between 2 and 4 hours
- 4 = Delay more than 4 hours
‣ Provide metrics measurement for each algorithms
- Accuracy
- Precision
- Recall
©2015 IBM Corporation
Architecture
Weather
Simple Data
Pipes
Airports
Flight Schedules
Flight Status
Metadata
Training
Set
Test
Set
Blind
Set
Custom
Connector run
every 24 hours
Notebook
©2015 IBM Corporation
Get
‣ Identify data sources:
- flightstats.com: https://developer.flightstats.com
- Airport metadata: FS Code, geolocation,…
- Flight Schedules
- Flight Status
- Weather Observations
- Insight for Weather on Bluemix
‣ Storage:
- Cloudant
‣ Tool used:
- Simple Data Pipes custom connector to build Training, Test and Blind data set
‣ Constraints:
- Weather service provide past observations as far as 24 hours back only
- Flightstats API key is a 30 day trial version, limited to 20,000 calls only
©2015 IBM Corporation
Custom Pipes Connector to build training data set
https://developer.ibm.com/clouddataservices/simple-data-pipe/
©2015 IBM Corporation
Run every 24 hours
Because Weather service doesn’t return observations older than 24 hours, the data
set must be ran every 24 hours
©2015 IBM Corporation
Build: Explore the data with Notebook
©2015 IBM Corporation
Loading training data set
©2015 IBM Corporation
Build: Visualize and explore data set
Scatter plot of flights delays based on temperature in Departing and Arrival airports
©2015 IBM Corporation
Build: Visualize and explore data set
Scatter plot of flights delays based on wind speed in Departing and Arrival airports
©2015 IBM Corporation
Constraints
‣ Past weather observations provided by the “Insight for Weather” service have more details than
forecast data:
- Limit the number of features used to train the models to the intersections of the 2.
‣ Restrict the training data to weather forecast at departure and arrival airport
- Would adding weather data from various point in the route increase the model performance?
‣ Difficult to get enough representative data because I was using a trial account on flightstats
- Ideally, I would use more airports with better representative weather
‣ Didn’t use any categorical features
‣ For simplicity: Use IPython Notebook as the user interface
- Make the experience less compelling for Business users
- To avoid writing too much code in the Notebook, encapsulate some of the business logic in a Python library
- Doesn’t cover as much of the Spark API as Scala
©2015 IBM Corporation
Load labeled data RDD
©2015 IBM Corporation
Load labeled data RDD
©2015 IBM Corporation
Build: NaiveBayes Classification
©2015 IBM Corporation
Build: Decision Tree classification
©2015 IBM Corporation
Build: Random Forest classification
©2015 IBM Corporation
Build: Performance measurements
Load blind data
©2015 IBM Corporation
Build: Compare metrics between different
models
©2015 IBM Corporation
Naïve Bayes vs Decision Tree
‣ Probabilistic: compute the probability of a
data instance to be in a specific class
‣ Assume that each feature (variable) is
independent from the others
‣ Performance depends on the predictive
nature of the features (non predictive
features will affect the accuracy)
‣ Works well with low amount of training data.
Doesn’t need all the possibilities
‣ Doesn’t work with categorical features.
‣ Non-Probabilistic: partition the data into
subsets that best describe the variable
‣ The deeper the tree, the better the model fits
the data
‣ Watch out for overfiting: need to prune the tree
‣ Can handle categorical or continuous features
‣ No need for input to be scaled or standardized:
Set you features and go!
‣ Requires a lot of data covering all possibilities
©2015 IBM Corporation
Analyze: Run model
©2015 IBM Corporation
Code: Run Model
©2015 IBM Corporation
If you want to know more
‣https://developer.ibm.com/clouddataservices/
‣https://github.com/ibm-cds-labs/pipes-connector-flightstats
‣http://spark.apache.org/docs/latest/mllib-guide.html
‣https://console.ng.bluemix.net/data/analytics/
©2015 IBM Corporation

More Related Content

What's hot

Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Databricks
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
Johann Schleier-Smith
 

What's hot (20)

Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Hopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIHopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AI
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the Business
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcp
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
ML-Ops: From Proof-of-Concept to Production Application
ML-Ops: From Proof-of-Concept to Production ApplicationML-Ops: From Proof-of-Concept to Production Application
ML-Ops: From Proof-of-Concept to Production Application
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 

Similar to Machine Learning with Apache Spark

Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 

Similar to Machine Learning with Apache Spark (20)

High Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environmentsHigh Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environments
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
 
Machine learning in the physical world by Kip Larson from AWS IoT
Machine learning in the physical world by  Kip Larson from AWS IoTMachine learning in the physical world by  Kip Larson from AWS IoT
Machine learning in the physical world by Kip Larson from AWS IoT
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Inawisdom MLOPS
Inawisdom MLOPSInawisdom MLOPS
Inawisdom MLOPS
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagreb
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented Analytics
 
Accelerating Innovation with Hybrid Cloud
Accelerating Innovation with Hybrid CloudAccelerating Innovation with Hybrid Cloud
Accelerating Innovation with Hybrid Cloud
 

More from IBM Cloud Data Services

Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISAnalyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
IBM Cloud Data Services
 

More from IBM Cloud Data Services (20)

CouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text SearchCouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text Search
 
CouchDB Day NYC 2017: Using Geospatial Data in Cloudant & CouchDB
CouchDB Day NYC 2017: Using Geospatial Data in Cloudant & CouchDBCouchDB Day NYC 2017: Using Geospatial Data in Cloudant & CouchDB
CouchDB Day NYC 2017: Using Geospatial Data in Cloudant & CouchDB
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
CouchDB Day NYC 2017: Replication
CouchDB Day NYC 2017: ReplicationCouchDB Day NYC 2017: Replication
CouchDB Day NYC 2017: Replication
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 
CouchDB Day NYC 2017: JSON Documents
CouchDB Day NYC 2017: JSON DocumentsCouchDB Day NYC 2017: JSON Documents
CouchDB Day NYC 2017: JSON Documents
 
CouchDB Day NYC 2017: Core HTTP API
CouchDB Day NYC 2017: Core HTTP APICouchDB Day NYC 2017: Core HTTP API
CouchDB Day NYC 2017: Core HTTP API
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
Practical Use of a NoSQL
Practical Use of a NoSQLPractical Use of a NoSQL
Practical Use of a NoSQL
 
I See NoSQL Document Stores in Geospatial Applications
I See NoSQL Document Stores in Geospatial ApplicationsI See NoSQL Document Stores in Geospatial Applications
I See NoSQL Document Stores in Geospatial Applications
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
NoSQL for SQL Users
NoSQL for SQL UsersNoSQL for SQL Users
NoSQL for SQL Users
 
dashDB: the GIS professional’s bridge to mainstream IT systems
dashDB: the GIS professional’s bridge to mainstream IT systemsdashDB: the GIS professional’s bridge to mainstream IT systems
dashDB: the GIS professional’s bridge to mainstream IT systems
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
Mobile App Development With IBM Cloudant
Mobile App Development With IBM CloudantMobile App Development With IBM Cloudant
Mobile App Development With IBM Cloudant
 
IBM Cognos Business Intelligence using dashDB
IBM Cognos Business Intelligence using dashDBIBM Cognos Business Intelligence using dashDB
IBM Cognos Business Intelligence using dashDB
 
Run Oracle Apps in the Cloud with dashDB
Run Oracle Apps in the Cloud with dashDBRun Oracle Apps in the Cloud with dashDB
Run Oracle Apps in the Cloud with dashDB
 
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISAnalyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
 
Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a Service
 

Recently uploaded

Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Recently uploaded (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 

Machine Learning with Apache Spark

  • 1. Machine Learning With Apache Spark CodeMash, Sandusky, Ohio, Jan 5-8, 2016 David Taieb STSM-IBM Cloud Data Services
  • 2. ©2015 IBM Corporation Introduction David Taieb david_taieb@us.ibm.com Developer Advocate IBM Cloud Data Services Our mission: We are here to help developers realize their most ambitious projects. https://developer.ibm.com/clouddataservices/connect/
  • 3. ©2015 IBM Corporation Big data, cloud and the rise of business Analytics ‣ Data being collected by enterprises grows exponentially : ERP, embedded systems (IOT) ‣ Cloud, with high availability and huge capacity, make more data available for analytics ‣ Big data and cloud create new opportunities: - Organizations: more effective decision- making process, richer client interactions - Business users: discover new insights, better decision-making process - Developers: access to diverse data sources and new tools that increase productivity
  • 4. ©2015 IBM Corporation Why Business Analytics with big data “In God we trust. All others bring data” W. Edwards Deming ‣ Every day, companies make bet-the-business decisions about their customers, competitors and new products ‣ Time available for decision-making is shrinking (sometimes real-time) ‣ As more and more companies go digital, data becomes the world’s newest resource for competitive advantage ‣ Decision making has moved from the elite few to the empowered many ‣ Few organizations can keep pace with the appetite for data
  • 5. Business Analytics Types Descriptive Analytics Predictive Analytics Prescriptive Analytics Look at the reason for past success or failure What is probably going to happen in the future? What’s my best actions? • Use interactive querying and visualization to explore and communicate data • Discover insight and trends • correlation between 2 seemingly unrelated variables • Data mining • Generate hypothesis and models • Predict occurrence of future events using probability (confidence) • Product recommendations • Classification • Help make the right decision based on the data • Find optimal solution to a given problem
  • 6. Taking Analytics a step further with Cognitive Systems ‣ Use natural language processing and machine learning algorithms to unlock knowledge from massive amount of structured and unstructured data Decide • Ingest and analyze domain sources, info models • Generate evidence based decisions with confidence • Learn with new outcomes and actions • e.g. - Next generation Apps  Probabilistic Apps Ask • Leverage vast amounts of data • Ask questions for greater insights • Natural language inquiries • e.g. - Next generation Chat Discover • Find the rationale for given answers • Prompt for inputs to yield improved responses • Inspire considerations of new ideas • e.g. - Next generation Search  Discovery IBM Watson
  • 7. IBM Cloud Data Services Resources for developers to get, build, and analyze on the IBM Cloud
  • 8. ©2015 IBM Corporation What is spark Spark is an open source in-memory computing framework for distributed data processing and iterative analysis on massive data volumes
  • 9. ©2015 IBM Corporation Spark Core Libraries Spark Core general compute engine, handles distributed task dispatching, scheduling and basic I/O functions Spark SQL Spark Streaming Mllib (machine learning) GraphX (graph) executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework
  • 10. ©2015 IBM Corporation Key reasons for interest in Spark Open Source Fast distributed data processing Productive Web Scale •In-memory storage greatly reduces disk I/O •Up to 100x faster in memory, 10x faster on disk •Largest project and one of the most active on Apache •Vibrant growing community of developers continuously improve code base and extend capabilities •Fast adoption in the enterprise (IBM, Databricks, etc…) •Fault tolerant, seamlessly recompute lost data from hardware failure •Scalable: easily increase number of worker nodes •Flexible job execution: Batch, Streaming, Interactive •Easily handle Petabytes of data without special code handling •Compatible with existing Hadoop ecosystem •Unified programming model across a range of use cases •Rich and expressive apis hide complexities of parallel computing and worker node management •Support for Java, Scala, Python and R: less code written •Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
  • 11. ©2015 IBM Corporation IBM is all-in on its commitment to Spark 11 Foster Community Educate 1M+ data scientists and engineers via online courses Sponsor AMPLab, creators and evangelists of Spark Infuse the Portfolio Integrate Spark throughout portfolio 3,500 employees working on Spark-related topics Spark however customers want it – standalone, platform or products Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss Launch Spark Technology Cluster (STC), 300 engineers Open source SystemML Partner with databricks Contribute to the Core
  • 12. ©2015 IBM Corporation Spark MLLib ‣ Extension to the Spark Core API that provide a library of easy to use Machine learning algorithms. ‣ Highly scalable: Leverages Spark ability to work with massive amount of data ‣ Fast: Designed for parallel computing ‣ Cover common Machine Learning algorithms: - Regression - Classification - Clustering - Recommender Systems - Text Analytics
  • 13. ©2015 IBM Corporation What is Machine Learning and where is it used ‣Subfield of computer science that focuses on getting computers to learn from data: - Recognize patterns - Make predictions ‣Example use: - Spam filters - Netflix recommendations - Self-driving cars - Watson - …
  • 14. ©2015 IBM Corporation Typical Machine Learning Flow diagram Data Acquisition Data Preparation Data Annotation (Ground Truth) Model Training • Cleansing • Shaping • Enrichment Model Testing Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model
  • 15. ©2015 IBM Corporation MLLib Algorithm Overview • Predictive analytics • Recommendations • Collaborative Filtering • Matrix Factorization • Feature extraction and Transformation • TF-IDF • HashingTF • Word2Vec • StandardScaler • Normalizer • Model Evaluation/Metrics • Binary Classification Metrics • Multi Class Metrics • Regression Metrics
  • 16. ©2015 IBM Corporation Predictive analytics Continuous Output Discrete Output Supervised Learning (require Ground-Truth) • Regression - Linear - Ridge - Lasso - Isotonic • Decision Tree • RandomForest • GradientBoostedTree • Classification - Logistic Regression - SVM - NaiveBayes • Decision Tree • RandomForest • GradientBoostedTree • K-NN (available as add-on spark package) Unsupervised Learning (no Ground-Truth data required) • Clustering - KMeans - Gaussian Mixture • Dimensionality Reduction - PCA - SVD • FP-Growth
  • 17. ©2015 IBM Corporation Featured demo: Flight Delay Predictor ‣ Use training data collected from flight stats and enriched with weather observations from “Insight for Weather” service on Bluemix ‣ Train multi-class classifier that, given and flight departure weather observations, can predict the flight delay class: - 0 = Canceled - 1 = On Time - 2 = Delay less than 2 hours - 3 = Delay between 2 and 4 hours - 4 = Delay more than 4 hours ‣ Provide metrics measurement for each algorithms - Accuracy - Precision - Recall
  • 18. ©2015 IBM Corporation Architecture Weather Simple Data Pipes Airports Flight Schedules Flight Status Metadata Training Set Test Set Blind Set Custom Connector run every 24 hours Notebook
  • 19. ©2015 IBM Corporation Get ‣ Identify data sources: - flightstats.com: https://developer.flightstats.com - Airport metadata: FS Code, geolocation,… - Flight Schedules - Flight Status - Weather Observations - Insight for Weather on Bluemix ‣ Storage: - Cloudant ‣ Tool used: - Simple Data Pipes custom connector to build Training, Test and Blind data set ‣ Constraints: - Weather service provide past observations as far as 24 hours back only - Flightstats API key is a 30 day trial version, limited to 20,000 calls only
  • 20. ©2015 IBM Corporation Custom Pipes Connector to build training data set https://developer.ibm.com/clouddataservices/simple-data-pipe/
  • 21. ©2015 IBM Corporation Run every 24 hours Because Weather service doesn’t return observations older than 24 hours, the data set must be ran every 24 hours
  • 22. ©2015 IBM Corporation Build: Explore the data with Notebook
  • 23. ©2015 IBM Corporation Loading training data set
  • 24. ©2015 IBM Corporation Build: Visualize and explore data set Scatter plot of flights delays based on temperature in Departing and Arrival airports
  • 25. ©2015 IBM Corporation Build: Visualize and explore data set Scatter plot of flights delays based on wind speed in Departing and Arrival airports
  • 26. ©2015 IBM Corporation Constraints ‣ Past weather observations provided by the “Insight for Weather” service have more details than forecast data: - Limit the number of features used to train the models to the intersections of the 2. ‣ Restrict the training data to weather forecast at departure and arrival airport - Would adding weather data from various point in the route increase the model performance? ‣ Difficult to get enough representative data because I was using a trial account on flightstats - Ideally, I would use more airports with better representative weather ‣ Didn’t use any categorical features ‣ For simplicity: Use IPython Notebook as the user interface - Make the experience less compelling for Business users - To avoid writing too much code in the Notebook, encapsulate some of the business logic in a Python library - Doesn’t cover as much of the Spark API as Scala
  • 27. ©2015 IBM Corporation Load labeled data RDD
  • 28. ©2015 IBM Corporation Load labeled data RDD
  • 29. ©2015 IBM Corporation Build: NaiveBayes Classification
  • 30. ©2015 IBM Corporation Build: Decision Tree classification
  • 31. ©2015 IBM Corporation Build: Random Forest classification
  • 32. ©2015 IBM Corporation Build: Performance measurements Load blind data
  • 33. ©2015 IBM Corporation Build: Compare metrics between different models
  • 34. ©2015 IBM Corporation Naïve Bayes vs Decision Tree ‣ Probabilistic: compute the probability of a data instance to be in a specific class ‣ Assume that each feature (variable) is independent from the others ‣ Performance depends on the predictive nature of the features (non predictive features will affect the accuracy) ‣ Works well with low amount of training data. Doesn’t need all the possibilities ‣ Doesn’t work with categorical features. ‣ Non-Probabilistic: partition the data into subsets that best describe the variable ‣ The deeper the tree, the better the model fits the data ‣ Watch out for overfiting: need to prune the tree ‣ Can handle categorical or continuous features ‣ No need for input to be scaled or standardized: Set you features and go! ‣ Requires a lot of data covering all possibilities
  • 37. ©2015 IBM Corporation If you want to know more ‣https://developer.ibm.com/clouddataservices/ ‣https://github.com/ibm-cds-labs/pipes-connector-flightstats ‣http://spark.apache.org/docs/latest/mllib-guide.html ‣https://console.ng.bluemix.net/data/analytics/