Contenu connexe Similaire à How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real Time Streaming Analytics (20) How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real Time Streaming Analytics1. HOW TO APPLY BIG DATA ANALYTICS
AND MACHINE LEARNING TO
REAL TIME PROCESSING
Kai Wähner
kwaehner@tibco.com
@KaiWaehner
www.kai-waehner.de
LinkedIn / Xing Please connect!
3. 3
Apply Big Data Analytics to Real Time Processing
© Copyright 2000-2016 TIBCO Software Inc.
4. 4
Analyse and Act on Critical Business Moments
© Copyright 2000-2016 TIBCO Software Inc.
5. Key Take-Aways
Insights are hidden in Historical Data on Big Data Platforms
Machine Learning and Big Data Analytics find these Insights by building Analytics Models
Event Processing uses these Models (without Rebuilding) to take Action in Real Time
6. 6
Agenda
© Copyright 2000-2016 TIBCO Software Inc.
1) Machine Learning and Big Data Analytics
2) Analysis of Historical Data
3) Real Time Processing
4) Live Demo
7. 7
Agenda
© Copyright 2000-2016 TIBCO Software Inc.
1) Machine Learning and Big Data Analytics
2) Analysis of Historical Data
3) Real Time Processing
4) Live Demo
8. 8
Machine Learning
© Copyright 2000-2016 TIBCO Software Inc.
Machine learning is a method of data analysis that automates analytical model building.
Using algorithms that iteratively learn from data, machine learning allows computers to
find hidden insights without being explicitly programmed where to look.
http://www.sas.com
9. 9
10 Examples of Machine Learning
© Copyright 2000-2016 TIBCO Software Inc.
• Spam Detection
• Credit Card Fraud Detection
• Digit Recognition
• Speech Understanding
• Face Detection
• Shape Detection
• Product Recommendation
• Medical Diagnosis
• Stock Trading
• Customer Segmentation
http://machinelearningmastery.com/practical-machine-learning-problems/
10. 10
10 Examples of Machine Learning
© Copyright 2000-2016 TIBCO Software Inc.
• Spam Detection: Given email in an inbox, identify those email messages that are spam and those that
are not. Having a model of this problem would allow a program to leave non-spam emails in the inbox
and move spam emails to a spam folder. We should all be familiar with this example.
• Credit Card Fraud Detection: Given credit card transactions for a customer in a month, identify those
transactions that were made by the customer and those that were not. A program with a model of this
decision could refund those transactions that were fraudulent.
• Digit Recognition: Given a zip codes hand written on envelops, identify the digit for each hand written
character. A model of this problem would allow a computer program to read and understand handwritten
zip codes and sort envelops by geographic region.
• Speech Understanding: Given an utterance from a user, identify the specific request made by the user.
A model of this problem would allow a program to understand and make an attempt to fulfil that request.
The iPhone with Siri has this capability.
• Face Detection: Given a digital photo album of many hundreds of digital photographs, identify those
photos that include a given person. A model of this decision process would allow a program to organize
photos by person. Some cameras and software like iPhoto has this capability.
http://machinelearningmastery.com/practical-machine-learning-problems/
11. 11
10 Examples of Machine Learning
© Copyright 2000-2016 TIBCO Software Inc.
• Product Recommendation: Given a purchase history for a customer and a large inventory of products,
identify those products in which that customer will be interested and likely to purchase. A model of this
decision process would allow a program to make recommendations to a customer and motivate product
purchases. Amazon has this capability. Also think of Facebook, GooglePlus and Facebook that
recommend users to connect with you after you sign-up.
• Medical Diagnosis: Given the symptoms exhibited in a patient and a database of anonymized patient
records, predict whether the patient is likely to have an illness. A model of this decision problem could be
used by a program to provide decision support to medical professionals.
• Stock Trading: Given the current and past price movements for a stock, determine whether the stock
should be bought, held or sold. A model of this decision problem could provide decision support to
financial analysts.
• Customer Segmentation: Given the pattern of behaviour by a user during a trial period and the past
behaviours of all users, identify those users that will convert to the paid version of the product and those
that will not. A model of this decision problem would allow a program to trigger customer interventions to
persuade the customer to covert early or better engage in the trial.
• Shape Detection: Given a user hand drawing a shape on a touch screen and a database of known
shapes, determine which shape the user was trying to draw. A model of this decision would allow a
program to show the platonic version of that shape the user drew to make crisp diagrams. The Instaviz
iPhone app does this.
http://machinelearningmastery.com/practical-machine-learning-problems/
12. 12
Types of Machine Learning Problems
© Copyright 2000-2016 TIBCO Software Inc.
• Classification: Data is labelled meaning it is assigned a class, for example
spam / non-spam or fraud / non-fraud.
• Regression: Data is labelled with a real value (think floating point) rather
then a label. Examples that are easy to understand are time series data like
the price of a stock over time.
• Clustering: Data is not labelled, but can be divided into groups based on
similarity and other measures of natural structure in the data. An example
from would be organising pictures by faces without names.
• Rule Extraction: Data is used as the basis for the extraction of
propositional rules (antecedent/consequent aka if-then). An example is the
discovery of the relationship between the purchase of beer and diapers.
http://machinelearningmastery.com/practical-machine-learning-problems/
(no complete list!)
13. © Copyright 2000-2016 TIBCO Software Inc.
Closed Loop for Big Data Analytics
MODEL
Develop model
Deploy into
Stream Processing flow
ACT
Automatically monitor
real-time transactions
Automatically trigger action
ANALYZE
Analyze data via
Data Discovery
Uncover patterns,
trends, correlations
14. 14
Analytics Maturity Model
© Copyright 2000-2016 TIBCO Software Inc.
Immediate
Long-Term
Competitive AdvantageValue to the Organization
A good Big Data Analytics platform can provide value to the organization
across the full spectrum of use cases
Self-service
Dashboards
Event Processing
Predictive and
Prescriptive Analytics
Measure Diagnose Predict Optimize Operationalize Automate
Analytics Maturity
Self-service
Dashboards
Event Processing
Analytics
15. 15
Analytics Maturity Model
© Copyright 2000-2016 TIBCO Software Inc.
Immediate
Long-Term
Competitive AdvantageValue to the Organization
Self-service
Dashboards
Event Processing
Predictive and
Prescriptive Analytics
Measure Diagnose Predict Optimize Operationalize Automate
Analytics Maturity
A good Big Data Analytics platform can provide value to the organization
across the full spectrum of use cases
Analytics
16. 16
Analytics Maturity Model
© Copyright 2000-2016 TIBCO Software Inc.
Immediate
Long-Term
Competitive AdvantageValue to the Organization
Self-service
Dashboards
Event Processing
Predictive and
Prescriptive Analytics
Measure Diagnose Predict Optimize Operationalize Automate
Analytics Maturity
A good Big Data Analytics platform can provide value to the organization
across the full spectrum of use cases
Self-service
Dashboards
Event Processing
Analytics
17. 17
Agenda
© Copyright 2000-2016 TIBCO Software Inc.
1) Machine Learning and Big Data Analytics
2) Analysis of Historical Data
3) Real Time Processing
4) Live Demo
19. 19
Analytics Maturity Model
© Copyright 2000-2016 TIBCO Software Inc.
Immediate
Long-Term
Competitive AdvantageValue to the Organization
A good Big Data Analytics platform can provide value to the organization
across the full spectrum of use cases
Self-service
Dashboards
Event Processing
Predictive and
Prescriptive Analytics
Measure Diagnose Predict Optimize Operationalize Automate
Analytics Maturity
Self-service
Dashboards
Event Processing
Analytics
25. cust_id dept sku dollar gift date
1 104 C 12003 2.40 FALSE 2016-10-17
2 105 A 12005 62.85 FALSE 2016-10-17
3 102 C 12007 69.23 TRUE 2016-10-17
4 104 B 12004 9.33 FALSE 2016-10-18
5 105 C 12010 14.16 TRUE 2016-10-18
6 101 B 12003 90.43 FALSE 2016-10-19
7 103 C 12005 90.97 FALSE 2016-10-19
n … … … … … …
cust_id A B C total # orders first_date last_date
1 100 21.76 23.67 0.00 45.43 2 2016-10-19 2016-10-20
2 101 0.01 74.65 0.00 74.66 3 2016-10-19 2016-10-20
3 102 0.00 60.92 50.29 111.21 6 2016-10-17 2016-10-20
4 103 0.00 0.00 52.30 52.30 2 2016-10-19 2016-10-20
5 104 31.34 9.33 2.40 43.06 4 2016-10-17 2016-10-20
6 105 62.85 0.00 56.00 118.85 3 2016-10-17 2016-10-20
© Copyright 2000-2016 TIBCO Software Inc.
Data Munging - Transformations
28. Exploratory Data Analysis (EDA) is an
approach/philosophy for data analysis
that employs a variety of techniques
(mostly graphical)
1. to maximize insight into a data set
2. uncover underlying structure
3. extract important variables
4. detect outliers and anomalies
5. test underlying assumptions
6. develop parsimonious models
7. determine optimal factor settings
© Copyright 2000-2016 TIBCO Software Inc.
Exploratory Data Analysis
29. “The greatest value of a picture is
when it forces us to notice what we
never expected to see”
John W. Tukey, 1977
© Copyright 2000-2016 TIBCO Software Inc.
Exploratory Data Analysis
31. 31
Analytics Maturity Model
© Copyright 2000-2016 TIBCO Software Inc.
Immediate
Long-Term
Competitive AdvantageValue to the Organization
Self-service
Dashboards
Event Processing
Predictive and
Prescriptive Analytics
Measure Diagnose Predict Optimize Operationalize Automate
Analytics Maturity
A good Big Data Analytics platform can provide value to the organization
across the full spectrum of use cases
Analytics
34. © Copyright 2000-2016 TIBCO Software Inc.
Which picture represents a model?
A model is a simplification of the truth that helps you with decision making.
35. © Copyright 2000-2016 TIBCO Software Inc.
Model Building
Supervised Models – known, labeled responses
• Regression (for example Linear Regression)
• Categorical (for example Random Forest)
Unsupervised Models – no labeled responses
• Clustering (for example k-means clustering)
37. Employees who write longer emails earn higher salaries!
© Copyright 2000-2016 TIBCO Software Inc.
Model Building
41. © Copyright 2000-2016 TIBCO Software Inc.
Model Validation
How is the IQ of a kid related to the IQ of his / her mum?
43. Data Scientists work with many Tools
© Copyright 2000-2016 TIBCO Software Inc.
• SQL
• Excel
• Python
• R
Source: O’Reilly 2015 Data Science Salary Survey
http://duu86o6n09pv.cloudfront.net/reports/2015-
data-science-salary-survey.pdf
44. 44
Alternatives for Data Scientists
© Copyright 2000-2016 TIBCO Software Inc.
Open Source Closed Source
Tooling
Source Code
(no complete list)
R
45. R Language
R is well known as the most and increasingly getting more popular
programming language used by data scientists for modeling. It is
developing very rapidly with a very active community.
© Copyright 2000-2016 TIBCO Software Inc.
46. R with Revolution Analytics (now Microsoft)
© Copyright 2000-2016 TIBCO Software Inc.
Open Source GPL License
(including its restrictions) http://www.revolutionanalytics.com/webinars/introducing-revolution-r-open-enhanced-open-source-r-distribution-revolution-analytics
47. • TIBCO has rewritten R as a Commercial Compute Engine
• Latest statistics scripting engine: S a S-PLUS® a R a TERR
• Runs R code including CRAN packages
• Engine internals rebuilt from scratch at low-level
• Redesigned data objects, memory management
• High performance + Big Data
• TERR is licensed from TIBCO
• TERR Installs (free) with Spotfire Analyst / Desktop + other TIBCO products
• Spotfire Server can manage all TERR / R scripts, artifacts for reuse
• Standalone Developer Edition
• Supported by TIBCO
• No GPL license issues
© Copyright 2000-2016 TIBCO Software Inc.
TERR - TIBCO’s Enterprise Runtime for R
48. Which R to use?
© Copyright 2000-2016 TIBCO Software Inc.
http://www.forbes.com/sites/danwoods/2016/01/27/microsofts-revolution-analytics-acquisition-is-the-wrong-way-to-embrace-r/
49. 49
Apache Spark
© Copyright 2000-2016 TIBCO Software Inc.
General Data-processing Framework
However, focus is especially on Analytics (at least these days)
http://fortune.com/2016/09/09/cloudera-spark-mapreduce/
50. Spark MLlib
© Copyright 2000-2016 TIBCO Software Inc.
MLlib is Spark’s machine learning
(ML) library. Its goal is to make
practical machine learning scalable
and easy.
It consists of common learning
algorithms and utilities, including
classification, regression,
clustering, collaborative filtering,
dimensionality reduction, as well as
lower-level optimization primitives
and higher-level pipeline APIs.
You can even combine Mllib module with R language
52. 52
Apache Spark – Focus on Analytics
http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/
http://fortune.com/2016/09/09/cloudera-spark-mapreduce/
http://www.ebaytechblog.com/2016/05/28/using-spark-to-ignite-data-analytics/
http://www.forbes.com/sites/paulmiller/2016/06/15/ibm-backs-apache-spark-for-big-data-analytics/
“[IBM’s initiatives] include:
• deepening the integration between Apache Spark and
existing IBM products like the Watson Health Cloud;
• open sourcing IBM’s existing SystemML machine
learning technology;
53. H20
© Copyright 2000-2016 TIBCO Software Inc.
An Extensible Open Source Platform for Analytics
• Best of Breed Open Source Technology
• Easy-to-use WebUI and Familiar Interfaces
• Data Agnostic Support for all Common Database
and File Types
• Massively Scalable Big Data Analysis
• Real-time Data Scoring (Nanofast Scoring Engine)
http://www.h2o.ai/
54. TIBCO Spotfire for Visual Data Discovery
© Copyright 2000-2016 TIBCO Software Inc.
Let the business user leverage historical data to find insights!
55. TIBCO Spotfire with R / TERR Integration
© Copyright 2000-2016 TIBCO Software Inc.
Let the business user leverage Analytic Models (created by the Data Scientist)!
Example: Customer Churn with Random Forest Algorithm
• ‘refresh model’ button lives a ‘random forest algorithm’
• requires no a priori assumptions at all, it just always works
• The business user doesn’t need to know what random forest is to be empowered by it
Select variables
for the model
56. SaaS Machine Learning
© Copyright 2000-2016 TIBCO Software Inc.
• Managed SaaS service for building ML models and generating predictions
• Integrated into the corresponding cloud ecosystem
• Easy to use, but limited feature set and potential latency issues if combined
with external data or applications
http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html
57. PMML (Predictive Model Markup Language )
© Copyright 2000-2016 TIBCO Software Inc.
• XML-based de facto standard to represent predictive analytic models
• Developed by the Data Mining Group (DMG)
• Easily share models between PMML compliant applications
(e.g. between model creation and deployment for operations)
http://www.ibm.com/developerworks/library/ba-ind-PMML1/
58. 58
Agenda
© Copyright 2000-2016 TIBCO Software Inc.
1) Machine Learning and Big Data Analytics
2) Analysis of Historical Data
3) Real Time Processing
4) Live Demo
59. 59
Analytics Maturity Model
© Copyright 2000-2016 TIBCO Software Inc.
Immediate
Long-Term
Competitive AdvantageValue to the Organization
Self-service
Dashboards
Event Processing
Predictive and
Prescriptive Analytics
Measure Diagnose Predict Optimize Operationalize Automate
Analytics Maturity
A good Big Data Analytics platform can provide value to the organization
across the full spectrum of use cases
Self-service
Dashboards
Event Processing
Analytics
60. Streaming Analytics
© Copyright 2000-2016 TIBCO Software Inc.
time
1 2 3 4 5 6 7 8 9
Event Streams
• Continuous Queries
• Sliding Windows
• Filter
• Aggregation
• Correlation
• …
61. Operational Intelligence in Action
© Copyright 2000-2016 TIBCO Software Inc.
Actions by Operations
Human decisions in real time informed by
up to date information
The Challenge:
Empower operations staff to see and
seize key business moments61
Automated action based on models of history
combined with live context and business rules
The Challenge:
Create, understand, and deploy algorithms &
rules that automate key business reactions
Machine-to-Machine Automation
63. 63
Alternatives for Stream Processing
© Copyright 2000-2016 TIBCO Software Inc.
OPEN SOURCE CLOSED SOURCE
PRODUCT
FRAMEWORK
(no complete list!)
Azure Microsoft
Stream Analytics
64. Visual IDE (Dev, Test, Debug)
Simulation (Feed Testing, Test Generation)
Live UI (monitoring, proactive interaction)
Maturity (24/7 support, consulting)
Integration (out-of-the-box: ESB, MDM, etc.)
Library (Java, .NET, Python)
Query Language (often similar to SQL)
Scalability (horizontal and vertical, fail over)
Connectivity (technologies, markets, products)
Operators (Filter, Sort, Aggregate)
What Streaming Alternative do you need?
Time
to
Market
Streaming
Frameworks
Streaming
Products
Slow Fast
Streaming
Concepts
65. 65
Comparison of Stream Processing Frameworks and Products
© Copyright 2000-2016 TIBCO Software Inc.
Slide Deck from JavaOne 2016:
http://www.kai-waehner.de/blog/2016/10/25/comparison-of-stream-processing-frameworks-and-products/
66. StreamBase: The Power of Visual Programming
© Copyright 2000-2016 TIBCO Software Inc.
1) Get ideas into
market in days or
weeks, not months or
years
2) Unlock the power of
IT and data scientists
working together
68. © Copyright 2000-2016 TIBCO Software Inc.
How to apply
analytic models to
real time processing
without rebuilding them ?
69. Streaming Analytics
to operationalize insights
and patterns in real time
without rebuilding the models
Stream
Processing
H20
Open
Source
R
TERR
Spark
MLlib
MATLAB
SAS
PMML
Real Time Close Loop: Understand – Anticipate – Act
74. 74
Agenda
© Copyright 2000-2016 TIBCO Software Inc.
1) Machine Learning and Big Data Analytics
2) Analysis of Historical Data
3) Real Time Processing
4) Live Demo
75. © Copyright 2000-2013 TIBCO Software Inc.
“An outage on one well can cost $10M per
hour. We have 20-100 outages per year.“
- Drilling operations VP, major oil company
77. Data Monitoring
• Motor temperature
• Motor vibration
• Current
• Intake pressure
• Intake temperature
• Flow
Electrical power cable
Pump
Intake
Protector
ESP motor
Pump monitoring unit
Pump Components
© Copyright 2000-2016 TIBCO Software Inc.
Live Surveillance of Equipment
79. Operational Analytics
Operations
Live UI
SENSOR DATA
TRANSACTIONS
MESSAGE BUS
MACHINE DATA
SOCIAL DATA
Streaming AnalyticsAction
Aggregate
Rules
Stream Processing
Analytics
Correlate
Live Monitoring
Continuous query
processing
Alerts
Manual action,
escalation
HISTORICAL ANALYSIS
Data
Sheets
BI
Data
Scientists
Cleansed
Data
History
Data Discovery
Analytics
Enterprise Service Bus
ERP MDM DB WMS
SOA
Data Storage
InternalData
IntegrationBus
API
Event Server
Predictive Maintenance
Spark
Big Data
Machine Data
(Sensors,
Weather Data, …)
Take Action
(Stop Machine, Send Mechanic, …)
Find Insights
(Sensor Behaviour,
Hardware Issues, …)
ERP System
(Transaction History, Production Volume)
2
80. Operational Analytics
Operations
Live UI
SENSOR DATA
TRANSACTIONS
MESSAGE BUS
MACHINE DATA
SOCIAL DATA
Streaming AnalyticsAction
Aggregate
Rules
Stream Processing
Analytics
Correlate
Live Monitoring
Continuous query
processing
Alerts
Manual action,
escalation
HISTORICAL ANALYSIS
Data
Sheets
BI
Data
Scientists
Cleansed
Data
History
Data Discovery
Analytics
Enterprise Service Bus
ERP MDM DB WMS
SOA
Data Storage
InternalData
IntegrationBus
API
Event Server
Complete Big Data Architecture
Spark
Big Data
83. © Copyright 2000-2016 TIBCO Software Inc.
Real Time Analytics
Trend Analysis
Combination of Rules
CUSUM Analysis
Statistical Analysis
Statistical Process Control
Machine Learning
• Location Change
– Variable moves up or down
• Slope Change
– Variable changes trend
• Variance Change
– Variable becomes more/less volatile
• Process Threshold
– Shewhart control chart
• Failure Model
y (0/1) = f (X, b) + e; f = logistic regression, trees, svm, nnet, ...
84. Upon event trigger, populate Spotfire RCA template; email responsible engineer
Put model into Action
85. 1. Rules / models pushed from
Spotfire
2. Data streams into StreamBase
3. Data evaluated in real-time
4. Spotfire RCA on trigger
Other notifications available
Live view on streaming data
Streambase – from Big Data to Fast Data
88. Responsible engineer clicks URL to launch Spotfire Root Cause Analysis; diagnose issue
Compare Live Data with Historical Data to make Human Decision
90. Key Take-Aways
Insights are hidden in Historical Data on Big Data Platforms
Machine Learning and Big Data Analytics find these Insights by building Analytics Models
Event Processing uses these Models (without Rebuilding) to take Action in Real Time
91. Questions? Please contact me!
Kai Wähner
kwaehner@tibco.com
@KaiWaehner
www.kai-waehner.de
LinkedIn / Xing Please connect!