Barga Galvanize Sept 2015

Roger S. Barga, Ph.D.
General Manager
Amazon Web Services
Driving Business Value with
Data Science

Fielded Solutions
• Customer Segmentation & Targeting
• Which category item will customer buy next?
• Azure ML Reference Customer
• Predictive Analytics to reduce school
dropout rate
• Predictive model identifies which students
are risk of dropping out at K12
• Predictive Model shows when JLL can charge
above or below the market for a specific deal
• Predictive maintenance & Internet of Things
• Built model to predict causes of elevator failure
• Reference Customer for Azure ML and ISS
REFERENCECUSTOMERS

Predictive Maintenance at ThyssenKrupp
ThyssenKrupp partnered with Microsoft to build
a new predictive maintenance solution to
improve service margins for its elevator business
• Great Internet of Things example
• Used ISS and Azure Machine Learning
• ML model predicts top causes of failure in an
elevator – 5M elevators in production, $400
cost savings annually.
Key Benefits
• Ease of use across skillsets
• Ease of deployment
• Increased productivity
Now we have the ability to
use live data to define the
needed repair before a
breakdown happens,
reducing costs for ourselves
and our customers.
Dr. Rory Smith
ThyssenKrupp

Problem:
To leverage the history of a person’s behavior on
Microsoft.com to identify their interests and
predict future actions
Findings:
• Opportunity to provide upsell after users hit on
Microsoft online products such as Bing,
SkyDrive, Xbox Live, Zune
• Target messaging on Windows Phone extends
the functionality of Microsoft products
Methodology:
• Big Data Platform – HDP for Windows/Azure
HDInsight and Advanced Analytics support
• Develop statistical models to determine the
probability of users buying a Surface Device
Customer Targeting With Machine Learning

Problem: Early detection of suspicious activity on the
network servers & eliminate the threat.
Methodology:
• File system to store massive security data.
• Fully automated workflow to drive end-to-end
data receiving and transformation process.
• Analysis and visualizations of Windows Events to
identify pre-defined threat scenarios.
• Move from descriptive analytics to a mature
predictive archetype.
Preventing Network Intrusion with Machine Learning

• Create a Pdemo to show potential of Predictive Analytics
• Develop a demo to answer the question “What factors drive
our client to charge over or below market rates?”
• Create 2 predictive models to predict
• If our client can charge over the market average for Landlords, and
• Whether our client can charge below the market for Tenants
• Develop strategies to explain key factors that drive these outcomes
• Visualize results in Power BI.

Building Predictive Models
Business
Insights
1
2
34
5
Note:
This is a variant of the Cross-Industry
Standard Process for Data Mining
(CRISP-DM)

Conceptual Solution
Data Pre-processing
on Hadoop (Hive
queries)
Data Preparation and Predictive
Models with Machine Learning
Source Data
#1
Source Data
#2 Visualization in Power
BI

How to Use the Predictive Model
Predictive Model
Data on a
new deal
1 = You can charge above the
market average
0 = You can charge below
market
Broker

Data Preparation
• Source data: 1 internal and 1
external data source
• Internal data source prepared on
Hadoop cluster
• Both datasets joined in our internal
Machine Learning tool
• New column created to determine
when our client charges above or
below market average
Data Source #1 Data Source #2

Predictive Model
• Tested several algorithms
including Logistic Regression,
Boosted Decision Trees, etc.
• Models were trained with 10-fold
cross validation.
• Boosted Decision Trees was the
best algorithm – see ROC curve
• Area under curve for Boosted
Decision Trees was 92.4%!

Predictive ModelforLandlords-Results
• Boosted Decision Trees - Area under
the curve = 92.4%!
• Logistic Regresssion - Area under the
curve = 81.2%!

Industry Overview: Financial Services
Data Science applied to the Financial Services sector enables insights into:
“The opportunity for the Financial sectors are to unlock
the potential in their data through analytics and shape
the strategy for business through reliable factual insight
rather than intuition…” - Deloitte, 2013
Fraud & Financial Crimes
• Enterprise fraud and financial crimes
• Fraud Detection
• Credit Risk Management
Analytics
• Actuarial analysis, portfolio management and rate making
• Forecasting and econometrics
• Predictive analytics and data mining
• Mathematical optimization and simulations
Marketing & Customer Experience
• Social media analytics
• Customer Segmentation
• Customer Targeting
Customer Experience Enhancement
• Clickstream analysis
• Customer lifecycle management
• Dynamic profiling and enhanced customer
segmentation
Banks, Insurance,
Real Estate

Industry Overview: Healthcare
Providers,
Payers,
Pharmaceuticals
& Biotechnology
Data Science applied to the Healthcare sector enables insights into:
“Predictive analytics addresses today's pressing challenges in
healthcare effectiveness and economics by improving
operations across the spectrum of healthcare functions…”
- Predictive Analytics World Healthcare, 2014
Quality & Outcomes
• Readmissions Avoidance Analysis
• Health outcomes
• Patient safety
Consumer Analytics
• Customer acquisition
• Health intervention
• Member & Population Health
• Value-based care and
payment models
• Membership portfolio
optimization
Risk & Incentives
• A holistic view of patient episodes
• Value-based care and payment models
Care Delivery
• Health care cost analytics
• Performance management
• Workforce planning
Cost Containment
• Fraud and improper payments
• Eligibility fraud
• Enterprise case management

Industry Overview: Oil &Gas
Oil & Gas Producers,
Oil Equipment,
Services &
Distribution,
Alternative Energy
Data Science applied to the Oil and Gas sector enables insights into:
Oil Field Analytics
• Seismic analyses
• Reservoir characterization
• Drilling optimization.
• Unconventional completions.
• Production forecasting.
Assets & Operations
• Facility integrity
• Demand forecasting.
• Integrated operations and
logistics
• Operational risk/environment,
health and safety (EH&S)
Data Management
• Complex Event Processing
• Data Quality
• Master Data Management
“Access to more information from multiple sources and
disciplines and more sophisticated analytics will improve the
oil and gas industry's ability to optimize production…
Analytics will provide a way to bring optimization from
statisticians to the business.” – IDC, 2013

How to be successful?
1. Create value
2. Capture some for yourself

How to create value (as a data scientist)
Extract insights from data for decision support

Productive Use of Time
Have a bias against writing learning algorithms
• Have a bias in favor of leveraging 3rd party
implementations…

• Bias in favor of leveraging 3rd party implementations
• Add data: more information beats better algorithms

• Bias in favor of leveraging 3rd party implementations
• Add data: more information beats better algorithms
You will write data manipulation algorithms
• Data is surprising enough, need algorithm certainty
• Defect count is proportional to line count
• Use as high level a language as possible

Analysis and Diminishing Returns
First few models tend to capture most of the value

Distinguish between:
• Marginal improvements important (e.g., search, WalMart);
• Marginal improvements unimportant (typical).

Distinguish between:
• Marginal improvements important (e.g., search, WalMart);
• Marginal improvements unimportant (typical).
Latter case: get first 80%, move to new problem

The Importance of Starting Small

When you first encounter a data set, you know nothing.
• Ergo: first piece of data is very informative.
• Think of data set utility as roughly logarithmic in size.

Don’t require a large data set before starting analysis.

Don’t require a large data set before starting analysis.
Always try things out on small portions of data first.

Timescales and Failing Fast
1. Immediate zone: less than 60 seconds
• 100s per day
2.Bathroom break zone: less than 5 minutes
• 10s per day
3.Lunch zone: less than an hour
• 5 per day
4.Overnight zone: less than 12 hours
• 1 per day

Timescales and Failing Slow
1. Immediate zone: less than 60 seconds
• 100s per day
2.Bathroom break zone: less than 5 minutes
• 10s per day
3.Lunch zone: less than an hour
• 5 per day
4.Overnight zone: less than 12 hours
• 1 per day

Failing Fast: Summary
1. Move code to data, not the converse!
2.Do feature engineering with a fast learning algorithm
(e.g., linear), then switch to a slower algorithm for
the final product (e.g., GBDT, NN).
3.Subsample your data intelligently.
4.Less examples (rows), e.g., imbalanced classification.
5.Less features (columns), e.g., random projections

Productivity demands debugging as fast as possible.
Stay in the immediate zone

Proxy Metrics
Proxy Metric: Something you can measure and optimize
• Revenue per impression
• Clickthrough rate
• Reciprocal communication rate
• Polling results
• Gene expression levels
• Value at risk

Proxy Metrics Reality
Reality: Something you actually care about
Revenue per impression Economic Value Created
Clickthrough rate User Experience Quality
Reciprocal communication rate Match Quality
Polling results Election Outcome
Gene expression levels Drug Efficacy in Vivo
Value at risk Portfolio Quality

Agree on the OEC
A concrete goal begets concrete stopping conditions and
concrete acceptance criteria.
The less specific the goal, the likelier that the project will go
unbounded, because no result will be "good enough."
If you don't know what you want to achieve, you don't know
when to stop trying – or even what to try. When the project
eventually terminates – because either time or resources run
out – no one will be happy with the outcome…

Key Takeaways
Think about your data, not about your software.
Productivity is about not waiting for answers.
Mind the gap (between proxy metrics and reality).
Agree upon the OEC with business stakeholders
Best Defense: close collaboration with a business expert.

You can make much stronger inferences about a woman named Brittany. That name was very
popular from the mid-1980s through the mid-1990s, but it wasn’t all that common before and
hasn’t been since. If you know a Brittany, she is probably of college age or just a bit older. Half
of living American Brittany’s are between the ages of 19 and 25

Blogs to Follow…
• FastML, covering practical applications of machine learning and data science
• Hilary Mason blog, from Bitly Chief Scientist, covering Data Science and Machine
Learning on Big Data.
• Hunch.net, by John Langford, a leading applied machine learning researcher; His
blog covers the intersection of theory and practice
• Kaggle blog no free hunch, covering Kaggle data science and machine learning
competitions
• KDnuggets, news, jobs, software, events, and more in Data Mining and Data Science
research and applications
• Normal Deviate by Larry Wasserman, CMU Prof. of Statistics and Machine Learning
• Statistical Modeling, Causal Inference, and Social Science by Andrew Gelman
• Three-Toed Sloth by Cosma Shalizi
• FiveThirtyEight Blog by Nate Silver, a very popular and non-technical blog covering
analytics applied mainly to politics and sports

Blogs to Follow…
• Data Mining Research blog by Sandro Saitta
• Data Mining: Text Mining, Visualization, and Social Media, by Matthew Hurst, a leading data
scientist at Microsoft
• DecisionStats, by Ajay Ohri, covering business analytics and R, with practical examples, and
interviews of field leaders
• Geeking with Greg , by Greg Linden, inventor of Amazon recommendation engine and
internet enterpreneur
• IA Ventures blog, one of the leading Big Data venture capitalists Roger Ehrenberg and team
• Occam's Razor, by Avinash Kaushik, brilliant Digital Marketing Evangelist at Google
• R-bloggers , best blogs from the community of R, with code, examples, and visualizations
• Smart Data Collective, an aggregation of blogs from many interesting data science people
• Steve Miller blog, covering data science, statistics, R, and other topics at Information
management.
• Tom H. C. Anderson blog, focusing on market research with data and text mining.
• What's the Big Data, by Gil Press. Gil covers the Big Data space and also writes a column on
Big Data and Business in Forbes.

Barga Galvanize Sept 2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à Barga Galvanize Sept 2015

Similaire à Barga Galvanize Sept 2015 (20)

Plus de Roger Barga

Plus de Roger Barga (9)

Dernier

Dernier (20)

Barga Galvanize Sept 2015