MapR and Machine Learning Primer

© 2017 MapR TechnologiesMapR Confidential 1
+
Mathieu Dumoulin
Data Engineer PS APAC – Tokyo, Japan
Wednesday, May 10, 2017

Today’s goals
• Machine Learning
• Enterprise Machine Learning
• Challenge of Enterprise ML
• MapR Unique Features for ML
• H2O and MapR

Machine Learning
Machine learning is a type of artificial intelligence (AI) that
provides computers with the ability to learn without being
explicitly programmed.
ML allows a computer to make predictions on data (usually
based on historical data)

1. Is this A or B?
2. Is this weird?
3. How much – or – How
many?
4. How is this organized?
5. What should I do next?
6. What’s similar?
Questions Data Science Can Answer
1. Classification
2. Anomaly Detection
3. Regression
4. Clustering
5. Reinforcement Learning
6. Recommendation

ML In the Enterprise…
… isn’t so easy after all.

Enterprise ML: Business Value Outputs
Growing number of ML use cases at successful companies
Anomaly
Detection
異常検出
Customer 360
Fraud
Detection
不正検出
Log Security
Analysis
ログ分析
Recommender
Engines
レコメンデーション
Sensor Data
Analysis (IoT)
Personalized
Offers
個人化
Ad Tech

Machine Learning Tools

What Most ML Tools Give You
A common rule of thumb
is that the modeling task
is about 10% of the total
effort of a ML project.
The choice of tool matters (to the
DS), but any top level ML
tool/library can eventually get good
results (if the data allows it at all)

Enterprise ML Projects: More than Just Modeling

Business Value is in Production
All the business value results
from a sufficiently accurate
model running in production
What it means:
Deploying a weaker model in production sooner is
MUCH better than endless work for an excellent model
(But you can make Google money if you get a world
class model in production)

Data Cleaning and Feature Engineering
80% of
the work!

Workflow View of Machine Learning
1
2
3
4
56
7 8

Enterprise ML Challenges
Data comes from
many sources
maybe very large
Data isn’t
always labeled!
Needs ETL
and cleaning
Finding the best
algorithm and
parameters can use a
lot of CPU
Real time data?
Production data
from many
sources?
Needs to run on a server
somewhere
The predictions
are used by
another system...

The Open Source Solution (I’m not joking!)
Ref: http://advancedspark.com/ , https://github.com/fluxcapacitor/pipeline
Separate
Clusters!

What Data Scientists and ML Engineers Want
I know where the data
is and how to access
it.
My work is made easier in ALL PHASES
of the ML project, not just modeling
Let me use
my favorite
tools at all
scales (MB,
GB, TB, PB)

© 2016 MapR Technologies© 2016 MapR TechnologiesMapR Confidential
An Ideal Platform For Enterprise ML
An ideal platform for ML:
•Scales with you and your data
•Freedom to use any tool
– Open source DS tools: Jupyter, Zeppelin, Spark, H2O, TF,…
– Legacy/local tools: NLP tools, scikit-learn, R
•Data can be versioned and kept reliably
•Combines storage, DB, compute and streams
•Supports both model building and model deployment
•Supports security when needed

Our Humble proposal
MapR is The Best Platform for Enterprise ML
on the market today

MapR Converged Data Platform
Open Source Engines & Tools Commercial Engines & Applications
Utility-Grade Platform Services
DataProcessing
Enterprise Storage
MapR-FS MapR-DB MapR Streams
Database Event Streaming
Global Namespace High Availability Data Protection Self-healing Unified Security Real-time Multi-tenancy
Search &
Others
Cloud &
Managed
Services
Custom Apps
UnifiedManagementandMonitoring

The MapR Stack: Converged + Open

• NFS mount and POSIX file system
– Small scale Python or R data exploration on the real data
– Keep the raw data, ETL work is easily reused
• Supports standard big data ecosystem (Spark)
• NFS mount can ingest data from any enterprise system that
can output files
– Even if they don’t support Hadoop!
• Much faster than HDFS
– Serve production models directly from MapR
MapR Supports All Tools Out of the Box

• Volumes and Topologies
– GPU enabled nodes for distributed deep learning on the same cluster
• Don’t waste resources
• Keep data locality
• Avoid unnecessary data movement
• Avoid multiple copies of data (which is the real one?)
• POSIX file system
– Use any DL framework on the cluster data
MapR has production experience with CaffeOnSpark (Samsung
Micro-Electronics) and has a new TensorFlow QSS
MapR Supports Deep Learning

• Volumes and Snapshots
– Experimental reproducibility
– Create models on real production data
– Easy to compare models on the same data
– Easy to evaluate a model across time on different snapshots
– snapshot of models: a time machine, built-in
• Volume Quotas
– Support multiple projects and teams on the same cluster
– Share storage resources efficiently
Clever Uses for Volumes and Snapshots

Remember that > 90% of the work in Enterprise ML is to realize the
workflow. This is where MapR shines! 
• Operational capabilities (MapR DB, MapR Client)
– Serve production models directly from MapR
• Snapshots and Mirrors
– Do A/B testing with almost no coding
– Promote the mirror to go back to the previous state
• Just update the path in the production system - no redeployment!
• MapR Streams for Real-time predictions
– Zero configuration Kafka – it just works!
– Kafka REST Proxy for max interoperability
– Supports microservices and Stateful Containers
Support the ML Workflow, Not Just Modeling

MapR 💖 Enterprise Machine Learning
• Features that work together to support all phases of ML
• Supports your existing tools/code and the state of the art
large scale frameworks
• Easier to manage, more robust and secure.
• MapR is made for the enterprise and great for ML

MapR Converged Application Blueprint
• Microservices connected by real-time streams
– Ideal to serve predictions from ML models
• Next-Generation large-scale architecture
• Working example:
https://www.mapr.com/appblueprint/overview

Q&A
ENGAGE WITH US
@mapr
mdumoulin@mapr.com

MapR and Machine Learning Primer

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à MapR and Machine Learning Primer

Similaire à MapR and Machine Learning Primer (20)

Dernier

Dernier (20)

MapR and Machine Learning Primer

Notes de l'éditeur