MapR is an ideal scalable platform for data science and specifically for operationalizing machine learning in the enterprise. This presentations gives specific reasons why.
1.
Will this tire fail in the next 1,000 miles: Yes or no?
Which brings in more customers: a $5 coupon or a 25% discount?
2.
If you have a car with pressure gauges, you might want to know: Is this pressure gauge reading normal?
If you're monitoring the internet you’d want to know: Is this message from the internet typical?
3.
What will the temperature be next Tuesday?
What will my fourth quarter sales be?
4.
Which viewers like the same types of movies?
Which printer models fail the same way?
5.
Reinforcement learning was inspired by how the brains of rats and humans respond to punishment and rewards. These algorithms learn from outcomes, and decide on the next action.
Typically, reinforcement learning is a good fit for automated systems that have to make lots of small decisions without human guidance.
6.
What did similar people/customer buy/watch/listen to?
What other movies will you like, if you like product A?
Data Ingestion is a non-trivial task for enterprise
The best systems combine data from multiple sources
Adding more data is a highly specialized task
Data Governance for ML
Dataset versions
Test data versions
Model versions
Test results versions
Model Deployment is a non-trivial integration task with another external enterprise system
May need to be scalable, HA and fault-tolerant
What about after deployment?
A/B Testing
Understanding performance
Dealing with data drift
Data Ingestion is a non-trivial task for enterprise
The best systems combine data from multiple sources
Adding more data is a highly specialized task
Data Governance for ML
Dataset versions
Test data versions
Model versions
Test results versions
Model Deployment is a non-trivial integration task with another external enterprise system
May need to be scalable, HA and fault-tolerant
What about after deployment?
A/B Testing
Understanding performance
Dealing with data drift
Data Ingestion is a non-trivial task for enterprise
The best systems combine data from multiple sources
Adding more data is a highly specialized task
Data Governance for ML
Dataset versions
Test data versions
Model versions
Test results versions
Model Deployment is a non-trivial integration task with another external enterprise system
May need to be scalable, HA and fault-tolerant
What about after deployment?
A/B Testing
Understanding performance
Dealing with data drift
Data Ingestion is a non-trivial task for enterprise
The best systems combine data from multiple sources
Adding more data is a highly specialized task
Data Governance for ML
Dataset versions
Test data versions
Model versions
Test results versions
Model Deployment is a non-trivial integration task with another external enterprise system
May need to be scalable, HA and fault-tolerant
What about after deployment?
A/B Testing
Understanding performance
Dealing with data drift
Get training data (example: images)
Get labels for the training data (examples: what the image is about, the image labels)
Transform the data into numbers (machine learning algorithms can’t deal with raw data, only vectors of numbers)
Heavily iterative work to find the best set of features
Try many different algorithms, and tune their parameters for best performance
Heavily iterative work to find the best algorithm and parameter values
The best algorithm, trained on your data, with its parameters tuned for best performance is your predictive model
Get new data
Transform it to match the same format as your training feature vectors
The model will output a predicted label for the new data
This is a lot of work, but glosses over a HUGE amount of work required to get business value in an enterprise setting
Here is a small sample of the issues faced in putting ML to work in an enterprise
This is for real. Chris Fregly is the Pipeline.io guy and he’s building enterprise ML systems with this set of tools.
This is the set of tools required to be able to provide a true 100% open source end to end story for enterprise ML.
How does MapR simplify this picture? What tools still remain useful if we’d run on MapR?
Can support OSS tools like R, scikit, Theano and TensorFlow first to avoid more expensive licences.
NoSQL, Kafka, Spark, etc...
Indeed, ML Tools are only really good at modeling. They typically provide limited support for feature engineering.
In addition Tools typically only support testing models on a single dataset, with no support for comparing production models to experimental models, comparing models across different versions of a dataset, etc.
Such capabilities need to be custom made and are typically done with low quality ad-hoc code by data scientists.
Most of the work resides in the ETL to get the data in the first place, the data cleaning and feature engineering not supported by the tools, as well as as the work required to deploy a model to production.
MapR really shines on that 90% of the work, and supports all ML tools just the same (indeed, better) as any competing platform.
legacy tools: R, Python, Bash, SPSS, Hive/Pig
State of the art: Apache projects like Drill, Impala, Spark, Zepplin, Mesos, Flink, …