The IT discipline of machine learning has become increasingly important in recent years. It promises to solve types of problems for which normal software development is considered unsuitable or too costly.
3. 3
“Philosophy [nature] is written in that great book
which ever is before our eyes -- I mean the
universe -- but we cannot understand it if we
do not first learn the language and grasp the
symbols in which it is written. The book is
written in mathematical language”
“Measure what can be measured, and make
measurable what cannot be measured.”
The first Data Scientist ?
Galileo Galilei (1564 – 1642)
4. 4
“Philosophy [nature] is written in that great book
which ever is before our eyes -- I mean the
universe -- but we cannot understand it if we
do not first learn the language and grasp the
symbols in which it is written. The book is
written in mathematical language”
“Measure what can be measured, and make
measurable what cannot be measured.”
The first Data Scientist ?
Galileo Galilei (1564 – 1642)
5. 5
“Philosophy [nature] is written in that great book
which ever is before our eyes -- I mean the
universe -- but we cannot understand it if we
do not first learn the language and grasp the
symbols in which it is written. The book is
written in mathematical language”
“Measure what can be measured, and make
measurable what cannot be measured.”
The first Data Scientist ?
Galileo Galilei (1564 – 1642)
6. 6
“Machine learning algorithms can figure out how to perform important
tasks by generalizing from examples. This is often feasible and cost-
effective where manual programming is not.”
“A Few Useful Things to Know about Machine Learning”, P.Domingos
• Learn it when you can’t code it (recognize speech or images)
• Learn it when you can’t scale (recommendations, spam, fraud
detection)
• Learn it when you have to adapt (predictive typing, ai gaming)
• Learn it when you can’t track it (robot control, self driving car)
Why Machine Learning
7. Machine Learning Workflow
Experiments look like data-flow configurations of
what you would like to do with your information
and with your models.
In order to do predictive analytics
you just have to:
• Import or connect to some
current or historical data.
• Build and validate a model.
• Publish trained models to make
live predictions.
8. Choosing Models (Algorithms)
Problems:
• Anomaly Detection
• Classification
• Two – Class
• Multiclass
• Clustering
• Regression
• Text Analytics
Questions:
• How large is your data?
• Do you need to train incrementally?
• Data is categorical or numerical?
• Do you need to understand the classifier?
• Is the problem complex (non-linear)?
9. Choosing Machine Learning Tools
Two main languages
R:
• Programming Language for Statistical
Computing and Graphics.
• Aimed at Data Scientists who can Develop.
• Freely Available.
Python:
• Multipurpose programming language.
• Aimed at Developers who can do Data Science.
• Freely Available.
All-in-one (cloud) platforms:
• Azure Machine Learning
• Amazon Machine Learning
• Google Prediction API
11. Azure Machine Learning
• A Machine Learning solution completely in the cloud.
• 18 February 2015: Announced as Generally
Available from Microsoft.
• No software to install, just a browser is required.
• Provides also a free plan to experiment with.
• Share your work with anyone with internet access.
• Visual composition of data science workflow with
pre-made ML algorithms ready to use.
• Allows rapid prototyping (experiments) to create a
better model.
• Deploy trained models as REST web services.
• Support of R and Python scripts to support advanced
scenarios or previous works.
15. 15
• Not having a real elevator i
wrote an elevator simulator.
• People enters at 9.00 AM
(more or less…) and exits at
6.00 PM (more or less…)
• One hour of lunch-break (more
or less…).
• Nobody working during
weekends (more or less…)
• 5 Floors (0 to 4)
• 2 Elevators (A and B)
• 10 People working
• 5 at floor 3
• 3 at floor 4
• 2 at floor 2
• 0 at floor 1
• 6 months of simulation (i know, no
holidays…)
Instead of coding, what about making it learn ?
16. 16
Cleaning Data and Selecting Features
• Exclude all except DayOfWeek, Hour,
Elevator Floor.
• Treat Elevator Floor as label (not
number).
17. 17
Training the selected model.
• Decision Forest is a common model for multiclass prediction.
• Train Elevator Floor based on DayOfWeek and Hour.
20. 20
A more advanced experiment
• Introducing «Agatha»
• Being software developer we are experimenting
with data generated by our software development
process.
• Predicting correlation between components
«Developers who changed this file also changed…»
• Predicting issue / bugs according to historical data
Uses complexity metrics, change frequency and defects
found in unit tests.
• We are using Machine Learning on our process to gain
insights and improve our software production process.
No «cobbler's children go unshod» syndrome
22. Classical Software vs Machine Learning
«Classic» Software Development Machine Learning Approach
Human writing code Human supplying data
Work is done (mainly) on algorithm
selection.
Work is done (mainly) on input/output
selection (feature engineering)
Model is (mainly) a white box. Model is (mainly) a black box.
Input / Output is less important. Algorithm is less important.
23. Data is indeed crude oil… It needs refinement
Great things happen in machine learning
when human and machine work together,
combining a person’s knowledge of
how to create relevant features from
the data with the machine’s talent for
optimization.
Feature engineering: when you use your knowledge about the data to
create fields that make machine learning algorithms work better.
It is easily the most important factor in determining the success of a
machine learning project
24. More data beats a cleverer algorithm…
More data wins. There’s increasingly good evidence that, in a lot of
problems, very simple machine learning techniques can be levered
into incredibly powerful classifiers with the addition of loads of data.
Computer algorithms trying to learn models have only a relatively
few tricks they can do efficiently, and many of them are not so very
different. Performance differences between algorithms are typically
not large. Thus, if you want better classifiers:
1. Engineer better features
2. Get your hands on more high-quality data
25. Algorithm importance: the Netflix prize
«Classic» approach
• Data was provided.
• Algorithm was
researched.
26. Algorithm importance: the Netflix prize
«Classic» approach
• Data was provided.
• Algorithm was
researched.
• Three years later a
winner was found…
27. Algorithm importance: the Netflix prize
«Classic» approach
• Data was provided.
• Algorithm was
researched.
• Three years later a
winner was found…
• But the algorithm
was never used…