ML insights from KUDO codefest

challenges, learnings and opportunities
presented by imron zuhri, adit, and samudra
KUDO codefest 14 May 2016
machine learning

in 1996, Garry Kasparov was not afraid of a computer, and he won
the next year, he played against a new and improved Deep Blue and lost

this is the move that was so surprising, so un-machine-like,
that he was sure the IBM team had cheated
Rd5
Rd1

a random move, a computer bug
to kasparov, a sign of superior intelligence
Rd5
Rd1

big data analytics, is the culmination
of the machine way of thinking
we can now immensely
extend our memory and computational power
to helped us doing that

some definitions
 a (hypnotized) user’s perspective
a scientific (witchcraft) field that:
researches fundamental principles from data (potions) and
develops magical algorithms (spells to cast)
 (pascal vincent, 2015)
 field of study that gives computers the ability to learn without
being explicitly programmed
 arthur samuel (1959)
 formal definitions (tom mitchell, 1998):
“A machine is said to be learning IF
it improves with:
 each experience E
 on specific tasks T
 with specific performance P

CURRENT VIEW OF ML FOUNDING DISCIPLINES

10
three niches for machine learning
data mining: using historical data to improve
decisions
 medical records  medical knowledge
software applications that are difficult to program
by hand
 autonomous driving
 image classification
user modeling
 automatic recommender systems
source: rong jin, 2013

(some) open problems in machine learning
 one-shot learning
 unsupervised learning
 reinforced learning
 artificial general intelligence
“most of human and animal learning
is unsupervised learning. If
intelligence was a cake, unsupervised
learning would be the cake,
supervised learning would be the
icing on the cake, and reinforcement
learning would be the cherry on the
cake. We know how to make the icing
and the cherry, but we don't know
how to make the cake.”
yan lecun

challenges in machine learning
 data-related:
 abundant yet scattered data
 unstructured, noisy data
 offline-stored data (duh!)
 resource-related:
 data storage
 space constraints
 computing power
 training time
 inve$$$tments
• initial investments
• running costs

challenges in machine learning
 methodical issues:
 result consistency
(i.e. accuracy)
 overfitting
 algorithm computational efficiency
 miscellaneous:
 architectural differences/
 portability issues
 popularity of non-open standard, vendor-
locked compute libraries/apis
(rawr!)

recent breakthroughs in machine learning
deepmind atari q learner (2014)
plays 5 kinds of atari 2600 games
states: pixels in atari
actions: left/right move
reward: score
algorithm used:
feedforward “q-learning”
conv-net
for unsupervised map of reward

recent breakthroughs in machine learning
the translator (2015)
real-time translations of speech
from/into 7 different languages
able to run from even from
resource-constrained embedded
hardware (i.e. smartphones)
uses same engine that was used in
microsoft cortana (creepy!)

Reinforcement Learning: DeepMind AlphaGo
 google deepmind alphago (2016)
 99.8% winning rate
vs other algorithm
 first program to defeat
human go champion
 algorithm used:
 deep neural network
 monte carlo search tree
 supervised learning from expert games
 reinforcement learning vs other alphago instances

supervised learning: random forest
deldago et. al. (2014) used 179 classifiers with 121 data sets in uci data,
result:
 top 5 are random forest classifier
 for kaggle competition, try gbm : xgboost.

supervised: deep learning
don’t be fooled, dl research improve
part by part, either new kind of layer,
new activation function, new non-
convex optimization solver, or deeper
neural net.
from rodrigo benenson
deep learning accuracies ranking

supervised: deep learning
summary:
 relu works better than sigmoid function for activation.
 maxout works better when applied to dropconnect for
activation function.
 dropout layer works to fight overfitting.
 adagrad and adadelta works better if you don’t want to
tune optimization hyperparameter.
 deeper layer works: highway layer and residual layer.

unsupervised: t-sne
t-stochastic neighbor embedding
maaten and hinton (2008):
mnist data set visualization
 works best for data-viz
 can be used for clustering too
(if you’d bother to tweak the algo)

Given 100 and 1000 label of data, and the other unlabeled (~50.000)
Try to predict 10.000 future data.
● It works! with small label data.
● Now we don’t have to tell some interns or PhD student to label some
data. :)
A Rasmus, H Valpola, M Honkala, M Berglund, and T Raiko. (2015)
semi-supervised learning: ladder neural networks

collaborative filtering: restricted boltzmann machine
rbm for collaborative learning (hinton, 2008):
 it has been used in netflix and spotify algo.
 it works better than svd!
 correlation(svd, rbm) : -1 < c < 1
• can be assembled with svd
 to improve the prediction.

some advices for applied machine learning research
(this competition)
 preprocessing: scaling & imputation
 cross-validation: choose best algos
 hyperparameter optimization
 ensembling n-models: dark knowledge

raschka(2014):
scaling improve prediction!
gelman(2006)
do prediction for n/a data, then
predict the data with noise
less biased!
data preprocessing: scaling & imputation

cross-validation: how to choose best algo?
 cross-validation is a must!
 (tibshirani et.al 2014)
 don’t overlap your cross-
validation data partition!
 (zhang, data robot)

hyperparameter optimization
if you want to search best hyperparamaters:
do random search.
random search is better than grid search
(bengio, 2012)

ensembling n-models: dark knowledge
If two model give same accuracy, but low
correlation of prediction output, then we can
improve prediction accuracy by averaging
model prediction.
(Hinton, 2015)

the landscape of opportunities

Popular Big Data Industry
Financial Services Telco Web/Media Retail Healthcare Government
• Fraud detection
• Compliance
reporting
• Portfolio analysis
• Customer
statements
• Wire transfer alerts
• Customer
acquisition,
retention, and
profitability
• Subscriber data
management
• Fraud analysis
• Social analysis
• Response times
• Traffic analysis
• Product
affinity/bundling
• Sentiment Analysis
• Content
monetization
• Advertising
optimization
• Optimization of user
experience/ click
stream analysis
• Network
optimization to
support service
levels
• Store operation
analysis
• Customer loyalty
programs
• Collaborative
planning and
forecasting
• Loss prevention
• Supply chain
optimization
• Drug development
and launch cost
reduction
• Regulatory
compliance
• Product quality
• Return on
promotional
investment
• Lowered risk of new
product success
• Security/anti-terror
• Recovery Act public
disclosure
• Budgetary control
and management
• Educational
reporting
• Asset control and
assessment
Environment
monitoring
*cisco 2013-2014

currently the biggest prescriptive analytics engine:
contextual advertising
http://www.flashtalking.com/us/targeted-ads/

another one:
marketplace and services recommendation engine

challenges of implementation
and
what we do with machine learning

do you follow waze instruction during the first one week?

 would you buy a self-driving car that couldn’t drive
itself in 99 percent of the country?
 or that knew nearly nothing about parking,
 couldn’t be taken out in snow or heavy rain,
 and would drive straight over a gaping pothole?
if your answer is yes, then check out the google self-driving car, model year
2014

the BIGGEST CHALLENGES in indonesia

the current analytics technology
human still doing
most of the process

the current challenges of big data analytics?
heterogeneous
data sources,
systems and
formats
time consuming
and complex
data preparation
process
almost
impossible task
of integrating
various kind of
data
it requires
experts to
analyze big and
complex data
most of the user
interactions are
not intuitive
“Before performing analytics, data scientists must first
format and prepare the raw data for analytics, often with
more than 80% of the effort.”, said Intel Corp. Research

what it would be like,
if we can simplify the whole process?
?
?

hence our vision
we believe human should not be bogged down by tedious matters.
by reimagining analytics we envisioned the creation of intelligent
machines,
that will free human to focus on solving the world’s toughest
problems.

intelligent machines that can helped us collect the massive amount of data
automatically reads and connects to
any kind of data, including automatic
machine to machine connections
structured
data
printed
invoices
social media
conversation

then helped us separate the signals from the noise
automatic data quality assessments,
data cleansing and data filtering
regi
mita
gundam
x-men

then helped us separate the signals from the noise
automatic data quality assessments,
data cleansing and data filtering
regi
mita
gundam

complete the information and connect them all in a meaningful way
automatic data transformation, entity
extraction, contextual profiling
regi
mita
gundam

complete the information and connect them all in a meaningful way
automatic data transformation, entity
extraction, contextual profiling
regi
mita
gundam
batman
tom
mediatrac

and finally helped us making sense of the massively connected data
contextual search and
recommendation
intelligent data discovery
gundam
batman
sith

and finally helped us making sense of the massively connected data
contextual search and
recommendation
intelligent data discovery
regi
mita
gundam
batman
tom
mediatrac
gundam
batman
sith

through a highly intuitive and natural user interface
natural language interface
voice and gesture recognition
ada berapa banyak restoran yg jual soto sepanjang jalan senopati?

digital
telco
legal
retail
healthcare
agriculture

multi format
structured
unstructured
unclean
missing data
unstandardized
unconnected
difficult to analyze
cleaned and standardized
enriched and validated
connected at granular level
analytics ready
data
automatic
data collection
automatic
data preparation
automatic
data integration

teritory management
CONFIDENTIAL for internal use only

all of our silo data will have a totally elevated value,
once you connect them all in a meaningful way

are all of our current data connected yet?

google is a humongous library index, with a smart
library card search that redirects you to the original
documents

facebook is a giant personal scrapbook of all your
acquaintances that are currently linked by manual
tagging and friends list
source:techglimpse

youtube and instagram are a huge repository of
current knowledge, lifestyle and trends that are still
largely unconnected

when we can have intelligent machines that can
connect everything, in a meaningful way…
we can start asking questions, on things we never
thought possible to be asked before

can map songs across social
graphs.
Spotify
can give us situational data — where
someone is listening to a song,
when, how and even (to an extent) why.
Shazam
can help us track the growth of a song
using search and streams.
YouTube
are becoming hotbeds for music discovery.
Instagram & Vine
If we can connect all their data together?

or if you have a radio station, what sort of playlist that will appeal to
your target audience, if we know, that a sizeable percentage of them
have a hummer?

we can even predict specific combination of words, notes and
beats that will increase the chance of putting the song in
billboard top 40 this upcoming season.

here are some sample of hidden insights
that we can discover from our own large repository of data,
using our intelligent data integration and data discovery tools

when we integrate historical media articles with geodemographic and point of
interest database we can create a model that can predict high probability of fire
incidence down to street level

lessons learned including how to scale your ML

scalability problems - outline
 large scale machine learning
 mahout - scalable ml on hadoop
 jubatus – distributed online real-time ml
 vowpal wabbit – fast learning at yahoo/ms
 trident ml and storm pattern: ml on storm, yarn
 upcoming --- samoa: ml on s4, storm
 issues in scalable distributed ml
 load balancing
 auto scaling
 job scheduling
 workflow management
 data and model parallelism
 parameter server framework
 peer-to-peer framework

scalability problems - outline
 distributed deep learning
 yahoolda: scalable parallel framework in latent variable models
 distbelief – distributed deep learning on cluster
 h2o – distributed deep learning on spark
 adam at msr – distributed deep learning
 dl4j – open source for deep learning on hadoop and spark
 petuum – distributed machine learning
 singa – distributed deep learning
 tensorflow: google large scale distributed dl
 mxnet: heterogeneous distributed deep learning
 caffee on spark: yahoo
 distributed learning and optimization
 proximal splitting/auxiliary coordinates;
 bundle (sub-gradient);
 shotgun: parallelized cdm (coordinate descent method)
 asynchronous sgd;
 hogwild/dogwild;

emerging analytics technology for automatic
analytics on large dimensional data
online deep learning
topological data analysis
fuzzy-rough set based data exploration system
granular computing
kernel set and spatiotemporal analysis
applied differential geometry
non axiomatic reasoning system
intelligent rule and knowledge extraction/discovery
multi agent based modeling
weak signal detection and analysis
bayesian networks analysis
genetic programming
self organizing neural networks

and also more humanlike user
interaction and data visualization
technology
eye tracking
glass-free auto stereoscopy
touch sensitive hologram
natural language user interface
tangible user interface
wearable gestural interface
brain-computer interface
sensor network user interface

principles for the development of a complete mind:
study the science of art. study the art of science.
develop your senses — especially learn how to see.
realize that everything connects to everything else.
Leonardo DaVinci

ML insights from KUDO codefest

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à ML insights from KUDO codefest

Similaire à ML insights from KUDO codefest (20)

Plus de CodePolitan

Plus de CodePolitan (19)

Dernier

Dernier (20)

ML insights from KUDO codefest