. An introduction to machine learning and probabilistic ...

An introduction to machine learning and probabilistic graphical models Kevin Murphy MIT AI Lab Presented at Intel’s workshop on “Machine learning for the life sciences”, Berkeley, CA, 3 November 2003

Overview ,[object Object],[object Object],[object Object],[object Object],Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling and various web sources for letting me use many of their slides

Supervised learning yes no N Small Arrow Red Y Small Star Blue Y Small Square Blue Y Big Torus Blue Output Size Shape Color F(x1, x2, x3) -> t Learn to approximate function from a training set of (x,t) pairs

Supervised learning Learner Training data Hypothesis Testing data Prediction N S A R Y S S B Y S S B Y B T B T X3 X2 X1 ? S C Y ? S A B T X3 X2 X1 N Y T

Key issue: generalization yes no ? ? Can’t just memorize the training set (overfitting)

Hypothesis spaces ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Perceptron (neural net with no hidden layers) Linearly separable data

The linear separator with the largest margin is the best one to pick margin

What if the data is not linearly separable?

Kernel trick kernel Kernel implicitly maps from 2D to 3D, making problem linearly separable x 1 x 2 z 1 z 2 z 3

Support Vector Machines (SVMs) ,[object Object],[object Object],[object Object]

Boosting Simple classifiers (weak learners) can have their performance boosted by taking weighted combinations Boosting maximizes the margin

Supervised learning success stories ,[object Object],[object Object],[object Object],[object Object],[object Object]

Unsupervised learning ,[object Object]

K-means clustering ,[object Object],[object Object],[object Object],[object Object],Reiterate

AutoClass (Cheeseman et al, 1986) ,[object Object],[object Object],[object Object],[object Object],[object Object]

Principal Component Analysis (PCA) PCA seeks a projection that best represents the data in a least-squares sense. PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest.

Discovering nonlinear manifolds

Combining supervised and unsupervised learning

Discovering rules (data mining) Find the most frequent patterns (association rules) Num in household = 1 ^ num children = 0 => language = English Language = English ^ Income < $40k ^ Married = false ^ num children = 0 => education {college, grad school} HS MD PhD MA Educ. $30k $80k $20k $10k Income Retired Doctor Student Student Occup. 60 M F 30 M M 24 S F 22 S M Age Married Sex

Unsupervised learning: summary ,[object Object],[object Object],[object Object],[object Object],[object Object]

Discovering networks ? From data visualization to causal discovery

Networks in biology ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Decreasing detail

Molecular level: Lysis-Lysogeny circuit in Lambda phage Arkin et al. (1998), Genetics 149(4):1633-48 ,[object Object],[object Object]

Concentration level: metabolic pathways ,[object Object],w 23 g1 g2 g3 g4 g5 w 12 w 55

Qualitative level: Boolean Networks

Probabilistic graphical models ,[object Object],[object Object],[object Object],[object Object],"The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities." -- James Clerk Maxwell "Probability theory is nothing but common sense reduced to calculation." -- Pierre Simon Laplace

Graphical models: outline ,[object Object],[object Object],[object Object]

Simple probabilistic model: linear regression Y Y =  +  X + noise Deterministic (functional) relationship X

Simple probabilistic model: linear regression Y Y =  +  X + noise Deterministic (functional) relationship X “ Learning” = estimating parameters  ,  ,  from (x,y) pairs. Can be estimate by least squares Is the empirical mean Is the residual variance

Piecewise linear regression Latent “switch” variable – hidden process at work

Probabilistic graphical model for piecewise linear regression ,[object Object],[object Object],output input ,[object Object],Learning is harder because Q is hidden, so we don’t know which data points to assign to each line; can be solved with EM (c.f., K-means) X Y Q

Classes of graphical models Probabilistic models Graphical models Directed Undirected Bayes nets MRFs DBNs

Bayesian Networks ,[object Object],[object Object],[object Object],[object Object],Quantitative part : Set of conditional probability distributions Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form Family of Alarm 0.9 0.1 e b e 0.2 0.8 0.01 0.99 0.9 0.1 b e b b e B E P(A | E,B)

Example: “ICU Alarm” network ,[object Object],[object Object],[object Object],[object Object],PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP

Success stories for graphical models ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Probabilistic Inference ,[object Object],[object Object],[object Object],Radio Call Earthquake Radio Burglary Alarm Call

Viterbi decoding Y 1 Y 3 X 1 X 2 X 3 Y 2 Compute most probable explanation (MPE) of observed data Hidden Markov Model (HMM) “ Tomato” hidden observed

Inference: computational issues Easy Hard Chains Trees Grids Dense, loopy graphs PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT MINOVL PVSAT PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP

Inference: computational issues Easy Hard Chains Trees Grids Dense, loopy graphs Many difference inference algorithms, both exact and approximate PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT MINOVL PVSAT PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP

Bayesian inference ,[object Object],[object Object],[object Object], X 1 Y 1 X n Y n Parameters are tied (shared) across repetitions of the data

Bayesian inference ,[object Object],[object Object],[object Object],[object Object]

Graphical models: outline ,[object Object],[object Object],[object Object],p p

Why Struggle for Accurate Structure? ,[object Object],[object Object],[object Object],[object Object],Adding an arc Missing an arc Earthquake Alarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Truth

Score b ased Learning E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y>

Learning Trees ,[object Object],[object Object]

Heuristic Search ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Local Search Operations ,[object Object],Reverse C  E Delete C  E Add C  D  score = S({C,E}  D) - S({E}  D) S C E D S C E D S C E D S C E D

Problems with local search S(G|D) Easy to get stuck in local optima “ truth” you

Problems with local search II Picking a single best model can be misleading E R B A C P(G|D)

Problems with local search II ,[object Object],[object Object],[object Object],Picking a single best model can be misleading E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)

Bayesian Approach to Structure Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],Feature of G , e.g., X  Y Indicator function for feature f Bayesian score for G

Bayesian approach: computational issues ,[object Object],How compute sum over super-exponential number of graphs? ,[object Object],[object Object]

Structure learning: other issues ,[object Object],[object Object],[object Object],[object Object]

Discovering latent variables a) 17 parameters b) 59 parameters There are some techniques for automatically detecting the possible presence of latent variables

Learning causal models ,[object Object],[object Object],[object Object],[object Object]

Learning causal models ,[object Object],[object Object],[object Object],X Y Z X Y Z X Y Z X Y Z

Learning from interventional data ,[object Object],[object Object],smoking Yellow fingers P(smoker|observe(yellow)) >> prior smoking Yellow fingers P(smoker | do(paint yellow)) = prior Cut arcs coming into nodes which were set by intervention

Active learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Learning from relational data Can we learn concepts from a set of relations between objects, instead of/ in addition to just their attributes?

Learning from relational data: approaches ,[object Object],[object Object],[object Object],[object Object],[object Object]

ILP for learning protein folding: input yes no TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ … 100 conjuncts describing structure of each pos/neg example

ILP for learning protein folding: results ,[object Object],[object Object]

ILP: Pros and Cons ,[object Object],[object Object],[object Object],[object Object]

The future of machine learning for bioinformatics? Oracle

The future of machine learning for bioinformatics Learner Prior knowledge Replicated experiments Biological literature Hypotheses Expt. design Real world ,[object Object]

Decision trees blue? big? oval? no no yes yes

Decision trees blue? big? oval? no no yes yes + Handles mixed variables + Handles missing data + Efficient for large data sets + Handles irrelevant attributes + Easy to understand - Predictive power

Feedforward neural network input Hidden layer Output Weights on each arc Sigmoid function at each node

Feedforward neural network input Hidden layer Output - Handles mixed variables - Handles missing data - Efficient for large data sets - Handles irrelevant attributes - Easy to understand + Predicts poorly

Nearest Neighbor ,[object Object],[object Object],[object Object],[object Object]

Nearest Neighbor ? - Handles mixed variables - Handles missing data - Efficient for large data sets - Handles irrelevant attributes - Easy to understand + Predictive power

SVM: mathematical details ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],margin

Replace all inner products with kernels Kernel function

SVMs: summary - Handles mixed variables - Handles missing data - Efficient for large data sets - Handles irrelevant attributes - Easy to understand + Predictive power ,[object Object],[object Object],General lessons from SVM success:

Boosting: summary ,[object Object],[object Object],+ Handles mixed variables + Handles missing data + Efficient for large data sets + Handles irrelevant attributes - Easy to understand + Predictive power

Supervised learning: summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Inference ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Radio Call Earthquake Radio Burglary Alarm Call

Assumption needed to make learning work ,[object Object],[object Object]

Structure learning success stories: gene regulation network (Friedman et al.) ,[object Object],[object Object],[object Object]

Structure learning success stories II: Phylogenetic Tree Reconstruction (Friedman et al.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],10 billion years Uses structural EM, with max-spanning-tree in the inner loop leaf

Instances of graphical models Probabilistic models Graphical models Directed Undirected Bayes nets MRFs DBNs Hidden Markov Model (HMM) Naïve Bayes classifier Mixtures of experts Kalman filter model Ising model

ML enabling technologies ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

. An introduction to machine learning and probabilistic ...

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à . An introduction to machine learning and probabilistic ...

Similaire à . An introduction to machine learning and probabilistic ... (20)

Plus de butest

Plus de butest (20)

. An introduction to machine learning and probabilistic ...