This document provides an introduction to machine learning and its applications in genomics and biology. It discusses how biology and genomics data have become "big data" due to technological advances in sequencing and data generation. Machine learning is well-suited for analyzing these large, multidimensional datasets and addressing complex biological questions. The document outlines different machine learning approaches like supervised and unsupervised learning, and provides examples of real-world applications. R and Python are introduced as popular programming languages for machine learning.
18. Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
19. Cells, Tissues, & Diseases
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
20. Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
21. Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Big Data
23. 12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
24. 12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
1 Petabyte of Data =
20M four-drawer filing cabinets filled with text
or
13.3 years of HD-TV video
or
~7 billion Facebook photos
or
1 PB of MP3 songs requires ~2,000 years to play
29. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 15
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
30. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 15
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
• We have lots of data and complex problems
• We want to make data-driven predictions
and need to automate model building
31. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
16
Multidimensional Data Sets
Complex problems + Big Data —> Machine Learning!
32. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
16
Multidimensional Data Sets
Complex problems + Big Data —> Machine Learning!
• Allows us to better utilize these increasingly large
data sets to capture their inherent structure
• Learning algorithms by training with data
33. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
34. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
35. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
36. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
• Our goal isn’t to make perfect guesses, but to make useful guesses—we want to
build a model that is useful for the future
46. The Rise of Machine Learning
• Hardware Advances
• Extreme performance
hardware (ex.
application-specific
integrated circuits)
• Smaller, cheaper
hardware (Moore’s law)
• Cloud computing (ex.
AWS)
• Software Advances
• New machine learning
algorithms including
deep learning and
reinforcement learning
• Data Advances
• High-performance, less
expensive sensors & data
generation
• ex. wearables, next-gen
sequencing, social media
20
We often use R, but Python is also a great choice!
• R tends to be favored by statisticians and academics
(for research)
• Python tends to be favored by engineers (with
production workflows)
47. • Open source implementation of S which was originally developed at Bell Lab
• Free programming language and software environment for advanced statistical
computing and graphics
• Functional programming language written primarily in C, Fortran
• Good at data manipulation, modeling and computing, data visualization
• Cross-platform compatible
• Vast community (e.g., CRAN, R-bloggers, Bioconductor)
• Over 10,000 packages including parallel/high-performance compute packages
• Used extensively by statisticians and academics
• Popularity is substantially increasing in recent years
• Drawbacks: can be steep learning curve (better recently), limited GUI (RStudio!),
documentation can be sparse, memory allocation can be an issue
The R Programming Language
21
53. Fisher’s/Anderson's iris data set:
measurements (cm) of the sepal length and width, petal length and width, and species (Iris setosa, versicolor,
and virginica) (5 features or variables) for 150 flowers (observations)
Iris Dataset in R
26
92. Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
60
93. Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
60
94. Iris Data: Adding Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
•Example methods of regression with regularization: ridge, elastic net, LASSO
60
111. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
112. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
113. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
114. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B
115. Iris Data: Adding Regularization (LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Computer
Petal.Length
Sepal.Width
Sepal.Length
Petal.Width
Species
Species(setosa)~
1.58*Sepal.Width +
-2.36*Petal.Length
+ 5.96
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
0 0
Species(setosa) ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + D* Petal.Width + b
Species(setosa) ~ A*Petal.Length + B*Sepal.Width +B
116. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
117. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
118. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Petal.Length < 2.35 cm
Setosa (40/0/0)
119. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Petal.Length < 2.35 cm
Setosa (40/0/0)
Petal.Width < 1.65 cm
Versicolor (0/40/12) Virginica (0/0/28)
122. Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
• Ex: Deep learning to automate resume scoring
• Scoring performance may be excellent (i.e. near human
performance)
• Does not reveal why a particular applicant was given a score
• Mathematically you can find out which nodes of the network
were activated, but we don’t know what those neurons were
supposed to model or what the layers of neurons were doing
collectively
• Interpretation is difficult
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
123. Deep Learning (i.e. neural nets)
• Subfield of machine learning describing ‘human-like AI’
• Algorithms are structured in layers to create artificial neural
networks to learn and make decsions without human intervention
• These networks represent the world as a nested hierarchy of
concepts with each defined in relation to simipler concepts
• Deep learning algorithms (compared to other machine learning):
• need a lot more data to perform well
• need more/better hardware
• typically identify and extract features without human
intervention
• usually solves problems end-to end instead of in parts
• takes a lot longer to train
• typically less interpretabile
• Ex: Deep learning to automate resume scoring
• Scoring performance may be excellent (i.e. near human
performance)
• Does not reveal why a particular applicant was given a score
• Mathematically you can find out which nodes of the network
were activated, but we don’t know what those neurons were
supposed to model or what the layers of neurons were doing
collectively
• Interpretation is difficult
75
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
124. Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
76
125. Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
76
126. Other Machine Learning Methods
• Neural Nets
• Ensemble Methods (e.g. bagging, boosting)
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian network
with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
76
Algorithm Selection is an Important Step!
144. Iris Data: Neural Nets
• Neural Networks (NNs) emulate how the
human brain works with a network of
interconnected neurons (essentially
logistic regression units) organized in
multiple layers, allowing more complex,
abstract, and subtle decisions
• Lots of tuning parameters (# of hidden
layers, # of neurons in each layer, and
multiple ways to tune learning)
• Learning is an iterative feedback
mechanism where training data error is
used to adjust the corresponding input
weights which is propagated back to
previous layers (i.e., back-propagation)
• NNs are good at learning non-linear
functions and can handle multiple
outputs, but have a long training time and
models are susceptible to local minimum
traps (can be mitigated by doing multiple
rounds—takes more time!)
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
88