SlideShare une entreprise Scribd logo
1  sur  144
Télécharger pour lire hors ligne
Introduction	to	Machine	Learning
Brittany	N.	Lasseigne,	PhD	
Senior	Scientist	
HudsonAlpha	Intstitute	for	Biotechnology	
8	December	2017	
@bnlasse					blasseigne@hudsonalpha.org
• ‘Genomical’	and	Biology	Big	Data
• Introduction	to	Machine	
Learning	and	R
• Machine	Learning	Algorithms
• Applying	Machine	Learning	to	
Genomics	Data	+	Problems
• ‘Genomical’	and	Biology	Big	Data	
• Introduction	to	Machine	
Learning	and	R	
• Machine	Learning	Algorithms	
• Applying	Machine	Learning	to	
Genomics	Data	+	Problems
Biology	Big	Data
• Molecular	and	cellular	profiling	of	large	numbers	of	features	in	large	numbers	
of	samples	(‘omics’	data)	
•					Image	processing:	cell	microscopy,	neuroimaging,	radiology	and	histology,	
crop	imagery,	etc.
4
Biology	Big	Data
• Molecular	and	cellular	profiling	of	large	numbers	of	features	in	large	numbers	
of	samples	(‘omics’	data)	
•					Image	processing:	cell	microscopy,	neuroimaging,	radiology	and	histology,	
crop	imagery,	etc.
4Esteva,	et	al.	Nature,	2017.
Biology	Big	Data
• Molecular	and	cellular	profiling	of	large	numbers	of	features	in	large	numbers	
of	samples	(‘omics’	data)	
•					Image	processing:	cell	microscopy,	neuroimaging,	radiology	and	histology,	
crop	imagery,	etc.
4Esteva,	et	al.	Nature,	2017.	
Resources:	
• Kan,	Machine	Learning	applications	in	cell	image	analysis,	Immunology	and	Cell	Biology,	2017	
• Angermueller,	et	al.	Deep	learning	for	computational	biology,	Mol	Syst	Biol,	2016.	
• Ching,	et	al.	Opportunities	and	Obstacles	for	Deep	Learning	in	Biology	and	Medicine,	biorxiv,	2017
5American	Cancer	Society,	2015	&	Harvard	NeuroDiscovery	Center,	2017.	
Complex	Human	Diseases:		
usually	caused	by	a	combination	of	genetic,	environmental	and	lifetyle	factors		
(most	of	which	have	not	yet	been	identified)
5American	Cancer	Society,	2015	&	Harvard	NeuroDiscovery	Center,	2017.	
Cancer:
• Men	have	a	1	in	2	lifetime	risk	of	developing	cancer	and	a	1	in	4	lifetime	risk	of	dying	from	cancer
• Women	have	a	1	in	3	lifetime	risk	of	developing	cancer	and	a	1	in	5	lifetime	risk	of	dying	from	
cancer
Psychiatric	Illness:	
• 1	in	4	American	adults	suffere	from	a	diagnosable	mental	disorder	in	any	given	year
• ~6%	suffer	serious	disabilities	as	a	result
Neurodegenerative	Disease:
• ~6.5M	Americans	suffer	(AD,	PD,	MS,	ALS,	HD),	expected	to	rise	to	12M	by	2030
Complex	Human	Diseases:		
usually	caused	by	a	combination	of	genetic,	environmental	and	lifetyle	factors		
(most	of	which	have	not	yet	been	identified)
5American	Cancer	Society,	2015	&	Harvard	NeuroDiscovery	Center,	2017.	
Cancer:
• Men	have	a	1	in	2	lifetime	risk	of	developing	cancer	and	a	1	in	4	lifetime	risk	of	dying	from	cancer
• Women	have	a	1	in	3	lifetime	risk	of	developing	cancer	and	a	1	in	5	lifetime	risk	of	dying	from	
cancer
Psychiatric	Illness:	
• 1	in	4	American	adults	suffere	from	a	diagnosable	mental	disorder	in	any	given	year
• ~6%	suffer	serious	disabilities	as	a	result
Neurodegenerative	Disease:
• ~6.5M	Americans	suffer	(AD,	PD,	MS,	ALS,	HD),	expected	to	rise	to	12M	by	2030
Complex	Human	Diseases:		
usually	caused	by	a	combination	of	genetic,	environmental	and	lifetyle	factors		
(most	of	which	have	not	yet	been	identified)
• Which	patients	are	high	risk	for	developing	cancer?	
• What	are	early	biomarkers	of	cancer?	
• Which	patients	are	likely	to	be	short/long	term	cancer	survivers?	
• What	chemotherapeutic	might	a	cancer	patient	benefit	from?
6
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
• Which	patients	are	high	risk	for	developing	cancer?	
• What	are	early	biomarkers	of	cancer?	
• Which	patients	are	likely	to	be	short/long	term	cancer	survivers?	
• What	chemotherapeutic	might	a	cancer	patient	benefit	from?
6
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Complex	problems
Genomics
• Understanding	the	function	of	the	
genome	(total	genetic	material)	and	
how	it	relates	to	human	disease	
(studying	all	of	the	genes	at	once!)
7
Genomics
• Understanding	the	function	of	the	
genome	(total	genetic	material)	and	
how	it	relates	to	human	disease	
(studying	all	of	the	genes	at	once!)
• The	sequencing	of	the	human	
genome	paved	the	way	for	genomic	
studies
7
Genomics
• Understanding	the	function	of	the	
genome	(total	genetic	material)	and	
how	it	relates	to	human	disease	
(studying	all	of	the	genes	at	once!)
• The	sequencing	of	the	human	
genome	paved	the	way	for	genomic	
studies
• Our	goal	it	identify	genetic/genomic	
variation	associated	with	disease	to	
improve	patient	care
7
8
Sequencing
9
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Cells, Tissues, & Diseases
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 10
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Big Data
Genomics	Data	is	Big	Data
11Stephens,	et	al.	PLOS	Biology,	2015.	
1	zettabyte	(ZB)	=	1024	EB	
1	exabyte	(EB)			=	1024	PB	
1	petabyte	(PB)	=	1024	TB		
1	terabyte	(TB)		=	1024	GB
12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
12
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
1	Petabyte	of	Data	=	
20M	four-drawer	filing	cabinets	filled	with	text	
or	
13.3	years	of	HD-TV	video	
or	
~7	billion	Facebook	photos	
or	
1	PB	of	MP3	songs	requires	~2,000	years	to	play
Astronomical	‘Genomical’	Data:		
the	‘four-headed	beast’	of	the	data	life-cycle	(2025	Projections)
13Stephens,	et	al.	PLOS	Biology,	2015	and	nanalyze.com.
Astronomical	‘Genomical’	Data:		
the	‘four-headed	beast’	of	the	data	life-cycle	(2025	Projections)
13Stephens,	et	al.	PLOS	Biology,	2015	and	nanalyze.com.	
1	zettabyte	(ZB)	=	1024	EB	
1	exabyte	(EB)			=	1024	PB	
1	petabyte	(PB)	=	1024	TB		
1	terabyte	(TB)		=	1024	GB
Astronomical	‘Genomical’	Data:		
the	‘four-headed	beast’	of	the	data	life-cycle	(2025	Projections)
13Stephens,	et	al.	PLOS	Biology,	2015	and	nanalyze.com.	
1	zettabyte	(ZB)	=	1024	EB	
1	exabyte	(EB)			=	1024	PB	
1	petabyte	(PB)	=	1024	TB		
1	terabyte	(TB)		=	1024	GB
• ‘Genomical’	and	Biology	Big	Data	
• Introduction	to	Machine	
Learning	and	R	
• Machine	Learning	Algorithms	
• Applying	Machine	Learning	to	
Genomics	Data	+	Problems
Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 15
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 15
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
• We	have	lots	of	data	and	complex	problems	
• We	want	to	make	data-driven	predictions	
and	need	to	automate	model	building
Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
16
Multidimensional Data Sets
Complex	problems	+	Big	Data	—>		Machine	Learning!
Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
16
Multidimensional Data Sets
Complex	problems	+	Big	Data	—>		Machine	Learning!
• Allows	us	to	better	utilize	these	increasingly	large	
data	sets	to	capture	their	inherent	structure	
• Learning	algorithms	by	training	with	data
• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer	
Data	
Program
Output
Traditional	Programming
Computer	
[2,3]	
+
5
• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer	
Data	
Program
Output
Traditional	Programming
Computer	
[2,3]	
+
5
Computer	
Data	
Output
Program
Machine	Learning
Computer	
[2,3]	
5
+
• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
17
Computer	
Data	
Program
Output
Traditional	Programming
Computer	
[2,3]	
+
5
Computer	
Data	
Output
Program
Machine	Learning
Computer	
[2,3]	
5
+
• Our goal isn’t to make perfect guesses, but to make useful guesses—we want to
build a model that is useful for the future
18
Supervised	Learning:	
-Prediction	
Ex.	linear	&	logistic	regression
Unsupervised	Learning:	
-Find	patterns		
Ex.	Clustering,	Principle	Component	Analysis
18
Supervised	Learning:	
-Prediction	
Ex.	linear	&	logistic	regression
Unsupervised	Learning:	
-Find	patterns		
Ex.	Clustering,	Principle	Component	Analysis
Known	Data	+	Known	Response
YES	
NO
18
Supervised	Learning:	
-Prediction	
Ex.	linear	&	logistic	regression
Unsupervised	Learning:	
-Find	patterns		
Ex.	Clustering,	Principle	Component	Analysis
Known	Data	+	Known	Response
YES	
NO
MODEL
18
Supervised	Learning:	
-Prediction	
Ex.	linear	&	logistic	regression
Unsupervised	Learning:	
-Find	patterns		
Ex.	Clustering,	Principle	Component	Analysis
Known	Data	+	Known	Response
YES	
NO
MODEL
NEW	DATA
Predict	Response
18
Supervised	Learning:	
-Prediction	
Ex.	linear	&	logistic	regression
Unsupervised	Learning:	
-Find	patterns		
Ex.	Clustering,	Principle	Component	Analysis
Known	Data	+	Known	Response
YES	
NO
MODEL
NEW	DATA
Predict	Response
Uncategorized	Data
18
Supervised	Learning:	
-Prediction	
Ex.	linear	&	logistic	regression
Unsupervised	Learning:	
-Find	patterns		
Ex.	Clustering,	Principle	Component	Analysis
Known	Data	+	Known	Response
YES	
NO
MODEL
NEW	DATA
Predict	Response
Clusters	of	Categorized	Data
Uncategorized	Data
Real-World	Machine	Learning	Applications
19
Recommendation	Engine
Mail	Sorting
Self-Driving	Car
The	Rise	of	Machine	Learning
• Hardware	Advances	
• Extreme	performance	
hardware	(ex.	
application-specific	
integrated	circuits)	
• Smaller,	cheaper	
hardware	(Moore’s	law)	
• Cloud	computing	(ex.	
AWS)	
• Software	Advances	
• New	machine	learning	
algorithms	including	
deep	learning	and	
reinforcement	learning	
• Data	Advances	
• High-performance,	less	
expensive	sensors	&	data	
generation	
• ex.	wearables,	next-gen	
sequencing,	social	media
20
The	Rise	of	Machine	Learning
• Hardware	Advances	
• Extreme	performance	
hardware	(ex.	
application-specific	
integrated	circuits)	
• Smaller,	cheaper	
hardware	(Moore’s	law)	
• Cloud	computing	(ex.	
AWS)	
• Software	Advances	
• New	machine	learning	
algorithms	including	
deep	learning	and	
reinforcement	learning	
• Data	Advances	
• High-performance,	less	
expensive	sensors	&	data	
generation	
• ex.	wearables,	next-gen	
sequencing,	social	media
20
The	Rise	of	Machine	Learning
• Hardware	Advances	
• Extreme	performance	
hardware	(ex.	
application-specific	
integrated	circuits)	
• Smaller,	cheaper	
hardware	(Moore’s	law)	
• Cloud	computing	(ex.	
AWS)	
• Software	Advances	
• New	machine	learning	
algorithms	including	
deep	learning	and	
reinforcement	learning	
• Data	Advances	
• High-performance,	less	
expensive	sensors	&	data	
generation	
• ex.	wearables,	next-gen	
sequencing,	social	media
20
We	often	use	R,	but	Python	is	also	a	great	choice!	
• R	tends	to	be	favored	by	statisticians	and	academics	
(for	research)	
• Python	tends	to	be	favored	by	engineers	(with	
production	workflows)
• Open	source	implementation	of	S	which	was	originally	developed	at	Bell	Lab
• Free	programming	language	and	software	environment	for	advanced	statistical	
computing	and	graphics
• Functional	programming	language	written	primarily	in	C,	Fortran
• Good	at	data	manipulation,	modeling	and	computing,	data	visualization
• Cross-platform	compatible
• Vast	community	(e.g.,	CRAN,	R-bloggers,	Bioconductor)
• Over	10,000	packages	including	parallel/high-performance	compute	packages
• Used	extensively	by	statisticians	and	academics
• Popularity	is	substantially	increasing	in	recent	years
• Drawbacks:	can	be	steep	learning	curve	(better	recently),	limited	GUI	(RStudio!),	
documentation	can	be	sparse,	memory	allocation	can	be	an	issue
The	R	Programming	Language
21
• ‘Genomical’	and	Biology	Big	Data	
• Introduction	to	Machine	
Learning	and	R	
• Machine	Learning	Algorithms	
• Applying	Machine	Learning	to	
Genomics	Data	+	Problems
23
Open	RStudio
24
Under	File->New	File->select	R	Script
25
We	will	be	working	in	the	R	script	panel	(top	left)
25
We	will	be	working	in	the	R	script	panel	(top	left)
Fisher’s/Anderson's iris data set:
measurements (cm) of the sepal length and width, petal length and width, and species (Iris setosa, versicolor,
and virginica) (5 features or variables) for 150 flowers (observations)
Iris	Dataset	in	R
26
27
Data	inspection:		
summary	function,	tab	out,	and	‘Help’	pages
28
Data	inspection:		
Built-in	iris	dataset	(check	out	mtcars	too!)
29
Data	inspection:		
execute	a	line	of	code	with	ctl+return	or	hit	‘Run’	button
30
Data	inspection:		
Can	also	inspect	data	with	the	str	(structure)	function
31
Data	inspection:		
Can	also	inspect	data	with	the	str	(structure)	function
32
Data	inspection:		
And	examine	the	first	5	rows	[x,]	and	first	5	columns	[,y]
33
Data	inspection:		
the	plot	function
34
Data	inspection:		
$	notation	for	calling	columns	by	name
35
Data	inspection:		
the	plot	function
36
Data	inspection:		
the	cor.test	function
Iris	Dataset:			
Summarize/Descriptive	Statistics	(Observational)
37
Iris	Dataset:			
Summarize/Descriptive	Statistics	(Observational)
37
Computer	
Data	
Program
Output
Traditional	Programming
Iris	Dataset:			
Summarize/Descriptive	Statistics	(Observational)
37
Computer	
Data	
Program
Output
Traditional	Programming
Computer	
Sepal.Lenth	
mean(x)
5.843
38
Data	modeling:
39
Data	modeling:	
the	lm	function
40
Data	modeling:	
the	lm	function
40
Data	modeling:	
the	lm	function	
y ~ mx + b
41
Data	modeling:	
the	lm	function	
y ~ mx + b
Petal.Width ~ 0.4158*Petal.Length - 0.3631
42
Data	modeling:	
the	abline	function	to	add	regression	line	to	our	plot
43
Data	modeling:	
the	abline	function	to	add	regression	line	to	our	plot
Iris	Dataset:			
Linear	Regression	is	Machine	Learning!
• Purple	line	is	a	linear	regression	line	
fit	to	the	data	describing	petal	length	
as	a	function	of	petal	width	
• We	can	now	PREDICT	petal	width	
given	petal	length	
		
Petal.Width	~	0.4158*Petal.Length	-	0.3631	
(y=mx+b)
44
Iris	Dataset:			
Linear	Regression	is	Machine	Learning!
• Purple	line	is	a	linear	regression	line	
fit	to	the	data	describing	petal	length	
as	a	function	of	petal	width	
• We	can	now	PREDICT	petal	width	
given	petal	length	
		
Petal.Width	~	0.4158*Petal.Length	-	0.3631	
(y=mx+b)
Computer	
Data	
Output
Program
Machine	Learning
44
Iris	Dataset:			
Linear	Regression	is	Machine	Learning!
• Purple	line	is	a	linear	regression	line	
fit	to	the	data	describing	petal	length	
as	a	function	of	petal	width	
• We	can	now	PREDICT	petal	width	
given	petal	length	
		
Petal.Width	~	0.4158*Petal.Length	-	0.3631	
(y=mx+b)
Computer	
Data	
Output
Program
Machine	Learning
Computer	
Petal.Length	
Petal.Width
Petal.Width	~	
0.4158*Petal.Length	-	
0.3631	
44
45
What	do	you	notice	about	the	plot?
46
Data	wrangling:	
Examining	Iris	species	variable
47
Data	wrangling:	
Examining	Iris	species	variable
48
Data	wrangling:	
Coding	species	labels	as	categories	to	color	the	points	by
49
Data	wrangling:	
Examine	species	variable
50
Data	wrangling:	
the	palette	function	describes	a	vector	of	default	colors	for	plotting	in	R
51
Data	inspection:		
plotting	with	data	points	colored	by	species	(setosa,	versicolor,	virginica)
Train	an	algorithm	to	classify		
Iris	flowers	by	species
52
Fisher’s	Iris	Data	
n=150
Training	Set	
n=105
Test	Set	
n=45
70% 30%
53
Defining	training	and	test	sets:		
use	nrow	function	to	code	the	total	number	of	observations	in	the	Iris	dataset
54
Defining	training	and	test	sets:		
use	sample	function	to	assign	observations	to	the	training	set
Note:	I	did	not	set	a	seed	for	this	tutorial	so	you	may	get	slightly	different	results.	For	more	about	setting	seeds,	
see	here:	https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function
55
Defining	training	and	test	sets:		
use	sample	function	to	assign	observations	to	the	training	set,		
x	is	1:n
56
Defining	training	and	test	sets:		
use	sample	function	to	assign	observations	to	the	training	set,	
	size	is	round(0.7*n)	->	0.7*150	=	105
57
Defining	training	and	test	sets:		
assign	the	105	selected	observations	to	the	training	set
58
Defining	training	and	test	sets:		
assign	the	non-105	selected	observations	to	the	test	set	(the	remaining	45	observations)
59
Fisher’s	Iris	Data	
n=150
Training	Set:	“iristrain”	
n=105
Test	Set:	“iristest”	
n=45
70% 30%
Train	an	algorithm	to	classify		
Iris	flowers	by	species
Iris	Data:		Adding	Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
60
Iris	Data:		Adding	Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
60
Iris	Data:		Adding	Regularization
•Model building with a large # of features/variables for a moderate number of
observations can result in ‘overfitting’ —the model is too specific to the
training set and not generalizable enough for accurate predictions with new
data
•Regularization is a technique for preventing this by introducing tuning
parameters that penalize the coefficients of features/variables that are
linearly dependent (redundant)
•This results in FEATURE SELECTION
•Example methods of regression with regularization: ridge, elastic net, LASSO
60
LASSO:

Least	Absolute	Shrinkage	and	Selection	Operator:
• Linear	regression:		
predictive	analysis	fitting	a	single	
line	through	data	to	describe	
relationships	between	one	
dependent	variable	and	one	or	
more	independent	variables
61
Credit	Card	Balance
Credit	Limit
Images:		Adapted	from	Tibshirani,	et	al.
LASSO:

Least	Absolute	Shrinkage	and	Selection	Operator:
• Linear	regression:		
predictive	analysis	fitting	a	single	
line	through	data	to	describe	
relationships	between	one	
dependent	variable	and	one	or	
more	independent	variables
• LASSO	regression:		
perform	variable	selection	by	
including	a	penalty	that	forces	some	
coefficient	estimates	to	be	exactly	
zero	based	on	a	turning	parameter	
(λ),	yielding	a	sparse	model
61
Credit	Card	Balance
Credit	Limit
Images:		Adapted	from	Tibshirani,	et	al.
LASSO:

Least	Absolute	Shrinkage	and	Selection	Operator:
• Linear	regression:		
predictive	analysis	fitting	a	single	
line	through	data	to	describe	
relationships	between	one	
dependent	variable	and	one	or	
more	independent	variables
• LASSO	regression:		
perform	variable	selection	by	
including	a	penalty	that	forces	some	
coefficient	estimates	to	be	exactly	
zero	based	on	a	turning	parameter	
(λ),	yielding	a	sparse	model
61
Credit	Card	Balance
Credit	Limit
Images:		Adapted	from	Tibshirani,	et	al.
LASSO:

Least	Absolute	Shrinkage	and	Selection	Operator:
• Linear	regression:		
predictive	analysis	fitting	a	single	
line	through	data	to	describe	
relationships	between	one	
dependent	variable	and	one	or	
more	independent	variables
• LASSO	regression:		
perform	variable	selection	by	
including	a	penalty	that	forces	some	
coefficient	estimates	to	be	exactly	
zero	based	on	a	turning	parameter	
(λ),	yielding	a	sparse	model
61
Credit	Card	Balance
Credit	Limit
Images:		Adapted	from	Tibshirani,	et	al.
LASSO	Tuning	Parameter	Selection	
• Select	tuning	
parameter	by	cross-
validation:	
– Partition	data	
multiple	times	
– Compute	cross-
validation	error	rate	
for	each	tuning	
parameter	
– Select	tuning	
parameter	value	
with	smallest	error
62
Example:		5-Fold	Cross-Validation
Image:		goldenhelix.com
63
Building	a	model	for	Iris	species	prediction:		
the	glmnet	package
64
Building	a	model	for	Iris	species	prediction:		
the	cv.glmnet	function
Note:	I	did	not	set	a	seed	for	this	tutorial	so	you	may	get	slightly	different	results.	For	more	about	setting	seeds,	
see	here:	https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function
65
Building	a	model	for	Iris	species	prediction:		
the	cv.glmnet	function	on	the	training	set
as.matrix(iristrain[,-5]
iristrain[,5]
66
Building	a	model	for	Iris	species	prediction:		
use	predict	function	to	evaluate	model	in	test	set
67
Building	a	model	for	Iris	species	prediction:		
use	table	function	to	view	predicted	species	vs.	actual	species
68
Building	a	model	for	Iris	species	prediction:		
view	resulting	predict	object
69
Building	a	model	for	Iris	species	prediction:		
examine	cv.glmnet	object
71
Building	a	model	for	Iris	species	prediction:		
plot	cv.glmnet	object
71
Building	a	model	for	Iris	species	prediction:		
plot	cv.glmnet	object
#	of	predictors	in	the	model
Error
Tuning	Parameter	Penalty
71
Building	a	model	for	Iris	species	prediction:		
plot	cv.glmnet	object
#	of	predictors	in	the	model
Error
Tuning	Parameter	Penalty
λmin	
	
Lambda	with	
minimum	cross-
validated	error
λ1SE
	
Largest	lambda	where	
error		w/in	1	standard	
error	of	minimum
72
Building	a	model	for	Iris	species	prediction:		
comparing	coefficients	at	lambda.min	and	lambda.1se
Iris	Data:		Adding	Regularization	(LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Iris	Data:		Adding	Regularization	(LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+	C*	Sepal.Length	+	D*	Petal.Width	+	b
Iris	Data:		Adding	Regularization	(LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+	C*	Sepal.Length	+	D*	Petal.Width	+	b
0 0
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+	C*	Sepal.Length	+	D*	Petal.Width	+	b
Iris	Data:		Adding	Regularization	(LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+	C*	Sepal.Length	+	D*	Petal.Width	+	b
0 0
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+	C*	Sepal.Length	+	D*	Petal.Width	+	b
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+B
Iris	Data:		Adding	Regularization	(LASSO)
• Model building with a large # of
features for a moderate number of
samples can result in ‘overfitting’ —
the model is too specific to the
training set and not generalizable
enough for accurate predictions
with new data
• Regularization is a technique for
preventing this by introducing tuning
parameters that penalize the
coefficients of variables that are
linearly dependent (redundant)-
>Feature Selection
73
Computer	
Petal.Length	
Sepal.Width	
Sepal.Length	
Petal.Width	
Species
Species(setosa)~	
1.58*Sepal.Width	+		
-2.36*Petal.Length		
+	5.96
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+	C*	Sepal.Length	+	D*	Petal.Width	+	b
0 0
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+	C*	Sepal.Length	+	D*	Petal.Width	+	b
Species(setosa)	~	A*Petal.Length	+	B*Sepal.Width	+B
Iris	Data:	Decision	Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Iris	Data:	Decision	Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Iris	Data:	Decision	Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Petal.Length	<	2.35	cm
Setosa	(40/0/0)
Iris	Data:	Decision	Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
74
Petal.Length	<	2.35	cm
Setosa	(40/0/0)
Petal.Width	<	1.65	cm
Versicolor	(0/40/12) Virginica	(0/0/28)
Deep	Learning	(i.e.	neural	nets)
• Subfield	of	machine	learning	describing	‘human-like	AI’
• Algorithms	are	structured	in	layers	to	create	artificial	neural	
networks	to	learn	and	make	decsions	without	human	intervention
• These	networks	represent	the	world	as	a	nested	hierarchy	of	
concepts	with	each	defined	in	relation	to	simipler	concepts	
75
X1
X2
Output
	(Summation	of	
Input	and	
Activation	with	
Sigmoid	Fxn)
‘Neuron’
Deep	Learning	(i.e.	neural	nets)
• Subfield	of	machine	learning	describing	‘human-like	AI’
• Algorithms	are	structured	in	layers	to	create	artificial	neural	
networks	to	learn	and	make	decsions	without	human	intervention
• These	networks	represent	the	world	as	a	nested	hierarchy	of	
concepts	with	each	defined	in	relation	to	simipler	concepts	
• Deep	learning	algorithms	(compared	to	other	machine	learning):
• need	a	lot	more	data	to	perform	well
• need	more/better	hardware
• typically	identify	and	extract	features	without	human	
intervention
• usually	solves	problems	end-to	end	instead	of	in	parts
• takes	a	lot	longer	to	train
• typically	less	interpretabile
75
X1
X2
Output
	(Summation	of	
Input	and	
Activation	with	
Sigmoid	Fxn)
‘Neuron’
Deep	Learning	(i.e.	neural	nets)
• Subfield	of	machine	learning	describing	‘human-like	AI’
• Algorithms	are	structured	in	layers	to	create	artificial	neural	
networks	to	learn	and	make	decsions	without	human	intervention
• These	networks	represent	the	world	as	a	nested	hierarchy	of	
concepts	with	each	defined	in	relation	to	simipler	concepts	
• Deep	learning	algorithms	(compared	to	other	machine	learning):
• need	a	lot	more	data	to	perform	well
• need	more/better	hardware
• typically	identify	and	extract	features	without	human	
intervention
• usually	solves	problems	end-to	end	instead	of	in	parts
• takes	a	lot	longer	to	train
• typically	less	interpretabile
• Ex:	Deep	learning	to	automate	resume	scoring		
• Scoring	performance	may	be	excellent	(i.e.	near	human	
performance)
• Does	not	reveal	why	a	particular	applicant	was	given	a	score
• Mathematically	you	can	find	out	which	nodes	of	the	network	
were	activated,	but	we	don’t	know	what	those	neurons	were	
supposed	to	model	or	what	the	layers	of	neurons	were	doing	
collectively
• Interpretation	is	difficult
75
X1
X2
Output
	(Summation	of	
Input	and	
Activation	with	
Sigmoid	Fxn)
‘Neuron’
Deep	Learning	(i.e.	neural	nets)
• Subfield	of	machine	learning	describing	‘human-like	AI’
• Algorithms	are	structured	in	layers	to	create	artificial	neural	
networks	to	learn	and	make	decsions	without	human	intervention
• These	networks	represent	the	world	as	a	nested	hierarchy	of	
concepts	with	each	defined	in	relation	to	simipler	concepts	
• Deep	learning	algorithms	(compared	to	other	machine	learning):
• need	a	lot	more	data	to	perform	well
• need	more/better	hardware
• typically	identify	and	extract	features	without	human	
intervention
• usually	solves	problems	end-to	end	instead	of	in	parts
• takes	a	lot	longer	to	train
• typically	less	interpretabile
• Ex:	Deep	learning	to	automate	resume	scoring		
• Scoring	performance	may	be	excellent	(i.e.	near	human	
performance)
• Does	not	reveal	why	a	particular	applicant	was	given	a	score
• Mathematically	you	can	find	out	which	nodes	of	the	network	
were	activated,	but	we	don’t	know	what	those	neurons	were	
supposed	to	model	or	what	the	layers	of	neurons	were	doing	
collectively
• Interpretation	is	difficult
75
X1
X2
Output
	(Summation	of	
Input	and	
Activation	with	
Sigmoid	Fxn)
‘Neuron’
Other	Machine	Learning	Methods
• Neural	Nets	
• Ensemble	Methods	(e.g.	bagging,	boosting)	
• Naive	Bayes	(based	on	prior	probabilities)	
• Hidden	Markov	Models	(Bayesian	network	
with	hidden	states)	
• K	Nearest	Neighbors	(instance-based	
learning—clustering!)	
• Support	Vector	Machines	(discriminator	
defined	by	a	separating	hyperplane)	
• Additional	Ensemble	Method	Approaches	
(combining	multiple	models)	
• And	new	methods	coming	out	all	the	time…
76
Other	Machine	Learning	Methods
• Neural	Nets	
• Ensemble	Methods	(e.g.	bagging,	boosting)	
• Naive	Bayes	(based	on	prior	probabilities)	
• Hidden	Markov	Models	(Bayesian	network	
with	hidden	states)	
• K	Nearest	Neighbors	(instance-based	
learning—clustering!)	
• Support	Vector	Machines	(discriminator	
defined	by	a	separating	hyperplane)	
• Additional	Ensemble	Method	Approaches	
(combining	multiple	models)	
• And	new	methods	coming	out	all	the	time…
Raw	Data
Clean/Normalize	Data
Training	Set Test	Set
Build	Model
Test
Apply	to	New	Data	
(Validation	Cohort	or	
Model	Application)
Tune	Model
76
Other	Machine	Learning	Methods
• Neural	Nets	
• Ensemble	Methods	(e.g.	bagging,	boosting)	
• Naive	Bayes	(based	on	prior	probabilities)	
• Hidden	Markov	Models	(Bayesian	network	
with	hidden	states)	
• K	Nearest	Neighbors	(instance-based	
learning—clustering!)	
• Support	Vector	Machines	(discriminator	
defined	by	a	separating	hyperplane)	
• Additional	Ensemble	Method	Approaches	
(combining	multiple	models)	
• And	new	methods	coming	out	all	the	time…
Raw	Data
Clean/Normalize	Data
Training	Set Test	Set
Build	Model
Test
Apply	to	New	Data	
(Validation	Cohort	or	
Model	Application)
Tune	Model
76
Algorithm	Selection	is	an	Important	Step!
• ‘Genomical’	and	Biology	Big	Data	
• Introduction	to	Machine	
Learning	and	R	
• Machine	Learning	Algorithms	
• Applying	Machine	Learning	to	
Genomics	Data	+	Problems
78
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
• Which	patients	are	high	risk	for	
developing	cancer?	
• What	are	early	biomarkers	of	
cancer?	
• Which	patients	are	likely	to	be	
short/long	term	cancer	survivers?	
• What	chemotherapeutic	might	a	
cancer	patient	benefit	from?	
78
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
• Which	patients	are	high	risk	for	
developing	cancer?	
• What	are	early	biomarkers	of	
cancer?	
• Which	patients	are	likely	to	be	
short/long	term	cancer	survivers?	
• What	chemotherapeutic	might	a	
cancer	patient	benefit	from?	
78
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Complex	problems	+	Big	Data	—>			
Machine	Learning
79
Integrating genomic data with machine learning to improve
predictive modeling
Cross-Cancer	Patient	Outcome	Prediction	Model
Scaled -log10 Cox p-value
-1 2 30 1
‘Common	Survival	Genes’	across	19	cancers
• ‘Common	Survival	Genes’	
Cox	regression	uncorrected	p-value	
<0.05	for	a	gene	in	at	least	9/19	
cancers:	
• 84	genes,	enriched	for	
proliferation-related	processes	
including	mitosis,	cell	and	
nuclear	division,	and	spindle	
formation		
• Clustering	by	Cox	regression	p-
values:		
7	‘Proliferative	Informative	Cancers’	
and	12	‘Non-Proliferative	Informative	
Cancers’	
80
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
Ramaker	&	Lasseigne,	et	al.	2017.
Scaled -log10 Cox p-value
-1 2 30 1
‘Common	Survival	Genes’	across	19	cancers
Proliferative	Informative	Cancers	
	(PICs)
81
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
• ‘Common	Survival	Genes’	
Cox	regression	uncorrected	p-value	
<0.05	for	a	gene	in	at	least	9/19	
cancers:	
• 84	genes,	enriched	for	
proliferation-related	processes	
including	mitosis,	cell	and	
nuclear	division,	and	spindle	
formation		
• Clustering	by	Cox	regression	p-
values:			
7	‘Proliferative	Informative	Cancers’	
and	12	‘Non-Proliferative	Informative	
Cancers’	
Ramaker	&	Lasseigne,	et	al.	2017.
Scaled -log10 Cox p-value
-1 2 30 1
‘Common	Survival	Genes’	across	19	cancers
Proliferative	Informative	Cancers	
	(PICs)
82
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
Non-Proliferative	Informative	Cancers	
(Non-PICs)
• ‘Common	Survival	Genes’	
Cox	regression	uncorrected	p-value	
<0.05	for	a	gene	in	at	least	9/19	
cancers:	
• 84	genes,	enriched	for	
proliferation-related	processes	
including	mitosis,	cell	and	
nuclear	division,	and	spindle	
formation		
• Clustering	by	Cox	regression	p-
values:			
7	‘Proliferative	Informative	Cancers’	
and	12	‘Non-Proliferative	Informative	
Cancers’	
Ramaker	&	Lasseigne,	et	al.	2017.
83
Cross-Cancer	Patient	Outcome	Model
Ramaker	&	Lasseigne,	et	al.	2017.
83
Cross-Cancer	Patient	Outcome	Model
Cox	
regression	
with	
LASSO	
feature	
selection
~20,000	gene	
expression	
values	
Cancer	Patient	
Survival
Survival~	-0.104	+	0.086*ADAM12	
+	0.037*CKS1	-	0.088*CRYL1	+	
0.056*DNA2	+	0.013*DONSON	+	
0.098*HJURP	-	0.022*NDRG2	+	
0.031*RAD54B	+	0.040*SHOX2	-	
0.155*SUOX
Ramaker	&	Lasseigne,	et	al.	2017.
83
Cross-Cancer	Patient	Outcome	Model
Cox	
regression	
with	
LASSO	
feature	
selection
~20,000	gene	
expression	
values	
Cancer	Patient	
Survival
Survival~	-0.104	+	0.086*ADAM12	
+	0.037*CKS1	-	0.088*CRYL1	+	
0.056*DNA2	+	0.013*DONSON	+	
0.098*HJURP	-	0.022*NDRG2	+	
0.031*RAD54B	+	0.040*SHOX2	-	
0.155*SUOX
Ramaker	&	Lasseigne,	et	al.	2017.
Take-Home	Message
• Genomics	generates	big	data	to	address	complex	biological	problems,	e.g.,	improving	human	
disease	prevention,	diagnosis,	prognosis,	and	treatment	efficacy
• Machine	learning	is	a	data	analysis	method	that	automate	analytical	model	building	to	make	
data	driven	predictions	or	discover	patterns	without	explicit	human	intervention
• Machine	learning	is	a	subfield	of	computer	science—>the	algorithms	are	implemented	in	code
• Machine	learning	is	useful	when	we	have	complex	problems	with	lots	of	‘big’	data
84
Take-Home	Message
• Genomics	generates	big	data	to	address	complex	biological	problems,	e.g.,	improving	human	
disease	prevention,	diagnosis,	prognosis,	and	treatment	efficacy
• Machine	learning	is	a	data	analysis	method	that	automate	analytical	model	building	to	make	
data	driven	predictions	or	discover	patterns	without	explicit	human	intervention
• Machine	learning	is	a	subfield	of	computer	science—>the	algorithms	are	implemented	in	code
• Machine	learning	is	useful	when	we	have	complex	problems	with	lots	of	‘big’	data
84
Computer	
Data	
Program
Output
Traditional	Programming
Computer	
[2,3]	
+
5
Take-Home	Message
• Genomics	generates	big	data	to	address	complex	biological	problems,	e.g.,	improving	human	
disease	prevention,	diagnosis,	prognosis,	and	treatment	efficacy
• Machine	learning	is	a	data	analysis	method	that	automate	analytical	model	building	to	make	
data	driven	predictions	or	discover	patterns	without	explicit	human	intervention
• Machine	learning	is	a	subfield	of	computer	science—>the	algorithms	are	implemented	in	code
• Machine	learning	is	useful	when	we	have	complex	problems	with	lots	of	‘big’	data
84
Computer	
Data	
Program
Output
Traditional	Programming
Computer	
[2,3]	
+
5
Computer	
Data	
Output
Program
Machine	Learning
Computer	
[2,3]	
5
+
HudsonAlpha:		
hudsonalpha.org	
R	Programming	Language	and/or	Machine	Learning	(mostly	free):		
Software	Carpentry	(software-carpentry.org)	and	Data	Carpentry	(datacarpentry.org)	
coursera.org	and	datacamp.com	
Stanford	Online’s	‘Statistical	Learning’	class		
Books:	
Rosalind	Franklin:	The	Dark	Lady	of	DNA	by	Brenda	Maddox	(Female	scientist	biography)	
The	Emperor	of	All	Maladies	by	Siddhartha	Mukherjee	(History	of	cancer)	
The	Gene	by	Siddhartha	Mukherjee	(History	of	genetics)	
Genome	by	Matt	Ridley	(Human	Genome)	
Headstrong:	52	Women	Who	Changed	Science-and	the	World	by	Rachel	Swaby
86
Thanks!
Brittany	N.	Lasseigne,	PhD	
@bnlasse					blasseigne@hudsonalpha.org
Iris	Data:		Ensemble	Methods		
Example:		tree	bagging	and	boosting
• Instead of picking a single model, ensemble methods
combine multiple models to fit the training data
(‘bagging’ and ‘boosting’)
• Random Forest is a Decision Tree Ensemble Method
Image:		Machado,	et	al.		Veterinary	Research,	2015.		 87
Iris	Data:	Neural	Nets
• Neural Networks (NNs) emulate how the
human brain works with a network of
interconnected neurons (essentially
logistic regression units) organized in
multiple layers, allowing more complex,
abstract, and subtle decisions
• Lots of tuning parameters (# of hidden
layers, # of neurons in each layer, and
multiple ways to tune learning)
• Learning is an iterative feedback
mechanism where training data error is
used to adjust the corresponding input
weights which is propagated back to
previous layers (i.e., back-propagation)
• NNs are good at learning non-linear
functions and can handle multiple
outputs, but have a long training time and
models are susceptible to local minimum
traps (can be mitigated by doing multiple
rounds—takes more time!)
X1
X2
Output
	(Summation	of	
Input	and	
Activation	with	
Sigmoid	Fxn)
‘Neuron’
88

Contenu connexe

Tendances

Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AINeo4j
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognitionSwarnava Sen
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...Edureka!
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsEnrico Palumbo
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableAditya Bhattacharya
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Learning from positive and unlabeled data
Learning from positive and unlabeled dataLearning from positive and unlabeled data
Learning from positive and unlabeled dataData Science Leuven
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep LearningJulien SIMON
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural NetworksAniket Maurya
 

Tendances (20)

Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Activation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural networkActivation functions and Training Algorithms for Deep Neural network
Activation functions and Training Algorithms for Deep Neural network
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Genetic Algorithm
Genetic AlgorithmGenetic Algorithm
Genetic Algorithm
 
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Learning from positive and unlabeled data
Learning from positive and unlabeled dataLearning from positive and unlabeled data
Learning from positive and unlabeled data
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
An introduction to Deep Learning
An introduction to Deep LearningAn introduction to Deep Learning
An introduction to Deep Learning
 
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEUnified Approach to Interpret Machine Learning Model: SHAP + LIME
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 

Similaire à Hands-on Introduction to Machine Learning

2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical DataPaul Agapow
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for DiscoveryDayOne
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian Aurisano
 
An introduction to machine learning in biomedical research: Key concepts, pr...
An introduction to machine learning in biomedical research:  Key concepts, pr...An introduction to machine learning in biomedical research:  Key concepts, pr...
An introduction to machine learning in biomedical research: Key concepts, pr...FranciscoJAzuajeG
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08Russ Altman
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrank Rybicki
 
The Uneven Future of Evidence-Based Medicine
The Uneven Future of Evidence-Based MedicineThe Uneven Future of Evidence-Based Medicine
The Uneven Future of Evidence-Based MedicineIda Sim
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesPremNarayanan6
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsGigaScience, BGI Hong Kong
 
“Big Data” and the Challenges for Statisticians
“Big Data” and the  Challenges for Statisticians“Big Data” and the  Challenges for Statisticians
“Big Data” and the Challenges for StatisticiansSetia Pramana
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesBastian Greshake
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingOla Spjuth
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
 
Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentEleanor Howe
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...Health Catalyst
 

Similaire à Hands-on Introduction to Machine Learning (20)

An Introduction to Biology with Computers
An Introduction to Biology with ComputersAn Introduction to Biology with Computers
An Introduction to Biology with Computers
 
2015 04-18-wilson cg
2015 04-18-wilson cg2015 04-18-wilson cg
2015 04-18-wilson cg
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for Discovery
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
 
An introduction to machine learning in biomedical research: Key concepts, pr...
An introduction to machine learning in biomedical research:  Key concepts, pr...An introduction to machine learning in biomedical research:  Key concepts, pr...
An introduction to machine learning in biomedical research: Key concepts, pr...
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / MedicineFrankie Rybicki slide set for Deep Learning in Radiology / Medicine
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
 
Use of data
Use of dataUse of data
Use of data
 
The Uneven Future of Evidence-Based Medicine
The Uneven Future of Evidence-Based MedicineThe Uneven Future of Evidence-Based Medicine
The Uneven Future of Evidence-Based Medicine
 
Basic of bioinformatics
Basic of bioinformaticsBasic of bioinformatics
Basic of bioinformatics
 
Big Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical DevicesBig Data in Healthcare and Medical Devices
Big Data in Healthcare and Medical Devices
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
 
“Big Data” and the Challenges for Statisticians
“Big Data” and the  Challenges for Statisticians“Big Data” and the  Challenges for Statisticians
“Big Data” and the Challenges for Statisticians
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association Studies
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imaging
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
 
Using Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and developmentUsing Bioinformatics Data to inform Therapeutics discovery and development
Using Bioinformatics Data to inform Therapeutics discovery and development
 
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
 

Dernier

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 

Dernier (20)

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 

Hands-on Introduction to Machine Learning