Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Deep	Learning
Filling	the	gap	between
practice	and	theory
Preferred Networks
Daisuke Okanohara
hillbig@preferred.jp
Aug. 3...
Background:
Unreasonable	success	of	deep	learning
l DL	succeed	in	solving	many	complex	tasks
̶ Image	recognition,	speech	r...
Background
DL	research	process	become	close	to	science	process
l Try	first,	examine	next
̶ First,	we	obtain	an	unexpected	...
Outline
Three	main	unsolved	problems	in	deep	learning
l Why	can	DL	learn	?
l Why	can	DL	recognize	and	generate	real	world	...
Why can DL learn ?
Optimization	in	training	DL
l Learn	a	NN	model	f(x;	q) by	minimizing	a	training	error	L(q)
L(q) = Si l(f(xi; q), yi)
where...
Gradient	descent
Stochastic	Gradient	Descent
l Gradient	descent
̶ Compute	the	gradient	of	L(q) with	regard	to	q; g(q), the...
Optimization	in	Deep	learning
l L(q) is	highly	non-convex	and	includes	many	local	optima,	
plateaus	and	saddle	points
̶ In...
Miracle	of	deep	learning	training
l It	was	believed	that	we	cannot	train	large	NNs	using	SGD
̶ Impossible	to	optimize	non-...
Why	can	DL	learn	?
l Why	does	DL	succeed	in	find	a	solution	with	a	low	train	error?
̶ Although	obtimization is	a	highly	no...
Loss	surface	analysis	using	spherical	spin	glass	model	(1/5)
[Choromanska+	2015]
l Consider	a	DNN	with	ReLU s(x)=max(0, x)...
Loss	surface	analysis	using	spherical	spin	glass	model	(2/5)	
l After	several	assumptions,	this	function	can	be	re-
exampr...
Loss	surface	analysis	using	spherical	spin	glass	model	(3/5)
Distribution	of	critical	points
Almost	no	critical	points
wit...
Loss	surface	analysis	using	spherical	spin	glass	model	(4/5)
Distribution	of	test	losses
14
l This	analysis	is	relied	on	several	unrealistic	assumptions
̶ Such	as	
“Each	activation	is	independent	from	inputs”
“Each...
Depth	creates	no	bad	local	minima	[Lu+	2017]
l Non	convexity	comes	from	depth	and	nonlinearity
l Depth	only	creates	non	co...
Deep	and	Wide	NN	also	create	no	bad	local	minima	
[Nguyen+	2017]
l If	the	following	conditions	hold	
̶ (1)	Activation	func...
Why	DL	can	learn	?
l Why	does	DL	succeed	in	find	a	solution	with	a	low	train	error?
̶ Although	obtimization is	a	highly	no...
NN	is	over	parametrized	but	achieves	generalization
l Although	the	number	of	parameters	of	DNN	is	much	larger	
than	the	nu...
Random	Labeling	experiment	[Zhang+	17]
l Model	capacity	should	be	restricted	to	achieve	generalization
̶ C.f.	Rademacher c...
SGD	plays	a	significant	role	for	generalization
l SGD	achieves	an	approximate	Bayesian	inference	[Mandt+	17]
̶ Bayesian	in...
Training	always	converge	to	the	solution	with	low-test	error
[Wu+	17]
l Even	when	we	optimize	the	model	with	different	ini...
Why can DL recognize and generate
real world data ?
Why	does	deep	learning	work	?
Lin’s	hypothesis	[Lin+	16]
l Real	world	phenomena	have	following	characteristics
1. Low	orde...
Generation	and	recognition	(1/2)
l Data	x	is	generated	from	unknown	factors	z
l Generation	and	recognition	are	inverse	ope...
Generation	and	recognition(2/2)
l Data	is	often	generated	from	multiple	factors
̶ Uninteresting	factors	are	sometimes	call...
Why do we consider generative models?
l For	more	accurate	recognition	and	inference
̶ If	we	know	the	generate	process,	we	...
E.	g.	Mapping	of	hand-written	data	into	2D	using	VAE
Original	hand-written	data	is	high-dimension	(784-dim)
If	we	map	thes...
Representation learning is more powerful than
the nearest neighbor method and manifold learning
l Actually	we	can	signific...
Real-world	data	is	distributed	in	low-dimensional	manifold
30/50
Each	point	
corresponds	to	a	
possible	data
Data	distribu...
Original	space	and	latent	space
31/50
generate
recognition
l In	the	latent	space,	the	meaning	of	data	is	smoothly	changed
Learning	is	easy	in	the	latent	space
32/50
generate
recognition
l Since	many	tasks	related	to	the	factors,	the	classificat...
How	to	learn	a	generative	and	inference	model	?
l Generation	process	and	its	counterpart	recognition	process	
are	highly	n...
Deep	generative	models
Fast	sampling	
of	x
Compute
the	likelihood
P(x)
Produce	sharp
image
Stable
Training
VAE
[Kingma+	14...
VAE:	Variational AutoEncoder [Kingma+	14]
z
μ
(μ,	σ)	=	Dec(z;	φ)
x〜N(μ,	σ)
σ
x
A	NN	network	outputs	mean	and	covariance
(μ...
VAE:	Variational Autoencoder
Induced	distribution
l p(x|z)	is	a	Gaussian	and	p(x) corresponds	to	(infinitely-many)	
mixtur...
VAE:	Variational AutoEncoder
Use	maximum	likelihood	estimation	for	learning	the	parameter	q
Since	the	exact	likelihood	is	...
Reparametization Trick
Since	we	take	an	expectation	with	regard	to	Q(z|x)	it	is	difficult	to	compute	
the	gradient	of	ELBO...
The	problem	of	maximum	likelihood	estimation	against
low-dimensional	manifold	data	(1/3)	[Arjovsky+	17ab]
l Maximum	likeli...
The	problem	of	maximum	likelihood	estimation	against
low-dimensional	manifold	data	(2/2)
l MLE	require	Q(xi)	>0	for	all	{x...
GAN(Generative	Adversarial	Net)
[Goodfellow+	14,	17]
l Compete	two	neural	networks	to	learn	a	distribution
l Generator	(co...
GAN:	Generative	adversarial	
z
x =	G(z)
x
Sample	x	in	the	following	step
(1) Sample	z	〜 U(0,	I)
(2) Compute	x	=	G(z)
(with...
Training	of	GAN
l Use	Discriminator	D(x)
̶ Output	1	if	x	is	estimated	as	real	and	0	otherwise
l Train	D	to	maximize	V	and	...
Modeling	low	dimensional	manifold	
l When	z	is	low-dimensional	data,	the	deterministic	function	
x	=	F(z)	outputs	low-dime...
Demonstration	of	GAN	training
http://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/
45
E...
Training	GAN
https://github.com/mattya/chainer-DCGAN
After	30	minutes
46
After	2	hours
47
After	1	day
48
49
LSGAN [Mao+	16]
Stacked	GAN	
http://mtyka.github.io/machine/learning/2017/06/06/highres-gan-faces.html
New	GAN	papers	are	coming	out	every	week
GAN	Zoo	https://github.com/hindupuravinash/the-gan-zoo	
l Since	GAN	provides	a	ne...
Super	Resolution	+	Regression	loss	for	perception	network
[Chen+	17]
l Generate	photo-realistic	image	from	segmentation	re...
ICA:	Independent	component	analysis
Reference:	[Hyvärinen 01]
l Find	a	component	z that	generates	data	x
x	=	f(z)
where f	...
Non-linear	ICA	for	non-stationary	time	series	data
[Hyvärinen+ 16]
l When	sources	are	independent	and	non-stationary,	we	c...
Non-linear	ICA	for	stationary	time	series	data
[Hyvärinen+	17]
l When	sources	are	independent	and	stationary,	we	can	also	...
Conjectures	[Okanohara]
l Train	a	multi-class	classifier	with	very	large	number	of	classes	
(e.g.	Imagenet).	Then	the	feat...
Why can DL keep and manipulate
complex information ?
Information	Abstract	Level
l Abstract	knowledge
̶ Text,	relation
l Model
̶ Simulator	/	generative	model
l Raw	Experience
̶...
Local	representation	vs	distributed	representation
l Local	representation
̶ each	concept	is	represented	by	one	symbol
̶ e....
High	dimensional	vector	vs		low	dimensional	data
l High	dimensional	vector
̶ Random	two	vectors	are	always	almost	orthogon...
Two	layer	feedforward	network	=		memory	augmented	network	
[Vaswani+	17]
l Memory	augmented	network
a	=	V	Softmax(Kq)
̶ K	...
Three	layer	feed-forward	network	is	also	memory-augmented	
network	[Okanohara unpublished]
l Three	layer	feed-forward	netw...
Two-leayr NN	update	rule	interpretation
[Okanohara unpublished]
l The	update	rule	of	two	layer	feedforward	network	for
h	=...
Resnet is	memory	augmented	network	
[Okanohara unpublished]
l Since	resnet is	the	following	form	
h	=	h	+	Resnet(h)
and	Re...
Infinite	memory	network
l What	happen	if	we	increase	the	number	of	hidden	units	
iteratively	for	each	training	sample	?
̶ ...
Conclusion
l There	are	still	many	unsolved	problems	in	DNN
̶ Why	can	DNN	learn	in	general	setting	?
̶ How	to	represent	rea...
References
l [Choromanska+	2015]	“The	loss	surface	of	multilayer	networks”,	A.	Choromanska,	
and	et	al.,	AIstats 2015
l [L...
l [Neyshabur+	17]	“Exploring	Generalization	in	Deep	Learning”,	B.	Neyhabur,	
and	et	al.,	arXiv:1706.08947
l [Wu+	17]	“Towa...
l [Goodfellow+	14]	“Generative	Adversarial	Nets”,	I.	Goodfellow,	and	et	al.,	
NIPS	2014
l [Goodfellow 16]	“NIPS	16	Tutoria...
l [Arjovsky+	17a]	”Towards	principled	methods	for	training	generative	
adversarial	networks”,	M.	Arjovsky,	and	et	al,	arXi...
l [Vaswani+	17]	“Attention	is	all	you	need”,	A.	Vaswani,	arxiv:1706.03762	(the	
idea	appears	only	in	version	3	https://arx...
Prochain SlideShare
Chargement dans…5
×
Prochain SlideShare
20171024 DLLab#04_PFN_Hiroshi Maruyama
Suivant
Télécharger pour lire hors ligne et voir en mode plein écran

759

Partager

Télécharger pour lire hors ligne

Deep Learning Practice and Theory

Télécharger pour lire hors ligne

Presentation by Daisuke Okanohara
At Summer School of Correspondence and Fusion of AI and Brain Science
Aug. 3rd, 2017.

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir

Deep Learning Practice and Theory

  1. 1. Deep Learning Filling the gap between practice and theory Preferred Networks Daisuke Okanohara hillbig@preferred.jp Aug. 3rd 2017 Summer School of Correspondence and Fusion of AI and Brain Science
  2. 2. Background: Unreasonable success of deep learning l DL succeed in solving many complex tasks ̶ Image recognition, speech recognition, natural language processing, robot controlling, computational chemistry etc. l But we don’t understand why DL work so well ̶ Its success is much higher than our understanding
  3. 3. Background DL research process become close to science process l Try first, examine next ̶ First, we obtain an unexpected good result experimentally ̶ We then find a theory that explains why it work so well l This process is different from previous ML research ̶ Careful design of new algorithms sometimes (or often) doesn’t work ̶ Many results contradict our intuition
  4. 4. Outline Three main unsolved problems in deep learning l Why can DL learn ? l Why can DL recognize and generate real world data ? l Why can DL keep and manipulate complex information ?
  5. 5. Why can DL learn ?
  6. 6. Optimization in training DL l Learn a NN model f(x; q) by minimizing a training error L(q) L(q) = Si l(f(xi; q), yi) where l(f(xi; q), yi) is a loss function and q is a set of parameters l E.g. two layer feed forward NN f(x; q)) = a(W2(a(W1x)) where a is an element-wise activate function such as a(z)=max(0, z) l(f(xi; q), yi) = ||f(xi; q) – yi||2 (L2 loss)
  7. 7. Gradient descent Stochastic Gradient Descent l Gradient descent ̶ Compute the gradient of L(q) with regard to q; g(q), then update q using g(q) as qt+1 := qt – at g(qt) where at>0 is a learning rate l Stochastic gradient descent: ̶ Since the exact computation of gradient is expensive, we instead use an approximated gradient by using a sampled data set (mini-batch) g’(qt) = 1/|B| Si∈B l(qt, xi, yi) -αg θ2 θ1 Contour of L(q)
  8. 8. Optimization in Deep learning l L(q) is highly non-convex and includes many local optima, plateaus and saddle points ̶ In plateau regions, the gradient becomes almost zero and the convergence becomes significantly slow ̶ In saddle points, only few directions will decrease L(q) and it is hard to escape from such points Saddle pointsPlateau local optimum
  9. 9. Miracle of deep learning training l It was believed that we cannot train large NNs using SGD ̶ Impossible to optimize non-convex problem of over million dimensions l However, SGD can find a solution with low-training error ̶ When using large model, it often find a solution with zero training error ̶ Moreover, an initialization doesn’t matter (c.f. <-> K-means require good initializer) l More surprisingly, SGD can find a solution with low-test error ̶ Although the model is over-parameterized, it does not over-fit and achieves generalization l Practically OK, but we want to know why
  10. 10. Why can DL learn ? l Why does DL succeed in find a solution with a low train error? ̶ Although obtimization is a highly non-convex optimization problem l Why does DL succeed in finding a solution with a low test error ? ̶ Although NN is over parametrized and no effective regularization
  11. 11. Loss surface analysis using spherical spin glass model (1/5) [Choromanska+ 2015] l Consider a DNN with ReLU s(x)=max(0, x) where q is the normalization factor l We can re-express this as where Ai,j=1 if the path (i, j) is active and Ai,j=0 if the path is inactive ̶ ReLU can be considered as a switch; the path is active if all ReLU are active and is inactive otherwise ReLU is active ReLU is inactive xi Y Path is active if all Relu is active
  12. 12. Loss surface analysis using spherical spin glass model (2/5) l After several assumptions, this function can be re- exampressed as a H-spin spherical spin-grass model s.t. l Now, we can use the analysis of spherical spin-grass model ̶ We now know the distribution of critical points ̶ k: Index (the number of negative eigenvalues of the Hessian) k=n: local minimum, k>0: saddle point 12
  13. 13. Loss surface analysis using spherical spin glass model (3/5) Distribution of critical points Almost no critical points with large k above LEinf -> Few local minima In the band [LE0, LEinf] many critical points with small k are found in near LE0 ->local minima are close to the global minimum
  14. 14. Loss surface analysis using spherical spin glass model (4/5) Distribution of test losses 14
  15. 15. l This analysis is relied on several unrealistic assumptions ̶ Such as “Each activation is independent from inputs” “Each path‘s input is independent” l Can we remove these assumptions or show these assumptions hold in almost training cases ? Loss surface analysis using spherical spin glass model (5/5) Remaining problem
  16. 16. Depth creates no bad local minima [Lu+ 2017] l Non convexity comes from depth and nonlinearity l Depth only creates non convexity ̶ Weight space symmetry means that there are many distinct configuration with same loss values which would result in a non-convex epigraph l Consider a following feed forward linear NN minW L(W) = ||WH WH-1 …W1X – Y||2 Then If X and Y have full row rank, then all local minima of L(W) are global minima [Theorem 2.3, Lu, Kawaguchi 2017]
  17. 17. Deep and Wide NN also create no bad local minima [Nguyen+ 2017] l If the following conditions hold ̶ (1) Activation function s is analytic on R, strictly monotonically increasing ̶ (2) s is bounded* ̶ (3) the loss function l(a) is twice differentiable, ̶ l’(a)=0 if a is a global minimum ̶ (4) Training samples are linearly independent, then every critical point for with the weight matrices have full column rank, is a global minimum ̶ We can achieve these conditions if we use sigmoid, tanh or softplus for s and the squared loss for l ̶ -> Solved in the case of non-linear NN with some conditions
  18. 18. Why DL can learn ? l Why does DL succeed in find a solution with a low train error? ̶ Although obtimization is a highly non-convex optimization problem l Why does DL succeed in finding a solution with a low test error ? ̶ Although NN is over parametrized and no effective regularization
  19. 19. NN is over parametrized but achieves generalization l Although the number of parameters of DNN is much larger than the number of samples, DNN does not overfit and achieves generalization l Large model tend to achieve low test error Number of parameters Test error (lower the better) When num. of parameters is larger than num. of training samples “overfitting” is observed Conventional ML models DNN No over-fitting is observed Moreover the test error decreases as the num. of parameters is increased
  20. 20. Random Labeling experiment [Zhang+ 17] l Model capacity should be restricted to achieve generalization ̶ C.f. Rademacher complexity, VC-dimension, uniform stability l Conduct an experiment on a copy of the test data where the true labels were replaced by random labels -> NN model easily fit even for random labels l Compare the result with that using regularization techniques -> No significant difference l Therefore NN model has enough model complexity to fit to random labeling but it can generalize well w/o regularization ̶ For random labels, NN memorize the samples, but for true labels NN learn patterns for generalization [Arpit+ 17] l … WHY?
  21. 21. SGD plays a significant role for generalization l SGD achieves an approximate Bayesian inference [Mandt+ 17] ̶ Bayesian inference provides a sample following q ~ P(q|D) l SGD’s noise removes unnecessary information of input to estimate output [Shwartz-Ziv+ 17] ̶ During training the mutual information between input and the network is decreased but that between the network and output is kept l Sharpness and norms of weights also relate to generalization ̶ Flat minima achieve generalization. But it depends on the scale of weights ̶ If we find a flat minimum with small norm of weights, then it achieves generalization [Neyshabur+ 17] FlatSharp
  22. 22. Training always converge to the solution with low-test error [Wu+ 17] l Even when we optimize the model with different initializations, they always converge to the solution with low test error l Flat minima have large basin while sharp minima have small basin ̶ Almost parameters will converge to flat minima l Flat minima corresponds to the low model complexity = low test error l Question: Why does NN learning induce flat minima ? Flat minima have large basin Sharp minima have small basin
  23. 23. Why can DL recognize and generate real world data ?
  24. 24. Why does deep learning work ? Lin’s hypothesis [Lin+ 16] l Real world phenomena have following characteristics 1. Low order polynomial ̶ Known physical interactions have at most 4th-order polynomials 2. Local interaction ̶ Number of interactions between objects increases linearly 3. Symmetry ̶ Small degree of freedoms 4. Markovian ̶ Most generation process depends on only the previous state l -> DNN can exploit these characteristics 24/50
  25. 25. Generation and recognition (1/2) l Data x is generated from unknown factors z l Generation and recognition are inverse operations z x E.g. Image generation, recognition z:Object, Position of camera, Lighting condition (Dragon, [10, 2, -4], white) x:Image Generation z x Recognition (Inference) Inference: Infer the posterior P(z|x) Generation Recognition
  26. 26. Generation and recognition(2/2) l Data is often generated from multiple factors ̶ Uninteresting factors are sometimes called covariates or disturbance variables of hidden variable l Generation process can be very complex ̶ Each step can be non-linear ̶ Gaussian, non-Gaussian noises are added at several steps ̶ E.g. Image rendering requires dozens steps l In general, generation process is unknown ̶ Any generation process is the approx. of actual process 26/50 z1 x c h z2 hm
  27. 27. Why do we consider generative models? l For more accurate recognition and inference ̶ If we know the generate process, we can improve recognition and inference u “What I cannot create, I do not understand” Richard Feynman u “Computer vision is inverse computer graphics” Geofferty Hinton ̶ By inverting the generation process, we obtain recognition process l For transfer learning ̶ By changing covariates, we can transfer the learned model to other environments l For sampling examples to compute statistics and validation 27/50
  28. 28. E. g. Mapping of hand-written data into 2D using VAE Original hand-written data is high-dimension (784-dim) If we map these data into 2-dim space, types, shapes change smoothly If we want to classify “1”, we need to find this simple boundary
  29. 29. Representation learning is more powerful than the nearest neighbor method and manifold learning l Actually we can significantly reduce the required training samples when using representation learning [Arora+ 2017] l Using the distance metric defined on the original space, or the neighborhood notion may not work ? In reality, samples with the same label are located in very different places in the original space. Their region may not be even connected in original space Ideally, near sample will help to determine the label Man with glasses
  30. 30. Real-world data is distributed in low-dimensional manifold 30/50 Each point corresponds to a possible data Data distributed in low-dimensional space C.f. distribution of galaxies in the universe Why does low-dimensional manifold appear ? Low dimensional factor is converted to high-dimensional data without increasing the complexity [Lin+16]
  31. 31. Original space and latent space 31/50 generate recognition l In the latent space, the meaning of data is smoothly changed
  32. 32. Learning is easy in the latent space 32/50 generate recognition l Since many tasks related to the factors, the classification boundary becomes simple in the latent space Require many training examples in the original space Require few training examples in the latent space
  33. 33. How to learn a generative and inference model ? l Generation process and its counterpart recognition process are highly non-linear and complex l -> Use a deep neural network to approximate them z x Generation x = f(z) z x Recognition z = g(x)
  34. 34. Deep generative models Fast sampling of x Compute the likelihood P(x) Produce sharp image Stable Training VAE [Kingma+ 14] √ △ Lower-bound (IW-VAE [Burda+15]) X √ GAN [Goodfellow+ 14,16] (IPM) √ X √ X-△ AutoRegressive [Oord+ 16ab] △-√ (Parallel multi-scale [Reed+ 17]) √ √ √ Energy model [Zhao+ 16] [Dai+ 17] △-√ △ Up to constant √ △
  35. 35. VAE: Variational AutoEncoder [Kingma+ 14] z μ (μ, σ) = Dec(z; φ) x〜N(μ, σ) σ x A NN network outputs mean and covariance (μ, σ) = Dec(z; φ) Generate x in the following steps (1) Sample z = N(0, I) (2) Compute (μ, σ) = Dec(z; φ) (3) Sample x = N(μ, σI) Defined distribution p(x) = ∫p(x|z)p(z)dz
  36. 36. VAE: Variational Autoencoder Induced distribution l p(x|z) is a Gaussian and p(x) corresponds to (infinitely-many) mixture of Gaussians p(x) = ∫p(x|z)p(z)dz ̶ Neural network can model complex relation between z and x
  37. 37. VAE: Variational AutoEncoder Use maximum likelihood estimation for learning the parameter q Since the exact likelihood is intractable, we instead maximize the lower bound of likelihood known as ELBO (Evidence lower bound) The proposal distribution q(z|x) should be close to the true posterior p(z|x) Maximizing wrt. q(z|x) correspond to the minimization of KL(q(z|x) || p(z|x)) = Learn the encoder as a side effect
  38. 38. Reparametization Trick Since we take an expectation with regard to Q(z|x) it is difficult to compute the gradient of ELBO wrt. Q(z|x) -> We can use reparamerization trick ! μ' σ' x' z μ σ x ε Converted computation graph can be regarded as an auto-encoder where a noise εσ is added to the latent variable μ
  39. 39. The problem of maximum likelihood estimation against low-dimensional manifold data (1/3) [Arjovsky+ 17ab] l Maximum likelihood estimation (MLE) estimate a distribution P(x) using a model Q(x) LMLE(P, Q) = Sx P(x) log Q(x) ̶ Usually, this is replaced with the empirical distribution (1/N)Si log Q(xi) l In low-dimensional manifold data, P(x) = 0 in most x l To model such P, Q(x) also should satisfy Q(x) = 0 in most x l If we use such Q(x), log Q(x) is undefined (or NaN) when Q(xi) = 0, so we cannot optimize Q(x) using MLE l to solve this -> Use Q(x) s.t. Q(xi)>0 for all {xi} ̶ E.g. Q(x) = N(µ, s) , this means a sample is µ with added noise s
  40. 40. The problem of maximum likelihood estimation against low-dimensional manifold data (2/2) l MLE require Q(xi) >0 for all {xi} l to solve this -> Use Q(x) s.t. Q(xi)>0 for all {xi} l Q(x) = N(µ, s) this means a sample is µ with added noise s ̶ This makes blurry images l Another difficulty is there is no notion of the closeness wrt. the space geometry When the area size of the intersection are same, MLE will give the same score. Although the left distribution is close to the true distribution, MLE scores are same
  41. 41. GAN(Generative Adversarial Net) [Goodfellow+ 14, 17] l Compete two neural networks to learn a distribution l Generator (counterfeiters) ̶ Goal: deceive the generator ̶ Learn to generate a realistic sample that can deceive the generator l Discriminator (Police) ̶ Goal: detect a sample generated by the generator ̶ Learn to detect the difference between real and generated ones Generator Real Discriminator RealFake Chosen randomly
  42. 42. GAN: Generative adversarial z x = G(z) x Sample x in the following step (1) Sample z 〜 U(0, I) (2) Compute x = G(z) (without adding noise) No adding noise step at last
  43. 43. Training of GAN l Use Discriminator D(x) ̶ Output 1 if x is estimated as real and 0 otherwise l Train D to maximize V and G to minimize V ̶ If learning succeeded, this learning will reach the following Nash equilibrium ∫p(z)G(z)dz=P(x), D(x)=1/2 ̶ Since D provides dD(x)/dx to update G, so they are actually cooperate to learn P(x) z x' x = G(z) {1(Real), 0(Fake)} y = D(x) x
  44. 44. Modeling low dimensional manifold l When z is low-dimensional data, the deterministic function x = F(z) outputs low-dimensional manifold in the space x l Using CNN for G(z) and D(x) is also important ̶ D(x) becomes similar score when x and x’ are similar l Recent study showed that training without using discriminator is also able to generate realistic data [Bojanowski+ 17] l These two factors are important to produce realistic data z x=F(z) z ∈ R1 x ∈ R2
  45. 45. Demonstration of GAN training http://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/ 45 Each generated samples follows dD(x)/dx
  46. 46. Training GAN https://github.com/mattya/chainer-DCGAN After 30 minutes 46
  47. 47. After 2 hours 47
  48. 48. After 1 day 48
  49. 49. 49
  50. 50. LSGAN [Mao+ 16]
  51. 51. Stacked GAN http://mtyka.github.io/machine/learning/2017/06/06/highres-gan-faces.html
  52. 52. New GAN papers are coming out every week GAN Zoo https://github.com/hindupuravinash/the-gan-zoo l Since GAN provides a new way to train a probabilistic model many GAN papers are coming out, (20 papers/mon Jul.2017) l Interpretation of GAN framework ̶ Wasserstein Distance, Integral Probability Measure, Inverse RL l New stable training method ̶ Lipschitzness of D, Ensemble of Ds, etc. l New Applications ̶ Speech, Text, Inference model (q(z|x)) l Conditional GAN ̶ Multi-class Super-resolution,
  53. 53. Super Resolution + Regression loss for perception network [Chen+ 17] l Generate photo-realistic image from segmentation result ̶ High resolution, globally consistent, stable training Output: photo-realistic imageInput: Segmentation
  54. 54. ICA: Independent component analysis Reference: [Hyvärinen 01] l Find a component z that generates data x x = f(z) where f is an unknown function called mixture function and components are independent each other p(z) = Pp(zi) l When f is linear and p(zi) is non-Gaussian, we can identify f and z correctly l However, when f is nonlinear, we cannot identify f and z ̶ There are infinitely many possible f and z l -> When data is time-series data x(1), x(2), …, x(n) and they are generate from z which are (1) non-stationary or (2) stationary independent sources, we can identify non-linear f and z
  55. 55. Non-linear ICA for non-stationary time series data [Hyvärinen+ 16] l When sources are independent and non-stationary, we can identify a non-linear mixture function f and z l Assumption: sources change slowly ̶ sources can be considered as stationary in short time segment ̶ Many interesting data have this property 1. Divide time series data into segments 2. Train multi-class classifier to classify each data point into each segment 3. The last layer’s feature corresponds to (linear mixture of) independent sources
  56. 56. Non-linear ICA for stationary time series data [Hyvärinen+ 17] l When sources are independent and stationary, we can also identify a non-linear mixture function f and z l Sources should be uniform dependent ̶ for x = s(t) and y=s(t-1) 1. Train a binary classifier to classify whether given data pairs are taken from adjacent (x(t), x(t+1)) or random (x(t), x(u)) 2. The last layer’s features correspond to (linear mixture of) independent sources
  57. 57. Conjectures [Okanohara] l Train a multi-class classifier with very large number of classes (e.g. Imagenet). Then the features of last layer correspond to (mixture-of) independent component ̶ To show this, we need a reasonable model between the set of labels and independent components ̶ Dark knowledge [Hinton14] is effective to transfer the model because this reveals the independent components l Similarly GAN’s discriminators (or the energy functions) also extract the independent components
  58. 58. Why can DL keep and manipulate complex information ?
  59. 59. Information Abstract Level l Abstract knowledge ̶ Text, relation l Model ̶ Simulator / generative model l Raw Experience ̶ Sensory stream Abstract Detailed Small volume Independent from problem/task context Large volume Dependent on problem/task/context
  60. 60. Local representation vs distributed representation l Local representation ̶ each concept is represented by one symbol ̶ e.g. Giraff=1, Panda=2, Lion=3, Tiger=4 ̶ no interfere, noise immunity, precise l Distributed representation ̶ each concept is represented by a set of symbol, and each symbol participates in representing many concepts ̶ Generalizable ̶ less accurate ̶ interfere Giraff Pand Lion Tiger Long neck ◯ four legs ◯ ◯ ◯ ◯ body hair ◯ ◯ ◯ paw pad ◯ ◯
  61. 61. High dimensional vector vs low dimensional data l High dimensional vector ̶ Random two vectors are always almost orthogonal ̶ many concepts can be stored within one vector u w = x + y + z, ̶ Same characteristics as local representation l Low dimensional vector ̶ Interfere each other ̶ Cannot keep precise memory ̶ Beneficial for generalization l Interference and generalization are strongly related
  62. 62. Two layer feedforward network = memory augmented network [Vaswani+ 17] l Memory augmented network a = V Softmax(Kq) ̶ K is a key matrix (i-th row corresponds to a key for i-th memory) ̶ V is a value matix. i-th column correspond to a value for i-th value ̶ We may use winner-take-all instead of Softmax l Two layer feedforward network a = W2Relu(W1x) ̶ i-th row of W1 corresponds to a key for i-th memory ̶ i-th column of W2 corresponds to a value for i-th memory
  63. 63. Three layer feed-forward network is also memory-augmented network [Okanohara unpublished] l Three layer feed-forward network can be considered as first layer is used for computing keys and second stores key and t a = W3Relu(W2Relu(W1x)) l key: Relu(W1x) l The i-th row of W2 corresponds to the key of i-th memory cell l The i-th column of W3 corresponds to the value of i-th memory cell
  64. 64. Two-leayr NN update rule interpretation [Okanohara unpublished] l The update rule of two layer feedforward network for h = Relu(W1x) a = W2h is dh = W2 Tda dW2= da hT dW1= dh diag(Relu’(W1x)) xT = W2 Tda diag(Relu’(W1x)) xT l These update rules correspond to storing the error (da) as a value and storing input (x) as a key for memory network ̶ Update only for active memories (Relu’(W1x))
  65. 65. Resnet is memory augmented network [Okanohara unpublished] l Since resnet is the following form h = h + Resnet(h) and Resnet(h) consists of two layer, we can interpret it as recalling memory and add it to the current vector ̶ Squeeze operation correspond to limit the number of memory cells l Resnet lookups memory iteratively ̶ Large number of steps = large number of memory lookups l This interpretation is different from using shortcut [He+15] or unrolled iterative estimation [Greff+16]
  66. 66. Infinite memory network l What happen if we increase the number of hidden units iteratively for each training sample ? ̶ This is similar to “Memory Networks” where we store previous hidden activation in explicit memory or “Progressive Network” [Rusu+ 16] where we incrementally add new network (and fixed old network) for each new task l We expect that it can prevent from catastrophic forgetting and achieve one-shot learning ̶ How to make sure generalization ?
  67. 67. Conclusion l There are still many unsolved problems in DNN ̶ Why can DNN learn in general setting ? ̶ How to represent real world information ? l There are still many unsolved problems in AI ̶ Disentanglement of information ̶ One-shot learning using attention and memory mechanism u Avoid catastrophic forgetting, interference ̶ Stable, data-efficient reinforcement learning ̶ How to abstract information u grounding (language), strong noise (e.g. dropout), extract hidden factors by using (non-)stationary or commonality among task
  68. 68. References l [Choromanska+ 2015] “The loss surface of multilayer networks”, A. Choromanska, and et al., AIstats 2015 l [Lu+ 2017] ”Depth creates No Bad Local Minima”, H. Lu, and et al., arXiv:1702.08580 l [Nguyen+ 2017] “The loss surface of deep and wide neural networks”, Q. Nguyen, and et al., arXiv:1704.08045 l [Zhang+ 2017] “Understanding deep learning requires rethinking generalization”, C. Zhang, and et al., ICLR 2017 l [Arpit+ 2017] ”A Closer Look at Memorization in Deep Networks”, D. Arpit, and et al., ICML 2017 l [Mangt+ 2017] “Stochastic Gradient Descent as Approximate Bayesian Inference”, S. Mandt and et al., arXiv:1704.04289 l [Shwartz-Ziv+ 2017] “Opening the Black Box of Deep Neural Networks via Information”, R. Shartz-Ziv, and et al., arXiv:1703.00810
  69. 69. l [Neyshabur+ 17] “Exploring Generalization in Deep Learning”, B. Neyhabur, and et al., arXiv:1706.08947 l [Wu+ 17] “Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes”, L. Wu and et al., arXiv:1706.10239 l [Lin + 16] “Why does deep and cheap learning work so well”, H W. Lin, and et al., arXiv1708.08226 l [Arora+ 17] “Provable benefits of representation learning”, S. Arora, and et al., arXiv:1706.04601 l [Kingma+ 14] ”Auto-Encoding Variational Bayes”, D. P. Kingma and et al., ICLR 2014 l [Burda+ 15] “Importance Weighted Autoencoders”, Y. Burda and et al., arXiv:1509.00519
  70. 70. l [Goodfellow+ 14] “Generative Adversarial Nets”, I. Goodfellow, and et al., NIPS 2014 l [Goodfellow 16] “NIPS 16 Tutorial: Generative Adversarial Networks”, arXiv:1701.00160 l [Oord+ 16a], “Conditional Image Generation with PixelCNN decoders”, A. Oord and et al., NIPS 2016 l [Oord+ 16b], “WaveNet: A Generative Model for Raw Audio”, A. Oord and et al., arXiv1609.03499 l [Reed+ 17] “Parallel Multiscale Autoregressive Density Estimation”, S. Reed and et al, arXiv:1703.03664 l [Zhao+ 17] ”Energy-based Generative Adversarial Network”, J. Zhao and et al., arXiv:1609.03126 l [Dai+ 17] “Calibrating Energy-based Generative Adversarial networks”, Z. Dai and et al., ICLR 2017
  71. 71. l [Arjovsky+ 17a] ”Towards principled methods for training generative adversarial networks”, M. Arjovsky, and et al, arXiv:1701.04862 l [Arjovsky+ 17b] “Wasserstein Generative Adversarial Networks”, M. Arjovsky, and et al., ICML 2017 l [Bojanowski+ 17] “Optimizing the Latent Space of Generative Networks”, P. Bojanowski and et al., arXiv:1707.05776 l [Chen+ 17] ”Photographic Image Synthesis with Cascaded Refinement Networks”, Q. Chen and et al., arXiv:1707.09405 l [Hyvärinen+ 01] “Independent Component Analysis”, A. Hyvärinen and et al., John Wiley ‘ Sons. 2001 l [Hyvärinen+ 16] “Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA”, A. Hyvärinen and et al, NIPS 2016 l [Hyvärinen+ 17] “Nonlinear ICA of Temporally Dependent Stationary Sources”, A. Hyvärinen and et al, AISTATS 2017
  72. 72. l [Vaswani+ 17] “Attention is all you need”, A. Vaswani, arxiv:1706.03762 (the idea appears only in version 3 https://arxiv.org/abs/1706.03762v3) l [He+ 15] “Deep Residual Learning for Image Recognition”, K. He and et al., arXiv:1512.03385 l [Rusu+ 16] “Progressive Neural Networks”, A. Rusu+ and et al., arXiv:1606.04671
  • KhowajaMahmoodSiddiqi

    Jun. 1, 2021
  • HerbertusJBertjanGro

    Jun. 25, 2019
  • kyd110

    Feb. 20, 2018
  • MihirPorwal1

    Jan. 19, 2018
  • ssuser40483a

    Jan. 18, 2018
  • KazukiInamura

    Jan. 5, 2018
  • GuiscelaGaitanLopez

    Jan. 4, 2018
  • cabdikaafiahmed

    Jan. 3, 2018
  • TngLng3

    Jan. 3, 2018
  • KoichiOkada4

    Jan. 2, 2018
  • CeleMansilla

    Dec. 26, 2017
  • BhaskarDutta4

    Dec. 17, 2017
  • DavidAguilar166

    Dec. 13, 2017
  • OSCARALEXGAMIONFABIA

    Dec. 3, 2017
  • MiftakhulHuda10

    Nov. 28, 2017
  • fernandaanugrah

    Nov. 26, 2017
  • MaxLima11

    Nov. 23, 2017
  • mariusUnico

    Nov. 20, 2017
  • JMRuthJambangan

    Nov. 15, 2017
  • fikriariyanto1

    Nov. 11, 2017

Presentation by Daisuke Okanohara At Summer School of Correspondence and Fusion of AI and Brain Science Aug. 3rd, 2017.

Vues

Nombre de vues

9 783

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

906

Actions

Téléchargements

221

Partages

0

Commentaires

0

Mentions J'aime

759

×