3. Perceptrons
One of the earliest supervised training algorithms is that of the perceptron, a basic neural network building block.
푓푓푥푥=푥푥∗푤푤+푏푏
transfer function:
activation function:
ℎ푥푥=ቊ1:푖푖푖푓푓푥푥>00:표표표표표표표표표표표표표표
4. Drawbacks
The single perceptron has one major drawback: it can only learn linearly separable functions.
How major is this drawback? Take XOR, a relatively simple function, and notice that it can’t be classified by a linear separator
5. Multilayer networks
Multilayer networks could learn complicated things and they did -butvery slowly. What emerged from this second neural network revolution was that we had a good theory but learning was slow and results while goodbut not amazing.
the real question, that received very little attention for such an important one, was -why don't multilayer networks learn?
7. ifeach of our perceptrons is only allowed to use a linear activation function Then, the final output of our network will still besome linear function of the inputs, just adjustedwith a ton of different weights that it’s collected throughout the network.
A linear composition of a bunch of linear functions is still just a linear function, so most neural networks use non-linear activation functions.
8. Hidden layers
a single hidden layer is powerful enough to learn any function.
هذا يعني :
We often learn better in practicewith multiple hidden layers.
لك يعنيييي :
Deeper networks
وأخير ا
9. The Problem with Large Networks
The problem is that it is fairly easy to create things that behave like neurons, the brains major component. What is not easy is working out what the whole thing does once you have assembled it.
why don't multilayer networks learn?
it all had to do with the way the training errors were being passed backfrom the output layer to the deeper layers of artificial neurons.
10. Vanishing gradient
"vanishing gradient" problem meant that as soon as a neural network got reasonably good at a task the lower layers didn't really get any information about how to change to help do the task better.
because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
13. Autoencoders
typically a feedforwardneural network which aims to learn a compressed, distributed representation (encoding) of a dataset.
الملاحظة الاساسية , عدد العصبونات
في كل طبقة من الطبقات :
• عدد العصبونات متساوي في طبقتي الدخل
والخرج
• عدد العصبونات في الطبقة الوسطى أقل من
عدد عصبونات الخرج, ومنافذ الدخل.
14. What ….auto..what !!??
The intuition behind this architecture is that the network will not learn a “mapping” between the training dataand its labels, but will instead learn the internal structure and features of the data itself.
(Because of this, the hidden layer is also called feature detector.)
15. So what …!
Usually, the number of hiddenunits is smallerthan the input/outputlayers, which forcesthe network to learn only the most important features and achieves a dimensionality reduction.
dimensionality reduction = featureselection and feature extraction.
we’re attempting to learn the data in atruer sense
16.
17. Restricted Boltzmann Machines
classical factor analysis
Example:
Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors.
18. Restricted Boltzmann Machines
Restricted Boltzmann Machines essentially perform a binary version of factor analysis.
Instead of usersrating a set of movies on a continuous scale, they simply tell you whether they likea movie or not, and the RBMwill try to discover latent factors that can explain the activation of these movie choices.
Restricted Boltzmann Machine is a stochastic neural network
stochastic meaning these activations have a probabilistic element
19. Restricted Boltzmann Machines
RBMsare composed of a hidden, and visible layer.
Unlike the feedforward networks, the connectionsbetween the visible and hidden layers are undirected.
(the values can be propagated in both thvisible-to-hidden and hidden-to-visible directions)
20. contrastive divergence -training
•
Positive phase:
An input sample vis clamped to the input layer.
v is propagated to the hidden layer in a similar manner to the feedforward networks.
The result of the hidden layer activations is h.
ىعدي ام باسح متي اذه دعبpositive functions جرخلا ةدقع دنع ةكبشلا جرخ نيب ءادج ىلا يواسي وهو نيتدقع لك نيبi ةدقعلا دنع ةكبشلا لخدوj .
푃푃푃푃푃푃푒푖푖푖푖=푥푥표표표표표표표표표표표표∗푥푥푖푖푖푖푖
•
Negative phase:
Propagate hback to the visible layer with result v’
Propagate the new v’ back to the hidden layer with activations result h’.
يتم بعدها اعادة بناء الدخل عن طريق عملية الانتشار العكسي .
بعد ذلك يعاد نشر الدخل الجديد الى طبقة الخرج .
ىعدي ام باسح متيNegative function نيتدقع لك نيب ةدقع دنع ةكبشلا جرخ نيب ءادج ىلا يواسي وهو جرخلاi دنع ةكبشلا لخدو ةدقعلاj .
•
Weight update:
•
푤푤푖푖푖푖=푤푤푖푖푖푖+푙푙∗(푝푝푝푒푖푖푖푖−푛푛푛푛푛푛푛푛푒푖푖푖푖)
22. Why deep learning now:
What's different is that we can run very large and very deep networks on fast GPUs(sometimes with billions of connections, and 12 layers)and train them on large datasets with millions of examples.
23. What is wrong with back-propagation?
•
It requires labelled training data.
•
Almost all data is unlabeled.
•
The learning time does not scale well
•
It is very slow in networks with multiple hidden layers.
•
It can get stuckin poor local optima.
•
These are often quite good, but for deep nets they are far from optimal.
24. Training a deep network
•
Firsttrain a layer of features that receiveinput directlyfrom the pixels.
•
Thentreat the activationsof the trained features as if they were pixels and learn features of features in a second hidden layer.
•
Do it again.
•
It can be proved (We’re not going to do it!) that each time we add another layer of features we improvea variational lower bound on the log probability of generating the training data.
•
نسحت يف ينعي.
25. Who is working on deep learning?
مين الشباب يلي عم يشتغلوا بهالشغلة ؟
26. Geoffrey Hinton
He is the co-inventor of the backpropagation and contrastive divergence training algorithms and is an important figure in the deep learning movement.
‘I get very excited when we discover a way of making neural networks better —and when that’s closely related to how the brain works.’
—Geoffrey Hinton
Googlehad hired him along with two of his University of Toronto graduate students.
27. Yann LeCun
computer science researcher with contributions in machine learning, known for his work on optical character recognition and computer vision using convolutional neural networks.
has been much in the news lately, as one of the leading experts in Deep Learning
Facebook has created a new research laboratory with the ambitious, long-term goal of bringing about major advances in Artificial Intelligence.
Director of AI Research, Facebook
28. Andrew Ng
On May 16, 2014, Ng announced from his Courserablog that he would be stepping away from his day-to-day responsibilities at Coursera, and join Baiduas Chief Scientist, working on Baidu Brain project.
Coursera co-founder
29. Facts:
•
At some point in the late 1990s, one of ConvNet-based systems was reading 10 to 20%of all the checks in the US.
•
ConvNetsare now widely used by Facebook, Google, Microsoft, IBM, Baidu, NECand others for image and speech recognition.
30. Example
Create an algorithm to distinguish dogs from cats
In this competition, you'll write an algorithm to classify whether images contain either a dog or a cat.
A student of Yann Lecun recently won Dogs vs Cats competition using a version of ConvNet, achieving 98.9% accuracy.
31. ImageNet LSVRC-2010 contest
•
Best system in 2010competition got 47% error for its first choice and 25%error for its top 5 choices.
classify 1.2 million high-resolution images into 1000 different classes.
•
A very deep neural net (Krizhevsky et. al. 2012) gets less that 40% error for its first choice and less than 20% for its top 5choices
32.
33.
34.
35. The Speech Recognition Task(Mohamed, Dahl, & Hinton, 2012)
•
Deep neural networks pioneered by George Dahl and Abdel-rahmanMohamedare now replacing the previous machine learning method for the acoustic model.
•
After standard processing, a deep net with 8 layers gets 20.7% error rate.
•
The best previous speaker-independent result was 24.4% and this required averaging several models.