Learn End-to-End Learning, Multi-Task Learning, Transfer Learning and Meta Learning. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2024 first half of the year.
2. End-to-End Learning
• In earlier time intermediate features were generated and they were used again for
training another ML model
• But, when you have more data, it is much accurate to train from original data against the
result information we expect
Source: https://www.youtube.com/watch?v=bkVCAk9Nsss
3. Multi-Task Learning
• Different tasks (e.g.: News Summarization, News Sentiment Analysis)
need different labeled datasets which are rare
• The available datasets may be insufficient in size to train a model with
a sufficient level of accuracy level
• When the business need is updated new ML tasks emerge where
there are no labeled datasets to train
• In order to address the above problems we need to have a way to
learn more than one task at a time where a new task can be possible
to be trained with the same model without much data and with a
higher speed, which is known as Multi-Task Learning
5. Assumption of Multi-Task Learning
• In order to learn in the multi-task manner each task should share some
structure
• Otherwise, single-task learning is better to be used
• Fortunately, most of the task have common structures. E.g.:
• Share the same laws of physics
• Languages like English and French share common patterns due to historical reasons
• Psychology and physiology of humans are very similar
Source: https://www.youtube.com/watch?v=bkVCAk9Nsss
6. Notations of Multi-Task Learning
• In multi-task learning, a new variable zi known as Task Descriptor is added
to the approximation function which is generally a one-hot encoded vector
• Task descriptor encodes the task
Source: https://www.youtube.com/watch?v=vI46tzt4O7Y
7. Encoding the Task Descriptor in NN
Source: https://www.youtube.com/watch?v=vI46tzt4O7Y
8. Weighted Multi-Task Learning
• Instead of giving an equal weight to each of the task during the training
different weights can be given on different criteria like,
• Manually setting a priority based weight
• Dynamically adjusting during the training process
• This weight is given to the loss function during the optimization
Source: https://www.youtube.com/watch?v=vI46tzt4O7Y
9. Training With Vanilla Multi-Task Learning
Source: https://www.youtube.com/watch?v=vI46tzt4O7Y
10. Introduction to Transfer Learning
• Transfer Learning refers to the process of leveraging knowledge
gained from solving one problem and applying it to a different, but
related, problem
• Unlike in traditional ML, where models are trained to perform a
specific task on a specific dataset, Transfer Learning allows to transfer
knowledge from one task/domain to another. This improves the
performance of the target task, especially when labeled data for the
target task is limited or expensive to obtain
• E.g.: In order to train a cat image classifier, you can use a pre-trained
CNN using the huge ImageNet dataset with many miscellaneous
images and then train only the last few layers of the CNN, with the
available cat image dataset which is smaller in size
11. Motivation of Transfer Learning
• Scarcity of Labeled Data: Annotated datasets required for training
machine learning models are often scarce and expensive to acquire.
Transfer learning mitigates this issue by utilizing knowledge from
related tasks or domains
• Model Generalization: By transferring knowledge from a pre-trained
model, the model can generalize better to new tasks or domains,
even with limited data
• Efficiency: Transfer learning can significantly reduce the
computational resources and time required for training models from
scratch, making it a practical approach in various real-world scenarios
12. Types of Transfer Learning
1.Inductive Transfer Learning: Involves transferring knowledge from a source
domain to a target domain by learning a new task in the target domain using
the knowledge gained from solving a related task in the source domain
Example: Suppose you have a model trained to classify different types of
fruits based on images in one dataset (source domain). You can then use the
knowledge gained from this task to classify different types of vegetables
based on images in a separate dataset (target domain)
2.Transductive Transfer Learning: Focuses on adapting a model to a new
domain where the target data distribution may differ from the source domain.
Instead of learning a new task, transductive transfer learning aims to adapt
the model to perform well on the target domain.
Example: Let's say you have a model trained on data from one country
(source domain) to predict housing prices. However, when you try to apply
this model to a different country (target domain), you encounter differences
in housing market dynamics. Transductive transfer learning involves
adapting the model to the target domain's characteristics without explicitly
learning a new task
13. Pre-Trained Models
• Specific models can be developed by training available small labeled
data with supervised learning on top of the commonly available pre-
trained models
• Large generic datasets like ImageNet and GPT models are some of the
examples for the pre-trained models
• ImageNet is an example for a large labeled dataset
• However, there are many unsupervised pre-trained models available
as open source content such as large language models like GPT
models and BERT models
14. Transfer Learning via Fine Tuning
• The pre-trained model for source data is trained again for the target
domain data
• Sometimes, all the layers of the NN are trained,
• Either a small Learning Rate is used for all the layers
• Or smaller Learning Rates are used for earlier layers
• Sometimes, train only the last layers while freezing the earlier layers and
gradually the unfreezing the earlier layers
• Sometimes, only the last one or few layers are trained while other layers
keeping frozen
• When the target task is simpler than the source task no need to update earlier layers
• Best techniques/hyperparameters are selected with cross-validation
15. Transfer Learning via Fine Tuning
• Overfitting can be mitigated by Early Stopping technique
• New layers can be added and initialized with Random Initialization
while keeping the earlier layers as they are
16. Unintuitive Facts about Transfer Learning
• When the pre-training is done with unsupervised ML and fine tuned with supervised ML
(e.g. Transformer models), you don’t need that much diverse data to pre-train
• You can use the same target dataset for pre-training without much sacrifice of the
accuracy!
• This may change when both pre-training and fine tuning is done with supervised ML
Source: https://www.youtube.com/watch?v=bVjCjdq06R4
17. Unintuitive Facts about Transfer Learning
• Selecting the last layer of a NN may not be the best layer to be fine tuned
• For different scenarios some middle layers may perform better when selected than a full
fine tuning
Source: https://www.youtube.com/watch?v=bVjCjdq06R4
18. Rule of Thumb for Transfer Learning
Source: https://www.youtube.com/watch?v=bVjCjdq06R4
19. Meta Learning
• “Given a set of training tasks, can we optimize for the ability to learn
these tasks quickly, so that we can learn new tasks quickly too?”
• This is what is achieved by Meta Learning
• In other words optimization for transferability is known as Meta Learning
Source: https://www.youtube.com/watch?v=bVjCjdq06R4
20. Two Views of Meta Learning Algorithms
Source: https://www.youtube.com/watch?v=bVjCjdq06R4
21. Bayes View of Meta Learning
• yi,j label value probabilities are
dependent on 𝜙𝑖 parameter probabilities
of the model of a task
• All the 𝜙𝑖 parameter probabilities for all
the tasks are dependent on the meta
level parameters 𝜃
• If 𝜙𝑖 are independent for each task i,
then 𝜃 has no information and vice versa
• Learning for 𝜃 is the idea of Meta
Learning
Source: https://www.youtube.com/watch?v=bVjCjdq06R4
22. Mechanistic View of Meta Learning
• yi,j label value probabilities are dependent on 𝜙𝑖 parameter probabilities of the model of a task
• All the 𝜙𝑖 parameter probabilities for all the tasks are dependent on the meta level
parameters 𝜃
• If 𝜙𝑖 are independent for each task i, then 𝜃 has no information and vice versa
• Learning for 𝜃 is the idea of Meta Learning
Source: https://www.youtube.com/watch?v=bVjCjdq06R4