Training separate models from scratch or fine-tuning each individually for different tasks is costly in terms of computational resources, memory usage, and environmental impact. Multi-task learning leverages information across N tasks and datasets to enhance their performance. This approach presents benefits such as a shared model, representation bias, increased data efficiency, and eavesdropping. To mitigate issues such as catastrophic forgetting and interference, various methods have been proposed. This talk will explore the concepts of a general approach to multi-task learning in transformer-based architectures, novel adapter-based and hypernetwork techniques, and solutions to task sampling and balancing problems.
5. BERT (large)
345 M (1.34 GB)
5.65 GB
64 TPUs
4 days
~$7,000
4 GB
PARAMETERS
GPU MEMORY
PRE-TRAINING
CHECKPOINT
284t of CO2
(average transatlantic flight)
CO2 EMISSION
SINGLE TASK LEARNING
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." (2018).
10. MOTIVATION
SINGLE MODEL
N times storage
reduction
DATA EFFICIENCY
Low-resource tasks
benefit
KNOWLEDGE SHARING
Gradient updates
from other tasks
SHARED ENCODERS
18. Augment the base model with
new task-specific sub-functions
FUNCTION
Augment function’s input
by concatenating the
parameter vector
INPUT
Directly augment
parameters of the base
model
PARAMETER
COMPOSITIONS
ADAPTERS
20. ADAPTERS
Houlsby, Neil, et al. "Parameter-efficient transfer learning for NLP." International Conference on Machine Learning. PMLR, 2019.
FUNCTION
BOTTLENECK ADAPTER
21. ADAPTERS
Li, Xiang Lisa, and Percy Liang. "Prefix-Tuning: Optimizing Continuous Prompts for Generation.". 2021.
INPUT
PREFIX-TUNING
22. ADAPTERS
Hu, Edward J., et al. "LoRA: Low-Rank Adaptation of Large Language Models.". 2021.
PARAMETER
LoRA