[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf

•

0 j'aime•7 vues

Training separate models from scratch or fine-tuning each individually for different tasks is costly in terms of computational resources, memory usage, and environmental impact. Multi-task learning leverages information across N tasks and datasets to enhance their performance. This approach presents benefits such as a shared model, representation bias, increased data efficiency, and eavesdropping. To mitigate issues such as catastrophic forgetting and interference, various methods have been proposed. This talk will explore the concepts of a general approach to multi-task learning in transformer-based architectures, novel adapter-based and hypernetwork techniques, and solutions to task sampling and balancing problems.

Données & analyses

Multi-Task Learning in
Transformer-Based
Architectures for NLP

ABOUT ME
Data Science Student at FER, Zagreb
NLP Researcher at Doxray
TIN FERKOVIĆ

TABLE OF CONTENTS
SINGLE TASK LEARNING, STL
N tasks = N models
SHARED ENCODER
+ task-specific heads
ADAPTERS
small, modular
components
HYPERNETWORKS
generate parameters
0 1
2 3

BERT (large)
345 M (1.34 GB)
5.65 GB
64 TPUs
4 days
~$7,000
4 GB
PARAMETERS
GPU MEMORY
PRE-TRAINING
CHECKPOINT
284t of CO2
(average transatlantic flight)
CO2 EMISSION
SINGLE TASK LEARNING
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." (2018).

6,000
TPU chips
And for today’s SOTA LLMs…
SINGLE TASK LEARNING

50 days
Training time
And for today’s SOTA LLMs…
SINGLE TASK LEARNING

$10M
Estimated training cost
And for today’s SOTA LLMs…
SINGLE TASK LEARNING

MOTIVATION
SINGLE MODEL
N times storage
reduction
DATA EFFICIENCY
Low-resource tasks
benefit
KNOWLEDGE SHARING
Gradient updates
from other tasks
SHARED ENCODERS

Batches
Sequential
Homogeneous
Heterogeneous
DECISIONS
Catastrophic forgetting
1 batch = 1 task
Multiple tasks in a batch
SHARED ENCODERS

Sampling
Proportional/
Random
Temperature
Annealed
DECISIONS
Simple, causes underfitting
Ease difference in dataset
sizes
Train equally towards the
end of training
Uncertainty Active learning
SHARED ENCODERS

Loss weighting
Uncertainty
Learning speed
Performance
DECISIONS
Logit - ground truth
Convergence
Is it the same metric?
Normalized Diving point-loss by log(n)
SHARED ENCODERS

INTERFERENCE
Different gradient
directions
SAMPLING
Difficult
PERFORMANCE
Easily
outperformed
UNDER-/OVERFITTING
Problems with
different dataset sizes
LOSS WEIGHTING
Difficult
TASK GROUPING
Identify compatible
tasks
DRAWBACKS
SHARED ENCODERS

MOTIVATION
0.5 %
Trainable
parameters
MODULARITY
Train, save, inject
PERFORMANCE
On-par with STL
ADAPTERS

Augment the base model with
new task-specific sub-functions
FUNCTION
Augment function’s input
by concatenating the
parameter vector
INPUT
Directly augment
parameters of the base
model
PARAMETER
COMPOSITIONS
ADAPTERS

FUNCTION
BOTTLENECK ADAPTER
PARAMETER
LoRA
INPUT
PREFIX-TUNING
ADAPTERS

ADAPTERS
Houlsby, Neil, et al. "Parameter-efficient transfer learning for NLP." International Conference on Machine Learning. PMLR, 2019.
FUNCTION
BOTTLENECK ADAPTER

ADAPTERS
Li, Xiang Lisa, and Percy Liang. "Prefix-Tuning: Optimizing Continuous Prompts for Generation.". 2021.
INPUT
PREFIX-TUNING

ADAPTERS
Hu, Edward J., et al. "LoRA: Low-Rank Adaptation of Large Language Models.". 2021.
PARAMETER
LoRA

ADAPTERS
FUNCTION
BOTTLENECK ADAPTER
PARAMETER
LoRA
INPUT
PREFIX-TUNING

PARAMETER
EFFICIENCY
TRAINING
EFFICIENCY
INFERENCE
EFFICIENCY
PERFORMANCE COMPOSITIONALITY
FUNCTION
COMPOSITION X ✓ X ✓✓ ✓
INPUT
COMPOSITION ✓✓ X X X ✓
PARAMETER
COMPOSITION ✓ X ✓✓ ✓ ✓
ADAPTERS
ADAPTERS

METHOD
COMBINATIONS
Bottleneck
ADAPTERS
Mao, Yuning, et al. "UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning.. 2022.

ADAPTER FUSION
ADAPTERS
Pfeiffer, Jonas, et al. "AdapterFusion: Non-Destructive Task Composition for Transfer Learning." 2021.

MOTIVATION
KNOWLEDGE SHARING
Hypernetwork
TASK-SPECIFIC
Generated
components
HYPERNETWORKS

ARCHITECTURE
HYPERNETWORKS
Üstün, Ahmet, et al. "Hyper-X: A Unified Hypernetwork for Multi-Task Multilingual Transfer." (2022).

INTERFERENCE
Different gradient
directions
SAMPLING
Difficult
UNDER-/OVERFITTING
Problems with
different dataset sizes
LOSS WEIGHTING
Difficult
DRAWBACKS
HYPERNETWORKS

“
“
"MULTI-TASK LEARNING: BECAUSE
DOING ONE THING AT A TIME IS SO
LAST YEAR."
— ChatGPT

THANKS
Do you have any questions?
tin.ferkovic@doxray.com
linkedin.com/in/tinferkovic/
doxray.com
CREDITS: Slidesgo, Flaticon, Freepik

Contenu connexe

Similaire à [DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf

NLP from scratch Bryan Gummibearehausen

Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...Annibale Panichella

Natural Language Processing - Research and Application TrendsShreyas Suresh Rao

A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia

Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe

The NLP Muppets revolution!Fabio Petroni, PhD

Tolog UpdatesLars Marius Garshol

Traits: A New Language Feature for PHP?Stefan Marr

Fast and Accurate Preordering for SMT using Neural NetworksSDL

Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics

Thomas Wolf "Transfer learning in NLP"Fwdays

Deep Learning for NLP ApplicationsSamiur Rahman

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail

Video+Language: From Classification to DescriptionGoergen Institute for Data Science

Recurrent Neural Networks for Text Analysisodsc

Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering

Software tools for high-throughput materials data generation and data miningAnubhav Jain

Similaire à [DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf (20)

NLP from scratch

Parameterizing and Assembling IR-based Solutions for SE Tasks using Genetic A...

Natural Language Processing - Research and Application Trends

A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

Deep Learning & NLP: Graphs to the Rescue!

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...

The NLP Muppets revolution!

Tolog Updates

Traits: A New Language Feature for PHP?

Fast and Accurate Preordering for SMT using Neural Networks

Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...

Thomas Wolf "Transfer learning in NLP"

Deep Learning for NLP Applications

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

Video+Language: From Classification to Description

Recurrent Neural Networks for Text Analysis

Naver learning to rank question answer pairs using hrde-ltc

Software tools for high-throughput materials data generation and data mining

Plus de DataScienceConferenc1

[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdfDataScienceConferenc1

[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...DataScienceConferenc1

[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdfDataScienceConferenc1

[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdfDataScienceConferenc1

[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdfDataScienceConferenc1

[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptxDataScienceConferenc1

[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdfDataScienceConferenc1

[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...DataScienceConferenc1

[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdfDataScienceConferenc1

[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...DataScienceConferenc1

[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...DataScienceConferenc1

[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdfDataScienceConferenc1

[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptxDataScienceConferenc1

[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...DataScienceConferenc1

[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptxDataScienceConferenc1

[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...DataScienceConferenc1

[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...DataScienceConferenc1

[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptxDataScienceConferenc1

[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptxDataScienceConferenc1

[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdfDataScienceConferenc1

Plus de DataScienceConferenc1 (20)

[DSC MENA 24] Mostafa_Essa_-_Ai_and_cloud.pdf

[DSC MENA 24] Yasser_El_Bendary - How NLP & LLMs model can excel in comprehen...

[DSC MENA 24] Medhat_Kandil - Empowering Egypt's AI & Biotechnology Scenes.pdf

[DSC MENA 24] Youssef_Kamal - Data governance and quality.pdf

[DSC MENA 24] Abdelrahman_Ghallab_-_Data_Product_mgmt.pdf

[DSC MENA 24] Asmaa_Eltaher_-_Innovation_Beyond_Brainstorming.pptx

[DSC MENA 24] Muhammad_Ezzat_-_Sustianable_Growth_Empowerment.pdf

[DSC MENA 24] Basma_Rady_-_Building_a_Data_Driven_Culture_in_Your_Organizatio...

[DSC MENA 24] Ahmed_Muselhy_-_Unveiling-the-Secrets-of-AI-in-Hiring.pdf

[DSC MENA 24] Ziad_Diab_-_Data-Driven_Disruption_-_The_Role_of_Data_Strategy_...

[DSC MENA 24] Mohammad_Essam_- Leveraging Scene Graphs for Generative AI and ...

[DSC MENA 24] Ahmed_Fahmy - Navigating the Future.pdf

[DSC MENA 24] Hany_Saad_Gheit_-_Azure_OpenAI_service.pptx

[DSC MENA 24] Nezar_El_Kady_-_From_Turing_to_Transformers__Navigating_the_AI_...

[DSC MENA 24] Amira_Abdelaziz_-_AI_in_Financial_Services.pptx

[DSC MENA 24] Omar_Ossama - My Journey from the Field of Oil & Gas, to the Ex...

[DSC MENA 24] Ramy_Agieb_-_Advancements_in_Artificial_Intelligence_for_Cybers...

[DSC MENA 24] Sohaila_Diab_-_Lets_Talk_Gen_AI_Presentation.pptx

[DSC MENA 24] Amal_Elgammal_-_QUALITOP_presentation.pptx

[DSC MENA 24] Abdelrahman_Sleem_-_AI_For_Marketing_DSC.pdf

Dernier

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Statistics notes ,it includes mean to index numberssuginr1

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样wsppdmt

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制vexqp

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB

Ranking and Scoring Exercises for ResearchRajesh Mondal

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg

7. Epi of Chronic respiratory diseases.pptibrahimabdi22

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

Dernier (20)

Aspirational Block Program Block Syaldey District - Almora

20240412-SmartCityIndex-2024-Full-Report.pdf

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

Statistics notes ,it includes mean to index numbers

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制

TrafficWave Generator Will Instantly drive targeted and engaging traffic back...

Ranking and Scoring Exercises for Research

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...

7. Epi of Chronic respiratory diseases.ppt

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...

[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf

1. Multi-Task Learning in Transformer-Based Architectures for NLP

2. ABOUT ME Data Science Student at FER, Zagreb NLP Researcher at Doxray TIN FERKOVIĆ

3. TABLE OF CONTENTS SINGLE TASK LEARNING, STL N tasks = N models SHARED ENCODER + task-specific heads ADAPTERS small, modular components HYPERNETWORKS generate parameters 0 1 2 3

4. STL N tasks = N models 0

5. BERT (large) 345 M (1.34 GB) 5.65 GB 64 TPUs 4 days ~$7,000 4 GB PARAMETERS GPU MEMORY PRE-TRAINING CHECKPOINT 284t of CO2 (average transatlantic flight) CO2 EMISSION SINGLE TASK LEARNING Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." (2018).

6. 6,000 TPU chips And for today’s SOTA LLMs… SINGLE TASK LEARNING

7. 50 days Training time And for today’s SOTA LLMs… SINGLE TASK LEARNING

8. $10M Estimated training cost And for today’s SOTA LLMs… SINGLE TASK LEARNING

9. SHARED ENCODER + TASK-SPECIFIC HEADS 1

10. MOTIVATION SINGLE MODEL N times storage reduction DATA EFFICIENCY Low-resource tasks benefit KNOWLEDGE SHARING Gradient updates from other tasks SHARED ENCODERS

11. ARCHITECTURE SHARED ENCODERS

12. Batches Sequential Homogeneous Heterogeneous DECISIONS Catastrophic forgetting 1 batch = 1 task Multiple tasks in a batch SHARED ENCODERS

13. Sampling Proportional/ Random Temperature Annealed DECISIONS Simple, causes underfitting Ease difference in dataset sizes Train equally towards the end of training Uncertainty Active learning SHARED ENCODERS

14. Loss weighting Uncertainty Learning speed Performance DECISIONS Logit - ground truth Convergence Is it the same metric? Normalized Diving point-loss by log(n) SHARED ENCODERS

15. INTERFERENCE Different gradient directions SAMPLING Difficult PERFORMANCE Easily outperformed UNDER-/OVERFITTING Problems with different dataset sizes LOSS WEIGHTING Difficult TASK GROUPING Identify compatible tasks DRAWBACKS SHARED ENCODERS

16. ADAPTERS small, modular components 02

17. MOTIVATION 0.5 % Trainable parameters MODULARITY Train, save, inject PERFORMANCE On-par with STL ADAPTERS

18. Augment the base model with new task-specific sub-functions FUNCTION Augment function’s input by concatenating the parameter vector INPUT Directly augment parameters of the base model PARAMETER COMPOSITIONS ADAPTERS

19. FUNCTION BOTTLENECK ADAPTER PARAMETER LoRA INPUT PREFIX-TUNING ADAPTERS

20. ADAPTERS Houlsby, Neil, et al. "Parameter-efficient transfer learning for NLP." International Conference on Machine Learning. PMLR, 2019. FUNCTION BOTTLENECK ADAPTER

21. ADAPTERS Li, Xiang Lisa, and Percy Liang. "Prefix-Tuning: Optimizing Continuous Prompts for Generation.". 2021. INPUT PREFIX-TUNING

22. ADAPTERS Hu, Edward J., et al. "LoRA: Low-Rank Adaptation of Large Language Models.". 2021. PARAMETER LoRA

23. ADAPTERS FUNCTION BOTTLENECK ADAPTER PARAMETER LoRA INPUT PREFIX-TUNING

24. PARAMETER EFFICIENCY TRAINING EFFICIENCY INFERENCE EFFICIENCY PERFORMANCE COMPOSITIONALITY FUNCTION COMPOSITION X ✓ X ✓✓ ✓ INPUT COMPOSITION ✓✓ X X X ✓ PARAMETER COMPOSITION ✓ X ✓✓ ✓ ✓ ADAPTERS ADAPTERS

25. METHOD COMBINATIONS Bottleneck ADAPTERS Mao, Yuning, et al. "UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning.. 2022.

26. ADAPTER FUSION ADAPTERS Pfeiffer, Jonas, et al. "AdapterFusion: Non-Destructive Task Composition for Transfer Learning." 2021.

27. ADAPTER FUSION ADAPTERS Pfeiffer, Jonas, et al. "AdapterFusion: Non-Destructive Task Composition for Transfer Learning." 2021.

28. HYPERNETWORKS generate parameters 03

29. MOTIVATION KNOWLEDGE SHARING Hypernetwork TASK-SPECIFIC Generated components HYPERNETWORKS

30. ARCHITECTURE HYPERNETWORKS

31. ARCHITECTURE HYPERNETWORKS Üstün, Ahmet, et al. "Hyper-X: A Unified Hypernetwork for Multi-Task Multilingual Transfer." (2022).

32. INTERFERENCE Different gradient directions SAMPLING Difficult UNDER-/OVERFITTING Problems with different dataset sizes LOSS WEIGHTING Difficult DRAWBACKS HYPERNETWORKS

33. “ “ "MULTI-TASK LEARNING: BECAUSE DOING ONE THING AT A TIME IS SO LAST YEAR." — ChatGPT

34. THANKS Do you have any questions? tin.ferkovic@doxray.com linkedin.com/in/tinferkovic/ doxray.com CREDITS: Slidesgo, Flaticon, Freepik

[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf

Recommandé

Recommandé

Contenu connexe

Similaire à [DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf

Similaire à [DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf (20)

Plus de DataScienceConferenc1

Plus de DataScienceConferenc1 (20)

Dernier

Dernier (20)

[DSC Adria 23]Tin_Ferkovic_Multi_Task_Learning_in_Transformer_Based_Architectures_for_NLP.pdf