SlideShare a Scribd company logo
1 of 31
Download to read offline
Machine Learning for Big Data
Eirini Ntoutsi
(joint work with Vasileios Iosifidis)
Leibniz University Hannover & L3S Research Center
(Machine)Learning with limited labels
4th Alexandria workshop, 19-20.11.2017
A good conjuncture for ML/DM (data-driven learning)
(Machine)Learning with limited labels
Data deluge Machine Learning
advances
Computer power Enthusiasm
2
More data = Better learning?
 However, more data does not necessarily imply better learning
(Machine)Learning with limited labels
Data deluge Machine Learning advances
• Data is the fuel for ML
• (Sophisticated) ML methods require more data for training
3
More data != Better learning
 More data != Better data
 The veracity issue/ data in doubt
 Data inconsistency, incompleteness, ambiguities, …
 The non-representative samples issue
 Biased data, not covering the population/problem we want to study
 The label scarcity issue
 Despite its volume, big data does not come with label information
 Unlabelled data: Abundant and free
 E.g., image classification: easy to get unlabeled images
 E.g., website classification: easy to get unlabeled webpages
 Labelled data: Expensive and scarce
 …
(Machine)Learning with limited labels 4
Why label scarcity is a problem?
 Standard supervised learning methods will not work
 Esp. a big problem for complex models, like deep neural networks.
(Machine)Learning with limited labels
Model
Learning
algorithm
Source: https://tinyurl.com/ya3svsxb
5
How to deal with label scarcity?
 A variety of methods is relevant
 Semi-supervised learning
 Exploit the unlabelled data together with the labelled one
 Active-learning
 Ask the user to contribute labels for a few, useful for learning instances
 Data augmentation
 Generate artificial data by expanding the original labelled dataset
 ….
(Machine)Learning with limited labels
This talk!
Ongoing work!
Past, ongoing work!
6
In this presentation
Semi-supervised learning
(or, exploiting the unlabelled data together with the
labelled one)
(Machine)Learning with limited labels 7
Semi-supervised learning
 Problem setting
 Given: Few initial labelled training data DL =(Xl,Yl) and unlabelled data DU = (Xu)
 Goal: Build a model using not only DL but also DU
8(Machine)Learning with limited labels
Labeled
DL
Unlabeled
DU
The intuition
 Lets consider only the labelled data
 We have two classes: red & blue
(Machine)Learning with limited labels 9
 Lets consider also some unlabelled data (light blue)
 The unlabelled data can give a better sense of the class separation
boundary (in this case)
Important prerequisite: the distribution of
examples, which the unlabeled data will help
elucidate, should be relevant for the
classification problem
Semi-supervised learning methods
 Self-learning
 Co-training
 Generative probabilistic models like EM
 …
(Machine)Learning with limited labels
Not included in this work.
10
Semi-supervised learning: Self-learning
(Machine)Learning with limited labels
 Given: Small amount of initial labelled training data DL
 Idea: Train, predict, re-train using classifier’s (best) predictions, repeat
 Can be used with any supervised learner.
Source: https://tinyurl.com/y98clzxb
11
Self-Learning: A good case
 Base learner: KNN classifier
(Machine)Learning with limited labels
Source: https://tinyurl.com/y98clzxb
12
Self-Learning: A bad case
 Base learner: KNN classifier
 Things can go wrong if there are outliers. Mistakes get reinforced.
(Machine)Learning with limited labels
Source: https://tinyurl.com/y98clzxb
13
Semi-supervised learning: Co-Training
 Given: Small amount of initial labelled training data
 Each instance x, has two views x=[x1, x2]
 E.g., in webpage classification:
1. Page view: words appearing on the web page
2. Hyperlink view: words underlined in links pointing in the webpage from other pages
 Co-training utilizes both views to learn better with fewer labels
 Idea: Each view teaching (training) the other view
 By providing labelled instances
(Machine)Learning with limited labels 14
Semi-supervised learning: Co-Training
(Machine)Learning with limited labels 15
Semi-supervised learning: Co-Training
 Assumption
 Views should be independent
 Intuitively, we don’t want redundancy between the views (we want classifiers that
make different mistakes)
 Given sufficient data, each view is good enough to learn from
(Machine)Learning with limited labels 16
Self-learning vs co-training
 Despite their differences
 Co-training splits the features, self-learning does not
 Both follow a similar training set expansion
strategy
 They expand the training set by adding labels to
(some of) the unlabeled data.
 So, the traning set is expanded via: real (unlabeled)
instances with predicted labels
 Both self learning & co-training incrementally uses
the unlabeled data.
 Both self learning & co-training propagate the most
confident predictions to the next round
(Machine)Learning with limited labels
Unlabeled
Labeled
Labeled
Unlabeled
17
This work
Semi-supervised learning for textual data
(self-learning, co-training)
(Machine)Learning with limited labels 18
The TSentiment15 dataset
 We used self-learning and co-training to annotate a big dataset
 the whole Twitter corpus of 2015 (228M tweets w.o. retweets, 275M with)
 The annotated dataset is available at: https://l3s.de/~iosifidis/TSentiment15/
 The largest previous dataset is
 TSentiment (1,6M tweets collected over a period of 3 months in 2009)
 In both cases, labelling relates to sentiment
 2 classes: positive, negative
(Machine)Learning with limited labels 19
Annotation settings
 For self-learning:
 the features are the unigrams
 For co-training: we tried two alternatives
 Unigrams and bigrams
 Unigrams and language features like part-of-speech tags, #words in capital,
#links, #mentions, etc.
 We considered two annotation modes:
 Batch annotation: the dataset was processed as a whole
 Stream annotation: the dataset was proposed in a stream fashion
(Machine)Learning with limited labels 20
U1
L1
U2
L2
…
U12
L12
Labeled
DL
Unlabeled
DU
How to build the ground truth (DL)
 We used two different label sources
 Distant Supervision
 Use emoticons as proxies for sentiment
 Only clearly-labelled tweets (with only positive or
only negative emoticons) are kept
 SentiWordNet: a lexicon-based approach
 The sentiment score of a tweet is an aggregation of
the sentiment scores of its words (the latest comes
from the lexicon)
(Machine)Learning with limited labels 22
 They agree on ~2,5M tweets  ground truth
Labeled-unlabeled volume (and over time)
23(Machine)Learning with limited labels
 On monthly average, DU 82 times larger than DL
 Positive class is overrepresented, average ration positive/negative per
month =3
Batch annotation: Self-learning vs co-training
24(Machine)Learning with limited labels
Self–learningCo-training
 The more selective δ is the
more unlabeled tweets
 The majority of the predictions
refer to positive class
 The model is more confident
on the positive class
 Co-training labels more
instances than self-learning
 Co-training learns the negative
class better than self-learning
Batch annotation: Effect of labelled set sample
 When the number of labels is small, co-training performs better
 With >=40% of labels, self-learning is better
25(Machine)Learning with limited labels
Stream annotation
 Input: stream in monthly batches: ((L1, U1), (L2, U2), …, (L12, U12))
 Two variants are evaluated, for training:
 Without history: We learn a model on each month i (using Li, Ui).
 With history: For a month i, we consider as Li = . Similarly for Ui.
 Two variants also for testing:
 Prequential evaluation: use the Li+1 as the test set for month i
 Holdout evaluation: we split D into Dtrain, Dtest . Training/ testing similar to
before but only on data from Dtrain, Dtest, respectively.
26(Machine)Learning with limited labels
U1
L1
U2
L2
…
U12
L12
Stream: Self-learning vs co-training
27(Machine)Learning with limited labels
Prequential
Holdout History improves the
performance
 For the models with history,
co-training is better in the
beginning but as the history
grows self-learning wins
Stream: the effect of the history length
 We used a sliding window approach
 E.g., training on months [1-3] using both labeled and unlabeled data, test on
month 4.
 Small decrease in performance comparing to the full history case but much
more light models
28(Machine)Learning with limited labels
Class distribution of the predictions
 Self-learning produces more positive predictions than co-training
 Version with retweets results in more balanced predictions
 Original class distribution w.o. retweets: 87%-13%
 Original class distribution w. retweets: 75%-25%
29(Machine)Learning with limited labels
Summary
 We annotated a big dataset with semi-supervised learning
 Self-training
 Co-training
 When the number of labels is small, co-training performs better
 Batch vs stream annotation
 History helps (but we don’t need to keep the whole history, a sliding window
based approach is also ok)
 Learning with redundancy (retweets)
 Better class balance in the predictions when retweets are used (because the
original dataset is balanced)
(Machine)Learning with limited labels 30
Ongoing work
 Thus far: Semi-supervised learning which focuses on label scarcity
 Another way to get around lack of data is data augmentation
 i.e., increasing the size of the training set by generating artificial data based on
the original labeled set
 Useful for many purposes
 Deal with class imbalance, create more robust models etc
 We investigate different augmentation approaches
 At the input layer
 At the intermediate layer
 And how to control the augmentation process
 The goal is to generate plausible data that help with the classification task
(Machine)Learning with limited labels 31
Thank you for you attention!
32(Machine)Learning with limited labels
www.kbs.uni-hannover.de/~ntoutsi/
ntoutsi@l3s.de
Questions/ Thoughts?
 Relevant work
 V. Iosifidis, E. Ntoutsi, "Large scale sentiment annotation with limited
labels", KDD, Halifax, Canada, 2017
 TSentiment15 available at:
 https://l3s.de/~iosifidis/TSentiment15/

More Related Content

What's hot

Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
butest
 
Machine Learning: Machine Learning: Introduction Introduction
Machine Learning: Machine Learning: Introduction IntroductionMachine Learning: Machine Learning: Introduction Introduction
Machine Learning: Machine Learning: Introduction Introduction
butest
 

What's hot (19)

Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 1 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai University
 
Machine learning
Machine learningMachine learning
Machine learning
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Supervised learning
  Supervised learning  Supervised learning
Supervised learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Supervised Machine Learning With Types And Techniques
Supervised Machine Learning With Types And TechniquesSupervised Machine Learning With Types And Techniques
Supervised Machine Learning With Types And Techniques
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
Hot Topics in Machine Learning For Research and thesis
Hot Topics in Machine Learning For Research and thesisHot Topics in Machine Learning For Research and thesis
Hot Topics in Machine Learning For Research and thesis
 
Machine Learning: Machine Learning: Introduction Introduction
Machine Learning: Machine Learning: Introduction IntroductionMachine Learning: Machine Learning: Introduction Introduction
Machine Learning: Machine Learning: Introduction Introduction
 
detailed Presentation on supervised learning
 detailed Presentation on supervised learning detailed Presentation on supervised learning
detailed Presentation on supervised learning
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
 

Similar to (Machine)Learning with limited labels(Machine)Learning with limited labels(Machine)Learning with limited labels(

PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
butest
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
butest
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Introduction
IntroductionIntroduction
Introduction
butest
 
Basic Notions of Learning, Introduction to Learning ...
Basic Notions of Learning, Introduction to Learning ...Basic Notions of Learning, Introduction to Learning ...
Basic Notions of Learning, Introduction to Learning ...
butest
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
butest
 
slides
slidesslides
slides
butest
 
slides
slidesslides
slides
butest
 
Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
butest
 

Similar to (Machine)Learning with limited labels(Machine)Learning with limited labels(Machine)Learning with limited labels( (20)

PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
 
ML crash course
ML crash courseML crash course
ML crash course
 
Presentation on supervised learning
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA Boost
 
Introduction
IntroductionIntroduction
Introduction
 
Introduction
IntroductionIntroduction
Introduction
 
Introduction
IntroductionIntroduction
Introduction
 
Basic Notions of Learning, Introduction to Learning ...
Basic Notions of Learning, Introduction to Learning ...Basic Notions of Learning, Introduction to Learning ...
Basic Notions of Learning, Introduction to Learning ...
 
Eric Smidth
Eric SmidthEric Smidth
Eric Smidth
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
slides
slidesslides
slides
 
slides
slidesslides
slides
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Chapter01.ppt
Chapter01.pptChapter01.ppt
Chapter01.ppt
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
 

More from Eirini Ntoutsi

More from Eirini Ntoutsi (11)

AAISI AI Colloquium 30/3/2021: Bias in AI systems
AAISI AI Colloquium 30/3/2021: Bias in AI systemsAAISI AI Colloquium 30/3/2021: Bias in AI systems
AAISI AI Colloquium 30/3/2021: Bias in AI systems
 
Fairness-aware learning: From single models to sequential ensemble learning a...
Fairness-aware learning: From single models to sequential ensemble learning a...Fairness-aware learning: From single models to sequential ensemble learning a...
Fairness-aware learning: From single models to sequential ensemble learning a...
 
Bias in AI-systems: A multi-step approach
Bias in AI-systems: A multi-step approachBias in AI-systems: A multi-step approach
Bias in AI-systems: A multi-step approach
 
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
 
Redundancies in Data and their E ect on the Evaluation of Recommendation Syst...
Redundancies in Data and their Eect on the Evaluation of Recommendation Syst...Redundancies in Data and their Eect on the Evaluation of Recommendation Syst...
Redundancies in Data and their E ect on the Evaluation of Recommendation Syst...
 
Density-based Projected Clustering over High Dimensional Data Streams
Density-based Projected Clustering over High Dimensional Data StreamsDensity-based Projected Clustering over High Dimensional Data Streams
Density-based Projected Clustering over High Dimensional Data Streams
 
Density Based Subspace Clustering Over Dynamic Data
Density Based Subspace Clustering Over Dynamic DataDensity Based Subspace Clustering Over Dynamic Data
Density Based Subspace Clustering Over Dynamic Data
 
Summarizing Cluster Evolution in Dynamic Environments
Summarizing Cluster Evolution in Dynamic EnvironmentsSummarizing Cluster Evolution in Dynamic Environments
Summarizing Cluster Evolution in Dynamic Environments
 
Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data
 
Discovering and Monitoring Product Features and the Opinions on them with OPI...
Discovering and Monitoring Product Features and the Opinions on them with OPI...Discovering and Monitoring Product Features and the Opinions on them with OPI...
Discovering and Monitoring Product Features and the Opinions on them with OPI...
 
NeeMo@OpenCoffeeXX
NeeMo@OpenCoffeeXXNeeMo@OpenCoffeeXX
NeeMo@OpenCoffeeXX
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Recently uploaded (20)

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 

(Machine)Learning with limited labels(Machine)Learning with limited labels(Machine)Learning with limited labels(

  • 1. Machine Learning for Big Data Eirini Ntoutsi (joint work with Vasileios Iosifidis) Leibniz University Hannover & L3S Research Center (Machine)Learning with limited labels 4th Alexandria workshop, 19-20.11.2017
  • 2. A good conjuncture for ML/DM (data-driven learning) (Machine)Learning with limited labels Data deluge Machine Learning advances Computer power Enthusiasm 2
  • 3. More data = Better learning?  However, more data does not necessarily imply better learning (Machine)Learning with limited labels Data deluge Machine Learning advances • Data is the fuel for ML • (Sophisticated) ML methods require more data for training 3
  • 4. More data != Better learning  More data != Better data  The veracity issue/ data in doubt  Data inconsistency, incompleteness, ambiguities, …  The non-representative samples issue  Biased data, not covering the population/problem we want to study  The label scarcity issue  Despite its volume, big data does not come with label information  Unlabelled data: Abundant and free  E.g., image classification: easy to get unlabeled images  E.g., website classification: easy to get unlabeled webpages  Labelled data: Expensive and scarce  … (Machine)Learning with limited labels 4
  • 5. Why label scarcity is a problem?  Standard supervised learning methods will not work  Esp. a big problem for complex models, like deep neural networks. (Machine)Learning with limited labels Model Learning algorithm Source: https://tinyurl.com/ya3svsxb 5
  • 6. How to deal with label scarcity?  A variety of methods is relevant  Semi-supervised learning  Exploit the unlabelled data together with the labelled one  Active-learning  Ask the user to contribute labels for a few, useful for learning instances  Data augmentation  Generate artificial data by expanding the original labelled dataset  …. (Machine)Learning with limited labels This talk! Ongoing work! Past, ongoing work! 6
  • 7. In this presentation Semi-supervised learning (or, exploiting the unlabelled data together with the labelled one) (Machine)Learning with limited labels 7
  • 8. Semi-supervised learning  Problem setting  Given: Few initial labelled training data DL =(Xl,Yl) and unlabelled data DU = (Xu)  Goal: Build a model using not only DL but also DU 8(Machine)Learning with limited labels Labeled DL Unlabeled DU
  • 9. The intuition  Lets consider only the labelled data  We have two classes: red & blue (Machine)Learning with limited labels 9  Lets consider also some unlabelled data (light blue)  The unlabelled data can give a better sense of the class separation boundary (in this case) Important prerequisite: the distribution of examples, which the unlabeled data will help elucidate, should be relevant for the classification problem
  • 10. Semi-supervised learning methods  Self-learning  Co-training  Generative probabilistic models like EM  … (Machine)Learning with limited labels Not included in this work. 10
  • 11. Semi-supervised learning: Self-learning (Machine)Learning with limited labels  Given: Small amount of initial labelled training data DL  Idea: Train, predict, re-train using classifier’s (best) predictions, repeat  Can be used with any supervised learner. Source: https://tinyurl.com/y98clzxb 11
  • 12. Self-Learning: A good case  Base learner: KNN classifier (Machine)Learning with limited labels Source: https://tinyurl.com/y98clzxb 12
  • 13. Self-Learning: A bad case  Base learner: KNN classifier  Things can go wrong if there are outliers. Mistakes get reinforced. (Machine)Learning with limited labels Source: https://tinyurl.com/y98clzxb 13
  • 14. Semi-supervised learning: Co-Training  Given: Small amount of initial labelled training data  Each instance x, has two views x=[x1, x2]  E.g., in webpage classification: 1. Page view: words appearing on the web page 2. Hyperlink view: words underlined in links pointing in the webpage from other pages  Co-training utilizes both views to learn better with fewer labels  Idea: Each view teaching (training) the other view  By providing labelled instances (Machine)Learning with limited labels 14
  • 16. Semi-supervised learning: Co-Training  Assumption  Views should be independent  Intuitively, we don’t want redundancy between the views (we want classifiers that make different mistakes)  Given sufficient data, each view is good enough to learn from (Machine)Learning with limited labels 16
  • 17. Self-learning vs co-training  Despite their differences  Co-training splits the features, self-learning does not  Both follow a similar training set expansion strategy  They expand the training set by adding labels to (some of) the unlabeled data.  So, the traning set is expanded via: real (unlabeled) instances with predicted labels  Both self learning & co-training incrementally uses the unlabeled data.  Both self learning & co-training propagate the most confident predictions to the next round (Machine)Learning with limited labels Unlabeled Labeled Labeled Unlabeled 17
  • 18. This work Semi-supervised learning for textual data (self-learning, co-training) (Machine)Learning with limited labels 18
  • 19. The TSentiment15 dataset  We used self-learning and co-training to annotate a big dataset  the whole Twitter corpus of 2015 (228M tweets w.o. retweets, 275M with)  The annotated dataset is available at: https://l3s.de/~iosifidis/TSentiment15/  The largest previous dataset is  TSentiment (1,6M tweets collected over a period of 3 months in 2009)  In both cases, labelling relates to sentiment  2 classes: positive, negative (Machine)Learning with limited labels 19
  • 20. Annotation settings  For self-learning:  the features are the unigrams  For co-training: we tried two alternatives  Unigrams and bigrams  Unigrams and language features like part-of-speech tags, #words in capital, #links, #mentions, etc.  We considered two annotation modes:  Batch annotation: the dataset was processed as a whole  Stream annotation: the dataset was proposed in a stream fashion (Machine)Learning with limited labels 20 U1 L1 U2 L2 … U12 L12 Labeled DL Unlabeled DU
  • 21. How to build the ground truth (DL)  We used two different label sources  Distant Supervision  Use emoticons as proxies for sentiment  Only clearly-labelled tweets (with only positive or only negative emoticons) are kept  SentiWordNet: a lexicon-based approach  The sentiment score of a tweet is an aggregation of the sentiment scores of its words (the latest comes from the lexicon) (Machine)Learning with limited labels 22  They agree on ~2,5M tweets  ground truth
  • 22. Labeled-unlabeled volume (and over time) 23(Machine)Learning with limited labels  On monthly average, DU 82 times larger than DL  Positive class is overrepresented, average ration positive/negative per month =3
  • 23. Batch annotation: Self-learning vs co-training 24(Machine)Learning with limited labels Self–learningCo-training  The more selective δ is the more unlabeled tweets  The majority of the predictions refer to positive class  The model is more confident on the positive class  Co-training labels more instances than self-learning  Co-training learns the negative class better than self-learning
  • 24. Batch annotation: Effect of labelled set sample  When the number of labels is small, co-training performs better  With >=40% of labels, self-learning is better 25(Machine)Learning with limited labels
  • 25. Stream annotation  Input: stream in monthly batches: ((L1, U1), (L2, U2), …, (L12, U12))  Two variants are evaluated, for training:  Without history: We learn a model on each month i (using Li, Ui).  With history: For a month i, we consider as Li = . Similarly for Ui.  Two variants also for testing:  Prequential evaluation: use the Li+1 as the test set for month i  Holdout evaluation: we split D into Dtrain, Dtest . Training/ testing similar to before but only on data from Dtrain, Dtest, respectively. 26(Machine)Learning with limited labels U1 L1 U2 L2 … U12 L12
  • 26. Stream: Self-learning vs co-training 27(Machine)Learning with limited labels Prequential Holdout History improves the performance  For the models with history, co-training is better in the beginning but as the history grows self-learning wins
  • 27. Stream: the effect of the history length  We used a sliding window approach  E.g., training on months [1-3] using both labeled and unlabeled data, test on month 4.  Small decrease in performance comparing to the full history case but much more light models 28(Machine)Learning with limited labels
  • 28. Class distribution of the predictions  Self-learning produces more positive predictions than co-training  Version with retweets results in more balanced predictions  Original class distribution w.o. retweets: 87%-13%  Original class distribution w. retweets: 75%-25% 29(Machine)Learning with limited labels
  • 29. Summary  We annotated a big dataset with semi-supervised learning  Self-training  Co-training  When the number of labels is small, co-training performs better  Batch vs stream annotation  History helps (but we don’t need to keep the whole history, a sliding window based approach is also ok)  Learning with redundancy (retweets)  Better class balance in the predictions when retweets are used (because the original dataset is balanced) (Machine)Learning with limited labels 30
  • 30. Ongoing work  Thus far: Semi-supervised learning which focuses on label scarcity  Another way to get around lack of data is data augmentation  i.e., increasing the size of the training set by generating artificial data based on the original labeled set  Useful for many purposes  Deal with class imbalance, create more robust models etc  We investigate different augmentation approaches  At the input layer  At the intermediate layer  And how to control the augmentation process  The goal is to generate plausible data that help with the classification task (Machine)Learning with limited labels 31
  • 31. Thank you for you attention! 32(Machine)Learning with limited labels www.kbs.uni-hannover.de/~ntoutsi/ ntoutsi@l3s.de Questions/ Thoughts?  Relevant work  V. Iosifidis, E. Ntoutsi, "Large scale sentiment annotation with limited labels", KDD, Halifax, Canada, 2017  TSentiment15 available at:  https://l3s.de/~iosifidis/TSentiment15/