This document discusses methods for building machine learning models that can handle concept drift and evolving data distributions when classifying tweets in real-time. It proposes using both a global deep learning model and a local online learning model that incorporates feedback. The local model, which uses an algorithm like Crammer's PA-II, adapts quickly to feedback but is prone to bias towards one class. The document suggests combining the models through online stacking into an ensemble called "glocal" and detecting concept drift periodically to replace outdated models. Handling concept drift and evolving data is important for domains with changing user preferences, markets, or adversarial settings.
2. Agenda
1. Problem v 1.0
2. Solution
3. Issues
a. Drift
b. Evolving Vocab
c. Feedback loop
4. Problem v 2.0
5. Our Solution
a. Global
b. Local
c. glocal
d. Drift Detection
6. Local – pros and cons
7. Way Forward
8. Conclusion/takeaway
3. Problem Statement – v 1.0
• Build a spam filter for twitter
• Use case: In customer service, we listen to twitter on behalf of brands and figure out what is that
brands can respond to.
• Examples:
To filter spam from the actionable in real-time twitter stream of brands.
4. Twitter is noisy
There is ~65-70% noise in consumer-to-business communication
[and 100% noise in business-to-consumer ].
% of noise is only higher if you are big B2C company
5. Solution
• Model it as (binary) classification problem.
• Acquire good quality dataset.
• Engineer features – there are some very good indicators.
• Select an algorithm.
• Train-test-tune, ~85% accuracy.
• Deploy.
Actionable Spam
6. Paradise lost
In production the model started very well, however, as time* went by we found the running accuracy of our
model started falling down.
*within couple of weeks of deployment
7. • Our data was changing and changing fast.
Behind the Scene
Non-stationary distributions
A stationary process is time-independent the averages remain more or less the constant.
This is also called drift – distribution generating the data changes over time.
8. • Vocabulary of our dataset was increasing.
o Unlike any other language - twitter vocabulary evolves faster, significantly faster.
Behind the Scene
9.
10. • Not learning from mistakes: In our system, user (brand agent) has the option to tell the system know if
the classification done by the system is wrong.
• The model was not utilizing these signals to improve.
Behind the Scene
11. In Nutshell
• Based on last few slides, degradation (with time) in the prediction accuracies
of our model shouldn’t come as surprise.
• This is not just specific to twitter data. In general, these problems are likely
occur in following domains :
o Monitoring & Anomaly detection (one-class classification) in adversarial setting
o Recommendations (where the user preferences are continuously changing; evolving labels)
o Stock market predictions (concept drift; evolving distributions).
12. • Build a spam filter for twitter which can:
o Handle drift in data.
o Learn (and improve) using feedbacks.
o Handle fast evolving vocabulary.
Problem Statement – v 2.0
• Build a classifier which can:
o Handle drift in data.
o Learn (and improve) using feedbacks.
o Handle fast evolving vocabulary.
13. Possible Solutions
• Frequently retrain your model on the updated data and deploy the same.
o Training, testing, fine-tuning – lot of work. Doesn’t scale at all.
o Loose all old learnings
• Continuous Learning : Model adapts to the new incoming data.
14. What worked for us
Deep Learning Model
Batch trained
Large Corpus
No short term updates
Per-brand model
Fast learner
Instant feedback
Detect drift
15. Text Representation
• Preprocess the tweets – replace mentions,
hashtags, urls, emojis, dates, numbers,
currency by relevant constants. Remove
stop words.
• How good is your preprocessing ?
- ZIPF’s Law
• Given a large corpus, if t1, t2, t3 are the
most common term (ascending order) in
the corpus and cfi be the collection
frequency of the ith most common term,
then cfi a 1/i
18. Text Representation
• Words Embedding:
o Use Google’s pre-trained word2vec model to replace a word by its corresponding embedding (300
dimensions).
o For a tweet, we average all the word embedding vectors for its constituent words.
o For missing words, we generate a random number between (-0.25, 0.25) for each of 300 dimensions. (Yann
LeCun 2014)
o Final representation:
Tweet = 300 dim vector of real numbers
19. ● DeepNet
○ CNN
○ Trained over a corpus of ~8 million tweets
○ Of the shelf architecture gave us ~86% cv accuracy.
Global model
20. Local
• Goals
o Strictly improves with every feedback.
o Higher retention of older concepts
• Desired properties
o Online learner
o Fast learner; aggressive model update
Incorporates (every last) Feedback successfully
(After model update, if the same data point is presented, it must correctly predict its class label.)
o Don’t forget recent i data points
(After model update, if the last N data point is presented, it must predict its class label with higher accuracy.)
22. ● Reward/punish if the
prediction is right/wrong.
● For binary classification
problem, underlying
MDP is too small (2
states). Doesn’t learn
much.
Works fine if the velocity of
feedback data is high (don’t
have to wait long to accumulate
a mini-batch of feedbacks).
Many applications don’t have
high velocity.
Just 1 data point - can skew the
model
Reinforcement Learning mini-batches Instant feedback, tiny-
batches
Possible Approaches
23. Building feedback loop
• We model a feedback point <Tweet, Y> as a datapoint presented to local model
in online setting.
• Thus, a bunch of feedbacks = incoming data stream
• Thus, we use a Online Learner.
• Online method in ML:
Data is modeled as stream.
Model makes a prediction (y’), when presented with data point (x).
Environment reveals the correct class label (y)
If y ≠ y’, update the model.
25. • Dataset – 160K tweets from 2015, time sequenced
• Feedback incorporation improves accuracy:
o Trained (offline batch mode) model on first 100K data points.
o On test set (last 60k data points) it gave 74% accuracy (offline batch mode)
o Then ran the model on test data (50k data points) in online fashion
Model made a total 9028 mistakes.
These mistakes were instantaneously fed into the local model as feedback.
This gives a accuracy ~85 % across the test set.
○ We gained ~11% accuracy by incorporating feedback.
Results of Local :
28. Its no fluke
We tested the local by feeding it with wrong feedbacks:
29. glokal : Ensembling global and local
• We use online stacking to ensemble our continuously adapting local and
erudite DeepNet model
• Outputs of the global and local go to an OnlineSVM.
• We train the ensemble in batch offline but continue to train it further on
feedback points in an online fashion.
• We get an cv accuracy of 82%
Global
Local
Online SVM
glocal
30.
31. ● Handle Drift
○ Periodically replace the model.
■ Shooting in the dark esp. when drifts are far and few
○ Find if a drift has indeed occurred or not
■ If it has, adapt to the changes.
■ 3 main algorithms:
● DDM (Gama et. al 2004)
● EDDM
● DDD
■ What about the old model - it knows the old concept, so keep it if the old distribution
lingers.
Last but not the Least
33. Pros
• Improves running accuracy
• Personalization : The notion of spam varies from brand to brand. Some
brands treat ‘Hi’, ‘Hello’ as spam while some treat them as actionable.
The local model serves well as per user statistical model, thus brining in user
personalization. Thus, learning from feedback, the model adapts to the
notions of the brand.
• Its light weight, fast thus easy to boot-strap, deploy and scale.
34. Cons
• PA-II decision boundary is a hyper-plane that divides feature space into 2 half-
planes.
• Margin of the data point a distance b/w data point and the hyperplane.
• An update on the model results in new hyper plane to remain as close as
possible to the current one while achieving at least a unit margin on the most
recent data point.
• Thus, incorporating a feedback is nothing but shifting the hyperplane to a unit
margin on the feedback point.
• Lets see this visually.
35. Cons
• This shifting of hyperplane increases model’s accuracy on one class (correct
label of the feedback point) while decreases model’s accuracy on other class.
• To verify the above, split the test set into 2 chunks as per class. And run the
local only on 1 chunk. If the above hypothesis is true then:
• #feedbacks should be very small and only in the initial part of the data set
• The running accuracy should on increase.
36. • Changing the algorithm doesn’t help much – all online learning classifiers in current literature are linear
37. Way Forward
• Instead of modeling the problem as classification, model it as ranking
(Gmail’s priority inbox does this).
• Actionable tweets are high in ranking, spam tweets are low in ranking.
• Actionable vs Spam = finding a cut of in the ranking.
• Incorporating feedback = updating the algorithm to get a better ranking
without getting biased towards one class.
• This is a work in progress.
38. Take Home
• Incorporating feedback is an important step in improving your model’s
performance.
• Global + Local is a great way to introduce personalization in ML.
• PA-II does well as local provided your data is such that most data points are far
from the decision hyperplane.
• For domains where distributions are continuously evolving, handling drift is
must.
39. References
1. “Online Passive-Aggressive Algorithms” - Crammer et al., JMLR 2006
2. “The learning behind gmail priority inbox” – Aberdeen et al., LCCC: NIPS Workshop 2010
3. “Learning with drift detection” – Gama et al., BSAI 2004
4. Baena-Garcıa, Manuel, et al. "Early drift detection method." - Baena-Garcıa et al., IWKDSD, 2006
5. "DDD: A new ensemble approach for dealing with concept drift." - Minku et al., IEEE transactions (2012)
6. "Adaptive regularization of weight vectors." ” - Crammer et al., ANIPS 2009
7. Soft Confidence Weighted algorithms - Wang et al., 2012
8. LIBOL - A Library for Online Learning Algorithms. https://github.com/LIBOL/LIBOL
40. Thank You
Please feel free to reach out post this talk or on the interwebs.
@anujgupta82, @tanish2k
Anuj Gupta Saurabh Arora
Editor's Notes
Data points are often non-stationary or have means, variances and covariances that change over time. Non-stationary behaviors can be trends, cycles, random walks or combinations of the three.
if t_1, t_2, t_3 are the most common term (ascending order) in the corpus and cf_i be the collection frequency of the i^th most common term, then cf_i is proportional to 1/i