How many folders do you really need ? Classifying email into a handful of categories.
1. +
How Many Folders Do You Really Need?
Classifying Email into a Handful of Categories
2014/1/23 (Fri.)
Chang Wei-Yuan @ MakeLab Group Meeting
Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek
Yahoo Labs CIKM‘14
6. +
Introduction
n The current email traffic is dominated by
non-spam machine-generated email.
n Social network
n Commerce sites
n Official institutions
6
11. +
Discovering Latent Categories
n All messages have the potential to be
classified.
n by retrieving the most popular folder from
users
n This paper applied LDA to these
"document folders " for finding latent
categories.
n latent topics would map into "latent
categories"
11
14. +
Discovering Latent Categories
n Our objective was to train a value of K
n each individual and overall set of topics
achieve significant coverage
n We further examined for K = 6
n good balance between total and individual
coverage
14
17. +
Modeling Data
n Original method: Each individual
message as a single data point
n various features extracted from the message
header and body
17
18. +
Modeling Data
n Extracting Features
n content features
n the message subject and body
n address features
n sender email address, including the subdomain
n behavioral features
n sender's and recipient's actions over a given
message
18
subject
body
action
time
sender
address
domain
msg
19. +
Modeling Data
n Extended method: Aggregating
messages at higher levels
n address/mail domain level
n This paper consider three levels of
aggregation.
19
subject
body
action
time
address
sender
domain
msg
Aggregating : sender level
Aggregating : domain level
22. +
Training Data
n labeling techniques
n label used as 6 latent categories
n we will create a two-stage classifier by msg-
level and sender-level
22
subject
action
…
sender
domain
category
msg
sender
domain
category
sender
23. +
Training Data
n labeling techniques
n label used as 6 latent categories
n we will create a two-stage classifier by msg-
level and sender-level
23
subject
action
…
sender
domain
category
msg
sender
domain
category
sender
known by LDA
unknown
29. +
Classification Mechanism
n Offline creation of classified senders
table and message-level classier
n We use the training set to train a logistic
regression model.
n For each category we train a separate model in a
one-vs-all manner.
n The classification process is run performed
periodically to account for new senders.
30. +
Classification Mechanism
35 % sender
training data
classifier
classifier
senders
table
65 % sender
testing data
msg
training data
32. +
Classification Mechanism
n Online Light-weight classification
n The initial classification
n hard coded rules designed to quickly classify
n This process described requires very
few resources and covers 32% of the
email traffic.
33. +
Classification Mechanism
n Online Sender-based classification
n The second phase in our cascade
classification
n looking for the sender with known categories
n using senders table
n The amount of traffic that is not
covered by this phase is roughly 8%.
34. +
Classification Mechanism
n Online Heavy-weight classification
n As only 8% of the traffic end up in this
last phase
n We can afford slightly heavier
computations to classifier.
n use all relevant feature, pertaining to the
message body, subject line and sender name
41. +
Experiment
n This paper estimated the actual volume
of machine-generated messages on a
very large Yahoo mail dataset.
n This dataset built for the purpose of this
work
n 6 months of email traffic
n more than 500 billion messages.
41
42. +
Experiment
n 5 sender based classifiers for machine
latent categories
n Shopping, Financial, Travel, Career and
Social
n 1 sender-based machine for human
classifier.
47. +
Conclusion
n We presented here a Web-scale
categorization approach.
n offline learning
n online classification
n Discovered latent categories.
n Discriminated human and machine-
generated email.
n Building a scalable online system can be
applied in Web mail.