SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Understanding email traffic 
David Graus, University of Amsterdam 
d.p.graus@uva.nl 
@dvdgrs
Dec. 12, 2014 - Frontiers of Forensic Science 2 
Some background… 
• PhD candidate at ILPS 
• Information Extraction & Retrieval 
• Project in NWO’s Forensic Science program 
• Semantic Search in E-Discovery
Dec. 12, 2014 - Frontiers of Forensic Science 3 
Some background… 
• PhD candidate at ILPS 
• Information Extraction & Retrieval 
• Project in NWO’s Forensic Science program 
• Semantic Search in E-Discovery
Dec. 12, 2014 - Frontiers of Forensic Science 4 
Information Retrieval?
Dec. 12, 2014 - Frontiers of Forensic Science 5 
Information Retrieval? 
Ò Finding material of unstructured nature from 
large collections
Dec. 12, 2014 - Frontiers of Forensic Science 6 
Information Extraction? 
Ò Text mining 
Ò Discovering patterns in text data
Semantic Search in E-Discovery? 
Dec. 12, 2014 - Frontiers of Forensic Science 7
Dec. 12, 2014 - Frontiers of Forensic Science 8 
Semantic Search?
Dec. 12, 2014 - Frontiers of Forensic Science 9 
E-Discovery? 
• Retrieving and securing digital forensic 
evidence
Dec. 12, 2014 - Frontiers of Forensic Science 10 
E-Discovery 
⬜ Semantic Search in E-Discovery
Semantic Search in E-Discovery 
• Supporting search for digital forensic evidence 
• from emails, hard drives, mobile phones, etc… 
• not the open web 
Dec. 12, 2014 - Frontiers of Forensic Science 11 
• (Google won’t help us here)
Dec. 12, 2014 - Frontiers of Forensic Science 12 
Search in E-Discovery 
¢ Finding out who knew what, from whom, and when 
¢We don’t know what we’re looking for 
¢ What we’re looking for might be deliberately hidden 
¢ Communication might be very domain-specific, 
contextualized or incomplete
Dec. 12, 2014 - Frontiers of Forensic Science 13 
Approach 
¢ Generic search is not the answer 
¢ Google: high precision search 
¢ E-Discovery: high recall & exploratory search
Dec. 12, 2014 - Frontiers of Forensic Science 14 
Tasks 
¢ Support iterative search 
¢ Support (re)formulating questions and hypotheses 
¢ Retrieve all relevant traces
Dec. 12, 2014 - Frontiers of Forensic Science 15
Dec. 12, 2014 - Frontiers of Forensic Science 16
Recipient recommendation 
Ò Given a sender, an email, all possible 
recipients (in an enterprise); 
Ò Predict which recipient(s) are most likely to 
receive the email 
Dec. 12, 2014 - Frontiers of Forensic Science 17
Dec. 12, 2014 - Frontiers of Forensic Science 18 
Why? 
Ò Understanding communication in/structure of an 
enterprise 
Ò Finding “unexpected” communication 
Ò Applications in: 
Ò enterprise search 
Ò expert finding 
Ò community detection 
Ò spam classification 
Ò anomaly detection
Dec. 12, 2014 - Frontiers of Forensic Science 19 
How? 
Ò Gmail 
Ò Who do you frequently “co-address” 
Ò egonetwork 
Ò Related work 
Ò Social Network Analysis (SNA) 
Ò Email content 
Ò Us 
Ò SNA + email content
Part 1: Social Network Analysis? 
d.p.graus@uva.nl z.ren@uva.nl 
derijke@uva.nl 
Dec. 12, 2014 - Frontiers of Forensic Science 20
Dec. 12, 2014 - Frontiers of Forensic Science 21 
image by Calvinius - Creative Commons Attribution-Share Alike 3.0
SNA for predicting recipients? 
1. Importance of a node in the network 
Prior probability 
More important people are more likely to be recipients 
of an(y) email 
2. Connection strength between two nodes 
Conditional probability 
Given the sender, the recipients who are strongly 
associated are more likely to be the recipient 
Dec. 12, 2014 - Frontiers of Forensic Science 22
Dec. 12, 2014 - Frontiers of Forensic Science 23 
Part 2: Email content 
Ò Statistical Language Models (LMs) 
Ò Assign a probability to [a sequence of] words; 
Ò By counting words 
Ò Used in lots of places; 
Ò Web Search 
Ò Machine Translation 
Ò Speech Recognition
Dec. 12, 2014 - Frontiers of Forensic Science 24 
Language Models 
Ò Language models as communication “profiles”
Dec. 12, 2014 - Frontiers of Forensic Science 25 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user)
Dec. 12, 2014 - Frontiers of Forensic Science 26 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user) 
2. Outgoing LM (how user talks to people)
Dec. 12, 2014 - Frontiers of Forensic Science 27 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user) 
2. Outgoing LM (how user talks to people) 
3. Interpersonal LM (how node1 
talks with node2)
Dec. 12, 2014 - Frontiers of Forensic Science 28 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user) 
2. Outgoing LM (how user talks to people) 
3. Interpersonal LM (how node1 
talks with node2)
Dec. 12, 2014 - Frontiers of Forensic Science 29 
Language Models 
Ò Language models as communication “profiles” 
1. Incoming LM (how people talk to user) 
2. Outgoing LM (how user talks to people) 
3. Interpersonal LM (how node1 
talks with node2) 
4. Corpus LM (how everyone 
talks)
Dec. 12, 2014 - Frontiers of Forensic Science 30 
Why language models? 
Ò Comparisons between communication profiles: 
Ò Find nodes with most similar communication
Dec. 12, 2014 - Frontiers of Forensic Science 31 
Model 
Ò Given sender and email, predict recipients 
Ò Ranking function:
Email likelihood 
Estimate using language modeling 
Sender likelihood 
using SNA to estimate closeness of R and S 
Recipient likelihood 
using SNA to estimate importance of R 
Dec. 12, 2014 - Frontiers of Forensic Science 32
Dec. 12, 2014 - Frontiers of Forensic Science 33 
Email likelihood
Dec. 12, 2014 - Frontiers of Forensic Science 34 
Email likelihood 
P(word|R,S) P(word|R) P(word)
Recipient Likelihood 
P(R) P(R) 
P(S|R) 
Dec. 12, 2014 - Frontiers of Forensic Science 35 
Strength of connection 
between two nodes 
1. Number of emails sent 
between nodes 
2. Number of times two nodes 
are addressed together 
Importance of node 
1. Number of emails received 
2. PageRank score 
Sender Likelihood 
P(S|R)
Dec. 12, 2014 - Frontiers of Forensic Science 36 
SNA 
1. Importance of a node 
in the network 
2. Strength of 
connection between 
nodes 
Email Content 
1. Interpersonal LM 
2. Recipient LM 
3. Corpus LM
Dec. 12, 2014 - Frontiers of Forensic Science 37 
Approach: time-based 
time 
Training period: build models (SNA + LM) 
Testing period: predict recipients
Testing period: predict recipients 
Dec. 12, 2014 - Frontiers of Forensic Science 38 
Testing 
Ò Remove recipients from email 
Ò Rank all nodes in the network, by computing: 
1. P(E|R,S): Similarity between sender and 
candidate LMs 
2. P(S|R): Strength of connection between 
sender and candidate 
3. P(R): Importance of candidate
Dec. 12, 2014 - Frontiers of Forensic Science 39
Dec. 12, 2014 - Frontiers of Forensic Science 40 
Findings: What works? 
Ò Importance of node: 
Number of received emails of node 
Pagerank 
Ò Strength of connection: 
Number of emails between nodes 
Number of times co-addressed 
Ò LM Similarity: 
Interpersonal LM is most important (60%-20%-20%)
Analysis: SNA vs email content 
Dec. 12, 2014 - Frontiers of Forensic Science 41 
Ò SNA: 
Ò SNA signals deteriorate over time 
Ò SNA signals are most informative on highly 
active users 
Ò Email content: 
Ò LM signal improves over time 
Ò LM signal does worse with highly active users
Dec. 12, 2014 - Frontiers of Forensic Science 42 
Finally 
Ò Combining Social Network Analysis with 
Language Modeling is better than doing either.
Dec. 12, 2014 - Frontiers of Forensic Science 43 
Future work 
Ò Consider structure of network in more detail 
Ò Departments? 
Ò Friends/family? 
Ò Include ‘time decay’ 
Ò Dynamically weight LM/SNA?
Applications in E-Discovery/Digital Forensics 
Dec. 12, 2014 - Frontiers of Forensic Science 44 
Ò Anomaly detection 
Ò Given a working prediction model; identify 
“unexpected” communication 
Ò Language models for communication 
Ò For a node, find the most different 
interpersonal communication 
Ò Friends/family vs colleagues? 
Ò Find communication that differs from the 
corpus-based communication
Dec. 12, 2014 - Frontiers of Forensic Science 45 
Fin 
Ò Questions?

Contenu connexe

Similaire à Understanding Email Traffic

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Julien PLU
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceBertram Ludäscher
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)David Graus
 
Genericity versus expressivity – reflections about the semantics of interoper...
Genericity versus expressivity – reflections about the semantics of interoper...Genericity versus expressivity – reflections about the semantics of interoper...
Genericity versus expressivity – reflections about the semantics of interoper...Andrea Scharnhorst
 
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...CASRAI
 
Linked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionLinked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionMathieu d'Aquin
 
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...DuraSpace
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Krishnaram Kenthapadi
 
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...Digitised Manuscripts to Europeana
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Marina Santini
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with GraphsNeo4j
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1Pier Luca Lanzi
 
Graph Query Languages: update from LDBC
Graph Query Languages: update from LDBCGraph Query Languages: update from LDBC
Graph Query Languages: update from LDBCJuan Sequeda
 
Visual Resources Librarianship and Information Literacy: using the Metalitera...
Visual Resources Librarianship and Information Literacy: using the Metalitera...Visual Resources Librarianship and Information Literacy: using the Metalitera...
Visual Resources Librarianship and Information Literacy: using the Metalitera...UCD Library
 

Similaire à Understanding Email Traffic (20)

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)Understanding Email Traffic (talk @ E-Discovery NL Symposium)
Understanding Email Traffic (talk @ E-Discovery NL Symposium)
 
04 pisa final_event_111214_wp1_dg
04 pisa final_event_111214_wp1_dg04 pisa final_event_111214_wp1_dg
04 pisa final_event_111214_wp1_dg
 
Genericity versus expressivity – reflections about the semantics of interoper...
Genericity versus expressivity – reflections about the semantics of interoper...Genericity versus expressivity – reflections about the semantics of interoper...
Genericity versus expressivity – reflections about the semantics of interoper...
 
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
Project Credit: Clifford Lynch - Developing a contributor role taxonomy for s...
 
Linked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionLinked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: Introduction
 
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
4.2.15 Slides, “Hydra: many heads, many connections. Enriching Fedora Reposit...
 
Dm2 e okfn-infoday_scholarly_activities_18_nov
Dm2 e okfn-infoday_scholarly_activities_18_novDm2 e okfn-infoday_scholarly_activities_18_nov
Dm2 e okfn-infoday_scholarly_activities_18_nov
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
 
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
Beyond Infrastructure - Stefan Gradmann (Leipzig Digital Humanities Seminar, ...
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Quantifying the bias in data links
Quantifying the bias in data linksQuantifying the bias in data links
Quantifying the bias in data links
 
Natural Language Processing with Graphs
Natural Language Processing with GraphsNatural Language Processing with Graphs
Natural Language Processing with Graphs
 
08b final event_experimente
08b final event_experimente08b final event_experimente
08b final event_experimente
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
Graph Query Languages: update from LDBC
Graph Query Languages: update from LDBCGraph Query Languages: update from LDBC
Graph Query Languages: update from LDBC
 
Visual Resources Librarianship and Information Literacy: using the Metalitera...
Visual Resources Librarianship and Information Literacy: using the Metalitera...Visual Resources Librarianship and Information Literacy: using the Metalitera...
Visual Resources Librarianship and Information Literacy: using the Metalitera...
 

Plus de David Graus

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsDavid Graus
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in RecommendationsDavid Graus
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.David Graus
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactDavid Graus
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsDavid Graus
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyDavid Graus
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesDavid Graus
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamDavid Graus
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDavid Graus
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoDavid Graus
 
Big Data & Machine Learning - Mogelijkheden & Valkuilen
Big Data & Machine Learning - Mogelijkheden & ValkuilenBig Data & Machine Learning - Mogelijkheden & Valkuilen
Big Data & Machine Learning - Mogelijkheden & ValkuilenDavid Graus
 
Analyzing and Predicting Task Reminders
Analyzing and Predicting Task RemindersAnalyzing and Predicting Task Reminders
Analyzing and Predicting Task RemindersDavid Graus
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDavid Graus
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDavid Graus
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus
 
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsGenerating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsDavid Graus
 
yourHistory - entity linking for a personalized timeline of historic events
yourHistory - entity linking for a personalized timeline of historic eventsyourHistory - entity linking for a personalized timeline of historic events
yourHistory - entity linking for a personalized timeline of historic eventsDavid Graus
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-DiscoveryDavid Graus
 
Semantic Annotation of the Cyttron Database
Semantic Annotation of the Cyttron DatabaseSemantic Annotation of the Cyttron Database
Semantic Annotation of the Cyttron DatabaseDavid Graus
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationDavid Graus
 

Plus de David Graus (20)

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientists
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in Recommendations
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for Impact
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender Systems
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData Amsterdam
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgeven
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.info
 
Big Data & Machine Learning - Mogelijkheden & Valkuilen
Big Data & Machine Learning - Mogelijkheden & ValkuilenBig Data & Machine Learning - Mogelijkheden & Valkuilen
Big Data & Machine Learning - Mogelijkheden & Valkuilen
 
Analyzing and Predicting Task Reminders
Analyzing and Predicting Task RemindersAnalyzing and Predicting Task Reminders
Analyzing and Predicting Task Reminders
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
 
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social StreamsGenerating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
 
yourHistory - entity linking for a personalized timeline of historic events
yourHistory - entity linking for a personalized timeline of historic eventsyourHistory - entity linking for a personalized timeline of historic events
yourHistory - entity linking for a personalized timeline of historic events
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-Discovery
 
Semantic Annotation of the Cyttron Database
Semantic Annotation of the Cyttron DatabaseSemantic Annotation of the Cyttron Database
Semantic Annotation of the Cyttron Database
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualization
 

Dernier

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 

Dernier (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 

Understanding Email Traffic

  • 1. Understanding email traffic David Graus, University of Amsterdam d.p.graus@uva.nl @dvdgrs
  • 2. Dec. 12, 2014 - Frontiers of Forensic Science 2 Some background… • PhD candidate at ILPS • Information Extraction & Retrieval • Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
  • 3. Dec. 12, 2014 - Frontiers of Forensic Science 3 Some background… • PhD candidate at ILPS • Information Extraction & Retrieval • Project in NWO’s Forensic Science program • Semantic Search in E-Discovery
  • 4. Dec. 12, 2014 - Frontiers of Forensic Science 4 Information Retrieval?
  • 5. Dec. 12, 2014 - Frontiers of Forensic Science 5 Information Retrieval? Ò Finding material of unstructured nature from large collections
  • 6. Dec. 12, 2014 - Frontiers of Forensic Science 6 Information Extraction? Ò Text mining Ò Discovering patterns in text data
  • 7. Semantic Search in E-Discovery? Dec. 12, 2014 - Frontiers of Forensic Science 7
  • 8. Dec. 12, 2014 - Frontiers of Forensic Science 8 Semantic Search?
  • 9. Dec. 12, 2014 - Frontiers of Forensic Science 9 E-Discovery? • Retrieving and securing digital forensic evidence
  • 10. Dec. 12, 2014 - Frontiers of Forensic Science 10 E-Discovery ⬜ Semantic Search in E-Discovery
  • 11. Semantic Search in E-Discovery • Supporting search for digital forensic evidence • from emails, hard drives, mobile phones, etc… • not the open web Dec. 12, 2014 - Frontiers of Forensic Science 11 • (Google won’t help us here)
  • 12. Dec. 12, 2014 - Frontiers of Forensic Science 12 Search in E-Discovery ¢ Finding out who knew what, from whom, and when ¢We don’t know what we’re looking for ¢ What we’re looking for might be deliberately hidden ¢ Communication might be very domain-specific, contextualized or incomplete
  • 13. Dec. 12, 2014 - Frontiers of Forensic Science 13 Approach ¢ Generic search is not the answer ¢ Google: high precision search ¢ E-Discovery: high recall & exploratory search
  • 14. Dec. 12, 2014 - Frontiers of Forensic Science 14 Tasks ¢ Support iterative search ¢ Support (re)formulating questions and hypotheses ¢ Retrieve all relevant traces
  • 15. Dec. 12, 2014 - Frontiers of Forensic Science 15
  • 16. Dec. 12, 2014 - Frontiers of Forensic Science 16
  • 17. Recipient recommendation Ò Given a sender, an email, all possible recipients (in an enterprise); Ò Predict which recipient(s) are most likely to receive the email Dec. 12, 2014 - Frontiers of Forensic Science 17
  • 18. Dec. 12, 2014 - Frontiers of Forensic Science 18 Why? Ò Understanding communication in/structure of an enterprise Ò Finding “unexpected” communication Ò Applications in: Ò enterprise search Ò expert finding Ò community detection Ò spam classification Ò anomaly detection
  • 19. Dec. 12, 2014 - Frontiers of Forensic Science 19 How? Ò Gmail Ò Who do you frequently “co-address” Ò egonetwork Ò Related work Ò Social Network Analysis (SNA) Ò Email content Ò Us Ò SNA + email content
  • 20. Part 1: Social Network Analysis? d.p.graus@uva.nl z.ren@uva.nl derijke@uva.nl Dec. 12, 2014 - Frontiers of Forensic Science 20
  • 21. Dec. 12, 2014 - Frontiers of Forensic Science 21 image by Calvinius - Creative Commons Attribution-Share Alike 3.0
  • 22. SNA for predicting recipients? 1. Importance of a node in the network Prior probability More important people are more likely to be recipients of an(y) email 2. Connection strength between two nodes Conditional probability Given the sender, the recipients who are strongly associated are more likely to be the recipient Dec. 12, 2014 - Frontiers of Forensic Science 22
  • 23. Dec. 12, 2014 - Frontiers of Forensic Science 23 Part 2: Email content Ò Statistical Language Models (LMs) Ò Assign a probability to [a sequence of] words; Ò By counting words Ò Used in lots of places; Ò Web Search Ò Machine Translation Ò Speech Recognition
  • 24. Dec. 12, 2014 - Frontiers of Forensic Science 24 Language Models Ò Language models as communication “profiles”
  • 25. Dec. 12, 2014 - Frontiers of Forensic Science 25 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user)
  • 26. Dec. 12, 2014 - Frontiers of Forensic Science 26 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)
  • 27. Dec. 12, 2014 - Frontiers of Forensic Science 27 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  • 28. Dec. 12, 2014 - Frontiers of Forensic Science 28 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  • 29. Dec. 12, 2014 - Frontiers of Forensic Science 29 Language Models Ò Language models as communication “profiles” 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2) 4. Corpus LM (how everyone talks)
  • 30. Dec. 12, 2014 - Frontiers of Forensic Science 30 Why language models? Ò Comparisons between communication profiles: Ò Find nodes with most similar communication
  • 31. Dec. 12, 2014 - Frontiers of Forensic Science 31 Model Ò Given sender and email, predict recipients Ò Ranking function:
  • 32. Email likelihood Estimate using language modeling Sender likelihood using SNA to estimate closeness of R and S Recipient likelihood using SNA to estimate importance of R Dec. 12, 2014 - Frontiers of Forensic Science 32
  • 33. Dec. 12, 2014 - Frontiers of Forensic Science 33 Email likelihood
  • 34. Dec. 12, 2014 - Frontiers of Forensic Science 34 Email likelihood P(word|R,S) P(word|R) P(word)
  • 35. Recipient Likelihood P(R) P(R) P(S|R) Dec. 12, 2014 - Frontiers of Forensic Science 35 Strength of connection between two nodes 1. Number of emails sent between nodes 2. Number of times two nodes are addressed together Importance of node 1. Number of emails received 2. PageRank score Sender Likelihood P(S|R)
  • 36. Dec. 12, 2014 - Frontiers of Forensic Science 36 SNA 1. Importance of a node in the network 2. Strength of connection between nodes Email Content 1. Interpersonal LM 2. Recipient LM 3. Corpus LM
  • 37. Dec. 12, 2014 - Frontiers of Forensic Science 37 Approach: time-based time Training period: build models (SNA + LM) Testing period: predict recipients
  • 38. Testing period: predict recipients Dec. 12, 2014 - Frontiers of Forensic Science 38 Testing Ò Remove recipients from email Ò Rank all nodes in the network, by computing: 1. P(E|R,S): Similarity between sender and candidate LMs 2. P(S|R): Strength of connection between sender and candidate 3. P(R): Importance of candidate
  • 39. Dec. 12, 2014 - Frontiers of Forensic Science 39
  • 40. Dec. 12, 2014 - Frontiers of Forensic Science 40 Findings: What works? Ò Importance of node: Number of received emails of node Pagerank Ò Strength of connection: Number of emails between nodes Number of times co-addressed Ò LM Similarity: Interpersonal LM is most important (60%-20%-20%)
  • 41. Analysis: SNA vs email content Dec. 12, 2014 - Frontiers of Forensic Science 41 Ò SNA: Ò SNA signals deteriorate over time Ò SNA signals are most informative on highly active users Ò Email content: Ò LM signal improves over time Ò LM signal does worse with highly active users
  • 42. Dec. 12, 2014 - Frontiers of Forensic Science 42 Finally Ò Combining Social Network Analysis with Language Modeling is better than doing either.
  • 43. Dec. 12, 2014 - Frontiers of Forensic Science 43 Future work Ò Consider structure of network in more detail Ò Departments? Ò Friends/family? Ò Include ‘time decay’ Ò Dynamically weight LM/SNA?
  • 44. Applications in E-Discovery/Digital Forensics Dec. 12, 2014 - Frontiers of Forensic Science 44 Ò Anomaly detection Ò Given a working prediction model; identify “unexpected” communication Ò Language models for communication Ò For a node, find the most different interpersonal communication Ò Friends/family vs colleagues? Ò Find communication that differs from the corpus-based communication
  • 45. Dec. 12, 2014 - Frontiers of Forensic Science 45 Fin Ò Questions?