SlideShare a Scribd company logo
1 of 106
Old Dominion University
Department of Computer Science
Hany SalahEldeen
Hany SalahEldeen Khalil hany@cs.odu.edu
Zen & the Art of Data Mining
07-08-14
Social Media Data Collection and the path to
Modeling & Predicting User Intention
Web Science & Digital Libraries Lab
1
Before we start..
here is a lil bit about me…
Hany SalahEldeen 2
Hany SalahELdeen
Education:
• PhD Candidate
• Web Science and Digital Libraries Group
• Masters Degree in Computer Vision and Artificial Intelligence
• Universitat Autonoma de Barcelona
• Bachelors of Computer Systems Engineering
• University of Alexandria
Hany SalahEldeen 3
Research & Technical Experience
• Microsoft Research Cairo
• Google GmBH Zurich
• Microsoft Inc. Mountain View
• National University of Singapore
Hany SalahEldeen 4
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
Web Mining Pattern Analysis Machine Learning
Human Behavioral Analysis
Social Media Analysis
So what am I investigating?
5
Publications
Hany SalahEldeen
Shanghai CIKM 2014 Conference
- 1 first author paper
- 1 second author paper
London DL 2014 Conference
- 1 third author paper
Malta TPDL 2013 Conference
- 1 first author paper
6
Publications
Hany SalahEldeen
Indianapolis JCDL 2013 Conference
- 1 first author paper
Rio de Janeiro WWW 2013 Conference
- 1 first author paper
Cyprus TPDL 2012 Conference
- 1 first author paper
7
Beside the perks of
travelling, our research has
been popular…
Hany SalahEldeen 8
MIT Technology Review
Hany SalahEldeen 9
MIT Technology Review
Hany SalahEldeen 10
MIT Technology Review
Hany SalahEldeen 11
Mashable
Hany SalahEldeen 12
Popular Mechanics
Hany SalahEldeen 13
BBC
Hany SalahEldeen 14
The Virginian Pilot
Hany SalahEldeen 15
Our Research’s Popularity
Hany SalahEldeen
• Local newspaper: The Virginia Pilot
• 4 x MIT Technology Review
• BBC
• Mashable
• The Atlantic
• Yahoo News
• Articles in > 11 different languages
• We have been called:
• The Internet Archeologists
• Web Time Travelers
16
My goal:
Detect, model, and predict
user intention in social media
Hany SalahEldeen 17
Ok hold on, let’s go back to
the basics…
Hany SalahEldeen 18
Web 2.0
Definition: Web 2.0 is a concept that
takes the network as a platform for
information sharing, interoperability,
user-centered design, and collaboration
on the World Wide Web.*
* http://en.wikipedia.org/wiki/Web_2.0
Hany SalahEldeen 19
Web 2.0
• Yes, Web 2.0 is about “user-generated
content”
• But explicit content contributed by
users is just 20% of what “matters”
• 80% is in the implicitly contributed
data*
Hany SalahEldeen 20
*Toby Segaran, Programming Collective Intelligence, 2007
Systems & Web 2.0
• Google: Utilizes PageRank which is a
technique for extracting intelligence from
the link structure
• Flickr: Utilizes “interestingness” algorithm
• Amazon: Utilizes “people who bought this
product also bought” feature
• Pandora: Utilizes “similar artist radio”
• eBay: Utilizes “reputation system”
Hany SalahEldeen 21
So why do we even care
about all that?
Hany SalahEldeen 22
Power to the People!
Hany SalahEldeen 23
Power to the People!
• Because analyzing a huge dataset of
millions of users will yield a lot of potential
insights into:
• User Experience
• Marketing
• Personal Taste
• Human Behavior in general.
Hany SalahEldeen 24
So what is Data Mining?
Hany SalahEldeen 25
Data Mining
• Definition: It is the computational process of
discovering patterns in large data
sets involving methods at the intersection
of artificial intelligence, machine
learning, statistics, and database systems. The
overall goal of the data mining process is to
extract information from a data set and
transform it into an understandable structure
for further use.
http://en.wikipedia.org/wiki/Data_mining
Hany SalahEldeen 26
Back to my goal:
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
27
Let’s breakdown the title first…
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
28
Let’s breakdown the title first…
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
29
Scenario 1:
Jenny reading Jeff’s tweets
Hany SalahEldeen 30
Michael Jackson Dies
Hany SalahEldeen
Snapshot on: June 25th 2009
http://web.archive.org/web/20090625232522/http://www.cnn.com/
31
Jeff tweets about it…
Hany SalahEldeen
Published on: June 25th 2009
https://twitter.com/mdnitehk/status/2333993907
32
Jeff’s friend Jenny was on a vacation in Hawaii for a
month
Jenny is off the grid…
Hany SalahEldeen 33
When she came back she checked Jeff’s tweets and
was shocked!
Jenny starts catching up a month later
Hany SalahEldeen
Read on: July26th 2009
https://twitter.com/mdnitehk/status/2333993907
34
She quickly clicked on the link in the tweet…
Jenny follows the link on July 26th
Hany SalahEldeen
http://web.archive.org/web/20090726234411/http://www.cnn.com/
CNN page on:
July 26th 2009
35
• Implication:
• Jenny thought Jeff is making a joke about her
favorite singer and she got mad at him
• Problem:
• The tweet and the resource the tweet links
to have become unsynchronized.
Jenny is confused!
Hany SalahEldeen 36
Scenario 2:
The Egyptian Revolution
Hany SalahEldeen 37
The Egyptian Revolution Jan 2011
Hany SalahEldeen 38
Reading about it in Storify.com a year
later in March 2012
Hany SalahEldeen
http://storify.com/maq4sure/egypts-revolution
39
I noticed some shared images are missing
Hany SalahEldeen
http://storify.com/maq4sure/egypts-revolution
40
Some tweets are still intact
Hany SalahEldeen
https://twitter.com/miss_amy_qb/status/32477898581483521
41
…and some lost their meaning with
the disappearance of the images
Hany SalahEldeen
Missing ?
https://twitter.com/aishes/status/32485352102952960
https://twitter.com/omar_chaaban/status/32203697597452289
42
The tweet remains but the shared
image disappeared…
Hany SalahEldeen
http://yfrog.com/h5923xrvbqqvgzj
43
• Implication:
• The reader cannot understand what the
author of the tweet meant because the image
is not available.
• Problem:
• The post is available but the linked resource
(image) is completely missing.
Cairo….we have a problem!
Hany SalahEldeen 44
…back to the title
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
45
…back to the title
Hany SalahEldeen
Detecting, Modeling, & Predicting
User Temporal Intention
in Social Media
46
47
The Anatomy of a Tweet
Hany SalahEldeen 47
48
The Anatomy of a Tweet
Author’s username
Other user mention
Tweet Body
Hash TagShortened URL
to resource
Publishing
timestamp
Social
Post
Shared Resource
Interaction
options
Hany SalahEldeen 48
49
3 URIs = 3 Chances to fail
Hany SalahEldeen
http://news.blogs.cnn.com/2012/04/26/norwegian
s-sing-to-annoy-mass-killer/
https://twitter.com/KentEiler/status/19553574
9754527745
49
50
…
t1
t4
t2
t3 t5
t7 t8 t9 tn
t6
Explanation in MJ’s example
50
51
If I click on a link in a tweet, which
version should I get?
ttweet or tclick ?
Hany SalahEldeen 51
52
Sometimes you want a
previous version
The Correct Temporal
Intention
CNN.com at the closest time to the tweet: 25th June 2009 ~ 7pm
Hany SalahEldeen 52
53
Sometimes you want the
current version
The Correct Temporal
Intention
In this case the current state of the press releases page
Hany SalahEldeen 53
54
Research Question
Can we estimate the users’
intention at the time of posting
and reading to predict and
maintain temporal consistency?
Hany SalahEldeen 54
55
People rely on social media for most
updated information
Hany SalahEldeen 55
Hany SalahEldeen
So if you are posting a tweet about
your cat…
…No one cares!
56
Hany SalahEldeen
Regardless how cool your cat was!
57
All tweets are equal…
…but some are more equal than the others
Hany SalahEldeen 58
Preliminary Research Questions:
1. How long would these last?
2. And if lost, are they archived?
3. Is this what the author intended?
Hany SalahEldeen 59
60
Since tweets are considered the first draft
of history… the historical integrity of the
tweets could be compromised.
Hany SalahEldeen
Historical Integrity
60
61
The life cycle of a social post
Hany SalahEldeen 61
62
The life cycle of a social post
tweets
Hany SalahEldeen 62
63
The life cycle of a social post
tweets Links to
Hany SalahEldeen 63
64
The life cycle of a social post
tweets
What the
reader
receives
Links to
Same state
the author
intended
Hany SalahEldeen 64
65
The life cycle of a social post
tweets
What the
reader
receives
Links to
Same state
the author
intended
Hany SalahEldeen
The resource
has disappeared
65
66
The life cycle of a social post
tweets
What the
reader
receives
Links to
Same state
the author
intended
The resource
has disappeared
The resource
has changed
Hany SalahEldeen 66
67
Same state
the author
intended
The Resource’s Possibilities
a bigger problem since the reader might not know.
What the
reader
receives
The resource
has disappeared
The resource
has changed
Hany SalahEldeen 67
68
We could lose the linked resource
Hany SalahEldeen 68
69
The attack on the embassy was in February
2013
Or the resource could change
Hany SalahEldeen 69
70
Why do we want to detect the
Author’s Temporal Intention?
• Match: and convey the intended information.
• Notify:
– the author that the resource is prone to change.
– the reader that the resource has changed.
• Preserve: the resource by pushing snapshots into the
archive automatically.
• Retrieve: the closest archived version to maintain the
consistency.
Hany SalahEldeen 70
71
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 71
72
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 72
73
Estimating Web Archiving Coverage
• Goal: Estimate how much of the public web is present in the public archives
and how many copies are available?
• Action:
– Getting 4 different datasets from 4 different sources:
• Search Engines Indices
• Bit.ly
• DMOZ
• Delicious.
• Results: *
• Publications:
– How much of the web is archived? JCDL '11
– http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-
archived.html
Hany SalahEldeen
16%-79% Archived according
to the source
73
74
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 74
75
The timeline of the resource
Hany SalahEldeen 75
http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html
76
Timestamps Accumulation
Hany SalahEldeen 76
77
Actual Vs. Estimated Dates
Hany SalahEldeen
• Successfully estimated the creation date
>75% of the resources
• >33% we estimated the exact date
77
78
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 78
• From Twitter, Websites, Books:
• The Egyptian revolution
• From Twitter Only:
• Stanford’s SNAP dataset:
• Iranian elections
• H1N1 virus outbreak
• Michael Jackson’s death
• Obama’s Nobel Peace Prize
• Twitter API:
• The Syrian uprising
Six Socially Significant Events
Hany SalahEldeen 79
Resources Missing & Archived
Hany SalahEldeen 80
Revisiting after a year…
Hany SalahEldeen
• There is a nearly linear relationship between the amount
missing from the web and time.
• After 1 year ~11% is gone, and 0.02% is lost every day
81
Measured Vs. Predicted
Hany SalahEldeen 82
First Attempts to Shared Content
Replacement
Hany SalahEldeen 83
• We performed an experiment to gauge how many
of the resources that are missing could be
replaced with other similar resources.
• Collected a dataset with available resources
which we assumed to be missing
• Used our method to extract the replacement
resources
• Measured the similarity with the original resource
First Attempts to Shared Content
Replacement
Hany SalahEldeen
We were able to extract another resource with >70% similarity
to the missing resource in >40% of the cases
84
85
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 85
86
Temporal Intention Relevancy Model
(TIRM)
Between ttweet and tclick:
The linked resource could have:
• Changed
• Not changed
The tweet and the linked resource could be:
• Still relevant
• No longer relevant
Hany SalahEldeen 86
87
Resource is changed but relevant
• The resource changed
• But it is still relevant
 Intention: need the current version of the resource at any time
Hany SalahEldeen 87
88
Relevancy and Intention Mapping
Current
Hany SalahEldeen 88
89
Resource is changed and not relevant
 Intention: need the past version of the resource at any time
• The resource changed
• But it is no longer relevant
Hany SalahEldeen 89
90
Past
Relevancy and Intention Mapping
Current
Hany SalahEldeen 90
91
Resource is not changed and relevant
 Intention: need the past version of the resource at any time
• The resource is not changed
• And it is relevant
Hany SalahEldeen 91
92
Past
Relevancy and Intention Mapping
Current
Past
Hany SalahEldeen 92
93
Resource is not changed and not relevant
 Intention: I am not sure which version of the resource I need
• The resource is not changed
• But it is not relevant
Hany SalahEldeen 93
94
Past
Relevancy and Intention Mapping
Current
Past Not Sure
Hany SalahEldeen 94
95
Our investigation angles
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen 95
96
Feature extraction
• For each tweet we perform:
– Link analysis
– Social Media Mining
– Archival Existence
– Sentiment Analysis
– Content Similarity
– Entity Identification
Hany SalahEldeen 96
97
1- Link analysis
• Since the tweets have embedded resources shortened by
Bit.ly we can extract:
– Total number of clicks
– Hourly click logs
– Creation dates
– Referring websites
– Referring countries
• We calculate the depth of the resource in relation to its domain
(either it is a leaf node or a root page)
– We calculated the number of backslashes in the resource’s URI
Hany SalahEldeen 97
98
2- Social Media Mining
• Twitter:
– Using Topsy.com’s API to
extract:
• Total number of tweets.
• The most recent 500.
• Number of tweets by
influential users.
The collection of tweets extracted provided an extended context of the
resource authored by users in the twittersphere.
Hany SalahEldeen 98
99
2- Social Media Mining
• Facebook:
– Mined too for likes, shares, posts, and clicks related to each
resource.
Hany SalahEldeen 99
100
3- Archival Existence
• Using Memento Time
Maps we get:
– Total mementos
available
– Different archives count.
– The closest archived
version to the tweet
time.
Hany SalahEldeen 100
101
4- Sentiment Analysis
• Using NLTK libraries of natural language text processing
• Extract the most prominent sentiment in the text
Hany SalahEldeen 101
102
5- Content Similarity
• Steps:
– We download the content HTML using Lynx browser.
– We apply boilerplate removal algorithm and full text extraction.
– Calculate the cosine similarity between the two pages.
 70% similarity 
Hany SalahEldeen 102
103
6- Entity Identification
• By visual inspection we observed that the majority of tweets about
celebrities are related to current events.
• We harvested Wikipedia for lists of actors, politicians, and athletes.
• Checked the existence of a celebrity mention in the tweets.
Actor: Johnny Depp
Hany SalahEldeen 103
104
The trained classifier
• From the feature extraction phase we extracted 39
different features to train the classifier.
• Using 10-fold cross validation, the Cost Sensitive Classifier
Based on Random Forests gave the highest success rate =
90.32%
Hany SalahEldeen 104
105
What’s Next for Hany?
• Finish up my dissertation
• Defend.
• Get a research/Data scientist position
• Interests:
– L3S Research Center Germany
– Microsoft Research
Hany SalahEldeen 105
106
1. The state of the archived content
2. The age of the shared resource
3. The states of the resource:
1. Missing from the live web
2. Changed from what the author intended to share
4. Detect the author’s intention and collect a dataset
5. Model this intention
6. Create a time-based navigation tool to match the
predicted intention
Hany SalahEldeen
Summary:
Email: hany@cs.odu.edu
Office: 3102
Website: http://www.cs.odu.edu/~hany/
Twitter: @hanysalaheldeen
106

More Related Content

Viewers also liked

Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
erp and related technologies
erp and related technologieserp and related technologies
erp and related technologiesMadan Kumawat
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data miningmaxonlinetr
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Harish Chand
 
Data mining & data warehousing
Data mining & data warehousingData mining & data warehousing
Data mining & data warehousingShubha Brota Raha
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
Erp --functional-modules
Erp --functional-modulesErp --functional-modules
Erp --functional-modulesRavi shankar
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 

Viewers also liked (15)

Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Data mining
Data miningData mining
Data mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
erp and related technologies
erp and related technologieserp and related technologies
erp and related technologies
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
 
Data mining & data warehousing
Data mining & data warehousingData mining & data warehousing
Data mining & data warehousing
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Data mining
Data miningData mining
Data mining
 
Erp --functional-modules
Erp --functional-modulesErp --functional-modules
Erp --functional-modules
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 

Similar to Zen & the art of data mining

Doctoral Defense: Hany SalahEldeen
Doctoral Defense: Hany SalahEldeenDoctoral Defense: Hany SalahEldeen
Doctoral Defense: Hany SalahEldeenheinestien
 
How to successfully unite new media and culture?
How to successfully unite new media and culture?How to successfully unite new media and culture?
How to successfully unite new media and culture?Domen Savič
 
2015-08-24_Media In The Classroom
2015-08-24_Media In The Classroom2015-08-24_Media In The Classroom
2015-08-24_Media In The ClassroomClairvoy
 
2016 How to Create Perfect Storm with SEO and Social Media PPT Presentation- ...
2016 How to Create Perfect Storm with SEO and Social Media PPT Presentation- ...2016 How to Create Perfect Storm with SEO and Social Media PPT Presentation- ...
2016 How to Create Perfect Storm with SEO and Social Media PPT Presentation- ...SimilarWeb - Digital Insights
 
How does Social Media and SEO work together?
How does Social Media and SEO work together? How does Social Media and SEO work together?
How does Social Media and SEO work together? Roy Hinkis
 
Crash Course: Social Media for Arts People
Crash Course: Social Media for Arts PeopleCrash Course: Social Media for Arts People
Crash Course: Social Media for Arts PeopleBeth Kanter
 
Out of Network: How to Reach Library Patrons Who Don't Use Social Media
Out of Network: How to Reach Library Patrons Who Don't Use Social MediaOut of Network: How to Reach Library Patrons Who Don't Use Social Media
Out of Network: How to Reach Library Patrons Who Don't Use Social MediaLeafingLight
 
Tpdl Doctoral consortium 2012
Tpdl Doctoral consortium 2012Tpdl Doctoral consortium 2012
Tpdl Doctoral consortium 2012heinestien
 
Advanced Social Media Techniques in Higher Education
Advanced Social Media Techniques in Higher EducationAdvanced Social Media Techniques in Higher Education
Advanced Social Media Techniques in Higher EducationChristopher Rice
 
Social Media for Professional Use
Social Media for Professional UseSocial Media for Professional Use
Social Media for Professional UseSiren Interactive
 
Developing your social media culture
Developing your social media cultureDeveloping your social media culture
Developing your social media cultureHelen Mitchell
 
Seams2016 presentation calikli_et_al
Seams2016 presentation calikli_et_alSeams2016 presentation calikli_et_al
Seams2016 presentation calikli_et_alGul Calikli
 
Finding Focus In A 24-7 Networked Environment
Finding Focus In A 24-7 Networked EnvironmentFinding Focus In A 24-7 Networked Environment
Finding Focus In A 24-7 Networked EnvironmentJeff Hurt
 
Using and creating open education resources.sycamorehs
Using and creating open education resources.sycamorehsUsing and creating open education resources.sycamorehs
Using and creating open education resources.sycamorehsLynn Ritchey
 
How Social Media Can Enhance Your Research Activities
How Social Media Can Enhance Your Research ActivitiesHow Social Media Can Enhance Your Research Activities
How Social Media Can Enhance Your Research Activitieslisbk
 
Digital Curation: What kind of curator are you? #converge11
Digital Curation: What kind of curator are you? #converge11Digital Curation: What kind of curator are you? #converge11
Digital Curation: What kind of curator are you? #converge11Joyce Seitzinger
 

Similar to Zen & the art of data mining (20)

Doctoral Defense: Hany SalahEldeen
Doctoral Defense: Hany SalahEldeenDoctoral Defense: Hany SalahEldeen
Doctoral Defense: Hany SalahEldeen
 
How to successfully unite new media and culture?
How to successfully unite new media and culture?How to successfully unite new media and culture?
How to successfully unite new media and culture?
 
2015-08-24_Media In The Classroom
2015-08-24_Media In The Classroom2015-08-24_Media In The Classroom
2015-08-24_Media In The Classroom
 
2016 How to Create Perfect Storm with SEO and Social Media PPT Presentation- ...
2016 How to Create Perfect Storm with SEO and Social Media PPT Presentation- ...2016 How to Create Perfect Storm with SEO and Social Media PPT Presentation- ...
2016 How to Create Perfect Storm with SEO and Social Media PPT Presentation- ...
 
How does Social Media and SEO work together?
How does Social Media and SEO work together? How does Social Media and SEO work together?
How does Social Media and SEO work together?
 
Crash Course: Social Media for Arts People
Crash Course: Social Media for Arts PeopleCrash Course: Social Media for Arts People
Crash Course: Social Media for Arts People
 
Out of Network: How to Reach Library Patrons Who Don't Use Social Media
Out of Network: How to Reach Library Patrons Who Don't Use Social MediaOut of Network: How to Reach Library Patrons Who Don't Use Social Media
Out of Network: How to Reach Library Patrons Who Don't Use Social Media
 
Tpdl Doctoral consortium 2012
Tpdl Doctoral consortium 2012Tpdl Doctoral consortium 2012
Tpdl Doctoral consortium 2012
 
Web 2.0 Meets Standards
Web 2.0 Meets StandardsWeb 2.0 Meets Standards
Web 2.0 Meets Standards
 
Social Tools to Share Your Research
Social Tools to Share Your ResearchSocial Tools to Share Your Research
Social Tools to Share Your Research
 
Advanced Social Media Techniques in Higher Education
Advanced Social Media Techniques in Higher EducationAdvanced Social Media Techniques in Higher Education
Advanced Social Media Techniques in Higher Education
 
Social Media for Professional Use
Social Media for Professional UseSocial Media for Professional Use
Social Media for Professional Use
 
The Social Media Triage
The Social Media TriageThe Social Media Triage
The Social Media Triage
 
The Future is Yesterday: Public Relations in the Networked Era
The Future is Yesterday:Public Relations in the Networked EraThe Future is Yesterday:Public Relations in the Networked Era
The Future is Yesterday: Public Relations in the Networked Era
 
Developing your social media culture
Developing your social media cultureDeveloping your social media culture
Developing your social media culture
 
Seams2016 presentation calikli_et_al
Seams2016 presentation calikli_et_alSeams2016 presentation calikli_et_al
Seams2016 presentation calikli_et_al
 
Finding Focus In A 24-7 Networked Environment
Finding Focus In A 24-7 Networked EnvironmentFinding Focus In A 24-7 Networked Environment
Finding Focus In A 24-7 Networked Environment
 
Using and creating open education resources.sycamorehs
Using and creating open education resources.sycamorehsUsing and creating open education resources.sycamorehs
Using and creating open education resources.sycamorehs
 
How Social Media Can Enhance Your Research Activities
How Social Media Can Enhance Your Research ActivitiesHow Social Media Can Enhance Your Research Activities
How Social Media Can Enhance Your Research Activities
 
Digital Curation: What kind of curator are you? #converge11
Digital Curation: What kind of curator are you? #converge11Digital Curation: What kind of curator are you? #converge11
Digital Curation: What kind of curator are you? #converge11
 

More from heinestien

MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1heinestien
 
Reading the Correct History? Modeling Temporal Intention in Resource Sharing
Reading the Correct History? Modeling Temporal Intention in Resource SharingReading the Correct History? Modeling Temporal Intention in Resource Sharing
Reading the Correct History? Modeling Temporal Intention in Resource Sharingheinestien
 
Carbon Dating The Web: Estimating the Age of Web Resources
Carbon Dating The Web: Estimating the Age of Web ResourcesCarbon Dating The Web: Estimating the Age of Web Resources
Carbon Dating The Web: Estimating the Age of Web Resourcesheinestien
 
Losing My Revolution Long Paper TPDL2012
Losing My Revolution Long Paper TPDL2012Losing My Revolution Long Paper TPDL2012
Losing My Revolution Long Paper TPDL2012heinestien
 
Hany's JCDL Doctoral Consortium
Hany's JCDL Doctoral ConsortiumHany's JCDL Doctoral Consortium
Hany's JCDL Doctoral Consortiumheinestien
 
Hany's Doctoral Consortium
Hany's Doctoral ConsortiumHany's Doctoral Consortium
Hany's Doctoral Consortiumheinestien
 

More from heinestien (6)

MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1
 
Reading the Correct History? Modeling Temporal Intention in Resource Sharing
Reading the Correct History? Modeling Temporal Intention in Resource SharingReading the Correct History? Modeling Temporal Intention in Resource Sharing
Reading the Correct History? Modeling Temporal Intention in Resource Sharing
 
Carbon Dating The Web: Estimating the Age of Web Resources
Carbon Dating The Web: Estimating the Age of Web ResourcesCarbon Dating The Web: Estimating the Age of Web Resources
Carbon Dating The Web: Estimating the Age of Web Resources
 
Losing My Revolution Long Paper TPDL2012
Losing My Revolution Long Paper TPDL2012Losing My Revolution Long Paper TPDL2012
Losing My Revolution Long Paper TPDL2012
 
Hany's JCDL Doctoral Consortium
Hany's JCDL Doctoral ConsortiumHany's JCDL Doctoral Consortium
Hany's JCDL Doctoral Consortium
 
Hany's Doctoral Consortium
Hany's Doctoral ConsortiumHany's Doctoral Consortium
Hany's Doctoral Consortium
 

Recently uploaded

Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 

Recently uploaded (20)

Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)Carbon Dioxide Capture and Storage (CSS)
Carbon Dioxide Capture and Storage (CSS)
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 

Zen & the art of data mining

  • 1. Old Dominion University Department of Computer Science Hany SalahEldeen Hany SalahEldeen Khalil hany@cs.odu.edu Zen & the Art of Data Mining 07-08-14 Social Media Data Collection and the path to Modeling & Predicting User Intention Web Science & Digital Libraries Lab 1
  • 2. Before we start.. here is a lil bit about me… Hany SalahEldeen 2
  • 3. Hany SalahELdeen Education: • PhD Candidate • Web Science and Digital Libraries Group • Masters Degree in Computer Vision and Artificial Intelligence • Universitat Autonoma de Barcelona • Bachelors of Computer Systems Engineering • University of Alexandria Hany SalahEldeen 3
  • 4. Research & Technical Experience • Microsoft Research Cairo • Google GmBH Zurich • Microsoft Inc. Mountain View • National University of Singapore Hany SalahEldeen 4
  • 5. Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media Web Mining Pattern Analysis Machine Learning Human Behavioral Analysis Social Media Analysis So what am I investigating? 5
  • 6. Publications Hany SalahEldeen Shanghai CIKM 2014 Conference - 1 first author paper - 1 second author paper London DL 2014 Conference - 1 third author paper Malta TPDL 2013 Conference - 1 first author paper 6
  • 7. Publications Hany SalahEldeen Indianapolis JCDL 2013 Conference - 1 first author paper Rio de Janeiro WWW 2013 Conference - 1 first author paper Cyprus TPDL 2012 Conference - 1 first author paper 7
  • 8. Beside the perks of travelling, our research has been popular… Hany SalahEldeen 8
  • 10. MIT Technology Review Hany SalahEldeen 10
  • 11. MIT Technology Review Hany SalahEldeen 11
  • 15. The Virginian Pilot Hany SalahEldeen 15
  • 16. Our Research’s Popularity Hany SalahEldeen • Local newspaper: The Virginia Pilot • 4 x MIT Technology Review • BBC • Mashable • The Atlantic • Yahoo News • Articles in > 11 different languages • We have been called: • The Internet Archeologists • Web Time Travelers 16
  • 17. My goal: Detect, model, and predict user intention in social media Hany SalahEldeen 17
  • 18. Ok hold on, let’s go back to the basics… Hany SalahEldeen 18
  • 19. Web 2.0 Definition: Web 2.0 is a concept that takes the network as a platform for information sharing, interoperability, user-centered design, and collaboration on the World Wide Web.* * http://en.wikipedia.org/wiki/Web_2.0 Hany SalahEldeen 19
  • 20. Web 2.0 • Yes, Web 2.0 is about “user-generated content” • But explicit content contributed by users is just 20% of what “matters” • 80% is in the implicitly contributed data* Hany SalahEldeen 20 *Toby Segaran, Programming Collective Intelligence, 2007
  • 21. Systems & Web 2.0 • Google: Utilizes PageRank which is a technique for extracting intelligence from the link structure • Flickr: Utilizes “interestingness” algorithm • Amazon: Utilizes “people who bought this product also bought” feature • Pandora: Utilizes “similar artist radio” • eBay: Utilizes “reputation system” Hany SalahEldeen 21
  • 22. So why do we even care about all that? Hany SalahEldeen 22
  • 23. Power to the People! Hany SalahEldeen 23
  • 24. Power to the People! • Because analyzing a huge dataset of millions of users will yield a lot of potential insights into: • User Experience • Marketing • Personal Taste • Human Behavior in general. Hany SalahEldeen 24
  • 25. So what is Data Mining? Hany SalahEldeen 25
  • 26. Data Mining • Definition: It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. http://en.wikipedia.org/wiki/Data_mining Hany SalahEldeen 26
  • 27. Back to my goal: Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 27
  • 28. Let’s breakdown the title first… Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 28
  • 29. Let’s breakdown the title first… Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 29
  • 30. Scenario 1: Jenny reading Jeff’s tweets Hany SalahEldeen 30
  • 31. Michael Jackson Dies Hany SalahEldeen Snapshot on: June 25th 2009 http://web.archive.org/web/20090625232522/http://www.cnn.com/ 31
  • 32. Jeff tweets about it… Hany SalahEldeen Published on: June 25th 2009 https://twitter.com/mdnitehk/status/2333993907 32
  • 33. Jeff’s friend Jenny was on a vacation in Hawaii for a month Jenny is off the grid… Hany SalahEldeen 33
  • 34. When she came back she checked Jeff’s tweets and was shocked! Jenny starts catching up a month later Hany SalahEldeen Read on: July26th 2009 https://twitter.com/mdnitehk/status/2333993907 34
  • 35. She quickly clicked on the link in the tweet… Jenny follows the link on July 26th Hany SalahEldeen http://web.archive.org/web/20090726234411/http://www.cnn.com/ CNN page on: July 26th 2009 35
  • 36. • Implication: • Jenny thought Jeff is making a joke about her favorite singer and she got mad at him • Problem: • The tweet and the resource the tweet links to have become unsynchronized. Jenny is confused! Hany SalahEldeen 36
  • 37. Scenario 2: The Egyptian Revolution Hany SalahEldeen 37
  • 38. The Egyptian Revolution Jan 2011 Hany SalahEldeen 38
  • 39. Reading about it in Storify.com a year later in March 2012 Hany SalahEldeen http://storify.com/maq4sure/egypts-revolution 39
  • 40. I noticed some shared images are missing Hany SalahEldeen http://storify.com/maq4sure/egypts-revolution 40
  • 41. Some tweets are still intact Hany SalahEldeen https://twitter.com/miss_amy_qb/status/32477898581483521 41
  • 42. …and some lost their meaning with the disappearance of the images Hany SalahEldeen Missing ? https://twitter.com/aishes/status/32485352102952960 https://twitter.com/omar_chaaban/status/32203697597452289 42
  • 43. The tweet remains but the shared image disappeared… Hany SalahEldeen http://yfrog.com/h5923xrvbqqvgzj 43
  • 44. • Implication: • The reader cannot understand what the author of the tweet meant because the image is not available. • Problem: • The post is available but the linked resource (image) is completely missing. Cairo….we have a problem! Hany SalahEldeen 44
  • 45. …back to the title Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 45
  • 46. …back to the title Hany SalahEldeen Detecting, Modeling, & Predicting User Temporal Intention in Social Media 46
  • 47. 47 The Anatomy of a Tweet Hany SalahEldeen 47
  • 48. 48 The Anatomy of a Tweet Author’s username Other user mention Tweet Body Hash TagShortened URL to resource Publishing timestamp Social Post Shared Resource Interaction options Hany SalahEldeen 48
  • 49. 49 3 URIs = 3 Chances to fail Hany SalahEldeen http://news.blogs.cnn.com/2012/04/26/norwegian s-sing-to-annoy-mass-killer/ https://twitter.com/KentEiler/status/19553574 9754527745 49
  • 50. 50 … t1 t4 t2 t3 t5 t7 t8 t9 tn t6 Explanation in MJ’s example 50
  • 51. 51 If I click on a link in a tweet, which version should I get? ttweet or tclick ? Hany SalahEldeen 51
  • 52. 52 Sometimes you want a previous version The Correct Temporal Intention CNN.com at the closest time to the tweet: 25th June 2009 ~ 7pm Hany SalahEldeen 52
  • 53. 53 Sometimes you want the current version The Correct Temporal Intention In this case the current state of the press releases page Hany SalahEldeen 53
  • 54. 54 Research Question Can we estimate the users’ intention at the time of posting and reading to predict and maintain temporal consistency? Hany SalahEldeen 54
  • 55. 55 People rely on social media for most updated information Hany SalahEldeen 55
  • 56. Hany SalahEldeen So if you are posting a tweet about your cat… …No one cares! 56
  • 57. Hany SalahEldeen Regardless how cool your cat was! 57
  • 58. All tweets are equal… …but some are more equal than the others Hany SalahEldeen 58
  • 59. Preliminary Research Questions: 1. How long would these last? 2. And if lost, are they archived? 3. Is this what the author intended? Hany SalahEldeen 59
  • 60. 60 Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised. Hany SalahEldeen Historical Integrity 60
  • 61. 61 The life cycle of a social post Hany SalahEldeen 61
  • 62. 62 The life cycle of a social post tweets Hany SalahEldeen 62
  • 63. 63 The life cycle of a social post tweets Links to Hany SalahEldeen 63
  • 64. 64 The life cycle of a social post tweets What the reader receives Links to Same state the author intended Hany SalahEldeen 64
  • 65. 65 The life cycle of a social post tweets What the reader receives Links to Same state the author intended Hany SalahEldeen The resource has disappeared 65
  • 66. 66 The life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared The resource has changed Hany SalahEldeen 66
  • 67. 67 Same state the author intended The Resource’s Possibilities a bigger problem since the reader might not know. What the reader receives The resource has disappeared The resource has changed Hany SalahEldeen 67
  • 68. 68 We could lose the linked resource Hany SalahEldeen 68
  • 69. 69 The attack on the embassy was in February 2013 Or the resource could change Hany SalahEldeen 69
  • 70. 70 Why do we want to detect the Author’s Temporal Intention? • Match: and convey the intended information. • Notify: – the author that the resource is prone to change. – the reader that the resource has changed. • Preserve: the resource by pushing snapshots into the archive automatically. • Retrieve: the closest archived version to maintain the consistency. Hany SalahEldeen 70
  • 71. 71 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 71
  • 72. 72 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 72
  • 73. 73 Estimating Web Archiving Coverage • Goal: Estimate how much of the public web is present in the public archives and how many copies are available? • Action: – Getting 4 different datasets from 4 different sources: • Search Engines Indices • Bit.ly • DMOZ • Delicious. • Results: * • Publications: – How much of the web is archived? JCDL '11 – http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is- archived.html Hany SalahEldeen 16%-79% Archived according to the source 73
  • 74. 74 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 74
  • 75. 75 The timeline of the resource Hany SalahEldeen 75 http://ws-dl.blogspot.com/2013/04/2013-04-19-carbon-dating-web.html
  • 77. 77 Actual Vs. Estimated Dates Hany SalahEldeen • Successfully estimated the creation date >75% of the resources • >33% we estimated the exact date 77
  • 78. 78 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 78
  • 79. • From Twitter, Websites, Books: • The Egyptian revolution • From Twitter Only: • Stanford’s SNAP dataset: • Iranian elections • H1N1 virus outbreak • Michael Jackson’s death • Obama’s Nobel Peace Prize • Twitter API: • The Syrian uprising Six Socially Significant Events Hany SalahEldeen 79
  • 80. Resources Missing & Archived Hany SalahEldeen 80
  • 81. Revisiting after a year… Hany SalahEldeen • There is a nearly linear relationship between the amount missing from the web and time. • After 1 year ~11% is gone, and 0.02% is lost every day 81
  • 82. Measured Vs. Predicted Hany SalahEldeen 82
  • 83. First Attempts to Shared Content Replacement Hany SalahEldeen 83 • We performed an experiment to gauge how many of the resources that are missing could be replaced with other similar resources. • Collected a dataset with available resources which we assumed to be missing • Used our method to extract the replacement resources • Measured the similarity with the original resource
  • 84. First Attempts to Shared Content Replacement Hany SalahEldeen We were able to extract another resource with >70% similarity to the missing resource in >40% of the cases 84
  • 85. 85 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 85
  • 86. 86 Temporal Intention Relevancy Model (TIRM) Between ttweet and tclick: The linked resource could have: • Changed • Not changed The tweet and the linked resource could be: • Still relevant • No longer relevant Hany SalahEldeen 86
  • 87. 87 Resource is changed but relevant • The resource changed • But it is still relevant  Intention: need the current version of the resource at any time Hany SalahEldeen 87
  • 88. 88 Relevancy and Intention Mapping Current Hany SalahEldeen 88
  • 89. 89 Resource is changed and not relevant  Intention: need the past version of the resource at any time • The resource changed • But it is no longer relevant Hany SalahEldeen 89
  • 90. 90 Past Relevancy and Intention Mapping Current Hany SalahEldeen 90
  • 91. 91 Resource is not changed and relevant  Intention: need the past version of the resource at any time • The resource is not changed • And it is relevant Hany SalahEldeen 91
  • 92. 92 Past Relevancy and Intention Mapping Current Past Hany SalahEldeen 92
  • 93. 93 Resource is not changed and not relevant  Intention: I am not sure which version of the resource I need • The resource is not changed • But it is not relevant Hany SalahEldeen 93
  • 94. 94 Past Relevancy and Intention Mapping Current Past Not Sure Hany SalahEldeen 94
  • 95. 95 Our investigation angles 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen 95
  • 96. 96 Feature extraction • For each tweet we perform: – Link analysis – Social Media Mining – Archival Existence – Sentiment Analysis – Content Similarity – Entity Identification Hany SalahEldeen 96
  • 97. 97 1- Link analysis • Since the tweets have embedded resources shortened by Bit.ly we can extract: – Total number of clicks – Hourly click logs – Creation dates – Referring websites – Referring countries • We calculate the depth of the resource in relation to its domain (either it is a leaf node or a root page) – We calculated the number of backslashes in the resource’s URI Hany SalahEldeen 97
  • 98. 98 2- Social Media Mining • Twitter: – Using Topsy.com’s API to extract: • Total number of tweets. • The most recent 500. • Number of tweets by influential users. The collection of tweets extracted provided an extended context of the resource authored by users in the twittersphere. Hany SalahEldeen 98
  • 99. 99 2- Social Media Mining • Facebook: – Mined too for likes, shares, posts, and clicks related to each resource. Hany SalahEldeen 99
  • 100. 100 3- Archival Existence • Using Memento Time Maps we get: – Total mementos available – Different archives count. – The closest archived version to the tweet time. Hany SalahEldeen 100
  • 101. 101 4- Sentiment Analysis • Using NLTK libraries of natural language text processing • Extract the most prominent sentiment in the text Hany SalahEldeen 101
  • 102. 102 5- Content Similarity • Steps: – We download the content HTML using Lynx browser. – We apply boilerplate removal algorithm and full text extraction. – Calculate the cosine similarity between the two pages.  70% similarity  Hany SalahEldeen 102
  • 103. 103 6- Entity Identification • By visual inspection we observed that the majority of tweets about celebrities are related to current events. • We harvested Wikipedia for lists of actors, politicians, and athletes. • Checked the existence of a celebrity mention in the tweets. Actor: Johnny Depp Hany SalahEldeen 103
  • 104. 104 The trained classifier • From the feature extraction phase we extracted 39 different features to train the classifier. • Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32% Hany SalahEldeen 104
  • 105. 105 What’s Next for Hany? • Finish up my dissertation • Defend. • Get a research/Data scientist position • Interests: – L3S Research Center Germany – Microsoft Research Hany SalahEldeen 105
  • 106. 106 1. The state of the archived content 2. The age of the shared resource 3. The states of the resource: 1. Missing from the live web 2. Changed from what the author intended to share 4. Detect the author’s intention and collect a dataset 5. Model this intention 6. Create a time-based navigation tool to match the predicted intention Hany SalahEldeen Summary: Email: hany@cs.odu.edu Office: 3102 Website: http://www.cs.odu.edu/~hany/ Twitter: @hanysalaheldeen 106