2. Agenda: Part 1
ā¢ Introduction
ā
ā
ā
ā
Bio
Some definitions
Growing Interest
Resources
ā¢ Social Homophily as an Example
ā Meaning and Import
ā Political Cyberbalkanization
ā¢ Social Network Analysis
ā¢ Text Mining
2
3. Agenda: Part 2
ā¢ Introduction to Social Network
Analysis Metrics
ā Degrees of Separation
ā Hooray for Bollywood
ā¢ Standard Measures
ā Centrality
ā Cohesion
ā¢ Levels of Social Network Analysis
ā Facebook & Egocentric Analysis
ā Searching for Cohesive Subgroups
3
4. Dave King
Biography
University
Professor
ā70
App Stats
Math Modeling
Dir R&D
Execucom
ā80
SVP & CTO
Comshare
ā90
DSS
Optimization
App Stats
Nat Lang for
DB Query
DSS Interface
HICSS (83)
EVP R&D
JDA Software
ā00
DSS
Multi-Dim DB
EIS & BI
ā10
Merch Planning
SCP Planning
Optimization
ECom
HICSS
Track Intern
& Dig Econ
Sentiment
Analysis
HICSS
Tutorial
Web Mining &
Analysis
4
14. Mining and Analyzing Social Media
The Content
Data Mining
Discovering meaningful patterns
from large data sets using
pattern recognition technologies
Web Mining
Data Mining focused on the
analysis of Web Usage, Structure
& Content
Text Mining
Using natural language processing
& data mining to discover patterns
in a collection of ādocuments
15. Mining and Analyzing Social Media
The Connections
Social Network Analysis
[SNA]
The mapping and measuring
of relationships and flows
between people, groups,
organizations, computers,
URLs, and other connected
information/knowledge
entities.
Network science
Study of the theoretical
foundations of network
structure and behavior and
the application of networks
to many subfields including
SNA, collaboration networks,
emergent systems, and
physical and life science
systems
15
17. Interest
Search - Google Trends
Web Mining
Text
Social Network Analysis
Web Mining
Network Science
Data Mining
Social Media Mining
Big Data
Sentiment Analysis
Data Analytics
Mathematical Statistics
17
18. Interest
Book Titles ā Google Ngram Viewer
Web Mining
Social Network Analysis
Text Mining
Network Science
Data Mining
Social Media**
Big Data
Sentiment Analysis
Data Analytics
Mathematical Statistics
18
26. Social Homophily
aka Assortative Mixing
Sound bites
Love of the same
Birds of a feather flock together
Likes associate with likes
Likes copy likes
Likes think alike
26
28. Social Network Analysis
Key Elements
Vertices or Nodes
The āthingsā
Graph
[ V,E, f ]
A
Edges or Links
The ārelationshipsā
Graph or Network
The set of
vertices/nodes,
edges/links and the
relationship/function
connecting them.
B
Edge
(Link)
C
Vertex
(Node)
D
29. Social Network Analysis
Types of Vertices
ā¢ Social Entities
ā People or social structures such as
workgroups, teams, organizations,
institutions, states, or even countries.
ā¢ Content
A
ā Web pages, keyword tags, or videos.
What does a
Vertex
represent?
ā¢ Locations
B
ā Physical or virtual locations or events.
ā¢ Primary building blocks of social
media
ā Friends in social networking sites,
posts or authors in blogs, or pages in
wikis.
C
D
29
30. Social Network Analysis
Types of Relations
ā¢ Similarities
ā Location (same spatial and temporal space)
ā Participation (same club, same event, ā¦)
ā Attributes (Age, gender, same attitudes, ā¦)
A
ā¢ Relational Roles
What does
an Edge
represent?
ā Kinship (mother of, sibling of, ā¦)
ā Other Roles (friend of, boss of, ā¦)
B
ā¢ Relational Cognition
ā Affective (Likes, Hatesā¦)
ā Perceptual (Knows, Knows of, ā¦)
ā¢ Relational Events
C
D
ā Interactions (Sold to, talked to, helped, ā¦)
ā Flows (Information, beliefs, money, ā¦)
30
31. Social Network Analysis
Types of Edges or Links
Undirected,
Unweighted
A
Facebook
Friends
Directed,
Unweighted
B
A
B
Twitter
Followers
C
Undirected,
Weighted
A
Width of
lines
C
Directed, Weighted
Collaboration
Network
100
B
A
5
10
C
60
Email 70
Network
B
20
C
31
37. Social Homophily
Gender homophily in adolescent school group
Density or Distinction? The Roles of Data Structure and Group Detection Methods in Describing Adolescent Peer Groups
Gest, S. et al. Journal of Social Structure: Vol. 6, http://www.cmu.edu/joss/content/articles/volume8/GestMoody/
37
38. Social Networks
Role of Attributes
ā¢ Add insights to visualizations
and analysis
ā¢ May include:
ā Demographic characteristics
(age, gender, race)
ā Other characteristics (income,
education, location)
ā Use of the system (number of
logins, blogs posted, edits
made)
ā Network Metrics (centralization)
A
B
C
D
ā¢ Often designated by color,
shape, size and/or opacity of
the vertices
38
40. Social Homophily
The beauty of social networks
ted.com/talks/nicholas_christakis_the_hidden_influence_of_social_networks.html
40
41. Social Homophily
Regardless of cause the resulting pattern is
Clusters of connections
emerge throughout the social
space such that there is
dense intra-connectivity
within clusters
and sparse inter-connectivity
between clusters
41
42. Social Homophily
What produces a pattern of homophily?
Cause: Similarity breeds connection
Effect:
Connection breeds similarity
Spurious: Some other factor is @ work
42
43. Social Homophily
In the world of U.S. politics?
Discourse
Viewing
Voting
Purchasing
ā¦
Antagonistic Polarization?
43
44. Social Homophily
Balkanization
Process of fragmentation or division of a region or state
into smaller regions or states that are often hostile or
non-cooperative with each other
44
45. Social Homophily
Balkanization of the US Congress (Data Sets)
senate.gov/
reference/
Index/Votes.htm
clerk.house.gov/
legislative/
legvotes.aspx
Before
Threshold
After
Threshold
govtrack.us/
developers
"Senate Voting Patterns," Morris, F.
analyticalvisions.blogspot.com/2006/04/senate-votingpatterns-part-2.html
45
46. Social Homophily
Balkanization of US Senate (1989-2013) - Votes
U.S. Senate More Divided
Than Ever Data Shows,
Swallow,E. ,Forbes, Oct. 17,
2013, forbes.com/sites/
ericaswallow/2013/11/17/sen
ate-voting-relationships-data/
Partisan voting patterns ā
1989 to present
Interactive data visualization
static.davidchouinard.com/c
ongress/
46
47. Social Homophily
Balkanization of US Senate (1975-2012) - Votes
Portrait of Political Party Polarization, Moody J. and P. Mucha, April 2013, Network Science,
journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8888768
47
48. Social Homophily
Are political books of āa featherā purchased together?
The Social Life of Books,Krebs V.,
orgnet.com/divided.html
Aug 2008
Oct 2008
49. Social Homophily
Cyberbalkanization of the Political Blogsphere
ā¢ Single day snapshot of a
Snowball Sample of
Political Blogs (N=1490)
ā¢ Manually assigned as
Liberal or Conservative
ā¢ Focus on blogrolls and
front page citations
ā¢ Are we witnessing an era
of āCyberbalkinization?ā
49
50. Social Homophily
An aside ā whatās a (We)blog?
ā¢ Information or discussion site
ā Focused on a particular subject area
ā Discrete entries posted in reverse
chronological order
ā Primarily text
ā Trend toward multiple author entries
ā¢ Some stats?
ā Wordpress ~ 60M blogs with 100M
pageviews/day
ā Tumblr ~ 150M blogs with 90M posts/day
ā Blogger ~ 46M unique visitors/month
50
55. Social Homophily
Social Network Analysis of Blogs with Pajek
F-R 2D Layout w/o labels
partitioned by pol. position
Circular Layout w. labels
Draw
Network
Fructerman-Reingold
2D Layout with labels
Draw
Partition
F-R 2D Layout w/o labels
55
56. Social Homophily
Cyberbalkanization of the Political Blogsphere?
Citation Network ā link
implies that a political
blog has a URL on itās
front page referencing
another blog
Proliferation of specialized online political news sources allows
people with different political leanings to be exposed only to
information in agreement with their previously held views?
56
58. Social Homophily
Cyberbalkanization of the Political Blogsphere
Blogrolls
Page Citation
90%
N = 1151
(82%)
N=758
Avg. Links=13.2
L
L
N=Top 20
2 MnthPosts = 12470
Citations =1398
67%
74%
N=1490
Pol
Blog
8%
10%
N = 312
(13%)
N = 247
(18%)
82%
84%
N=732
Avg. Links = 15.1
C
92%
C
N=Top 20
2 MnthPosts = 10414
Citations =2422
N = 2110
(87%)
58
59. Social Homophily
Polarization on Twitter
ā¢ Twitter communications
six weeks prior to 2010
mid-term elections (9/1411/1)
ā¢ Examined 250,000
āpolitically relevantā
tweets (specific hashtags)
ā¢ Focused on two networks
ā retweets and mentions
using āstreaming APIā
59
60. Social Homophily
An Aside ā What is Twitter?
ā¢ Each tweet <= 140 characters
(avg. 10-15 words/message)
ā¢ Heavy presence of non-alpha
symbols, abbrevs, misspellings
and slang
ā¢ Tweets often include
ā Retweets (original tweet
repeated)
ā Mentions (user mentions
another user in a tweet)
ā Hashtags ā#ā (metadata
annotation denoting topic or
audience)
http://www.forbes.com/sites/marketshare/2013/01/23/
the-number-one-mistake-retail-brands-make-when-it-comes-to-twitter/
60
61. Social Homophily
Polarization on Twitter ā Social Networks (in study)
Network 1:
N=23,766
A
B
retweets
Network 2:
A
mentions
N=10,142
B
Both representing potential pathways
for information to flow from A to B
61
63. Social Homophily
Polarization on Twitter ā Gephi Analysis
Forced
Directed
Layout (Force
Atlas 2)
Nodes colored
by Cluster
(Political
Ideology)
63
64. Social Homophily
Contributions of Twitter Study
ā¢
Cluster analysis of networks derived
from corpus shows that the networks of
retweets exhibits clear segregation
while the mentioned network is
dominated by single cluster.
ā¢
Manual classification of Twitter user by
political alignment, demonstrating that
the retweet network clusters
corresponding to political left and right.
They also show the mention network is
political.
ā¢
Interpretation of the observed
community structures based on
injection of partisan content into
ideologically opposed hashtag
information streams
Retweets
Mentions
64
65. Social Homophily
Polarization ā The other side of the coin
Oh, buy the wayā¦
ā¢ In addition to citations, Adamic
and Glance also considered the
similarity in the textual content of
the blogs
ā¢ Identified 498 ābigramsā across
the Top 40 blogs
ā¢ Computed cosine similarity
measure between TD-IDF scores
for all pairs of blogs
ā¢ Average similarity between
opposites was .10, among
conservatives was .54 and
liberals was .57.
65
66. Social Homophily
Mining Text Content
ā¢ āScholars have long recognized that
much of politics is expressed in
words.ā
ā¢ āTo understand what politics is about
we need to know what political actors
are saying and writing.ā
ā¢ āā¦The assumption is that ideology
dominates the language used in the
text.ā
ā¢ Ergo, another way to study the
polarization of political blogs is to
compare what they say.
66
67. Social Homophily
Mining Text Content - Political Blogs Example
Conservative or Liberal?
Republican nightmare begins: Obamacare is 'a godsend' for people getting
coverage
U.S. President Barack Obama delivers remarks alongside Human Services
Secretary Kathleen Sebelius (R) and other Americans the White House says
will benefit from the opening of health insurance marketplaces under the
Affordable Care Act, in the Rose Garden.
The health care exchangesāfederal and stateāare now functioning and not
sucking up all the oxygen around the implementation of Obamacare. Finally,
the "good news" news stories are finally being told, like this one in The New
York Times.
ā¦
That's the kind of story that makes Republicans quake in their shoes. For more
of them, go below the fold.
67
68. Social Homophily
Mining Text Content ā Political Blogs Example
Conservative or Liberal?
The Colorado #obamacare exchange is a disaster.
Iām sorry, but thereās no other way to describe these enrollment numbers. Which are, by the way, from
a state exchange ā meaning that the problems of healthcare.gov should theoretically be irrelevant to
the conversation:
Enrollment have grown to 15,074, marketplace officials announced Monday. Marketplace officials set
a goal of 136,000 people covered on exchanged-based plans by the end of 2014, but so far the
exchange has failed to reach even worst-case enrollment projections.
The Democratic-controlled Coloradan state government is blaming this mess on people supposedly
not having to replace their insurance now and bad press about the national exchange. Which is easier
than blaming the Democrats now running the state exchange, particularly since the head of the
exchange wanted a merit raise for all her work on it. Itās certainly easier than simply admitting that
they, maybe there wasnāt all that much of a burning need for Obamacare in the first place.
But surely thatās crazy talk.
Yes, I am sorry. Bad numbers = people not being insured next year. What I am not is responsible for
any of this.
68
69. Social Homophily
Mining Text Content ā Political Blogs Example
Are they separated into 2 distinct clusters
based on their content?
Sample of 50 blog entries about
āObamacareā from popular Liberal &
Conservative Political Blogs
69
70. Social Homophily
Mining Text Content vs. Data Mining
Real-World
Data
Data Consolidation
Business
Understanding
Data
Understanding
Data
Preparation
Data Cleaning
Deployment
Modeling
Evaluation
Cross-Industry Standard Process for Data Mining
Data Transformation
Data Reduction
Well-Formed
Data
70
70
71. Social Homophily
Mining Text Content vs. Data Mining
Some Basic DM Tasks
ā¢ Anomaly Detection
ā¢ Association Rule
Learning
ā¢ Clustering
ā¢ Classification
ā¢ Regression
ā¢ Summarization
71
72. Social Homophily
Mining Text Content vs. Data Mining
DM Data is usually
ā¢ Structured
ā¢ Transformed
ā¢ Well-formed
72
73. Social Homophily
Mining Text Content ā The Process
Unstructured Data
ā¢ No specified format
ā¢ Variable length
ā¢ Variable spelling
ā¢ Punctuation and
non-alphanumeric
characters
ā¢ Contents are not
predefined and no
predefined set of
values
73
74. Social Homophily
Mining Text Content ā The Process
NLP
The overarching goal is, essentially, to turn
text into data for analysis, via application
of natural language processing (NLP) and
analytical methods.
74
75. Social Homophily
Mining Text Content ā The Process
NLP
āThe most consequential, and shocking, step we will take
is to discard the order in which words occur in documents.
We will assume documents are a bag of words, where
order does not inform our analysesā¦ If this assumption is
unpalatable, we can retain some word order by including
bigrams (word pairs) or trigrams..ā
75
76. Social Homophily
Mining Text Content ā The Process
Real-World
Text Data
Business
Understanding
Document
Understanding
Document
Preparation
Deployment
Document
Consolidation
Establish the
Corpus
Documents
Modeling
Evaluation
CRISP-Like Processes
Corpus Refinement
(Token, Stem, Stopā¦)
Feature Selection
& Weighting
Doc-Term
Matrix*
76
77. Social Homophily
Mining Text Content ā The Process
Common representation of tokens within and between documents
Tokenization
Normalize
Eliminate
Stop Words
Stemming
ā¢
Tokenization āParse the text to generate terms. Sophisticated analyzers can also
extract phrases from the text.
ā¢
Normalize ā Convert them to lowercase.
ā¢
Eliminate stop words ā Eliminate terms that appear very often (e.g. the, and, ā¦).
ā¢
Stemming ā Convert the terms into their stemmed formāremove plurals and
different word forms (e.g. achieve, achieves, achieved ā achiev) [note: word about
synonyms ā WordNet Synset]
77
78. Social Homophily
Mining Text Content ā The Process
Documents
Four score and seven years
ago our fathers brought forth
on this continent, a new
nation, conceived in Liberty,
and dedicated to the
proposition that all mean are
created equal.
Now we are engaged in a
great civil war, testing
whether that nation, orā¦
Token Sets
Feature
Extraction
nation ā 5
civil ā 1
war ā 2
men ā 2
died ā 4
people ā 5
liberty ā 1
god ā 1
ā¦
Feature Extraction & Weighting
āBag of Words, Terms
or Tokensā
Doc/Token Matrix:
Vectors of Words, Terms or Tokens by Doc
āBag of Wordsā (BOW) or
Vector Space Model (VSM):
Words or Tokens are
attributes and documents
are examples
78
80. Social Homophily
Mining Text Content ā Political Blog Example
DocStem
Matrix
(from
Python
NLTK
Program)
Stem Distribution from Obamacare Postings
50 Most Frequently Used out of 792 Stems by Political Ideology
80
81. Social Homophily
Mining Text Content ā Political Blog Example
Stem Distribution from Obamacare Postings
For Stems whose proportions are between .1 and .7 by Political Ideology
Differences in Stem Frequencies by Political Ideology
81
82. Social Homophily
Mining Text Content ā The Process
Transforming Frequencies and Weights
ā¢
ā¢
ā¢
ā¢
Binary Frequencies: tf =1 for tf>0; otherwise 0
Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K
Log Frequencies: 1 + log(tf) for tf>0; otherwise 0
Normalized Frequencies: Divide each frequency by
SQRT of Sum of Squares of the frequencies within the
vector (column)
ā¢ Term FrequencyāInverse Document Frequency
ā TF: Freq of term for given doc/ sum of words for given doc
ā Inverse Document Frequency: log(N/(1+D)) where N is total
number of docs and D is number with term
ā TF * IDF
82
84. Social Homophily
Mining Text Content ā Political Blog Example
Average
Cosine Similarities
Cons = .25
Lib = .27
Cons-Lib = .26
Hierarchical
Clustering
Based on TD-IDF
Of Stems from
Obamacare Postings
By Political Position
85. Social Homophily
Mining Text Content ā Retweets and Mentions
Content Homogenity
ā¢
An interesting question, therefore,
is whether it has any significance in
terms of the actual content of the
discussions involved.
ā¢
To address this issue we associate
each user with a profile vector
containing all the hashtags in her
tweets, weighted by their
frequencies.
ā¢
We can then compute the cosine
similarities between each pair of
user profiles, separately for users
in the same cluster and users in
different clusters.
85
First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studiedRange of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls ā¦Convert into organized set of texts ā called a corpus ā standardized and prepared for the purpose of knowledge discovery