Self-disclosure in twitter conversations - talk in QCRI

Self-disclosure in Twitter conversations
JinYeong Bakjy.bak@kaist.ac.krDepartment of Computer Science, KAIST

About Me
2 2014-10-23
 JinYeong Bak
 Ph.D. student at KAIST, U&I Lab
 Research interests
Bayesian Data Analysis
Computational Social Science

About Me
2 2014-10-23
 JinYeong Bak
 Ph.D. student at KAIST, U&I Lab
 Research interests
Bayesian Data Analysis
Computational Social Science
 Research Intern, MSRA, 2013, Supervisor: Chin-Yew Lin
 Related publications
Self-Disclosure and Relationship Strength in Twitter Conversations, ACL 2012 (with Suin Kim, Alice Oh)
Self-disclosure topic model for classifying and analyzing Twitter conversations, EMNLP 2014 (with Chin-Yew Lin, Alice Oh)

2014-10-23
Overview
2014-10-23

Limitations in Previous Works
5 2014-10-23
 Survey
 Hand coding
 Lab environment

5 2014-10-23
 Survey
 Hand coding
 Lab environment
Hard to identify
self-disclosure
in naturally occurring and
large dataset

Twitter Conversations
https://twitter.com/britneyspears
Example ofa Twitter conversation
6 2014-10-23

Graphical model of Self-Disclosure Topic Model
Self-Disclosure Topic Model (SDTM)
7 2014-10-23

7 2014-10-23
Accuracy and average F1

Self-disclosure & Social features
What are relations between self-disclosure and social features
in Twitter conversations?
8

2014-10-23
Self-disclosure (SD)


The verbal expressions by which a person reveals aspects of self to others [Jourard1971b]

Process of making the self known to others [Jourard&Lasakow1958]

3~40% of everyday conversation is consist of self-disclosure [Dunbar et al.1997]
Self-disclosure: Definition
10 2014-10-23

Self-disclosure: Level
11 2014-10-23
Self-disclosure level [Vondracek et al.1971, Barak et al.2007]
 No disclosure (G level)
General information and ideas
 Medium disclosure (M level)
General information about self or someone close to him
 High disclosure (H level)
Sensitive information about self or someone close to him

Self-disclosure: G level

General information and ideas

No information about self or someone close to him
12 2014-10-23

Self-disclosure: M level

General information about self or someone close to him

Personal events, age, occupation and family members
13 2014-10-23

Self-disclosure: H level

Sensitive information about self or someone close to him

Problematic behaviors of self and family members

Physical appearance, health, death, sexual topics
14 2014-10-23

Self-disclosure: Relations
15 2014-10-23
Human relationship
 Degree of self-disclosure in a relationship depends on the
strength of the relationship [Duck2007]
 Strategic self-disclosure can strengthen the relationship

16 2014-10-23
Benefits
 Can get social support from others [Derlega et al.1993]
 Can cope with stress [Derlega et al.1993,Tamir and Mitchell2012]
 Examples

17 2014-10-23
Consideration
 Easy to be attacked when private information is opened
 Need to manage privacy boundary (e.g. people, topics) [Petronio2002]
 Example

18 2014-10-23
 Survey
Asking questions to participants
Cons) Biased by participants memory

18 2014-10-23
 Survey
 Hand coding
Analyzing dataset by human
Cons) Cannot apply to large dataset

18 2014-10-23
 Survey
 Hand coding
Analyzing dataset by human
Cons) Cannot apply to large dataset
 Lab environment
Experiments held in lab or artificial environment
Cons) Not real/naturally occurring dataset

Research Questions
19 2014-10-23
 How can we find self-disclosure in large & naturally
occurring corpus automatically?

Research Questions
19 2014-10-23
 How can we find self-disclosure in large & naturally
occurring corpus automatically?
 What are relations between self-disclosure and social
features in large & naturally occurring corpus?

2014-10-23

Twitter
21
 Online social networking service
 www.twitter.com
 200 million users send over 400 million tweets daily
(2013.09)
2014-10-23
https://twitter.com/NoSyu

Tweet
22 2014-10-23
 Users write 140-characters messages
 Users mention others or re-tweet other’s tweet

Conversation in Twitter
23 2014-10-23
 Users have a conversation in Twitter

Conversation Topics
24 2014-10-23
 Users discuss several topics with others
Soccer
Politics

Conversation Topics
25 2014-10-23
 Users discuss several topics with others
Places
Family


A Twitter conversation

5 or more tweets

At least one reply by each user
26 2014-10-23


A Twitter conversation

5 or more tweets

At least one reply by each user

Twitter conversation data

Aug 2007 to Jul 2013

102K users

2M conversations

17M tweets
26 2014-10-23

2014-10-23
Self-disclosure and relationship strength in Twitter conversations
ACL 2012 short paper
2014-10-23

28 2014-10-23
Human relationship
 Degree of self-disclosure in a relationship depends on the
strength of the relationship [Duck2007]
 Strategic self-disclosure can strengthen the relationship

Research Question
29 2014-10-23
Does Twitter conversations also show a similar pattern?
 Dyads with high relationship strength show more self-disclosure
behavior
 Dyads with low relationship strength show less self-disclosure
behavior

Methodology

Twitter data

131K users

2M conversations
30 2014-10-23

Methodology

Twitter data

131K users

2M conversations

Relationship strength

Conversation frequency (CF)

Conversation length (CL)
30 2014-10-23

Methodology

Twitter data

131K users

2M conversations




Self-disclosure

Personal information

Profanity
30 2014-10-23

Methodology

Twitter data

131K users

2M conversations




Self-disclosure


Profanity

Analysis with topic models

Latent Dirichlet allocation (LDA, [Blei, JMLR 2003])
30 2014-10-23

Relationship Strength

CF: conversation frequency

The numberof conversational chains between the dyad averaged per month

CL: conversation length

The lengthof conversational chains between the dyad averaged per month
31 2014-10-23

Relationship Strength

CF: conversation frequency

The numberof conversational chains between the dyad averaged per month

CL: conversation length

The lengthof conversational chains between the dyad averaged per month


A high CF or CL for a dyad means the relationship is strong

A low CF or CL for a dyad means the relationship is weak
31 2014-10-23

Self-disclosure


Personally Identifiable Information (PII)

Personally Embarrassing Information (PEI)

Profanity

nigga, ass, wtf, lmao
32 2014-10-23

Self-disclosure: Personal Information
Personally Identifiable Information (PII)
Personally Embarrassing Information (PEI)
33 2014-10-23
Ex) name, location,
email address, job,
social security number
Ex) clinical history,
sexual life,
job loss,
family problem


Discover topics in each conversation

Use LDA[Blei2003]with 푘푘=300

LDA outputs a topic proportion for each conversation

LDA outputs a multinomial word distribution for each topic
34 2014-10-23


Discover topics in each conversation

Use LDA[Blei2003]with 푘푘=300

LDA outputs a topic proportion for each conversation

LDA outputs a multinomial word distribution for each topic

Find related topics

Annotate conversations that best represent each topic

Use Amazon Mechanical Turk

Turkers annotated conversations for

Existence of PII

Existence of PEI

Keywords
34 2014-10-23

Example of PII, PEI and Profanity topics

Shown by high probability words in each topic
PII 1
PII 2
PEI1
PEI 2
PEI 3
Profanity
san
tonight
pants
teeth
family
nigga
live
time
wear
doctor
brother
lmao
state
tomorrow
boobs
dr
sister
shit
texas
good
naked
dentist
uncle
ass
south
ill
wearing
tooth
cousin
bitch
35 2014-10-23

Results
36 2014-10-23
weak  strong weak  strong
Profanity
PII & PEI
Conversation Frequency
Conversation Length
Profanity
PII & PEI

Results
37 2014-10-23
weak  strong weak  strong
profanity
PII & PEI
Conversation Frequency
Conversation Length
profanity
PII & PEI

Results: Interpretation

PII

When they meet new acquaintances, they use PII to introduce themselves
38 2014-10-23

Summary

Used a large corpus of Twitter conversations

Measured relationship strength by conversation frequency and conversation length

Measured self-disclosure by

PII, PEI

Profanity

Confirmed hypothesis that stronger relationships show more self-disclosure behaviors in Twitter conversations
39 2014-10-23

Weakness of the Paper
40 2014-10-23
 Use naïve definition of degree of self-disclosure
PII, PEI, Profanity
Need to use more concrete definition for self-disclosure degree

40 2014-10-23
Need to use more concrete definition for self-disclosure degree Self-disclosure level

40 2014-10-23
 Use naïve computational method
LDA with post-processing
Need to build more concrete novel method
Self-disclosure level

40 2014-10-23
 Use naïve computational method
LDA with post-processing
Need to build more concrete novel method
Self-disclosure level
Self-disclosure Topic Model

2014-10-23
Self-disclosure Topic Model (SDTM)
EMNLP 2014 long paper
2014-10-23

Difficulties for SD research

Lack of ground-truth dataset of SD level

No tagged dataset for Twitter conversation

No accessible self-disclosure datasets
42 2014-10-23

Difficulties for SD research

Lack of ground-truth dataset of SD level

No tagged dataset for Twitter conversation

No accessible self-disclosure datasets

Lack of study about SD in computational linguistics

Definitions and relations with others in social psychology

Survey or hand-coding

Related word categories in LIWC [Houghton et al.2012]
42 2014-10-23

Ground-truth Dataset

Process

Sample random 301 Twitter conversations

Ask it to three judges

Tag self-disclosure level to each tweet

Work on a web-based platform
43
Screenshot of annotation web-based platform
2014-10-23

Ground-truth Dataset

Process

Sample random 301 Twitter conversations

Ask it to three judges

Tag self-disclosure level to each tweet

Work on a web-based platform

Result

Tagged G: 122, M: 147, H: 32 conversations

Fleiss kappa: 0.68
43
Screenshot of annotation web-based platform
2014-10-23

Assumptions: First person pronouns
First person pronouns are good indicators for self-disclosure

Ex) ‘I’, ‘My’

Used in previous research [Joinsonet al.2001, Barak et al.2007]
44 2014-10-23

Assumptions: First person pronouns
First person pronouns are good indicators for self-disclosure

Ex) ‘I’, ‘My’

Observed highly discriminative features between G and M/H in annotated dataset
45
Unigram
Bigram
Trigram
my
I love
I have a
I
I was
is going to
I’m
I have
to go to
but
my dad
wantto go
was
go to
and I was
I’ve
my mom
going to miss
2014-10-23

Assumptions: Topics
M and H level have different topics

[General vsSensitive] information about self or intimate
46 2014-10-23

Assumptions: Topics
Self-disclosure related topics by LDA
Location
Time
Adult
Health
Family
Profanity
san
tonight
pants
teeth
family
nigga
live
time
wear
doctor
brother
lmao
state
tomorrow
boobs
dr
sister
shit
texas
good
naked
dentist
uncle
ass
south
ill
wearing
tooth
cousin
bitch
47 2014-10-23

Assumptions: Topics
M and H level have different topics

[General vsSensitive] information about self or intimate

Can be formalized as topics

Personally Identifiable Information

General information about self

Ex) name, location, email address, job, …

Secrets

Sensitive information about self

Ex) physical appearance, health, sexuality, death, …
48 2014-10-23


Based on probabilistic topic modeling
49 2014-10-23



Classifying G and M/H level

Observed first-person pronouns

Using learned maximum entropy classifier
49 2014-10-23



Classifying G and M/H level

Observed first-person pronouns

Using learned maximum entropy classifier

Classifying M and H level

Observed words

Using seed words for each level
49 2014-10-23

50 2014-10-23
Rough description of how to infer self-disclosure in SDTM
Maximum Entropy
Classifier
Topic Model
G level
M level
H level
Topic Model with Seed Words
Tweet

Maximum Entropy Classifier
51 2014-10-23
 Learned from annotated dataset
 Works better than others
(C4.5, Naïve Bayes, SVM with linear kernel, polynomial kernel
and radial basis)
 Used to identify aspect and opinions in topic model [Zhao2010]

Seed Words
Seed words are prior knowledge for each level

G level

No seed words (symmetric prior)

M level

Data-driven approach in Twitter conversation

H level

Data-driven approach from external dataset
52 2014-10-23

Seed Words

M level

Data-driven approach

Use Twitter conversation dataset

Get frequently occurred trigram that begin with ‘I’ and ‘my’
53 2014-10-23

Seed Words

M level


Use Twitter conversation dataset

Get frequently occurred trigram that begin with ‘I’ and ‘my’

Example seed words
53
Name
Birthday
Location
Occupation
My nameis
My birthday is
Ilive in
My jobis
My last name
Mybirthday party
Ilived in
My new job
My realname
My bdayis
I live on
My high school
2014-10-23

Seed Words

H level


Use external dataset (Six Billion Secrets)
http://www.sixbillionsecrets.com

Users write and share his/her secrets

26,523 posts

Extract high ranked word features
54 2014-10-23
Example of secret posts in Six Billion Secrets

Seed Words

H level


Use external dataset (Six Billion Secrets)
http://www.sixbillionsecrets.com

Users write and share his/her secrets

26,523 posts

Extract high ranked word features

Example seed words
54
Physical appearance
Health condition
Death
chubby
addicted
dead
fat
surgery
died
scar
syndrome
suicide
acne
disorder
funeral
2014-10-23
Example of secret posts in Six Billion Secrets

Classifying Performance

Data

Annotated Twitter conversation

Random shuffled 80/20 train/test
55 2014-10-23


Data

Annotated Twitter conversation

Random shuffled 80/20 train/test

Methods

BOW+
Bag of Words + Bigrams + Trigrams features, Maximum entropy

FirstP
Occurrence of first-person pronouns features, Maximum entropy

SEED
Seed words and trigrams features, Maximum entropy

FirstP+SEED
FirstP and SEED feature, Two stage Maximum entropy

SDTM
Self-disclosure Topic Model
55 2014-10-23

56 2014-10-23

57 2014-10-23

2014-10-23
EMNLP 2014 long paper

What are relations between self-disclosure and social features
in Twitter conversations?
 Research questions
1.Does high self-disclosure lead to longer conversations?
2.Is there difference in conversation length patterns over time depending on overall self-disclosure level?
3.Does high self-disclosure users have many conversation partners?
4.Does high self-disclosure users have more conversations frequently?
59

Research Questions
Q1) Does high self-disclosure lead to longer conversations?
60 2014-10-23

Research Questions
Q2) Is there difference in conversation length patterns over time depending on overall self-disclosure level?
61 2014-10-23
High SD level dyad
Low SD level dyad

Research Questions
62 2014-10-23
Q3) Does high self-disclosure users have many conversation
partners?
High SD level user
Low SD level user

Research Questions
63 2014-10-23
Q4) Does high self-disclosure users have more conversations
frequently?
High SD level user
Low SD level user

Results
High ranked topics in each level (G, M, H levels)
Shown by high probability words in each topic
G 1
G 2
M 1
M 2
H 1
H 2
obama
league
send
going
better
ass
he’s
win
email
party
sick
bitch
romney
game
i’ll
weekend
feel
fuck
vote
season
sent
day
throat
yo
right
team
dm
night
cold
shit
president
cup
address
dinner
hope
fucking
64 2014-10-23

Results
Q1) Does high self-disclosure lead to longer conversations?
Ans) Positive relations between initial SD level and changes CL
65 2014-10-23

Results
Q2) Is there difference in CL patterns over time by overall SD level?
Ans) ‘high’ and ‘mid’ groups increase CL over time, not ‘low’
‘high’ groups talk more in a conversation than ‘mid’ & ‘low’ groups
66 2014-10-23

Results
67 2014-10-23
Q3) Does high self-disclosure users have many conversation partners?
Ans) ‘mid’ self-disclosure users have more conversation partners than
others
#Partners
# Conv/ Day
Words / Conv
ConvLength
low
3.33
0.46
59.17
4.13
mid
3.55
0.52
61.17
4.28
high
3.47
0.54
63.26
4.45
p-value
<0.001
<0.001
<0.1
<0.001

Results
68 2014-10-23
Q4) Does high self-disclosure users have more conversations
frequently?
Ans) ‘high’ self-disclosure users have more conversations per day than
others
#Partners
# Conv/ Day
Words / Conv
ConvLength
low
3.33
0.46
59.17
4.13
mid
3.55
0.52
61.17
4.28
high
3.47
0.54
63.26
4.45
p-value
<0.001
<0.001
<0.1
<0.001

Results
69 2014-10-23
Finding)
 Researchers often look at the number of words in a conversation
for relation with self-disclosure
 Conversation length is more significant than # words
#Partners
# Conv/ Day
Words / Conv
ConvLength
low
3.33
0.46
59.17
4.13
mid
3.55
0.52
61.17
4.28
high
3.47
0.54
63.26
4.45
p-value
<0.001
<0.001
<0.1
<0.001

Summary
70 2014-10-23
 Self-disclosure (SD)
Definition from social psychology
Limitations inprevious research
 Computational approaches for self-disclosure
Twitter conversation dataset
Self-disclosure topic model (SDTM)
 Self-disclosure & Social features
Relationship strength over time
Conversation partners and frequency

Future Work
71 2014-10-23
 Self-disclosure for a user timeline tweets
Have positive relations with
Loneliness [Al-Saggaf.2014]
Online social network usage[Trepte.2013]
Predict user’s
Loneliness and give a social support
Usage patterns in online social network and give feedback
 Self-disclosure by machine
Looks like human in dialogue system
Can increase satisfaction in talking cure dialogue system

Reference
72 2014-10-23
 [Jourard1971b] Sidney M Jourard. 1971b. The transparent self (rev. ed.). Princeton,
NJ: VanNostrand.
 [Jourard1958] Sidney M Jourard and Paul Lasakow. 1958. Some factors in self-disclosure.
The Journal of Abnormal and Social Psychology, 56(1):91.
 [Dunbar et al.1997] Robin IM Dunbar, Anna Marriott,
and Neil DC Duncan. 1997. Human conversational behavior. Human Nature, 8(3):231–246.
 [Vondracek and Vondracek1971] Sarah I Vondracek and Fred W Vondracek. 1971.
The manipulation and measurement of self-disclosure in preadolescents. Merrill-
Palmer Quarterly of Behavior and Development, 17(1):51–58.
 [Chelune and others1979] Gordon J Chelune et al. 1979. Self-disclosure: Origins,
patterns, and implications of openness in interpersonal relationships. Jossey-Bass San
Francisco.
 [Barak&Gluck-Ofri2007] Azy Barak and Orit Gluck-Ofri. 2007. Degree and
reciprocity of self-disclosure in online forums. CyberPsychology & Behavior,
10(3):407–417.
 [Jo2011] Jo, Yohan, and Alice H. Oh. "Aspect and sentiment unification model for
online review analysis." Proceedings of the fourth ACM international conference on Web
search and data mining. ACM, 2011.

Reference
73 2014-10-23
 [Tamir and Mitchell2012] Diana I Tamir and Jason P Mitchell. 2012. Disclosing
information about the self is intrinsically rewarding. roceedings of the National
Academy of Sciences, 109(21):8038–8043.
 [Duck2007] Steve Duck. 2007. Human relationships. Sage.
 [Bak et al.2012] JinYeong Bak, Suin Kim, and Alice
Oh. 2012. Self-disclosure and relationship strength in twitter conversations. In Proceedings of the 50thAnnual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 60–64. Association for Computational Linguistics.
 [Derlega et al.1993] Valerian J. Derlega, Sandra Metts, Sandra Petronio, and Stephen
T. Margulis. 1993. Self-Disclosure, volume 5 of SAGE Series on Close Relationships.
SAGE Publications, Inc.
 [Wills1985] Thomas Ashby Wills. 1985. Supportive functions of interpersonal
relationships.
 [Trepte.2013] Sabine Trepte and Leonard Reinecke. 2013. The reciprocal effects of
social network site use and the disposition for selfdisclosure: A longitudinal study.
Computers in Human Behavior, 29(3):1102 – 1112.
 [Harris, J, 2009] Kamvar, Sep, and Jonathan Harris. We feel fine: An almanac of
human emotion. Simon and Schuster, 2009.

Reference
74 2014-10-23
 [Houghton and Joinson2012] David J Houghton and Adam N Joinson. 2012.
Linguistic markers of secrets and sensitive self-disclosure in twitter. In System
Science (HICSS), 2012 45th Hawaii International
Conference on, pages 3480–3489. IEEE.
 [Steinfield et al.2008] Charles Steinfield, Nicole B Ellison, and Cliff Lampe. 2008.
Social capital, selfesteem, and use of online social network sites: A longitudinal
analysis. Journal of Applied Developmental Psychology, 29(6):434–445.
 [Petronio2002] Petronio, S. 2002. Boundaries of privacy: Dialectics of disclosure. 29.
Albany, NY
 [Valkenburg2011] Valkenburg, Patti M and Sumter. 2011. Sindy R and Peter, Jochen,
Gender differences in online and offline self-disclosure in pre-adolescence and
adolescence. British journal of developmental psychology
 [Sprecher2012] Susan Sprecher, Stanislav Treger and Joshua D. Wondra. 2012. Effects
of self-disclosure role on liking, closeness, and other impressions in get-acquainted
interactions. Journal of Social and Personal Relationships.
 [Zhao2010] Wayne Xin Zhao, Jing Jiang, HongfeiYan, and Xiaoming Li. 2010. Jointly
modeling aspects and opinions with a maxent-lda hybrid. In Proceedings of EMNLP.

Thank you!
Any questions or comments?
JinYeongBakjy.bak@kaist.ac.krDepartment of Computer Science, KAIST
75 2014-10-23

Self-disclosure in twitter conversations - talk in QCRI

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (6)

Similaire à Self-disclosure in twitter conversations - talk in QCRI

Similaire à Self-disclosure in twitter conversations - talk in QCRI (20)

Plus de JinYeong Bak

Plus de JinYeong Bak (6)

Dernier

Dernier (20)

Self-disclosure in twitter conversations - talk in QCRI