RESLVE: Leveraging User Interest to Improve Entity Disambiguation on Short Text
Similaire à Noticing the Nuance: Designing intelligent systems that can understand semantic, psychological, and behavioral dimensions of our digital footprints
Cortana intelligence suite for projects & hacksLee Stott
Similaire à Noticing the Nuance: Designing intelligent systems that can understand semantic, psychological, and behavioral dimensions of our digital footprints (20)
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Noticing the Nuance: Designing intelligent systems that can understand semantic, psychological, and behavioral dimensions of our digital footprints
1. NOTICING
THE
NUANCE:
Designing
intelligent
systems
that
can
understand
semantic,
psychological,
and
behavioral
dimensions
of
our
digital
footprints
Elizabeth
L.
Murnane
elm236@cornell.edu
www.cs.cornell.edu/~elm236/
2. ABOUT
ELIZABETH
Currently
• 3nd
year
PhD
at
Cornell
Information
Science
• Committee:
Profs.
Dan
Cosley
(chair),
Claire
Cardie,
Geri
Gay
Research
• Personalization;
IR/NLP;
Personal
Informatics;
Affective-‐,
Semantic-‐,
Social-‐
Computing
• 2011
NSF
Graduate
Research
Fellow
Background
• 2007
MIT
S.B.
in
Mathematics
with
Computer
Science
• Co-‐founded
MIT
CSAIL
startup
3. USER-‐CENTRIC
DATA
• Explicit
&
Implicit
• User-‐generated
content
• Sensor
data
• Big
Data
&
Big
Personal
Data
(“Little
Data”)
7. DIGITAL
FOOTPRINTS
• Search
Queries
• Social
web,
microblogs,
media
sharing
• Mobile
sensing,
personal
informatics,
life-‐logging,
check-‐ins
8. DIGITAL
FOOTPRINTS
• Search
Queries
• Social
web,
microblogs,
media
sharing
• Mobile
sensing,
personal
informatics,
life-‐logging,
check-‐ins
• Social
networking
9. NUANCED
DIMENSIONS
OF
DATA
• Semantics
• Helping
machines
extract
intended
meaning
from
an
individual’s
content
• Personality
&
Emotion
• Helping
machines
interpret
psychological,
affective,
and
subjective
characteristics
of
users
and
their
data
• Behavior
• Helping
machines
understand
the
dynamics
of
both
private
and
interpersonal
activities
13. THE
RESLVE
PROJECT
• Gain
better
understanding
of
challenges
machines
face
in
understanding
semantic
meaning
of
social
Web
data
• Use
those
insights
to
develop
more
advanced
computational
methods
that
can
more
reliably
make
sense
of
this
data
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
19. TASK
DEFINITION
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
Named
En)ty
Recogni)on
(NER)
• SystemaEcally
idenEfying
menEons
of
en##es
(e.g.,
people,
places,
concepts,
ideas)
20. TASK
DEFINITION
Named
En)ty
Recogni)on
(NER)
• SystemaEcally
idenEfying
menEons
of
en##es
(e.g.,
people,
places,
concepts,
ideas)
Named
En)ty
Disambigua)on
(NED)
• Resolving
the
intended
meaning
of
ambiguous
enEEes
from
mulEple
candidate
meanings
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
21. AMBIGUOUS
ENTITIES
aaahh
one
more
day
un,l
finn!!!
#cantwait
office
holiday
party
Beetle
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
22. AMBIGUOUS
ENTITIES
aaahh
one
more
day
un,l
finn!!!
#cantwait
office
holiday
party
Beetle
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
23. AMBIGUOUS
ENTITIES
aaahh
one
more
day
un,l
finn!!!
#cantwait
office
holiday
party
Beetle
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
24. AMBIGUOUS
ENTITIES
aaahh
one
more
day
un,l
finn!!!
#cantwait
office
holiday
party
Beetle
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
25. Footage:
office
holiday
party
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
26. Footage:
office
holiday
party
Footage:
• Workplace?
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
27. Footage:
office
holiday
party
Footage:
• Workplace?
• TV
Show?
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
28. Footage:
office
holiday
party
Footage:
• Workplace?
• TV
Show?
Episode
4
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
29. Footage:
office
holiday
party
Episode
4
Footage:
• Workplace?
• TV
Show?
• US
Version?
• UK
Version?
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
30. Episode
4
office
holiday
party
office,
december
3
Footage:
• Workplace?
• TV
Show?
• US
Version?
• UK
Version?
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
31. ANALYSIS
Data
Sample
• TwiKer:
tweets
• YouTube:
video
Etles,
descripEons
• Flickr:
photo
tags,
Etles,
descripEons
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
32. TEXT
LENGTH
• Longest
uKerances
sEll
shorter
than
even
shortest
texts
from
NER
task
corpora
like
Reuters-‐21578,
Brown-‐Corpus
0"
5"
10"
15"
20"
25"
30"
10"
40"
70"
100"
130"
160"
190"
300"
450"
600"
800"
1100"
1400"
2500"
4000"
5500"
7000"
8500"
10000"
11500"
13000"
14500"
Twi/er" YouTube" Flickr"
Reuters" Brown"
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
34. HIGH
AMBIGUITY
• NER
services
have
low
confidence
• Many
potenEal
candidates
(2
to
163,
avg.
5-‐6,
median
4)
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
Wikipedia"Miner" DBPedia"Spotlight"
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
35. HIGH
AMBIGUITY
• 91%
of
uKerances
contain
at
least
1
ambiguous
enEty
• 2/3
of
enEEes
detected
are
ambiguous
• Almost
no
enEEes
without
at
least
2
senses
to
disambiguate
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
36. CHALLENGES
&
FOCUS
• Short
Length
• Sparse
Lexical
Context
• Noisy
• Highly
personal
in
nature
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
37. CHALLENGES
&
FOCUS
• Short
Length
• Sparse
Lexical
Context
• Noisy
• Highly
personal
in
nature
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
38. LIMITATIONS
OF
EXTANT
RESEARCH
Tweets
severely
degrade
tradiEonal
techniques
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
39. LIMITATIONS
OF
EXTANT
RESEARCH
Tweets
severely
degrade
tradiEonal
techniques
• Stanford
NER:
F1
drops
90%
à
46%
• DBPedia
Spotlight
&
Wikipedia
Miner:
P@1
<
40%
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
40. LIMITATIONS
OF
EXTANT
RESEARCH
Tweets
severely
degrade
tradiEonal
techniques
• Stanford
NER:
F1
drops
90%
à
46%
• DBPedia
Spotlight
&
Wikipedia
Miner:
P@1
<
40%
Recent
strategies
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
41. LIMITATIONS
OF
EXTANT
RESEARCH
Tweets
severely
degrade
tradiEonal
techniques
• Stanford
NER:
F1
drops
90%
à
46%
• DBPedia
Spotlight
&
Wikipedia
Miner:
P@1
<
40%
Recent
strategies
• Crowd-‐sourcing
• LimitaEon:
Dependent
on
reliable
human
workers
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
42. LIMITATIONS
OF
EXTANT
RESEARCH
Tweets
severely
degrade
tradiEonal
techniques
• Stanford
NER:
F1
drops
90%
à
46%
• DBPedia
Spotlight
&
Wikipedia
Miner:
P@1
<
40%
Recent
strategies
• Crowd-‐sourcing
• LimitaEon:
Dependent
on
reliable
human
workers
• Automated
aKempts
• LimitaEon:
Focus
on
NER
not
NED
• LimitaEon:
Generalizability
beyond
TwiKer?
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
43. HYPOTHESES
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
• User
has
core
interests
• User
more
likely
to
menEon
an
enEty
about
a
topic
relevant
to
personal
interests
than
menEon
a
topic
of
non-‐interest
• User
expresses
these
interests
consistently
in
content
she
posts
online
in
mulEple
communiEes
• Can
use
a
semanEc
knowledge
base
to
formally
represent
these
topics
of
interest
44. HYPOTHESES
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
• User
has
core
interests
• User
more
likely
to
menEon
an
enEty
about
a
topic
relevant
to
personal
interests
than
menEon
a
topic
of
non-‐interest
• User
expresses
these
interests
consistently
in
content
she
posts
online
in
mulEple
communiEes
• Can
use
a
semanEc
knowledge
base
to
formally
represent
these
topics
of
interest
• Wikipedia
• ArEcles,
categories
effecEvely
represent
topic
• CompaEble
with
NER
toolkits
(DBPedia
Spotlight,
Wikipedia
Miner)
ArEcle
ediEng
behavior
≈
interests
45. QUALITATIVE
ANALYSIS:
STABLE
INTERESTS
User’s
topics
of
contribuEon
similar
across
Web:
On
average,
52.4%
of
enEEes
a
user
menEons
in
social
Web
(e.g.,
“Java”)
have
at
least
1
candidate
sense
in
same
parent
category
of
Wikipedia
arEcle
same
user
edited
(e.g.,
“Programming
language”)
If
extend
to
just
4
parents
up
category
hierarchy,
get
all
100%
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
46. QUALITATIVE
ANALYSIS:
STABLE
INTERESTS
User’s
topics
of
contribuEon
similar
across
Web:
Same
Topic
On
average,
52.4%
of
enEEes
a
user
menEons
in
social
Web
(e.g.,
“Java”)
have
at
least
1
candidate
sense
in
same
parent
category
of
Wikipedia
arEcle
same
user
edited
(e.g.,
“Programming
language”)
If
extend
to
just
4
parents
up
category
hierarchy,
get
all
100%
Ambiguous
YouTube
post:
office,
december
3
Same
user’s
recent
Wikipedia
edit:
<item
userid="xxxx"
user="xxxx”
pageid="31841130”
,tle=
"The
Office
(U.S.
season
8)"/>
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
47. QUALITATIVE
ANALYSIS:
STABLE
INTERESTS
A
user’s
topics
of
contribuEon
similar
across
Web:
Same
Topic
Same
categories
On
average,
52.4%
of
enEEes
a
user
menEons
in
social
Web
(e.g.,
“Java”)
have
at
least
1
candidate
sense
in
same
parent
category
of
Wikipedia
arEcle
same
user
edited
(e.g.,
“Programming
language”)
If
extend
to
just
4
parents
up
category
hierarchy,
get
all
100%
Ambiguous
YouTube
post:
office,
december
3
Same
user’s
recent
Wikipedia
edit:
<item
userid="xxxx"
user="xxxx”
pageid="31841130”
,tle=
"The
Office
(U.S.
season
8)"/>
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
48. STRATEGY
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
Ø Bridge
user
idenEty
between
social
Web
and
knowledge
base,
K
Ø Model
interests
using
K’s
organizaEonal
scheme
Ø Rank
enEty
senses
according
to
relevance
to
interests
49. EXPLORING
A
PERSONALIZED
SOLUTION
Individual-‐centric
approach
to
NED
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
50. EXPLORING
A
PERSONALIZED
SOLUTION
Individual-‐centric
approach
to
NED
Incorporates
external,
user-‐specific
semanEc
data
Personal
Context
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
51. EXPLORING
A
PERSONALIZED
SOLUTION
Individual-‐centric
approach
to
NED
Incorporates
external,
user-‐specific
semanEc
data
Model
personal
interests
with
respect
to
this
informaEon
Personal
Context
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
52. EXPLORING
A
PERSONALIZED
SOLUTION
Individual-‐centric
approach
to
NED
Incorporates
external,
user-‐specific
semanEc
data
Model
personal
interests
with
respect
to
this
informaEon
Determine
user’s
likely
intended
meaning
of
ambiguous
enEty
based
on
similarity
between
potenEal
meanings
and
interests
Personal
Context
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
53. EXPLORING
A
PERSONALIZED
SOLUTION
Individual-‐centric
approach
to
NED
Incorporates
external,
user-‐specific
semanEc
data
Model
personal
interests
with
respect
to
this
informaEon
Determine
user’s
likely
intended
meaning
of
ambiguous
enEty
based
on
similarity
between
potenEal
meanings
and
interests
RESLVE
Resolving
EnEty
Sense
by
LeVeraging
Edits
Personal
Context
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
54. IMPLEMENTATION:
THE
RESLVE
SYSTEM
RESLVE
(Resolving
EnEty
Sense
by
LeVeraging
Edits)
addresses
NED
by:
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
55. IMPLEMENTATION:
THE
RESLVE
SYSTEM
RESLVE
(Resolving
EnEty
Sense
by
LeVeraging
Edits)
addresses
NED
by:
I. ConnecEng
social
Web
+
Wikipedia
editor
idenEty
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
56. IMPLEMENTATION:
THE
RESLVE
SYSTEM
RESLVE
(Resolving
EnEty
Sense
by
LeVeraging
Edits)
addresses
NED
by:
I. ConnecEng
social
Web
+
Wikipedia
editor
idenEty
II. Modeling
topics
of
interests
using
arEcle
edits
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
57. IMPLEMENTATION:
THE
RESLVE
SYSTEM
RESLVE
(Resolving
EnEty
Sense
by
LeVeraging
Edits)
addresses
NED
by:
I. ConnecEng
social
Web
+
Wikipedia
editor
idenEty
II. Modeling
topics
of
interests
using
arEcle
edits
III. Ranking
enEty
candidates
by
personal
relevance
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
58. IMPLEMENTATION:
THE
RESLVE
SYSTEM
RESLVE
(Resolving
EnEty
Sense
by
LeVeraging
Edits)
addresses
NED
by:
I. ConnecEng
social
Web
+
Wikipedia
editor
idenEty
II. Modeling
topics
of
interests
using
arEcle
edits
III. Ranking
enEty
candidates
by
personal
relevance
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
59. PHASE
1:
BRIDGING
WEB
IDENTITIES
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
• Connect
idenEty
of
social
media
user
with
Wikipedia
editor
60. PHASE
1:
BRIDGING
WEB
IDENTITIES
• Connect
idenEty
of
social
media
user
with
Wikipedia
editor
• Simple
string
matching
Iofciu,
2011;
Perito,
2011
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
61. pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Models
user’s
topics
of
interest
using
bridged
Wiki
account’s
ediEng-‐history
Compares
similarity
of
those
topics
to
topic
associated
with
candidate
sense
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
PHASE
2:
REPRESENTING
USERS
AND
ENTITIES
62. Models
user’s
topics
of
interest
using
bridged
Wiki
account’s
ediEng-‐history
Compares
similarity
of
those
topics
to
topic
associated
with
candidate
sense
Content-‐based
&
knowledge-‐graph
based
similarity
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
PHASE
2:
REPRESENTING
USERS
AND
ENTITIES
63. MODELING
A
KNOWLEDGE
CONTEXT
Knowledge
base,
K
K=(N,E)
2
node
types:
Categories
Topics
c1
c2
c4
t3t2
c3
d2d1 d3
t1
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
64. USER
INTEREST
MODEL
• EdiEng
a
descripEon
signals
interest
in
associated
topic
• Topic
nodes:
all
topics
user
edited
descripEon
of
• Category
nodes:
categories
reachable
in
knowledge
graph
from
those
topics
• Edge
weight
=
inverse
of
shortest
path
length
! c1 c2 c3 c4
t1
!
!
! 1!
!
!
! 0!
t2
!
!
! 1!
!
!
! 1!
t3 0! 0!
!
!
! 1!
• Same
representaEon
for
candidates
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
65. Models
user’s
topics
of
interest
using
bridged
Wiki
account’s
ediEng-‐history
Compares
similarity
of
those
topics
to
topic
associated
with
candidate
sense
Content-‐based
&
knowledge-‐graph
based
similarity
Weighted
vectors
used
to
represent
user
and
candidate
sense
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
PHASE
2:
REPRESENTING
USERS
AND
ENTITIES
66. PHASE
3:
RANKING
BY
PERSONAL
RELEVANCE
Output
highest
scoring
candidate
as
intended
meaning
by
measuring:
sim(u,m)=α*simcontent(u,m)+(1-‐α)*simcategory(u,m)
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
67. PRE-‐PROCESSING
&
PREPARATION
MODULES
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
68. pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
PRE-‐PROCESSING
&
PREPARATION
MODULES
69. EXPERIMENT
Labeling
correct
enEty
meaning
• 1545
valid
ambiguous
enEEes
• Mechanical
Turk
CategorizaEon
Masters
• Averaged
observed
agreement
across
all
coders
and
items
=
0.866
• Average
Fleiss
Kappa
=
0.803
• 918
unanimously
labeled
ambiguous
enEEes
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
70. PERFORMANCE
Metric
• Precision
at
rank
1
(P@1)
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
71. PERFORMANCE
Metric
• Precision
at
rank
1
(P@1)
Methods
of
comparison
• Human
annotated
gold
standard
• RC:
Randomly
sorted
candidates
• PF:
Prior
frequency
• RU:
RESLVE
given
a
random
Wikipedia
user's
interest
model
• DS:
DBPedia
Spotlight
• WM:
Wikipedia
Miner
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
73. RESULTS
• Best
performance
on
YouTube
texts
(longest)
due
to
content-‐based
sim
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
74. RESULTS
• Best
performance
on
YouTube
texts
(longest)
due
to
content-‐based
sim
• Outperforms
on
more
personal
text
(e.g.,
tweets)
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
75. RESULTS
• Best
performance
on
YouTube
texts
(longest)
due
to
content-‐based
sim
• Outperforms
on
more
personal
text
(e.g.,
tweets)
• Less
effecEve
on
impersonal
text
(e.g.,
photo
geo-‐tags)
•
High
prior
frequency
so
standard
methods
suffice
• Personally-‐unfamiliar
topics
so
not
likely
to
make
Wiki
edits
about
them
• Stable
interests
assumpEon
breaks
down
here
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
78. SENTIMENT
BASED
SEARCH
• Zip
codes
of
10
most
populated
cities,
10
least
populated
cities,
10
random
cities
across
the
country
• 54,015
places
across
1500
US
cities
• Movie
theaters,
hotels,
spas,
stores,
restaurants,
etc.
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
83. CHALLENGES
FOR
USERS
• Interpreting
mixed
reviews
• Confidence
in
reviewer’s
subjective
opinions
• Reading
multiple
reviews
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
84. RESEARCH
QUESTIONS
• Language
and
rating
• How
does
a
place’s
rating
relate
to
the
language
used
in
its
reviews?
• Personality
and
rating
• Do
people
with
similar
personalities
tend
to
like
or
dislike
the
same
places?
• Search
interfaces
• How
can
we
rank
search
results
in
order
to
recommend
places
according
to
how
appealing
their
atmosphere
is
likely
to
be
to
a
user
based
on
her
personality
and
mood?
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
85. STRATEGY
• Extract
features
from
reviews
using
Linguistic
Inquiry
and
Word
Count
(LIWC)
and
MRC
Psycholinguistic
Database
• Support
vector
models
trained
by
Mairesse
algorithm
to
derive
Big
Five
personality
types
of
reviewers
• Average
personality
score
of
reviewers
of
a
place
who
rated
the
place
5
or
higher/lower
as
proxy
for
people
who
like/dislike
a
location’s
essence
Information
Retrieval
Knowledge
Sharing
Personal
Informatics
92. CERI:
CORNELL
E-‐RULEMAKING
INITIATIVE
• Law
School
• Legal
Information
Institute
(LII)
• Information
Science
• Computer
Science
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
93. BACKGROUND
• Rulemaking:
process
federal
agencies
use
to
create
regulations
(called
“rules”)
• e-‐Rulemaking:
the
use
of
digital
technologies
during
this
process
• Regulations.gov,
RegulationRoom.org:
online
communities
that
allow
people
to
learn
about,
discuss,
and
react
to
proposed
rules
during
e-‐Rulemaking
process
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
94. BACKGROUND
• Rulemaking:
process
federal
agencies
use
to
create
regulations
(called
“rules”)
• e-‐Rulemaking:
the
use
of
digital
technologies
during
this
process
• Regulations.gov,
RegulationRoom.org:
online
communities
that
allow
people
to
learn
about,
discuss,
and
react
to
proposed
rules
during
e-‐Rulemaking
process
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
95. PARTICIPATION
PATTERNS
• Regulations.gov
• 14,000
rules
• 2
million
comments
• Regulation
Room
• 5
live
rules
• 1,318
comments
• Common
problem:
under-‐contribution
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
97. PARTICIPATION
PATTERNS
Frequency
of
comments
per
rule
Comments
per
rule
across
agencies
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
98. CHALLENGE
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
• A
major
goal
of
eRulemaking
is
to
increase
public
participation
across
a
broad
audience
and
make
the
process
more
representative
• A
major
challenge
is
sustained
participation
by
multiple
actors
across
rules
99. SOLUTION
• Twitter
is
a
popular
medium
where
people
express
views
and
ideas
• Identify
and
target
Twitter
users
who
may
be
interested
in
contributing
feedback
on
a
rule
A
solution
is
to
bring
new
users
to
an
e-‐rule
• A
major
goal
of
eRulemaking
is
to
increase
public
participation
across
a
broad
audience
and
make
the
process
more
representative
• A
major
challenge
is
sustained
participation
by
multiple
actors
across
rules
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
100. EXPERIMENT
• How
useful
is
Twitter
content
for
drawing
inferences
about
people’s
interests
and
knowledgeability
about
a
topic?
• Are
users
who
create
content
about
topics
relevant
to
an
e-‐rule
more
likely
to
engage
in
related
e-‐Rulemaking
processes
if
targeted
with
requests
for
participation?
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
101. 1 Identify
Subjects
Bio
Tweet
Combo
Contro
l
• Similarity
between
query
and
each
document
• Highest
score
used
to
assign
user
to
condition
*
via
Google
Keyword
Tool,
which
provides
less
technical
words
used
by
public
to
discuss
same
topics
User:
Rule:
Document
term
matrix
Query
q = words in rule +
query expansion *
D1 = bio D2 = tweets
D3 = bio+tweets (“combo”)
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
102. 2
² Highest
ranked
users
in
each
group
sent
an
outreach
tweet
Send
Tweets
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
103. 3
• Engagement
(retweets,
replies,
and
follows)
• Click
Through
Rate
• Contributed
to
the
rule
Measure
Response
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
104. PSYCHOLOGICAL
TRAITS
OF
EFFECTIVE
CONTRIBUTORS
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
• Connecting
psychological
traits,
language
use,
and
contribution
capability
• Classification,
Outreach,
and
Task
Routing
105. PSYCHOLOGICAL
TRAITS
OF
EFFECTIVE
CONTRIBUTORS
• Connecting
psychological
traits,
language
use,
and
contribution
capability
• Classification,
Outreach,
and
Task
Routing
• Inventories
• Self-‐efficacy
&
self-‐esteem
• Big
5
personality
• Self-‐regulation
&
self-‐monitoring
• Trendsetting
&
Opinion
Leadership
• Pro-‐social
&
altruistic
value
orientations
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
106. COMPUTATIONAL
SUPPORTS
FOR
KNOWLEDGE
SHARING
• Meaningful
games
to
teach
community
norms
• Personalized
rule
recommendation
• Providing
assistance,
prompts,
and
examples
to
improve
the
quality
of
contributions
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
107. COMPUTATIONAL
SUPPORTS
FOR
KNOWLEDGE
SHARING
• Meaningful
games
to
teach
community
norms
• Personalized
rule
recommendation
• Providing
assistance,
prompts,
and
examples
to
improve
the
quality
of
contributions
Knowledge
Sharing
Personal
Informatics
Information
Retrieval
108. RESEARCH
PROJECTS
Knowledge
Sharing
• Psychological
• Behavioral
• CeRI
• Outreach
• Task
routing
• Commenting
interface
• Smart
Pensieve
• Activity
Rhythms
• Smoking
Cessation
• Semantic
• Psychological
Information
Retrieval
• RESLVE
• Sentiment-‐based
search
Personal
Informatics
• Semantic
• Psychological
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
Computational
Problem:
Dimensions
Mined:
Projects:
109. RESEARCH
PROJECTS
Knowledge
Sharing
• Psychological
• Behavioral
• CeRI
• Outreach
• Task
routing
• Commenting
interface
• Activity
Rhythms
• Smoking
Cessation
• Semantic
• Psychological
Information
Retrieval
• RESLVE
• Sentiment-‐based
search
Personal
Informatics
• Semantic
• Psychological
• Smart
Pensieve
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
Computational
Problem:
Dimensions
Mined:
Projects:
110. REMINISCENCE
• Current
tools
are
too
technically
focused
• Emphasize
data
capture
and
logging
(photos,
videos,
scanned
documents)
• Treats
memories
as
information
to
be
later
manipulated
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
111. REMINISCENCE
• Current
tools
are
too
technically
focused
• Emphasize
data
capture
and
logging
(photos,
videos,
scanned
documents)
• Treats
memories
as
information
to
be
later
manipulated
• But
the
activity
of
reminiscence
is
actually..
• Imprecise
• Social
• Nuanced
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
112. SMART
PENSIEVE:
WHAT
MAKES
A
MEMORY
MEANINGFUL?
• Content
type
• Photos,
wall
posts,
status
updates,
event
information
• Social
dynamics
• Tie
strength,
kind
of
relationship,
amount
of
interaction
• Temporal
features
• Recent,
distant
past
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
121. SMOKING
CESSATION
• Leading
cause
of
preventable
death
&
leading
form
of
chemical
dependence
in
U.S.
• 44
million
smokers
in
the
U.S.
alone
(1/5
of
population)
• 68.8%
report
they
want
to
quit
and
over
50%
have
tried
for
at
least
1
day
in
the
past
year
• Relapse
common
&
a
minority
permanently
abstain
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
122. INTERVENTION
• Requires
tailoring
to
individual
conditions
• Lack
of
long
term
patient
assessment
&
follow-‐up
• Access
and
affordability
are
obstacles
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
123. INTERVENTION
• Requires
tailoring
to
individual
conditions
• Lack
of
long
term
patient
assessment
&
follow-‐up
• Access
and
affordability
are
obstacles
• Technology
based
interventions
have
major
shortcomings
• Low
adherence
to
established
guidelines
• Not
personalized
• Unable
to
handle
user
struggles
and
setbacks
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
124. FACTORS
INFLUENCING
OUTCOME
• Personal,
psychological,
emotional
traits
• Behaviors
&
activities
• Environment
and
social
interactions
• Cessation
motivations
and
process
125. LEVERAGING
DIGITAL
FOOTPRINTS
• Naturally
expressed
language
• Content
is
posted
spontaneously
and
regularly
• Social
setting
• Low-‐cost,
large-‐scale,
longitudinal
data
access
Information
Retrieval
Personal
Informatics
Information
Retrieval
Knowledge
Sharing
126. MAKE
A
PREDICTION
General
illness
+
coughing
+
wheezing
=
Today
I
quit
smoking.
Just
saw
a
cigarette
commercial
with
people
with
holes
in
their
throat.
It's
official.
No
more
cigarettes.
Today,
I
quit
smoking.
My
son
came
home
with
an
ashtray
he
made
in
arts
and
crafts
class.
FML
127. MAKE
A
PREDICTION
General
illness
+
coughing
+
wheezing
=
Today
I
quit
smoking.
Just
saw
a
cigarette
commercial
with
people
with
holes
in
their
throat.
It's
official.
No
more
cigarettes.
Today,
I
quit
smoking.
My
son
came
home
with
an
ashtray
he
made
in
arts
and
crafts
class.
FML
128. MAKE
A
PREDICTION
General
illness
+
coughing
+
wheezing
=
Today
I
quit
smoking.
Just
saw
a
cigarette
commercial
with
people
with
holes
in
their
throat.
It's
official.
No
more
cigarettes.
Today,
I
quit
smoking.
My
son
came
home
with
an
ashtray
he
made
in
arts
and
crafts
class.
FML
n i’m
cool,
day
4
no
cigs
but
my
mom
smokes,
i
stay
with
her,
does
not
respect
me
trying
to
quit
:
n I
quit
smoking
on
Sunday
evening.
Day
3
today.
I
feel
exhausted,
annoyed,
bored.
But
the
fight
must
go
on.
Keep
fighting
:)
n somebody
is
getting
punched
in
the
f***ing
mouth
today.
#coldturkey
129. MAKE
A
PREDICTION
General
illness
+
coughing
+
wheezing
=
Today
I
quit
smoking.
Just
saw
a
cigarette
commercial
with
people
with
holes
in
their
throat.
It's
official.
No
more
cigarettes.
Today,
I
quit
smoking.
My
son
came
home
with
an
ashtray
he
made
in
arts
and
crafts
class.
FML
n i’m
cool,
day
4
no
cigs
but
my
mom
smokes,
i
stay
with
her,
does
not
respect
me
trying
to
quit
:
n I
quit
smoking
on
Sunday
evening.
Day
3
today.
I
feel
exhausted,
annoyed,
bored.
But
the
fight
must
go
on.
Keep
fighting
:)
n somebody
is
getting
punched
in
the
f***ing
mouth
today.
#coldturkey
130. METHODOLOGY
&
DATA
COLLECTION
• Identify
smokers
• Query
Twitter
firehose
for
cessation
event
tweets
• Sample
2000
users
• 3
Mechanical
Turkers
per
tweet
for
verification
• 2
years
worth
of
tweets
per
verified
smoker
(1
year
before
cessation
event,
1
year
after)
131. MEASURES
Activity
variables
• Tweet
volume,
burstiness,
frequency
Social
variables
• Friends,
followers,
tweets
with
@mentions,
unique
mentions
Personal
&
Emotional
variables
• Location,
sentiment
intensity
Behavior
Change
Process
variables
• Cessation
date,
motive
to
quit,
treatment,
stages
of
behavior
change
132. MEASURES
Activity
variables
• Tweet
volume,
burstiness,
frequency
Social
variables
• Friends,
followers,
tweets
with
@mentions,
unique
mentions
Personal
&
Emotional
variables
• Location,
sentiment
intensity
Behavior
Change
Process
variables
• Cessation
date,
motive
to
quit,
treatment,
stages
of
behavior
change
133. MEASURES
Activity
variables
• Tweet
volume,
burstiness,
frequency
Social
variables
• Friends,
followers,
tweets
with
@mentions,
unique
mentions
Personal
&
Emotional
variables
• Location,
sentiment
intensity
Behavior
Change
Process
variables
• Cessation
date,
motive
to
quit,
treatment,
stages
of
behavior
change
134. MEASURES
Activity
variables
• Tweet
volume,
burstiness,
frequency
Social
variables
• Friends,
followers,
tweets
with
@mentions,
unique
mentions
Personal
&
Emotional
variables
• Location,
sentiment
intensity
Behavior
Change
Process
variables
• Cessation
date,
motive
to
quit,
treatment,
stages
of
behavior
change
135. MEASURES
Activity
variables
• Tweet
volume,
burstiness,
frequency
Social
variables
• Friends,
followers,
tweets
with
@mentions,
unique
mentions
Personal
&
Emotional
variables
• Location,
sentiment
intensity
Behavior
Change
Process
variables
• Cessation
date,
motive
to
quit,
treatment,
stages
of
behavior
change
136. RESPONSE
VARIABLES
Outcome
Survival
/
Relapse
Survivors
Congratulations
to
me,
still
smoke
free
J
@username
nope
i
don’t
smoke
anymore
first
few
weeks
were
hard
but
I
haven’t
craved
a
cig
in
months
Relapsers
Day
26:
Broke
down
and
bought
a
pack
of
smokes
last
weekend.
Smoked
the
last
one
today.
Well,
tried
to
quit
smokin
tobacco
but..had
a
fucked
up
day
So
day
3
of
not
smoking
is
about
to
get
cut
short..i
can’t
do
it
lol
137. ALIGNMENT
WITH
CDC
REPORTS
!
!
Men Women
CDC 54% 46%
Twitter 59% 41%
Location
Gender
Abstinence
Rates
! !
138. ALIGNMENT
WITH
CDC
REPORTS
!
!
Men Women
CDC 54% 46%
Twitter 59% 41%
Location
Gender
Abstinence
Rates
! !
139. ALIGNMENT
WITH
CDC
REPORTS
!
!
Men Women
CDC 54% 46%
Twitter 59% 41%
Location
Gender
Abstinence
Rates
! !
140. ALIGNMENT
WITH
CDC
REPORTS
!
!
Men Women
CDC 54% 46%
Twitter 59% 41%
Location
Gender
Abstinence
Rates
! !
141. RESULTS
• Survivors
(S)
and
Relapsers
(R)
• Before
(B)
and
After
(A)
the
cessation
point
142. SIGNIFICANT
DIFFERENCES:
ACTIVITY
Tweets
before
Tweets
after
Burst
before
Burst
after
Freq
before
Freq
after
FAIL
1243
3551
10.119
10.943
3.56
2.704
SUCCEED
412
771
4.459
4.278
9.906
11.254
143. TIME
OF
DAY
!
“im
really
considering
smoking
tonight
bcause
im
so
stressed”
144. TIME
OF
DAY
!
“outside
the
club
and
guy
beside
me
smoking
makes
me
wanna”
“im
really
considering
smoking
tonight
bcause
im
so
stressed”
145. SIGNIFICANT
DIFFERENCES:
SOCIAL
Friends
before
Friends
after
Followers
before
Follwers
after
FAIL
.093
.073
.074
.064
SUCCEED
.187
.207
.114
.125
“Starting
the
patch
today.
Everyone
please
support
me
on
the
road
to
quitting
smoking”
“Ok
I
started
a
really
big
challenge
yesterday...
I
quit
smoking!
I
may
need
some
help
from
you
guys
in
the
upcoming
days/weeks”.
146. SIGNIFICANT
DIFFERENCES:
SOCIAL
Friends
before
Friends
after
Followers
before
Follwers
after
FAIL
.093
.073
.074
.064
SUCCEED
.187
.207
.114
.125
Day
2
of
not
smoking
#bittersweet
I
quit
smoking
yesterday
and
everyone
is
pissing
me
off!
Day
3
without
a
cig.
Ooo
I'm
about
to
shoot
someone
151. SUMMARY
&
CONCLUSION
• Advance
our
understanding
of
what
our
digital
footprints
reveal
about
us
as
humans
• Develop
new
computational
techniques
that
can
make
sense
of
and
utilize
this
data’s
nuanced
semantic,
psychological,
and
behavioral
dimensions
• Apply
the
resulting
intelligent
systems
across
multiple
domains
in
order
to
help
people
use
digital
information
and
have
meaningful
experiences
with
technology
152. THANK
YOU!
• Advance
our
understanding
of
what
our
digital
footprints
reveal
about
us
as
humans
• Develop
new
computational
techniques
that
can
make
sense
of
and
utilize
this
data’s
nuanced
semantic,
psychological,
and
behavioral
dimensions
• Apply
the
resulting
intelligent
systems
across
multiple
domains
in
order
to
help
people
use
digital
information
and
have
meaningful
experiences
with
technology
v Questions,
comments,
and
guidance
welcome!
Elizabeth
L.
Murnane
elm236@cornell.edu
www.cs.cornell.edu/~elm236/