第47回TokyoWebMining, トピックモデリングによる評判分析

トピックモデリングによる評判分析
@I_eric_Y

第47回TokyoWebMining
1

Agenda
目的

1.  トピックモデルの拡張で問題を乗り越えて行く例/定性的な
可視化をコントロールする例をご紹介する

2.  上記技術に関してビジネス観点からご意見を頂く

•  テキストの評判分析概観

•  テキストの評点回帰

•  トピックモデリングによるアプローチのメリット

–  定性的な可視化

–  柔軟なモデルの改造

•  実例：

Domain-‐Dependent/Independent
Topic
Switching
Model

2

テキストの評判分析
•  何らかの評価に関する情報をテキストから抽出,
整理,
集約,

提示する技術

•  アンケートの分析やお客様の声の分析に使える技術

–  製品やお店の改善,
顧客ニーズの把握,
マーケティングとか

•  データ分析ソリューションの一つとして備えられていることも
多い

–  IBM,
NTTコミュニケーションズ,…

•  Consumer
Generated
Mediaが出てきたときに急速に注目を
浴びた

–  Amazon,
TwiNer,
食べログ,…

–  今では個人の意見や評判に関するデータがネットにあふれている

3

評判分析を構成する技術
•  評価表現辞書の構築

–  評価表現（言葉）とその表現の極性（肯定/否定的）がペアとなった辞書

•  評価情報の観点からの文書分類【今回主に扱うところ】

–  文書の粒度で評価極性を決める

•  評価情報を含む文の抽出

–  文の粒度で評価極性を決める

•  評価情報の要素組の抽出

–  要素組の粒度で評価極性を決める

•  他にもあるかもしれません

4

評判情報による文書分類
•  更に2つに分けられる

–  評価表現の比率で分類する

•  極性が肯定的な表現が多かったら肯定的な文書とする

•  研究はたくさんあるようです

–  機械学習で分類する

•  2002年に初めて適用され,
最初は主に2値分類（肯定/否定）

–  高い分類精度を達成

•  興味深い知見も出ている

–  形容詞だけでなく,
全ての単語を用いた方がよい結果に

–  表層素性に加え,
言語素性も入れるとよい結果に

•  中立を含めた3値分類は意外に最近

–  2005年〜

•  更に細かい粒度の分類,
そして評点の回帰へ【今回扱うところ】

–  AmazonのレビューのraRngをテキストから回帰する

5

hNp://www.amazon.com/
6
テキストの評点回帰
•  テキストを見て評点を当てるシンプルなタスク

トピックモデリングによるアプローチのメリット
•  ネット上の荒れたテキストにある程度対応できる

–  次元圧縮でトピックに丸め込み,
トピック分布を特徴量として回帰を行う

–  分かち書きが適当でも分け方が同じ規則であればトピック特徴量としては
問題ない場合もある

•  定性的な可視化【今回主に扱うところ】

–  トピックの単語分布を見ると何となく内容がわかる

–  評価極性付きトピックとしてモデリングすると,
トピックの単語分布が極性
を反映したものになり,
「評判のトピック」を可視化できる

　　→　評判要約やopinion
miningに繋がる

•  柔軟にモデルを改造できる【今回主に扱うところ】

–  何らかの問題にぶつかったときへの対応
ex)ドメイン依存問題への対応

–  定性的な可視化（トピックの単語分布）のコントロール

ex)評判トピックの極性をコントロールしたい

7




•  定性的な可視化



miningに繋がる

•  柔軟にモデルを改造できる




8

•  Supervised
Latent
Dirichlet
AllocaRon[1]

–  LDAと線形回帰のjoint
model

linear
regression
y = ˆT
ˆw
9
topic
distribuRon
=
feature

0

0.25

0.5

0.75

1

A
B
C
D

ˆ
基本手法

Supervised
LDA
•  LDA+線形回帰やbag-‐of-‐words
+
Lassoよりも精度が改善

•  jointで解くことでトピックに線形回帰の係数と対応した極性が
付与され,
単語分布もそれに応じる【評判のトピック】

10
Figure 1: (Left) A graphical model
representation of Supervised Latent
Dirichlet allocation. (Bottom) The
topics of a 10-topic sLDA model fit to
the movie review data of Section 3.
both
motion
simple
perfect
fascinating
power
complex
however
cinematography
screenplay
performances
pictures
effective
picture
his
their
character
many
while
performance
between
!30 !20 !10 0 10 20
! !! ! !! ! ! ! !
more
has
than
films
director
will
characters
one
from
there
which
who
much
what
awful
featuring
routine
dry
offered
charlie
paris
not
about
movie
all
would
they
its
have
like
you
was
just
some
out
bad
guys
watchable
its
not
one
movie
least
problem
unfortunately
supposed
worse
flat
dull
d Zd,n Wd,n
N
D
Kk
Yd η, σ2
By regressing the response on the empirical topic frequencies, we treat the response as non-
exchangeable with the words. The document (i.e., words and their topic assignments) is generated
first, under full word exchangeability; then, based on the document, the response variable is gen-
erated. In contrast, one could formulate a model in which y is regressed on the topic proportions
✓. This treats the response and all the words as jointly exchangeable. But as a practical matter,
of an unconstrained real-valued response. Then, in Section 2.3, w
sLDA, and explain how it handles diverse response types.
Focus now on the case y 2 R. Fix for a moment the model param
k a vector of term probabilities), the Dirichlet parameter ↵, and the
Under the sLDA model, each document and response arises from th
1. Draw topic proportions ✓ | ↵ ⇠ Dir(↵).
2. For each word
(a) Draw topic assignment zn | ✓ ⇠ Mult(✓).
(b) Draw word wn | zn, 1:K ⇠ Mult( zn ).
3. Draw response variable y | z1:N , ⌘, 2 ⇠ N ⌘> ¯z, 2 .
Here we define ¯z := (1/N)
PN
n=1 zn. The family of probability dis
generative process is depicted as a graphical model in Figure 1.
Notice the response comes from a normal linear model. The cova
observed) empirical frequencies of the topics in the document. The
frequencies constitute ⌘. Note that a linear model usually includes a
to adding a covariate that always equals one. Here, such a term is
nents of ¯z always sum to one.
2
Figure 1: (Left) A graphical model
representation of Supervised Latent
Dirichlet allocation. (Bottom) The
topics of a 10-topic sLDA model fit to
the movie review data of Section 3.
both
motion
simple
perfect
fascinating
power
complex
however
cinematography
screenplay
performances
pictures
effective
picture
his
their
character
many
while
performance
between
!30 !20 !10 0 10 20
! !! ! !! ! ! ! !
more
has
than
films
director
will
characters
one
from
there
which
who
much
what
awful
featuring
routine
dry
offered
charlie
paris
not
about
movie
all
would
they
its
have
like
you
was
just
some
out
bad
guys
watchable
its
not
one
movie
least
problem
unfortunately
supposed
worse
flat
dull
d Zd,n Wd,n
N
D
Kk
Yd η, σ2

models
ra)ngs
mul)-‐
lingual
mul)-‐
aspects
polari)es
domain

dependency
observed-‐
labels
Bayesian
Nonpara
metrics
supervised
LDA
○
○
ML-‐sLDA
○
○
○
MAS
○
○
○
MG-‐LDA
○
JST/ASUM
○
JAS
○
○
Yoshida
et
al.
○
○
DDI-‐TSM
○
○
○
○
○
11
評判分析系トピックモデル
•  多言語拡張/Aspect毎の評点回帰/離散極性値付
与/ドメイン適応…







miningに繋がる





12

•  Domain
Dependent/Independent-‐Topic
Switching
Modelを例に説明

•  単語のドメイン依存問題

•  ドメイン
=
ECサイトのカテゴリ:
BOOK,
CD,
DVD,…

13
モデル拡張による問題解決の例

•  A
domain
can
contain
both
Domain-‐dependent
and

-‐independent
words.

BOOK

-‐
The
story
is
good

-‐
Too
small
leNer

-‐
Boring
magazine

-‐
Product
was
scratched
KITCHEN

-‐
The
toaster
doesn’t
work

-‐
The
knife
is
sturdy

-‐
This
dishcloth
is
easy
to
use

-‐
Customer
support
is
not
good

14
ドメイン依存問題

BOOK

-‐
The
story
is
good

-‐
Too
small
le?er

-‐
Boring
magazine

-‐
Product
was
scratched
KITCHEN

-‐
The
toaster
doesn’t
work

-‐
The
knife
is
sturdy

-‐
This
dishcloth
is
easy
to
use

-‐
Customer
support
is
not
good

15
•  A
domain
can
contain
both
Domain-‐dependent
and

-‐independent
words.


BOOK

-‐
The
story
is
good

-‐
Too
small
le?er

-‐
Boring
magazine

-‐
Product
was
scratched
KITCHEN

-‐
The
toaster
doesn’t
work

-‐
The
knife
is
sturdy

-‐
This
dishcloth
is
easy
to
use

-‐
Customer
support
is
not
good

16
•  A
domain
can
contain
both
Domain-‐dependent
and

-‐independent
words.


BOOK

-‐
The
story
is
good

-‐
Too
small
le?er

-‐
Boring
magazine

-‐
Product
was
scratched
KITCHEN

-‐
The
toaster
doesn’t
work

-‐
The
knife
is
sturdy

-‐
This
dishcloth
is
easy
to
use

-‐
Customer
support
is
not
good

17
•  A
domain
can
contain
both
Domain-‐dependent
and

-‐independent
words.

•  製品のドメインで単語分布が異なるがsLDAではこれを考慮で
きない

–  ドメインで使われる評価表現は異なるにも関わらず

•  ドメイン適応と類似した対策を取りたい

BOOK

-‐
The
story
is
good

-‐
Too
small
le?er

-‐
Boring
magazine

-‐
Product
was
scratched
KITCHEN

-‐
The
toaster
doesn’t
work

-‐
The
knife
is
sturdy

-‐
This
dishcloth
is
easy
to
use

-‐
Customer
support
is
not
good

1.
Introducing
domain-‐dependent/independent

topics
into
sLDA

2.
Domain-‐Dependent/Independent
Topic

Switching
Model
(DDI-‐TSM)

Proposal

18
•  A
domain
can
contain
both
Domain-‐dependent
and

-‐independent
words.


BOOK$ CD$ DVD$ ELECRRONICS$
Domain$Dependent$Topics$
switch
topic word
A$ B$ C$ D$ E$ F$ G$ H$ I$ …$
Domain$Independent$Topics$
Document
observed$domain$labels
ELECTRONICS
19
モデリングによるアプローチ

switch
topic word
A$ B$ C$ D$ E$ F$ G$ H$ I$ …$
Document
observed$domain$labels
ELECTRONICS
20

domain$dependent
CD music
A$ B$ C$ D$ E$ F$ G$ H$ I$ …$
Document$in$domain$‘CD’
………"
ELECTRONICS
21

domain$dependent
CD music
A$ B$ C$ D$ E$ F$ G$ H$ I$ …$
………"
ELECTRONICS
22

domain$dependent
CD music
A$ B$ C$ D$ E$ F$ G$ H$ I$ …$
………"
ELECTRONICS
23

domain$independent
C good
A$ B$ C$ D$ E$ F$ G$ H$ I$ …$
………"
ELECTRONICS
24

•  Domain
Dependent/Independentは0/1の値をとるス
イッチの潜在変数で切り替える

switching

latent

variable
word

(observed)
topical

latent

variable

(same
as
LDA)
Z
0
music
Z
1
good
domain

dependent

topic
dist.
domain

independent

topic
dist.
Z
X
W
DD
DI
DD
DI
DD
DI
25

26
K:
The
number
of

topics
in
sLDA

DDI-‐TSMは総トピック

数10~20で高速に

この精度を達成可能
評点回帰実験結果







miningに繋がる





27

Book−negative
Book−positive
DVD−negative
DVD−poisitive
Electronics−negative
Electronics−positive
Kitchen−negative
Kitchen−positive
weight that did not correspond to labels
bias
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4 Domain-‐Dependent
Domain-‐Independent
weight
parameters
posiRve
neutral
negaRve
28
可視化のコントロールの例
Domain
Dependent

Topicに対応する係
数は奇麗に揃って
いる

•  Domain
Dependent
Topicはドメイン情報で制約をかける

–  Labeled
LDA[3]

Observed
labels:
BOOK,
CD
0

0.2

0.4

0.6

0.8

1

All
observed
labels:
BOOK,
CD,
DVD,
KITCHEN
-‐>
4
domain
dependent
topics
29
可視化のコントロールの例
•  更に評点が3以上であればBOOK-‐posiRve,
2以下であれば
BOOK-‐negaRveというように細分化

Book−negative
Book−positive
DVD−negative
DVD−poisitive
Kitchen−negative
Kitchen−positive
bias
−16
−14
−12
−10
−8
−6
−4
−2
0
2
weight
parameters
music video
great good best
better funny
interesting nice
like love wonderful
worst waste
poor boring
wrong terrible
like good best
funny interesting
-
product ipod work
player printer sony
phone battery
keyboard audio
button speaker
monitor memory
great good like
better excellent
perfect happy clear
product speaker work
sound phone player
software dvd radio
tv device printer
ipod computer
battery sony
button headphones
nothing waste never
didn’t cannot problem
disappointed doesn’t
good great
-
coffee water
machine filter
cooking food
glass steel
stainless ice
rice espresso
wine tea toaster
wonderful sturdy
sharp love great
good well easy
best better
product water coffee
steel tank kitchen
knives hot heat
maker design machine
work vacuum filter
don’t doesn’t
didn’t never problem
few broke less
disappointed poor
cheap nothing no
good great better nice
-
arthur harold
Electronics-‐posiRve
Kitchen-‐posiRve
30
コントロールされたトピックの単語分布

Book−negative
Book−positive
DVD−negative
DVD−poisitive
Kitchen−negative
Kitchen−positive
bias
−16
−14
−12
−10
−8
−6
−4
−2
0
2
weight
parameters
Kitchen-‐negaRve
actor music cast
no never didn’t
worst waste
poor boring
wrong terrible
like good best
funny interesting
-
product speaker work
sound phone player
software dvd radio
tv device printer
ipod computer
battery sony
button headphones
nothing waste never
didn’t cannot problem
disappointed doesn’t
good great
-
product water coﬀee
steel tank kitchen
knives hot heat
work vacuum ﬁlter
don’t doesn’t
few broke less
disappointed poor
cheap nothing no
-
amazon product
service customer
arthur harold
bravo vincent
moor america
adventure comedy
Electronics-‐negaRve
31

Book−negative
Book−positive
DVD−negative
DVD−poisitive
Kitchen−negative
Kitchen−positive
bias
−16
−14
−12
−10
−8
−6
−4
−2
0
2
weight
parameters
Domain-‐Independent-‐neutral
Domain-‐Independent-‐posiRve
wine tea toaster
wonderful sturdy
sharp love great
good well easy
best better
few broke less
disappointed poor
cheap nothing no
good great love
work funny sexy
greatly best fans
amusing cool
thrilling succinctly
accurately
satisfying gracious
amazon product
service customer
quality warranty
support manufacturer
vendor
damage matter poor
scratched blame wrong
problem defective
arthur harold
bravo vincent
moor america
adventure comedy
minelli john
manhattan roxanne
bob napoleon
benjamin ghostbusters
book dvd
amazon
dependent topics, and we
se topics are positive or
ous values. Fourth, we
s the baseline model that
ndence in the sentiment
ical ratings from reviews
The experimental results showed two interesting ﬁndings.
First, DDITSM converged more rapidly than the baseline
model because of the strong constraint due to observed
domain information. Second, domain-independent topics
had positive, negative, and neutral polarities in the form
of continuous values. Neutral domain-independent topics
included proper nouns, and this means that proper nouns
-
-
arthur harold
bravo vincent
moor america
adventure comedy
minelli john
manhattan roxanne
bob napoleon
book dvd
amazon
32

Book−negative
Book−positive
DVD−negative
DVD−poisitive
Kitchen−negative
Kitchen−positive
bias
−16
−14
−12
−10
−8
−6
−4
−2
0
2
weight
parameters
Domain-‐Independent-‐negaRve
knives hot heat
work vacuum ﬁlter
don’t doesn’t
few broke less
disappointed poor
cheap nothing no
-
amazon product
service customer
quality warranty
support manufacturer
vendor
damage matter poor
scratched blame wrong
problem defective
arthur harold
bravo vincent
moor america
adventure comedy
minelli john
manhattan roxanne
bob napoleon
book dvd
amazon
The experimental results showed two interesting ﬁndings.
First, DDITSM converged more rapidly than the baseline
model because of the strong constraint due to observed
domain information. Second, domain-independent topics
had positive, negative, and neutral polarities in the form
of continuous values. Neutral domain-independent topics
included proper nouns, and this means that proper nouns
33
Complaints
about
the
e-‐commerce
site,

customer
support,
and
delivery

•  評判の要約/レポート
–  トピックモデルでできるのか？
–  例えば(名詞,形容詞)のペアを入力にする
•  複雑な構造の扱い
–  構文構造などをどう取り込んでいくか
•  学習時間の増加
–  モデルの巨大化/複雑化
–  データの巨大化
–  第６回DSIRNLP “大規模データに対するベイズ学習” @slideshare
•  ハイパーパラメータチューニング
–  定量的指標と定性的可視化の乖離
34
残された課題

•  評判分析の中でもトピックモデリングによる評点回帰を扱った
•  メリットとしては
–  荒れたテキストでもある程度うまくいく
–  単語分布に極性が反映され, 定性的な可視化がコントロールできる
–  可視化したいイメージや問題に合わせてモデルを柔軟に変更できる
•  例としてDDI-TSMを扱った
–  ドメイン適応の問題をモデリングで対応
–  ドメイン毎の評判をなんとなく可視化（可視化のコントロール）
•  課題は結構あります
–  Neural Networkが興味深い結果を出ているので, そもそもトピックモデ
ルでのアプローチが妥当か検討の余地がある
35
まとめ

References
36
[1]
D.
M.
Blei
and
J.
D.
McAuliﬀe.
Supervised
topic
models.
In
Neural

InformaRon
Processing
Systems,
20:121–128,
2008.

[2]
D.
M.
Blei,
A.
Y.
Ng,
and
M.
I.
Jordan.
Latent
dirichlet
allocaRon.

Journal
of
Machine
Learning
Research,
3:993–1022,
2003.

[3]
D.
Ramage,
D.
Hall,
R.
NallapaR,
and
C.
D.
Manning.
Labeled
lda:
A

supervised
topic
model
for
credit
aNribuRon
in
mulR-‐labeled
corpora.

In
Proceedings
of
the
2009
Conference
on
Empirical
Methods
in

Natural
Language
Processing,
1:248–256,
2009.

[4]
Y.
W.
Teh,
M.
I.
Jordan,
M.
J.
Beal,
and
D.
M.
Blei.
Hierarchical

dirichlet
processes.
Journal
of
the
American
StaRsRcal
AssociaRon,

101(476):1566–1581,
2006.

[5]
J.
Blitzer,
M.
Dredze,
and
F.
Pereir.
Biographies,
bollywood,
boom-‐
boxes
and
blenders:
Domain
adaptaRon
for
senRment
classiﬁcaRon.
In

Annual
MeeRng-‐AssociaRon
For
ComputaRonal
LinguisRcs,
45(1):440,

2007.

第47回TokyoWebMining, トピックモデリングによる評判分析

Recommandé

Recommandé

Contenu connexe

Similaire à 第47回TokyoWebMining, トピックモデリングによる評判分析

Similaire à 第47回TokyoWebMining, トピックモデリングによる評判分析 (15)

第47回TokyoWebMining, トピックモデリングによる評判分析