1. Harnessing Deep Neural
Networks with Logic Rules
Zhiting Hu, Xuezhe Ma,
Zhengzhong Liu, Eduard Hovy, and Eric P. Xing
ACL2016
読む⼈人:東北北⼤大学,⾼高瀬翔
116/09/12 第8回最先端NLP勉強会
スライド中の図,表は [Hu+ 16] から引⽤用
2. ⽬目的
• ⼀一般的な規則や⼈人の直観をニューラル
ネットに導⼊入したい
– 評判分析において,A but B という⽂文のポジ/
ネガは B と⼀一致する
– 固有表現抽出において,B-PERの後にI-ORG
はありえない
• 規則や直観を⼀一階述語論論理理で表現
– equal(yi-1, B-PER) → ¬ equal(yi, I-ORG)
• 論論理理規則を制約として学習に⽤用いる
216/09/12 第8回最先端NLP勉強会
6. • 達成したいこと
1,制約(論論理理規則)を満たす
2,モデルの出⼒力力(pθ(y|x))に近い値となる
• ⽬目的関数
expectation operator. That is, for each rule (indexed
by l) and each of its groundings (indexed by g)
on (X, Y ), we expect Eq(Y |X)[rlg(X, Y )] = 1,
with confidence l. The constraints define a rule-
regularized space of all valid distributions. For the
second property, we measure the closeness between
q and p✓ with KL-divergence, and wish to minimize
it. Combining the two factors together and further
allowing slackness for the constraints, we finally
get the following optimization problem:
min
q,⇠ 0
KL(q(Y |X)kp✓(Y |X)) + C
X
l,gl
⇠l,gl
s.t. l(1 Eq[rl,gl (X, Y )]) ⇠l,gl
gl = 1, . . . , Gl, l = 1, . . . , L,
(3)
where ⇠l,gl
0 is the slack variable for respec-
tive logic constraint; and C is the regularization
parameter. The problem can be seen as project-
forward
straints
the bas
sentime
straints
gram de
task (se
ming fo
constra
proxima
samples
the con
the rele
ference
when a
the soft
forward
2に相当
q(y|x)の計算(1/2)
6
1に相当
ξで制約を緩和
0から1までの連続値
事例例glが規則rlを満たすとき1
規則rlの強さ
(λが⼤大きい
=満たすべき規則)
16/09/12 第8回最先端NLP勉強会
7. • q(y|x)は解析的に解ける(ラグランジュ双対問題)
– Posterior regularization [Ganchev+ 10] と同様の解き⽅方
• 制約の強さはC(定数)と λ(規則ごとの値)で決定
• 規則を満たしていない場合,q(y|x)は⼩小さくなる
regularized space of all valid distributions. For the
second property, we measure the closeness between
q and p✓ with KL-divergence, and wish to minimize
it. Combining the two factors together and further
allowing slackness for the constraints, we finally
get the following optimization problem:
min
q,⇠ 0
KL(q(Y |X)kp✓(Y |X)) + C
X
l,gl
⇠l,gl
s.t. l(1 Eq[rl,gl (X, Y )]) ⇠l,gl
gl = 1, . . . , Gl, l = 1, . . . , L,
(3)
where ⇠l,gl
0 is the slack variable for respec-
tive logic constraint; and C is the regularization
parameter. The problem can be seen as project-
ing p✓ into the constrained subspace. The problem
is convex and can be efficiently solved in its dual
form with closed-form solutions. We provide the
detailed derivation in the supplementary materials
sentime
straints
gram de
task (se
ming fo
constrai
proxima
samples
the con
the rele
ference
when a
the soft
forward
tributio
calculat
p v.s. q
q(y|x)の計算(2/2)
7
q,⇠ 0 l,gl
l
s.t. l(1 Eq[rl,gl (X, Y )]) ⇠l,gl
gl = 1, . . . , Gl, l = 1, . . . , L,
(3)
where ⇠l,gl
0 is the slack variable for respec-
tive logic constraint; and C is the regularization
parameter. The problem can be seen as project-
ing p✓ into the constrained subspace. The problem
is convex and can be efficiently solved in its dual
form with closed-form solutions. We provide the
detailed derivation in the supplementary materials
and directly give the solution here:
q⇤
(Y |X) / p✓(Y |X) exp
8
<
:
X
l,gl
C l(1 rl,gl (X, Y ))
9
=
;
(4)
Intuitively, a strong rule with large l will lead to
low probabilities of predictions that fail to meet
the constraints spa
the relevant instan
ference (and rando
when a group is to
the soft prediction
forward pass is re
tribution p✓(y|x)
calculating the trut
p v.s. q at Test T
either the distilled
network q after a fi
sults show that bot
over the base netwo
label instances. In
p. Particularly, q i
rules introduce add
2413
16/09/12 第8回最先端NLP勉強会
8. 規則について
• ⼀一階述語論論理理で表現
– A but B という⽂文のポジ/ネガは B と⼀一致
• Probabilistic soft logicの枠組みで0から1の連続値
に変換
– 論論理理演算⼦子は
8
lg g=1
is typically relevant to only a single or subset of
examples, though here we give the most general
form on the entire set.
We encode the FOL rules using soft logic (Bach
et al., 2015) for flexible encoding and stable opti-
mization. Specifically, soft logic allows continu-
ous truth values from the interval [0, 1] instead of
{0, 1}, and the Boolean logic operators are refor-
mulated as:
A&B = max{A + B 1, 0}
A _ B = min{A + B, 1}
A1 ^ · · · ^ AN =
X
i
Ai/N
¬A = 1 A
(1)
Here & and ^ are two different approximations
vector
tion pa
of the t
A si
other s
et al.,
cess is
p✓(y|x
which
to hum
system
by pro
(i.e., th
ence f
teache
trained
is a classification
gnition which is a
describe the base
not focusing on
we largely use the
evious successful
the linguistically-
junction word “but” is one of the strong indicators
for such sentiment changes in a sentence, where
the sentiment of clauses following “but” generally
dominates. We thus consider sentences S with an
“A-but-B” structure, and expect the sentiment of the
whole sentence to be consistent with the sentiment
of clause B. The logic rule is written as:
has-‘A-but-B’-structure(S) )
(1(y = +) ) ✓(B)+ ^ ✓(B)+ ) 1(y = +)) ,
(5)
2414
16/09/12 第8回最先端NLP勉強会
9. 規則の計算例例
• A but B という⽂文のポジ/ネガは B と⼀一致
• ⽂文がポジティブ:
• ⽂文がネガティブ:
9
ich is a classification
ecognition which is a
efly describe the base
are not focusing on
s, we largely use the
o previous successful
ign the linguistically-
ted.
junction word “but” is one of the strong indicators
for such sentiment changes in a sentence, where
the sentiment of clauses following “but” generally
dominates. We thus consider sentences S with an
“A-but-B” structure, and expect the sentiment of the
whole sentence to be consistent with the sentiment
of clause B. The logic rule is written as:
has-‘A-but-B’-structure(S) )
(1(y = +) ) ✓(B)+ ^ ✓(B)+ ) 1(y = +)) ,
(5)
2414
⽂文Sが A but B という構造を持つ
⽂文がポジティブのとき1
そうでないとき0
B部分がポジティブと
モデルが予測した確率率率
(1(y = +) ) ✓(B)+ ^ ✓(B)+ ) 1(y = +))
, (¬1(y = +) _ ✓(B)+ ^ ¬ ✓(B)+ _ 1(y = +))
, (1 1(y = +) _ ✓(B)+ ^ 1 ✓(B)+ _ 1(y = +))
, (min{1 1(y = +) + ✓(B)+, 1} ^ min{1 ✓(B)+ + 1(y = +), 1})
(1 + ✓(B)+)/2
(2 ✓(B)+)/2
16/09/12 第8回最先端NLP勉強会
11. 実験設定(評判分析)
• ポジ/ネガの⼆二値分類タスク
• データセット:
– Stanford Sentiment Treebank(SST2)
– Movie Review(MR)
– Customer Review(CR)
• ベースライン:(単純な)CNN
– [Kim+ 14] と同じ⼿手法
• 適⽤用する制約(規則)
– A but B という⽂文のポジ/ネガは B と⼀一致
• 重要性 λ = 1
11
Algorithm 1 Harnessing NN with Rules
Input: The training data D = {(xn, yn)}N
n=1,
The rule set R = {(Rl, l)}L
l=1,
Parameters: ⇡ – imitation parameter
C – regularization strength
1: Initialize neural network parameter ✓
2: repeat
3: Sample a minibatch (X, Y ) ⇢ D
4: Construct teacher network q with Eq.(4)
5: Transfer knowledge into p✓ by updating ✓ with Eq.(2)
6: until convergence
Output: Distill student network p✓ and teacher network q
ning over multiple examples), requiring joint infer-
ence. In contrast, as mentioned above, p is more
lightweight and efficient, and useful when rule eval-
uation is expensive or impossible at prediction time.
Our experiments compare the performance of p and
q extensively.
Imitation Strength ⇡ The imitation parameter ⇡
in Eq.(2) balances between emulating the teacher
soft predictions and predicting the true hard la-
bels. Since the teacher network is constructed from
p✓, which, at the beginning of training, would pro-
duce low-quality predictions, we thus favor pre-
I like this book store a lot PaddingPadding
Word
Embedding
Convolution
Max Pooling
Sentence
Representation
Figure 2: The CNN architecture for sentence-level
sentiment analysis. The sentence representation
vector is followed by a fully-connected layer with
softmax output activation, to output sentiment pre-
dictions.
4.1 Sentiment Classification
Sentence-level sentiment analysis is to identify the
sentiment (e.g., positive or negative) underlying
an individual sentence. The task is crucial for
many opinion mining applications. One challeng-
ing point of the task is to capture the contrastive
of neural networks
t users are allowed
ntentions through
c. In this section
ur approach by ap-
work architectures,
recurrent network,
ons, i.e., sentence-
is a classification
gnition which is a
describe the base
not focusing on
we largely use the
evious successful
the linguistically-
windows. Multiple filters with varying window
sizes are used to obtain multiple features. Figure 2
shows the network architecture.
Logic Rules One difficulty for the plain neural
network is to identify contrastive sense in order to
capture the dominant sentiment precisely. The con-
junction word “but” is one of the strong indicators
for such sentiment changes in a sentence, where
the sentiment of clauses following “but” generally
dominates. We thus consider sentences S with an
“A-but-B” structure, and expect the sentiment of the
whole sentence to be consistent with the sentiment
of clause B. The logic rule is written as:
has-‘A-but-B’-structure(S) )
(1(y = +) ) ✓(B)+ ^ ✓(B)+ ) 1(y = +)) ,
(5)
241416/09/12 第8回最先端NLP勉強会
12. 実験結果(評判分析)(1/3)
• ベースライン [Kim+ 14] から性能が向上
• MR,CRでstate-of-the-art
• MVCNN(複数の単語ベクトル利利⽤用 + 複雑なCNN
(マルチチャンネル,多層))と同等の性能
12
Model SST2 MR CR
1 CNN (Kim, 2014) 87.2 81.3±0.1 84.3±0.2
2 CNN-Rule-p 88.8 81.6±0.1 85.0±0.3
3 CNN-Rule-q 89.3 81.7±0.1 85.3±0.3
4 MGNC-CNN (Zhang et al., 2016) 88.4 – –
5 MVCNN (Yin and Schutze, 2015) 89.4 – –
6 CNN-multichannel (Kim, 2014) 88.1 81.1 85.0
7 Paragraph-Vec (Le and Mikolov, 2014) 87.8 – –
8 CRF-PR (Yang and Cardie, 2014) – – 82.7
9 RNTN (Socher et al., 2013) 85.4 – –
10 G-Dropout (Wang and Manning, 2013) – 79.0 82.1
Table 1: Accuracy (%) of Sentiment Classification. Row 1, CNN (Kim, 2014) is the base network
corresponding to the “CNN-non-static” model in (Kim, 2014). Rows 2-3 are the networks enhanced by
our framework: CNN-Rule-p is the student network and CNN-Rule-q is the teacher network. For MR and
CR, we report the average accuracy±one standard deviation using 10-fold cross validation.
the base networks, we obtain substantial improve-
ments on both tasks and achieve state-of-the-art
or comparable results to previous best-performing
systems. Comparison with a diverse set of other
or positive sentiment. 3) CR (Hu and Liu, 2004),
customer reviews of various products, containing 2
classes and 3,775 instances. For MR and CR, we
use 10-fold cross validation as in previous work. In
16/09/12 第8回最先端NLP勉強会
14. 実験結果(評判分析)(3/3)
• データ量量に対する効果とラベルなしデータの利利⽤用
• ラベルなしデータの利利⽤用で性能向上
– 制約によってラベルなしデータを上⼿手く使える
14
(%)
integration
Data size 5% 10% 30% 100%
1 CNN 79.9 81.6 83.6 87.2
2 -Rule-p 81.5 83.2 84.5 88.8
3 -Rule-q 82.5 83.9 85.6 89.3
4 -semi-PR 81.5 83.1 84.6 –
5 -semi-Rule-p 81.7 83.3 84.7 –
6 -semi-Rule-q 82.7 84.2 85.7 –
Table 3: Accuracy (%) on SST2 with varying sizes
of labeled data and semi-supervised learning. The
header row is the percentage of labeled examples
◯%のラベル付き
データを⽤用いて
学習
◯%のラベル付き
データと(100 - ◯)%の
ラベルなしデータを
⽤用いて学習
16/09/12 第8回最先端NLP勉強会
15. 実験設定(固有表現抽出)
• 4種の固有表現(PER,ORG,LOC,Misc)の認識識タスク
• データセット:CoNLL-2003 データセット
– BIOESタグを採⽤用([Lample+ 16] などと同じ)
• ベースライン:双⽅方向LSTM
– [Chiu and Nichols, 15] からCNNを除去
• 適⽤用する制約(規則)
– 出⼒力力タグの並びが破綻していない
• 重要性 λ = ∞(強い制約)
– リスト形式の場合,同種のタグとなる
• 1. Juventus, 2. Barcelona, 3. …で Juventus と Barcelona のタグは同種
• 重要性 λ = 1
1516/09/12 第8回最先端NLP勉強会
where 1(·) is an indicator function that takes 1
when its argument is true, and 0 otherwise; class ‘+’
represents ‘positive’; and ✓(B)+ is the element of
✓(B) for class ’+’. By Eq.(1), when S has the ‘A-
but-B’ structure, the truth value of the above logic
rule equals to (1 + ✓(B)+)/2 when y = +, and
(2 ✓(B)+)/2 otherwise 1. Note that here we
assume two-way classification (i.e., positive and
negative), though it is straightforward to design
rules for finer grained sentiment classification.
4.2 Named Entity Recognition
NER is to locate and classify elements in text into
entity categories such as “persons” and “organiza-
tions”. It is an essential first step for downstream
language understanding applications. The task as-
signs to each word a named entity tag in an “X-Y”
format where X is one of BIEOS (Beginning, In-
side, End, Outside, and Singleton) and Y is the
entity category. A valid tag sequence has to follow
Char+Word
Representation
Backward
LSTM
Forward
LSTM
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM
Output
Representation
NYC locates in USA
Figure 3: The architecture of the bidirectional
LSTM recurrent network for NER. The CNN for
extracting character representation is omitted.
The confidence levels are set to 1 to prevent any
violation.
We further leverage the list structures within and
などequal(yi 1, B PER) ) ¬ equal(yi, I ORG)
s. The task as-
g in an “X-Y”
Beginning, In-
and Y is the
e has to follow
of the tagging
es (e.g., lists)
y expose some
has a similar
LSTM recur-
) proposed in
which has out-
models. The
word vectors
l information,
hen fed into a
s for sequence
hols, 2015) we
The confidence levels are set to 1 to prevent any
violation.
We further leverage the list structures within and
across sentences of the same documents. Specifi-
cally, named entities at corresponding positions in
a list are likely to be in the same categories. For
instance, in “1. Juventus, 2. Barcelona, 3. ...” we
know “Barcelona” must be an organization rather
than a location, since its counterpart entity “Juven-
tus” is an organization. We describe our simple
procedure for identifying lists and counterparts in
the supplementary materials. The logic rule is en-
coded as:
is-counterpart(X, A) ) 1 kc(ey) c( ✓(A))k2, (7)
where ey is the one-hot encoding of y (the class pre-
diction of X); c(·) collapses the probability mass
16. 実験結果(固有表現抽出)
• ベースラインから性能が向上
• 膨⼤大な外部資源を使った⼿手法 [Luo+ 15] やパラメータの多い
ニューラルネット [Ma and Hovy, 16] と同等の性能
16/09/12 第8回最先端NLP勉強会 16
Model F1
1 BLSTM 89.55
2 BLSTM-Rule-trans p: 89.80, q: 91.11
3 BLSTM-Rules p: 89.93, q: 91.18
4 NN-lex (Collobert et al., 2011) 89.59
5 S-LSTM (Lample et al., 2016) 90.33
6 BLSTM-lex (Chiu and Nichols, 2015) 90.77
7 BLSTM-CRF1 (Lample et al., 2016) 90.94
8 Joint-NER-EL (Luo et al., 2015) 91.20
9 BLSTM-CRF2 (Ma and Hovy, 2016) 91.21
Table 4: Performance of NER on CoNLL-2003.
Row 2, BLSTM-Rule-trans imposes the transition
rules (Eq.(6)) on the base BLSTM. Row 3, BLSTM-
Rules further incorporates the list rule (Eq.(7)). We
report the performance of both the student model p
NER
extra
as w
joint
tured
6 D
We h
deep
to al
tions
pose
fers
the w
リストに関する制約なし
リストに関する制約あり