12. 权重计算方法
布尔权重(boolean weighting)
– aij=1(TFij>0) or (TFij=0)0
TFIDF型权重
– TF: aij=TFij
– TF*IDF: aij=TFij*log(N/DFi)
– TFC: 对上面进行归一化
– LTC: 降低TF的作用
基于熵概念的权重(Entropy weighting)
– 称为term i的某种熵
– 如果term分布极度均匀:熵等于-1
– 只在一个文档中出现:熵等于0
k
kkj
iij
ij
DFNTF
DFNTF
a
2
)]/log(*[
)/log(*
k
kkj
iij
ij
DFNTF
DFNTF
a
2
)]/log(*)0.1[log(
)/log(*)0.1log(
))]log([
log
1
1*)0.1log(
1
N
j i
ij
i
ij
ijij
DF
TF
DF
TF
N
TFa
13. 特征选择(1)
基于DF
– Term的DF小于某个阈值去掉(太少,没有代表性)
– Term的DF大于某个阈值也去掉(太多,没有区分度)
信息增益(Information Gain, IG):该term为整个
分类所能提供的信息量(不考虑任何特征的熵和
考虑该特征后的熵的差值)
)}]|(log)|(){(
)}|(log)|(){([
)}(log)({
)Entropy(Expected)(EntropyGain(t)
1
1
1
tcPtcPtP
tcPtcPtP
cPcP
SS
i
M
i i
i
M
i i
i
M
i i
t
22. Rocchio方法
可以认为类中心向量法是它的特例
– Rocchio公式
– 分类
C
Ci ij
C
Ci ij
jcjc
nn
x
n
x
ww
'
类C中心向量的权重 训练样本中正例个数 文档向量的权重
22c xw)(
ijcj
ijcj
iic
xw
xw
dCSV
23. Naïve Bayes
)()|(
)(
)()|(
)|( jji
i
jji
ij cPcdP
dP
cPcdP
dcP
)()|(
)(
)()|(
)|( jji
i
jji
ij cPcdP
dP
cPcdP
dcP
,独立性假设
r
k
jikji cwPcdP
1
)|()|(
参数计算
Bayes公式
1
)(||
)(1
)(
)(
)(
k
k
j
k
k
jj
j
cNc
cN
cN
cNc
cP
总文档个数
的文档个数
k
kj
ij
j
ji
ji
N
N
c
cw
cwP
不同词个数的次数类所有文档中出现的词在
类别文档中出现的次数在 1
)|(
32. 分类方法的评估
邻接表
每个类
– Precision=a/(a+b), Recall=a/(a+c), fallout=b/(b+d)=false alarm rate,
accuracy=(a+d)/(a+b+c+d), error=(b+c)/(a+b+c+d)=1-accuracy, miss
rate=1-recall
– F=(β2+1)p.r/(β2p+r)
– Break Even Point, BEP, p=r的点
– 如果多类排序输出,采用interpolated 11 point average precision
所有类:
– 宏平均:对每个类求值,然后平均
– 微平均:将所有文档一块儿计算,求值
真正对的 错误
标YES a b
标NO c d
33. 其他分类方法
Regression based on Least Squares Fit (1991)
Nearest Neighbor Classification (1992) *
Bayesian Probabilistic Models (1992) *
Symbolic Rule Induction (1994)
Decision Tree (1994) *
Neural Networks (1995)
Rocchio approach (traditional IR, 1996) *
Support Vector Machines (1997)
Boosting or Bagging (1997)*
Hierarchical Language Modeling (1998)
First-Order-Logic Rule Induction (1999)
Maximum Entropy (1999)
Hidden Markov Models (1999)
Error-Correcting Output Coding (1999)
...
35. 文献及其他资源
Papers
– K. Aas and L. Eikvil. Text categorisation: A survey. Technical report,
Norwegian Computing Center, June 1999
http://citeseer.nj.nec.com/aas99text.html
– Xiaomeng Su, “Text categorization”,Lesson Presentation
– Yiming Yang and Xin Liu. 1999. "A re-examination of text categorization
methods." 22ndAnnual International SIGIR
http://www.cs.cmu.edu/~yiming/publications.html
– A Survey on Text Categorization, NLP Lab, Korean U.
– 庞剑峰,基于向量空间模型的自反馈的文本分类系统的研究与实现,中科院计
算所硕士论文,2001
– 黄萱菁等,独立于语种的文本分类方法,中文信息学报,2000年第6期
Software:
– Rainbow http://www-2.cs.cmu.edu/~mccallum/bow/
– BoosTexter http://www.research.att.com/~schapire/BoosTexter/
– TiMBL http://ilk.kub.nl/software.html#timbl
– C4.5 http://www.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html
Corpus
– http://www.cs.cmu.edu/~textlearning