2. Speaker
Lecturer: Lun-Wei Ku
Currently: Assistant Research Fellow, IIS, Academia Sinica
Adjunct Assistant Professor, NCTU
• Working on NLP and Sentiment Analysis
• Running NLPSA Lab
:http://www.lunweiku.com/
:http://academiasinicanlplab.github.io/
• Currently On-going Projects:
– Graph Embedding, Emotion Enabled Dialog System, Cross-lingual Text
Suggestion, Proactive Dialog Generation from Images and Texts
2
6. What is Natural Language
Processing?
• Natural language processing (NLP) is a field of computer
science, artificial intelligence and computational
linguistics concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to
fruitfully process large natural language corpora. Challenges in
natural language processing frequently involve natural
language understanding, natural language
generation (frequently from formal, machine-readable logical
forms), connecting language and machine perception, dialog
systems, or some combination thereof. (Wikipedia)
page 6
14. Applications (4)
• Problem Solver
– Math solver: https://www.cymath.com/
Step by step, NLP + others (graph, formula, …)
page 14
15. Applications (5)
• AI doctor
– IBM Watson Health
• Optimize performance
• Engaged cunsumers
• Enable effective care
• Manage population health
– Why is AI doctor related
to NLP?
• MedNLP: medical records, communication…
page 15
16. Applications (5)
• Summarization
– 最佳示範:谷阿莫 *blog.investis.com
• Sentiment/Opinion/Review
• Social Media/Network application
– Full of texts!
*techxb.com
page 16
19. Other Close Disciplines
• Artificial Intelligence (AI)
• Information Retrieval (IR)
• Machine Learning (ML)
• Human Computer Interaction (HCI)
page 19
20. NLP and AI
• NLP takes care of the input/output of
unstructural information for AI applications.
• AI applications are expected to be write/speak
like people.
• NLP is getting more and more important in AI.
• However, NLP is challenging.
page 20
21. NLP and IR
• NLP borrows some concepts from IR,
especially weighting scheme of words.
• For IR, efficiency is very important. Some
time limited NLP tasks will also incorporate
ideas of IR to save time, e.g., clustering/offline
preprocessing.
page 21
22. NLP and ML
• In the past, NLP techniques utilized a lot of
linguistic knowledge in the form of rules or
probability.
• NLP uses a lot of ML/DL techniques
nowadays.
page 22
23. NLP and HCI
• (writing or speaking) Language is a way for
computers to communicate with people.
• Representing information and utilizing them in
an appropriate way can mitigate the errors
people may sense.
• NLP + HCI may lead to killer apps.
page 23
28. Wrap Up -1
• What is NLP?
• What applications are related to NLP?
• NLP and NLU
• What are the current challenges?
• Next, let’s go ahead to NLP!
– about introducing the concept and trying the tools
online (if available)
page 28
31. Natural Language Processing
• Basic Functions
– (Word Segmentation)
– Part of Speech Tagging
– (Stemming)
– Named Entity Extraction
– (Syntactic) Parsing
– Coreference resolution
– Text Categorization
page 31
32. Word Segmentation
• Some written languages have no explicit word
boundary markers, such as Chinese or
Japanese.
• If words are to be the basic units for text
processing, we need to know the boundaries.
• 下雨天留客天留我不留
• 私は自然言語処理を好む
• الطبيعية اللغة معالجة أفضل أنا
page 32
33. Stemmer (English)
• The process of reducing inflected (or sometimes
derived) words to their word stem, base or root
form—generally a written word form*
*wikipedia
page 33
I love natural language processing.
I love natur languag process .
Stemming
34. TF‧IDF (1)
• Something used a lot in IR
• term frequency * inversed document frequency
• Calculate the weight of each term (usually
words) in a dataset
• An example of how to represent documents
page 34
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
36. Bag of Words Model
• Often abbreviated as “BOW”
• Words are used as features
WITHOUT their order.
• 我給你一百萬 = 你給我一百萬
• Usually working with
N-gram features
我 給 你 一 百 萬 我給 給你 你一 一百
百萬 我給你 給你一 你一百 一百萬
page 36
37. TFIDF + BOW (uni-&bi-grams) 現今仍是某些
task的state of the art,或至少能得到很接近
state of the art的效能,是很強的baseline。
page 37
38. • So far we have word-level information.
• Next, we start to add more information on
words and further to larger segments.
page 38
39. Part of Speech (POS) Tagging
• I love natural language processing.
• (PRP I) (VBP love) (JJ natural) (NN language)
• (NN processing)(. .)
Verb, non-3rd person singular present
Personal pronoun
Adjective Noun, singular or mass
Tags may vary – using different tagging sets:
Penn Treebank Tagging Set
page 39
42. Semantic Role Labeling
To label the role each word plays in sentences
from the semantic aspect.
https://www.slideshare.net/marinasantini1/semantic-role-labeling
page 42
49. • For the traditional Chinese text environment…
NLP Tools Comparison
page 49
Stanford
CoreNLP
Jieba CKIP
Language support
Easy to use
Domain adaptation
Performance
Price
50. Using Tools (3)
• NLTK (python): tokenize, tag, NE extraction,
show parsing trees
– Porter stemmer
– n-grams
• tfidf not in NLTK, use scikit-learn.
(machine learning in python)
page 50
53. Word Embeddings (2)
Pre-trained or train by yourself!
• w2v
• Glove
我不會deep learning怎麼辦?
You can find various of embeddings on the Web.
[Check here!]
page 53
61. Differences
• Web or social texts are in a written form of the
spoken language.
– New words
– Typos
– Urban language
– Cyber language
– Abbreviations
– A lot of (homo)phonic/semantic puns (諧音、雙關
語)
– Foreign languages (激安殿堂 牛逼)
page 61
64. More Preprocessing Needed
• Need to filter out dirty texts and find the major
content.
– Texts for ads
– Texts for format
• Need to cut sentences first before sending
them into the parser.
6411 December 2016
65. Skills We Might Need
• Text Normalization
• Multimedia multimodal
• User and Text Networking
• Social Network
page 65
66. Skills We Might Need
• Text Normalization
• Multimedia multimodal
• User and Text Networking
• Social Network
page 66
67. Text Normalization
• Normalization is to change the text written in
web language into the one in the formal language
before further to process it.
• 私心喜翻的日式簡約風 私心喜歡的日式簡
約風
• 想一起去ㄉ水水們 想一起去的漂亮女生們
• 漂漂是今年才成為麻麻 漂漂是今年才成為
媽媽
page 67
70. Or, A Parser for Web Text
• Tweet POS Tagger/Parser like: ARK
• Train with web texts to capture their characteristics.
ikr smh he asked fir yo last name
so he can add u on fb lololol
• Unfortunately, so far we don’t have any for the
Chinese language.
7011 December 2016
71. Skills We Might Need
• Text Normalization
• Multimedia multimodal
• User and Text Networking
• Social Network
page 71
79. Sentiment Analysis Is…
• Studying opinions, sentiments, subjectivities,
affects, emotions, views, etc. in text such as
news, blogs, reviews, comments, dialogs, or
other kind of documents.
• An important research question:
– Sentiment information is global and powerful.
– Sentiment information is valuable for companies,
customers and personal communication.
79
11 December 2016
82. CSentiPackage
• Datasets
– Chinese Morphological Dataset Cmorph (former
version of ACiBiMA)*
– Chinese Opinion Treebank
• Resources
– NTUSD/ANTUSD
• Tools
– CopeOpi + Tag Mapping File
– UTCNN
*https://github.com/windx0303/ACBiMA
11 December 201682
83. Statistics
• NTUSD: Sentiment Dictionary (with 10,371
words): free for research, 400+ applications
• ANTUSD: Augmented NTUSD (with 27,221
words, now integrating with e-Hownet)
• Cmorph (with 8,000+ words) -> ACBiMA
(with 11,000+ words)
• Chinese Opinion Treebank: labels on Chinese
Treebank 5.1
11 December 201683
84. Materials:
From Words to Sentences
• NTUSD: words (binary sentiment)
• ANTUSD: words (annotation features)
• Chinese Morphological Dataset: words
(morphological structures)
• Chinese Opinion Treebank: phrases (sentence
structure)
• Chinese Opinion Treebank: sentences (binary
sentiment)
11 December 201684
85. Tools:
From Words to Sentences,
Documents, and Beyond
• CopeOpi Sentiment Scoring Tool: words,
sentences, documents, documents+ (text)
• UTCNN: posts and users (text and social
media)
11 December 201685
86. NTUSD
• Simplified Chinese and traditional Chinese
versions
• A positive word collection of 2,812 words
• A negative word collection of 8,276 words
• No degree, no estimated scores and other
information.
11 December 201686
87. ANTUSD
• 6 Fields
– CopeOpi Score
– Number of positive annotation
– Number of neutral annotation
– Number of negative annotation
– Number of non-sentiment annotation
– Number of not-a-word annotation
• Not-a-word: useful as they are collected from real
segmentated data
開心 0.434168 1 0 0 0 0
酣聲 0 0 0 1 3 0
憤怒 -0.80011 0 0 5 0 0
11 December 201687
89. ANTUSD and E-HOWNET
• An integration of two resources which may help us play with
sentiment and semantics.
• Related English resource: SentiWordnet
– Refer to Wordnet
– With PosScore and NegScore added
– ObjScore = 1-(PosScore+NegScore)
E-HowNet
.., A frame-based entity-relation model extended from HowNet
.., Define lexical senses (concepts) in a hierarchical manner
.., Now integrated with ANTUSD and covers 47.7% words in
ANTUSD
11 December 201689
93. Chinese Opinion Treebank
• Based on Chinese Treebank 5.1.
• Including the opinion labels of each sentences.
• Including the word-pairs and their composing
type in opinionated sentences.
• To avoid copyright issue, you need to have
Chinese Treebank 5.1 by yourself in order to
use Chinese Opinion Treebank!
11 December 201693
95. Notation (Parsing Tree)
• T: the parsing tree of a
sentence S
• O = {o1, o2, …}: in-ordered set
of tree nodes
• tri
=
: an opinion trio
• : a syntactic inter-
word relation
Rpt є {Substantive-Modifier,
Subjective-Predicate, Verb-
Object, Verb-Complement,
Other}
Tri(S)=
1, IP, 活动, VP, Subjective-Predicate
2, VP, 取得, NP-OBJ, Verb-Object
3,NP-OBJ, 圆 满 , 成 功 , Substantive-
Modifier
11 December 201695
96. Chinese Opinion Treebank
• Align the opinion labels of sentences to
Chinese Treebank 5.1 by sentence IDs.
• Align Opinion trios to Chinese Treebank 5.1
by node IDs.
• Can be used to do opinion cause analysis.
11 December 201696
97. CopeOpi
• A statistical sentiment analysis tool
• Can be used without any training
• Users can update character weights or add any
sentiment words
• It runs fast.
11 December 201697
98. The First Idea
• Chinese characters are mostly morphemes and they
bear sentiment, too.
• Simple example: some characters are preferred for
naming, but some are not.
• For example, 德(ethic) 胜(win) 高(high) good for
names; 笨(stupid) 悲(sorrow) 惨(terrible) are not
good choices for names.
• With some exceptions, but still quite reliable if the
sentiment of character is acquired statistically from a
large naming corpus (or just sentiment dictionaries.)
Exceptions like 徐悲鸿.
11 December 201698
99. [仇 (-1.0) + 視 (0.0)] / 2 = -1/2 = -0.5 (NEG)
[富(1.0) + 貴(0.936)] / 2 = 0.968 (POS)
好人、美麗、憤怒、弱小…
m
j
cc
n
j
cc
m
j
cc
c
jiji
ji
i
fnfnfpfp
fnfn
N
11
1
//
/
)( iii ccc NPS
p
j
cw j
S
p
S
1
1
m
j
cc
n
j
cc
n
j
cc
c
jiji
ji
i
fnfnfpfp
fpfp
P
11
1
//
/
99
Bag of Unit
11 December 2016
100. Aggregation
• Word sentiment
– Summing up opinion scores of characters
• Sentence sentiment
– Summing up opinion scores of words
So is there any way we can give them weights?
11 December 2016100
101. • Linguistic Information:
– Morphological structures
• Intra-word structures
– Sentence syntactic structures
• Inter-word structures
101
Weighted by Structures
11 December 2016
102. Linguistic Morpho. Type Example
1. Parallel 財富、打罵
2. Substantive-Modifier 低級、痛哭
3. Subjective-Predicate 心疼、氣虛
4. Verb-Object 失控、免職
5. Verb-Complement 看清、擊潰
Opinion Morpho. Type Example
6. Negation 無法、不慎
7. Confirmation 有賴、有愧
8. Others 姪子、薄荷
102
Get types by SVM, CRF, handcraft…
Morphological Structure
11 December 2016
103. Example of Sentiment Trios in
Chinese Opinion Treebank
Linguistic Morpho. Type Example
Parallel (Skip) 美麗而聰慧
1. Substantive-Modifier 高大的樓房
2. Subjective-Predicate 學習認真
3. Verb-Object 恢復疲勞
4. Verb-Complement 收拾乾淨
Morpho. Type Opinion Example
n. Others 為…/以…
11 December 2016103
105. Example of Using Sentiment Trios
• Score: 0.6736
11 December 2016105
)()()(else
)(1-)(else
)()(then)0)(and0)((if
then)0)(and0)((if
2121
121
12121
21
CSCSCCS
CSCCS
CSCCSCSCS
CSCS
Substantive-Modifier type
)()()(else
))(())(()()(then
)0)(and0)((if
2121
21121
21
CSCSCCS
CSSIGNCSSIGNCSCCS
CSCS
Verb-Object type
0.3018
0.6736
0.4109
0.6736
106. Preprocessing
• Tokenize (segmentation)
– Jieba
– CKIP
– Stanford parser
• Part-of-speech tagging
– CKIP
– Stanford parser
Tokenize is mandatory, we will release the
“optional” version in the future.
11 December 2016106
107. CopeOpi – example
• $ ./run_trad.sh
– Run the CopeOpi with the files in the list “file.lst”
• Check the results in out/0001.txt
11 December 2016107
test_trad.txt 0001
109. Deep Neural Network Example
Word
• Morphological structure
for a better
word representation.
• Same idea but
for *Chinese sentiment
analysis*
• Luong, Thang, Richard Socher, and Christopher D. Manning. "Better Word Representations with Recursive Neural Networks
for Morphology." CoNLL. 2013.
11 December 2016109
110. Deep Neural Network Example
Sentence
• Learned composition function (of semantics): Richard Socher (RNN, series
work from 2011)
11 December 2016110
111. Learning by Neural Network
• Word Sentiment
• Sentence Sentiment
• Document Sentiment
• Social Media Post Sentiment
11 December 2016111
112. Learning by Deep Neural Network
• Word Sentiment: CNN + ANTUSD
• Sentence Sentiment
• Document Sentiment
• Social Media Post Sentiment: Text + User
Context
– Not yet consider structures!
11 December 2016112
113. CSentiPackage: UTCNN
Learning by Deep Neural Network
• Word Sentiment: CNN + ANTUSD
• Sentence Sentiment
• Document Sentiment
• Social Media Post Sentiment: Text + User
Context
11 December 2016113
114. User Topic Comment Neural Network
(UTCNN)
• A deep learning model of stance classification
on social media text
11 December 2016114
Deep Learning Model
AuthorsLikers
Post content
Comment content
Commenters
Topics
115. UTCNN
• Stance tendency
– Author
– Liker
– Topic
– Commenter
• Semantic preference
– Author
– Liker
– Topic
– Commenter
11 December 2016115
We should reject the re-construction
of the Nuclear power plant.
Great! ( )
NO! ……
(post)
(comment)
116. If you don’t know anything about deep learning
(again) …
– I won’t talk too much about it. No worries.
– You can take the courses organized by 臺灣資料
科學協會
– Knowing that it’s a DNN Chinese sentiment model
for now is enough.
page 116
117. Social Media Dataset Released
in CSentiPackage
• Facebook fan groups (Chinese)
– Author/liker/comment/commenter
– Single topic (learn latent topics by LDA)
– Unbalance
– Chinese
• Create Debate (English)
– Author
– Four topics
– Balance
– English
11 December 2016117
118. Environment
• Software
– OS: Linux
– Programming language
• Java 6 or higher
• python 2.7
– Theano 0.8.2
– Keras 1.0.3
– sklearn
• Hardware
– Graphic cards (deep learning)
11 December 2016118
119. Demo Environment
• CPU
– Intel Xeon E5-2630 v3 ×2
• RAM
– 64 GB
• OS
– Ubuntu 14.04 LTS
• Graphic cards
– Nvidia Tesla K40 ×2
11 December 2016119
121. UTCNN - demo
11 December 2016121
http://doraemon.iis.sinica.edu.tw/wordforce/
122. UTCNN - demo
11 December 2016122
http://doraemon.iis.sinica.edu.tw/wordforce/
123. Something Important About
CSentiPackage
11 December 2016123
• CSentiPackage you obtained is only for your group to
use for the research purpose.
• It has been officially released so they can be
downloaded any time.
• Download or check what’s new @
http://academiasinicanlplab.github.io/
• Find the tutorial materials of CSentiPackage @
http://www.lunweiku.com/
124. Skills We Might Need
• Text Normalization
• Multimedia multimodal
• User and Text Networking
• Social Network
page 124
125. NLP and Social Network
• NLP sometimes serves as the pre-processing of
the social network research to deal with
unstructured data.
• NLP in social media is sometimes referred by
Social Media Analytics
• NLP models can help find information such as
events, sentiment, named entities for social
network analysis
• The network analysis algorithm can benefit NLP
research by bringing in heterogeneous features.
page 125
126. Challenges
• Integrating features is not easy
• Integrating knowledge is not easy, either
• Data are big. Performance and efficiency are
tradeoffs.
• Social media are always changing and
different over generations.
• Visualizing both texts and the network is
challenging.
12611 December 2016
127. Wrap Up – Part III
• More context, more to know
• More context, better for guessing
• Inner context, outer context, inter context
• Pay more attention to the relations
12711 December 2016
130. Industrial Needs
• Techniques can make
money
• Techniques can provide
better services (then to
make money)
• Techniques can make
users engaged (then to
make money)
13011 December 2016
133. Ads (1)
• Google AdSense
– AdSense 運作方式 網站擁有者可以藉由Google
AdSense,以自己的線上內容來營利。 AdSense
會依據您的網站內容及訪客,放送適合的文字
與多媒體廣告。 這些廣告由想要宣傳產品的廣
告客戶製作及付費,而廣告客戶支付的費用會
因廣告而異,所以您的賺取的金額也會有所不
同。
• 廣告市占率: Google + FB 占九成
• But there is very little you can do (with NLP).
page 133
135. Recommendation 產品推薦
• Content-based
• Collaborative filtering
• User behavior
NLP techniques are needed mostly for content-
based (items in e-commerce websites).
page 135
136. • User behavior can be related to unstructured
data.
page 136
147. Chatbot Challenges
• It is difficult to cross domain.
• Needs very big data
• It is challenging to connect to the background
knowledge.
However, chatbot performs satisfactory as a
small, limited bot. Many Facebook stores utilize
this kind of chatbot to sell things and provide
services.
page 147
149. Final Wrap Up
• You have known what is NLP
• You have checked major NLP tools
• You have heard the cool things NLP can do
• Start NLP today!
14911 December 2016