[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹

Lun-Wei Ku
NLPSA, Academia Sinica
無所不在的自然語言處理—
基礎概念、技術與工具介紹

Speaker
Lecturer: Lun-Wei Ku
Currently: Assistant Research Fellow, IIS, Academia Sinica
Adjunct Assistant Professor, NCTU
• Working on NLP and Sentiment Analysis
• Running NLPSA Lab
:http://www.lunweiku.com/
:http://academiasinicanlplab.github.io/
• Currently On-going Projects:
– Graph Embedding, Emotion Enabled Dialog System, Cross-lingual Text
Suggestion, Proactive Dialog Generation from Images and Texts
2

Outline
9:30 - 10:30 什麼是自然語言處理
10:30 - 10:50 茶點時間
10:50 - 12:30 中英文文本處理相關工具與資源介紹
12:30 - 13:20 午餐
13:20 - 15:00 自然語言處理於網路與社群媒體的挑戰
15:00 - 15:20 茶點時間
15:20 - 17:00 自然語言處理發展趨勢與業界應用
3

Section 1
什麼是自然語言處理
page 4

自然語言
• 相對於機器語言
• 人類使用以溝通之語言
page 5

What is Natural Language
Processing?
• Natural language processing (NLP) is a field of computer
science, artificial intelligence and computational
linguistics concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to
fruitfully process large natural language corpora. Challenges in
natural language processing frequently involve natural
language understanding, natural language
generation (frequently from formal, machine-readable logical
forms), connecting language and machine perception, dialog
systems, or some combination thereof. (Wikipedia)
page 6

甚麼是自然語言處理
• 自然語言處理（英語：Natural Language
Processing，簡稱NLP）是人工智慧和語言
學領域的分支學科。此領域探討如何處理
及運用自然語言；自然語言認知則是指讓
電腦「懂」人類的語言。
• 自然語言生成系統把計算機數據轉化為自
然語言。自然語言理解系統把自然語言轉
化為計算機程序更易於處理的形式。
(Wikipedia)
page 7

自然語言處理
• 是一個AI-complete的問題
page 8

Domain 範疇 (1)
• Biomedical
• Cognitive Modeling and
Psycholinguistics
• Dialogue and Interactive Systems
• Discourse and Pragmatics
• Generation and Summarization
• Information Extraction, Retrieval,
Question Answering, Document
Analysis and NLP Applications
• Machine Learning
• Machine Translation
page 9
• Multidisciplinary
• Multilinguality
• Phonology, Morphology and Word
Segmentation
• Resources and Evaluation
• Semantics
• Sentiment Analysis and Opinion
Mining
• Social Media
• Speech
• Tagging, Chunking, Syntax and
Parsing
• Vision, Robotics and Grounding

Domain 範疇 (2)
• 文本朗讀（Text to speech）/語音合成（Speech
synthesis）
• 語音識別（Speech recognition）
• 中文自動分詞（Chinese word segmentation）
• 詞性標註（Part-of-speech tagging）
• 句法分析（Parsing）
• 自然語言生成（Natural language generation）
• 文本分類（Text categorization）
• 信息檢索（Information retrieval）
• 信息抽取（Information extraction）
• 文字校對（Text-proofing）
• 問答系統（Question answering）
• 機器翻譯（Machine translation）
• 自動摘要（Automatic summarization）
• 文字蘊涵（Textual entailment）
page 10

Applications (1)
• IBM Watson: Jeopardy
https://www.youtube.com/watch?v=WFR3lOm_
xhE
• Google Translate/Google小姐
page 11

Applications (2)
• Spam filtering <-> Ads pushing
– Google AdSense and so many others
• Spelling Correction, Grammar
– Grammarly- free grammar checker:
https://www.grammarly.com/
– duoLinguo https://www.duolingo.com/
– 批改網 https://www.pigai.org/
– …
page 12

Applications (3)
• Paper Generator
– Mathgen http://thatsmathematics.com/mathgen/
• Poem Generator
《秋蟲的聲音》
– 幸運將要投奔你的門上的時候
– 秋蟲的聲音也沒有
– 你的眼睛的誘惑
– 在天空中飛動
– 像人家把門關了幾天吧
– 我一個迷人的容貌
– 有時候不必再有一個太陽
– 把大地照成一顆星球
page 13

Applications (4)
• Problem Solver
– Math solver: https://www.cymath.com/
Step by step, NLP + others (graph, formula, …)
page 14

Applications (5)
• AI doctor
– IBM Watson Health
• Optimize performance
• Engaged cunsumers
• Enable effective care
• Manage population health
– Why is AI doctor related
to NLP?
• MedNLP: medical records, communication…
page 15

Applications (5)
• Summarization
– 最佳示範：谷阿莫 *blog.investis.com
• Sentiment/Opinion/Review
• Social Media/Network application
– Full of texts!
*techxb.com
page 16

Application: Multi-modal NLP
• Captioning
page 17

Application: Multi-modal NLP
• Story Telling
page 18

Other Close Disciplines
• Artificial Intelligence (AI)
• Information Retrieval (IR)
• Machine Learning (ML)
• Human Computer Interaction (HCI)
page 19

NLP and AI
• NLP takes care of the input/output of
unstructural information for AI applications.
• AI applications are expected to be write/speak
like people.
• NLP is getting more and more important in AI.
• However, NLP is challenging.
page 20

NLP and IR
• NLP borrows some concepts from IR,
especially weighting scheme of words.
• For IR, efficiency is very important. Some
time limited NLP tasks will also incorporate
ideas of IR to save time, e.g., clustering/offline
preprocessing.
page 21

NLP and ML
• In the past, NLP techniques utilized a lot of
linguistic knowledge in the form of rules or
probability.
• NLP uses a lot of ML/DL techniques
nowadays.
page 22

NLP and HCI
• (writing or speaking) Language is a way for
computers to communicate with people.
• Representing information and utilizing them in
an appropriate way can mitigate the errors
people may sense.
• NLP + HCI may lead to killer apps.
page 23

Everywhere 無所不在？
• 人類是群居動物，語言是人類溝通的工具
• 大腦資訊的輸入輸出
• 每天使用語言，賴以為生
• 不會說話？聽不見？
無時無刻，無所不在！
page 24

Sample Text (中文)
• 下雨天留客天留我不留
– 下雨天留客天留我不留
• 紅鯉魚與綠鯉魚與驢與鯉魚與驢與紅鯉魚
與驢與綠鯉魚
page 25

Typical Challenges
• NLU: Natural Language Understanding
• Inference
– 玻璃杯碎了一地  玻璃杯不能用了
• Changing of languages, emerging of new
words, phrases and concepts.
– Domain: 跆拳道的品勢
– Social Media: 多多變套套
page 26

http://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf
page 27

Wrap Up -1
• What is NLP?
• What applications are related to NLP?
• NLP and NLU
• What are the current challenges?
• Next, let’s go ahead to NLP!
– about introducing the concept and trying the tools
online (if available)
page 28

Section 2
中英文文本處理相關工
具與資源介紹
page 29

First, make your
corpus/datasets/mater
ials ready!
11 December 2016
30

Natural Language Processing
• Basic Functions
– (Word Segmentation)
– Part of Speech Tagging
– (Stemming)
– Named Entity Extraction
– (Syntactic) Parsing
– Coreference resolution
– Text Categorization
page 31

Word Segmentation
• Some written languages have no explicit word
boundary markers, such as Chinese or
Japanese.
• If words are to be the basic units for text
processing, we need to know the boundaries.
• 下雨天留客天留我不留
• 私は自然言語処理を好む
• ‫الطبيعية‬ ‫اللغة‬ ‫معالجة‬ ‫أفضل‬ ‫أنا‬
page 32

Stemmer (English)
• The process of reducing inflected (or sometimes
derived) words to their word stem, base or root
form—generally a written word form*
*wikipedia
page 33
I love natural language processing.
I love natur languag process .
Stemming

TF‧IDF (1)
• Something used a lot in IR
• term frequency * inversed document frequency
• Calculate the weight of each term (usually
words) in a dataset
• An example of how to represent documents
page 34
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

TF‧IDF (2)
• 非常老但有效的公式
• 一個字的重要性指標由兩個因素決定：
– 在同一篇文章中，出現的次數越多越重要
– 出現的文章越少越重要
page 35
)df/(log)tflog1(w 10,, tdt Ndt


Bag of Words Model
• Often abbreviated as “BOW”
• Words are used as features
WITHOUT their order.
• 我給你一百萬 = 你給我一百萬
• Usually working with
N-gram features
我給你一百萬我給給你你一一百
百萬我給你給你一你一百一百萬
page 36

TFIDF + BOW (uni-&bi-grams) 現今仍是某些
task的state of the art，或至少能得到很接近
state of the art的效能，是很強的baseline。
page 37

• So far we have word-level information.
• Next, we start to add more information on
words and further to larger segments.
page 38

Part of Speech (POS) Tagging
• I love natural language processing.
• (PRP I) (VBP love) (JJ natural) (NN language)
• (NN processing)(. .)
Verb, non-3rd person singular present
Personal pronoun
Adjective Noun, singular or mass
Tags may vary – using different tagging sets:
Penn Treebank Tagging Set
page 39

Parsing (English)
• Constituent Parse Tree
(ROOT
(S
(NP (PRP I))
(VP (VBP love)
(NP (JJ natural) (NN language))
(NP (NN processing)))
(. .)))
page 40

Parsing (English)
• POS
• Dependency Tree
• Dependency Parser
page 41

Semantic Role Labeling
To label the role each word plays in sentences
from the semantic aspect.
https://www.slideshare.net/marinasantini1/semantic-role-labeling
page 42

Parsing (Chinese)
• Stanford Parser (Simplified Chinese)
• 语言云（语言技术平台云LTP-Cloud)
(Simplified Chinese)
– 哈工大-讯飞语言云 (2014)
– 經由HTTP request 取得結果
• CKIP Parser
page 43

Parsing (Chinese)
• 我爱自然语言处理
• 我爱/VV 自然/NN 语言/NN 处理/NN
• root(ROOT-0, 我爱-1)
• compound:nn(处理-4, 自然-2)
• compound:nn(处理-4, 语言-3)
• dobj(我爱-1, 处理-4)
• Error Propogation!
page 44

Parsing (Chinese)
• #1:1.[0] S(experiencer:NP(Head:Nhaa:
我)|Head:VL1:愛|reason:NP(property:Na:自然
|Head:Nac:語言)|goal:VP(Head:VC2:處理))#
page 45

Using Tools (1)
• Stanford Parser
• Stanford CoreNLP
(Demo)
• Berkeley Parser
• SRL
(Demo)
page 46

英文的工具好完整，那
中文呢？
page 47

Using Tools (2)
• Jieba (segmentation, python codes)
– HMM/Viterbi algorithm
• CKIP
– Chinese Segmentor/POS Tagger
– Parser
page 48

• For the traditional Chinese text environment…
NLP Tools Comparison
page 49
Stanford
CoreNLP
Jieba CKIP
Language support
Easy to use
Domain adaptation
Performance
Price

Using Tools (3)
• NLTK (python): tokenize, tag, NE extraction,
show parsing trees
– Porter stemmer
– n-grams
• tfidf not in NLTK, use scikit-learn.
(machine learning in python)
page 50

Semantic Resources
• Wordnet (English) online demo
• Freebase (English): API shutdown on Aug 31,
2016 => Google’s knowledge graph
• Hownet (Simplified Chinese)
• E-hownet (Traditional Chinese)
page 51

Word Embeddings (1)
Word embedding與過去使用的詞向量差異點：
可以做語意運算: king + woman – man = queen
page 52

Word Embeddings (2)
Pre-trained or train by yourself!
• w2v
• Glove
我不會deep learning怎麼辦?
You can find various of embeddings on the Web.
[Check here!]
page 53

我知道這些資訊跟處理方法了
然後呢？
• 可以做
– 資訊擷取 (information extraction): 甚麼學習方法
都不會的話，可以寫一些規則來抽取自己需要
的資訊！
– 機器學習 (machine learning): 如果會一點機器學
習，使用剛才介紹的文字處理方式，可以獲得
很多資訊當作特徵來學習模型，例如詞頻
(word frequency)、重要性(weight)、語言特徵
(POS)、句子結構(parsing tree)、語意(semantic
ontology, word embedding) 等等等
page 54

NLP Tasks
• Most of them can be transformed into
– Classification problem
– Clustering problem
– (Sequential) Labeling problem
page 55

現在已經可以進行基本
的自然語言處理任務了!
page 56

Wrap Up – Part II
• For the English and Chinese languages
• Pre-processing tools
• Syntactic analysis tools
• Semantic analysis tools
page 57

Part III
自然語言處理於網路與
社群媒體的挑戰
page 58

1. WWW/Social Media NLP
2. Sentiment Analysis Tool
page 59

Not only texts…
Created by Freepik
6011 December 2016
Money
Network
Sentiment User

Differences
• Web or social texts are in a written form of the
spoken language.
– New words
– Typos
– Urban language
– Cyber language
– Abbreviations
– A lot of (homo)phonic/semantic puns (諧音、雙關
語)
– Foreign languages (激安殿堂牛逼)
page 61

If we just treat them as pure
texts…
• 八百屋的健太和大蔥女部分幫整個
劇情超加分
• 而且兩位演技都很好呀！最喜歡一
幕是
• 健太知道大蔥女的真面目後，在大
蔥女再來買蔥完要離開時
• 健太衝出去要追問的樣子，一副欲
言又止的臉
• 大蔥女也一副等待著健太說出來整
個很曖昧的畫面
• 八百(Neu) 屋(Na) 的(DE) 健(VH) 太(Dfa) 和
(P) 大蔥女(Na) 部分(Neqa) 幫(P) 整個(Neqa)
劇情(Na) 超(VJ) 加分(VB) 而且(Cbb) 兩(Neu)
位(Nf) 演技(Na) 都(D) 很(Dfa) 好(VH) 呀
(T) ！(EXCLAMATIONCATEGORY)
• ---------------------------------------------------------------------
-------------------------------------------------------------
• 最(Dfa) 喜歡(VK) 一(Neu) 幕(Nf) 是(SHI)
健(VH) 太(Dfa) 知道(VK) 大蔥女(Na) 的(DE)
真面目(Na) 後(Ng) ，(COMMACATEGORY)
• ---------------------------------------------------------------------
-------------------------------------------------------------
• 在(P) 大蔥女(Na) 再(D) 來(D) 買蔥完(VC)
要(D) 離開(VC) 時(Ng) 健(VH) 太(Dfa) 衝出
去(VA) 要(D) 追問(VE) 的(DE) 樣子(Na) ，
(COMMACATEGORY)
• ---------------------------------------------------------------------
-------------------------------------------------------------
• 一(Neu) 副(Nf) 欲言又止(VH) 的(DE) 臉(Na)
大蔥女(Na) 也(D) 一(Neu) 副(Nf) 等待(VK)
著(Di) 健(VH) 太(Dfa) 說出來(VB) 整個(Neqa)
很(Dfa) 曖昧(VH) 的(DE) 畫面(Na)
• ---------------------------------------------------------------------
-------------------------------------------------------------
page 62

Stanford vs. CKIP
• 八百(Neu) 屋(Na)
的(DE) 健(VH)
太(Dfa) 和(P) 大
蔥女(Na) 部分
(Neqa) 幫(P) 整
個(Neqa) 劇情
(Na) 超(VJ) 加
分(VB)
• 八百/CD 屋/NN 的
/DEG 健太/NR 和
/CC 大葱/NR 女/JJ
部分/NN 帮/VV 整
个/DT 剧情/NN 超
加分/NN
11 December 2016
63

More Preprocessing Needed
• Need to filter out dirty texts and find the major
content.
– Texts for ads
– Texts for format
• Need to cut sentences first before sending
them into the parser.
6411 December 2016

Skills We Might Need
• Text Normalization
• Multimedia multimodal
• User and Text Networking
• Social Network
page 65

• Social Network
page 66

Text Normalization
• Normalization is to change the text written in
web language into the one in the formal language
before further to process it.
• 私心喜翻的日式簡約風  私心喜歡的日式簡
約風
• 想一起去ㄉ水水們  想一起去的漂亮女生們
• 漂漂是今年才成為麻麻  漂漂是今年才成為
媽媽
page 67

2017十大鄉民流行用語
• #1 低能卡
• #2 垃圾不分藍綠
• #3 我難過
• #4 這我一定吉
• #5 發錢
• #6 8+9
• #7 銅鋰鋅
• #8 下去領500
• #9 海水退潮就知道
誰沒穿褲子
• #10 少時不讀書，
長大當記者
• 同場加映：廠廠
page 68

Processing Web Text: do we need normalization?
page 69

Or, A Parser for Web Text
• Tweet POS Tagger/Parser like: ARK
• Train with web texts to capture their characteristics.
ikr smh he asked fir yo last name
so he can add u on fb lololol
• Unfortunately, so far we don’t have any for the
Chinese language.
7011 December 2016

• Social Network
page 71

隨便開一個網路文章
• http://linshibi.com/
page 72

• Social Network
page 73

User and Text Network (1)
• We can observe this networking in all social
media in a forum-like style.
page 74

page 75

page 76

• We will explain the way to utilize the concept
of user and text network using the UTCNN
model in the following sentiment package.
page 77

Sentiment Analysis Is…
• Studying opinions, sentiments, subjectivities,
affects, emotions, views, etc. in text such as
news, blogs, reviews, comments, dialogs, or
other kind of documents.
• An important research question:
– Sentiment information is global and powerful.
– Sentiment information is valuable for companies,
customers and personal communication.
79
11 December 2016

Sentiment Representation
• Categorical
– Sentiment, non-sentiment
– Positive, neutral, negative
– Stars
– Emotions categories like Joy, Angry, Sadness…
• Dimensional
– Valence Arousal
11 December 201680

CSentiPackage
@NLPSA
11 December 2016
81

CSentiPackage
• Datasets
– Chinese Morphological Dataset Cmorph (former
version of ACiBiMA)*
– Chinese Opinion Treebank
• Resources
– NTUSD/ANTUSD
• Tools
– CopeOpi + Tag Mapping File
– UTCNN
*https://github.com/windx0303/ACBiMA
11 December 201682

Statistics
• NTUSD: Sentiment Dictionary (with 10,371
words): free for research, 400+ applications
• ANTUSD: Augmented NTUSD (with 27,221
words, now integrating with e-Hownet)
• Cmorph (with 8,000+ words) -> ACBiMA
(with 11,000+ words)
• Chinese Opinion Treebank: labels on Chinese
Treebank 5.1
11 December 201683

Materials:
From Words to Sentences
• NTUSD: words (binary sentiment)
• ANTUSD: words (annotation features)
• Chinese Morphological Dataset: words
(morphological structures)
• Chinese Opinion Treebank: phrases (sentence
structure)
• Chinese Opinion Treebank: sentences (binary
sentiment)
11 December 201684

Tools:
From Words to Sentences,
Documents, and Beyond
• CopeOpi Sentiment Scoring Tool: words,
sentences, documents, documents+ (text)
• UTCNN: posts and users (text and social
media)
11 December 201685

NTUSD
• Simplified Chinese and traditional Chinese
versions
• A positive word collection of 2,812 words
• A negative word collection of 8,276 words
• No degree, no estimated scores and other
information.
11 December 201686

ANTUSD
• 6 Fields
– CopeOpi Score
– Number of positive annotation
– Number of neutral annotation
– Number of negative annotation
– Number of non-sentiment annotation
– Number of not-a-word annotation
• Not-a-word: useful as they are collected from real
segmentated data
開心 0.434168 1 0 0 0 0
酣聲 0 0 0 1 3 0
憤怒 -0.80011 0 0 5 0 0
11 December 201687

ANTUSD
• Contains also short phrases like一昧要求, 一
路過關斬將,備受外界期待…
11 December 201688

ANTUSD and E-HOWNET
• An integration of two resources which may help us play with
sentiment and semantics.
• Related English resource: SentiWordnet
– Refer to Wordnet
– With PosScore and NegScore added
– ObjScore = 1-(PosScore+NegScore)
E-HowNet
.., A frame-based entity-relation model extended from HowNet
.., Define lexical senses (concepts) in a hierarchical manner
.., Now integrated with ANTUSD and covers 47.7% words in
ANTUSD
11 December 201689

ANTUSD in E-HOWNET
11 December 201690

Chinese Morphological Structure
• Parallel type: 財富 (rich wealth)
• Substantive-Modifier type: 痛哭 (bitterly cry)
• Subjective-Predicate type: 山崩 (land slip; landslide)
• Verb-Object type: 避暑 (escape from summer)
• Verb-Complement type: 提高 (increase: raise up)
• Negation type: 無情 (no feelings)
• Confirmation type: 有心 (have heart)
• Others
11 December 201692

Chinese Opinion Treebank
• Based on Chinese Treebank 5.1.
• Including the opinion labels of each sentences.
• Including the word-pairs and their composing
type in opinionated sentences.
• To avoid copyright issue, you need to have
Chinese Treebank 5.1 by yourself in order to
use Chinese Opinion Treebank!
11 December 201693

Chinese Opinion TreebankS ID=230: 黄河“金三角”成为新的投资热点
.node file .tree file .trio file
Fields
Node ID, POS, node
content, node depth
Node ID: children
Trio ID, trio head, trio left
node, trio right node, trio
type
Content
0,,,0
1,IP-HLN,,1
2,NP-SBJ,,2
3,NP-PN,,3
4,NR,黄河,4
5,NP,,3
6,PU,“,4
7,NN,金三角,4
8,PU,”,4
9,VP,,2
10,VV,成为,3
11,NP-OBJ,,3
12,CP,,4
13,WHNP-1,,5
14,-NONE-,*OP*,6
15,CP,,5
16,IP,,6
17,NP-SBJ,,7
18,-NONE-,*T*-1,8
19,VP,,7
20,VA,新,8
21,DEC,的,6
22,NP,,4
23,NN,投资,5
24,NN,热点,5
0:1,
1:2,9,
2:3,5,
3:4,
4:
5:6,7,8,
6:
7:
8:
9:10,11,
10:
11:12,22,
12:13,15,
13:14,
14:
15:16,21,
16:17,19,
17:18,
18:
19:20,
20:
21:
22:23,24,
23:
24:
2,1,2,9,3
3,22,23,24,2
Opinion labels of three annotators
(filename, SID, opinion, polarity, opinion type)
chtb_020.raw,230,N,,
chtb_020.raw,230,Y,POS,STATUS
Opinion gold standard
11 December 201694

Notation (Parsing Tree)
• T: the parsing tree of a
sentence S
• O = {o1, o2, …}: in-ordered set
of tree nodes
• tri
=
: an opinion trio
• : a syntactic inter-
word relation
Rpt є {Substantive-Modifier,
Subjective-Predicate, Verb-
Object, Verb-Complement,
Other}
Tri(S)=
1, IP, 活动, VP, Subjective-Predicate
2, VP, 取得, NP-OBJ, Verb-Object
3,NP-OBJ, 圆满 , 成功 , Substantive-
Modifier
11 December 201695

• Align the opinion labels of sentences to
Chinese Treebank 5.1 by sentence IDs.
• Align Opinion trios to Chinese Treebank 5.1
by node IDs.
• Can be used to do opinion cause analysis.
11 December 201696

CopeOpi
• A statistical sentiment analysis tool
• Can be used without any training
• Users can update character weights or add any
sentiment words
• It runs fast.
11 December 201697

The First Idea
• Chinese characters are mostly morphemes and they
bear sentiment, too.
• Simple example: some characters are preferred for
naming, but some are not.
• For example, 德(ethic) 胜(win) 高(high) good for
names; 笨(stupid) 悲(sorrow) 惨(terrible) are not
good choices for names.
• With some exceptions, but still quite reliable if the
sentiment of character is acquired statistically from a
large naming corpus (or just sentiment dictionaries.)
Exceptions like 徐悲鸿.
11 December 201698

[仇 (-1.0) + 視 (0.0)] / 2 = -1/2 = -0.5 (NEG)
[富(1.0) + 貴(0.936)] / 2 = 0.968 (POS)
好人、美麗、憤怒、弱小…





 m
j
cc
n
j
cc
m
j
cc
c
jiji
ji
i
fnfnfpfp
fnfn
N
11
1
//
/
)( iii ccc NPS 


p
j
cw j
S
p
S
1
1





 m
j
cc
n
j
cc
n
j
cc
c
jiji
ji
i
fnfnfpfp
fpfp
P
11
1
//
/
99
Bag of Unit
11 December 2016

Aggregation
• Word sentiment
– Summing up opinion scores of characters
• Sentence sentiment
– Summing up opinion scores of words
So is there any way we can give them weights?
11 December 2016100

• Linguistic Information:
– Morphological structures
• Intra-word structures
– Sentence syntactic structures
• Inter-word structures
101
Weighted by Structures
11 December 2016

Linguistic Morpho. Type Example
1. Parallel 財富、打罵
2. Substantive-Modifier 低級、痛哭
3. Subjective-Predicate 心疼、氣虛
4. Verb-Object 失控、免職
5. Verb-Complement 看清、擊潰
Opinion Morpho. Type Example
6. Negation 無法、不慎
7. Confirmation 有賴、有愧
8. Others 姪子、薄荷
102
Get types by SVM, CRF, handcraft…
Morphological Structure
11 December 2016

Example of Sentiment Trios in
Linguistic Morpho. Type Example
Parallel (Skip) 美麗而聰慧
1. Substantive-Modifier 高大的樓房
2. Subjective-Predicate 學習認真
3. Verb-Object 恢復疲勞
4. Verb-Complement 收拾乾淨
Morpho. Type Opinion Example
n. Others 為…/以…
11 December 2016103

Compositional
Chinese Sentiment Analysis
• Example:氣虛
• Subjective-Predicate type
• 氣 0.5195
• 虛 -0.8178
• Score(氣虛) = -0.8178
11 December 2016104
• Example:看清、看壞
• Verb-Complement type
• 看: 0.1
• 清: 0.8032
• 壞: -0.9
• Score(看清) = 0.8072
• Score(看壞) = -0.9

Example of Using Sentiment Trios
• Score: 0.6736
11 December 2016105
)()()(else
)(1-)(else
)()(then)0)(and0)((if
then)0)(and0)((if
2121
121
12121
21
CSCSCCS
CSCCS
CSCCSCSCS
CSCS




Substantive-Modifier type
)()()(else
))(())(()()(then
)0)(and0)((if
2121
21121
21
CSCSCCS
CSSIGNCSSIGNCSCCS
CSCS



Verb-Object type
0.3018
0.6736
0.4109
0.6736

Preprocessing
• Tokenize (segmentation)
– Jieba
– CKIP
– Stanford parser
• Part-of-speech tagging
– CKIP
– Stanford parser
Tokenize is mandatory, we will release the
“optional” version in the future.
11 December 2016106

CopeOpi – example
• $ ./run_trad.sh
– Run the CopeOpi with the files in the list “file.lst”
• Check the results in out/0001.txt
11 December 2016107
test_trad.txt 0001

CopeOpi – example
• Result summary in ./out.csv
11 December 2016108

Deep Neural Network Example
Word
• Morphological structure
for a better
word representation.
• Same idea but
for *Chinese sentiment
analysis*
• Luong, Thang, Richard Socher, and Christopher D. Manning. "Better Word Representations with Recursive Neural Networks
for Morphology." CoNLL. 2013.
11 December 2016109

Deep Neural Network Example
Sentence
• Learned composition function (of semantics): Richard Socher (RNN, series
work from 2011)
11 December 2016110

Learning by Neural Network
• Word Sentiment
• Sentence Sentiment
• Document Sentiment
• Social Media Post Sentiment
11 December 2016111

Learning by Deep Neural Network
• Word Sentiment: CNN + ANTUSD
• Social Media Post Sentiment: Text + User
Context
– Not yet consider structures!
11 December 2016112

CSentiPackage: UTCNN
Learning by Deep Neural Network
• Word Sentiment: CNN + ANTUSD
• Social Media Post Sentiment: Text + User
Context
11 December 2016113

User Topic Comment Neural Network
(UTCNN)
• A deep learning model of stance classification
on social media text
11 December 2016114
Deep Learning Model
AuthorsLikers
Post content
Comment content
Commenters
Topics

UTCNN
• Stance tendency
– Author
– Liker
– Topic
– Commenter
• Semantic preference
– Author
– Liker
– Topic
– Commenter
11 December 2016115
We should reject the re-construction
of the Nuclear power plant.
Great! ( )
NO! ……
(post)
(comment)

If you don’t know anything about deep learning
(again) …
– I won’t talk too much about it. No worries.
– You can take the courses organized by 臺灣資料
科學協會
– Knowing that it’s a DNN Chinese sentiment model
for now is enough.
page 116

Social Media Dataset Released
in CSentiPackage
• Facebook fan groups (Chinese)
– Author/liker/comment/commenter
– Single topic (learn latent topics by LDA)
– Unbalance
– Chinese
• Create Debate (English)
– Author
– Four topics
– Balance
– English
11 December 2016117

Environment
• Software
– OS: Linux
– Programming language
• Java 6 or higher
• python 2.7
– Theano 0.8.2
– Keras 1.0.3
– sklearn
• Hardware
– Graphic cards (deep learning)
11 December 2016118

Demo Environment
• CPU
– Intel Xeon E5-2630 v3 ×2
• RAM
– 64 GB
• OS
– Ubuntu 14.04 LTS
• Graphic cards
– Nvidia Tesla K40 ×2
11 December 2016119

UTCNN - data
11 December 2016120
• 3 46 57 … 573 49 61 4 -1 <sssss>福島核電廠的
熔毀核燃料棒到底有沒有掉到地下水層 …..<sssss>詳
見俄國時報電視專訪 <sssss> 544 490 565 … 428
危機 ,如果安全你家借放 ,事實是沒有人知道真相這
些都只是推論就看誰的推論有根據合理奇怪的是
擁核五毛只根據東京電力的說法而東京電力是最
有利益關係最有企圖掩藏事實的事主貼此文是提
供大家獨立沒有核電利益纏身的核工專家與小出裕
章的推論僅供參考

UTCNN - demo
11 December 2016121
http://doraemon.iis.sinica.edu.tw/wordforce/

UTCNN - demo
11 December 2016122
http://doraemon.iis.sinica.edu.tw/wordforce/

Something Important About
CSentiPackage
11 December 2016123
• CSentiPackage you obtained is only for your group to
use for the research purpose.
• It has been officially released so they can be
downloaded any time.
• Download or check what’s new @
http://academiasinicanlplab.github.io/
• Find the tutorial materials of CSentiPackage @
http://www.lunweiku.com/

• Social Network
page 124

NLP and Social Network
• NLP sometimes serves as the pre-processing of
the social network research to deal with
unstructured data.
• NLP in social media is sometimes referred by
Social Media Analytics
• NLP models can help find information such as
events, sentiment, named entities for social
network analysis
• The network analysis algorithm can benefit NLP
research by bringing in heterogeneous features.
page 125

Challenges
• Integrating features is not easy
• Integrating knowledge is not easy, either
• Data are big. Performance and efficiency are
tradeoffs.
• Social media are always changing and
different over generations.
• Visualizing both texts and the network is
challenging.
12611 December 2016

Wrap Up – Part III
• More context, more to know
• More context, better for guessing
• Inner context, outer context, inter context
• Pay more attention to the relations
12711 December 2016

Part IV
自然語言處理發展趨勢
與業界應用
page 128

1. Industrial Needs and Apps
2. Future Trend
11 December 2016
129

Industrial Needs
• Techniques can make
money
• Techniques can provide
better services (then to
make money)
• Techniques can make
users engaged (then to
make money)
13011 December 2016

Applications
• Ads
• Recommendation
• QA
• Interface: Chatbot
page 131

Advertisement
The most direct way to make profit
page 132

Ads (1)
• Google AdSense
– AdSense 運作方式網站擁有者可以藉由Google
AdSense，以自己的線上內容來營利。 AdSense
會依據您的網站內容及訪客，放送適合的文字
與多媒體廣告。這些廣告由想要宣傳產品的廣
告客戶製作及付費，而廣告客戶支付的費用會
因廣告而異，所以您的賺取的金額也會有所不
同。
• 廣告市占率: Google + FB 占九成
• But there is very little you can do (with NLP).
page 133

其他網站廣告常見形式
• 內容網站：推薦廣告文
page 134

Recommendation 產品推薦
• Content-based
• Collaborative filtering
• User behavior
NLP techniques are needed mostly for content-
based (items in e-commerce websites).
page 135

• User behavior can be related to unstructured
data.
page 136

Mobile: Apps Recommendation (1)
page 138
Descriptions
Review
Users
Others
Images
Images

Mobile: Apps Recommendation (2)
• Grouping them with similarity (like
communication) or events (like travel).
page 139

Chatbot: Where is my Dr. Know?
A new interface connected to understanding and
text generation.
page 140

Two major purposes of chatbot
• Chit-chat
• Task-oriented
The most natural kind is mixed somehow.
page 142

Four major types of functions
• 助理 (MS cortana)
• 陪伴者 (MS 小冰)
• 客服 (京東JIMI)
• 問答 (IBM Watson)
page 143

Chatbot
• Retrieval based
– 原理: 大家都接甚麼話，就接(最像的)那一句
– 優點: 句子都是人說過的，回應句較少出現不合
文法的問題
• Generation based
– 原理: 目前大部分的generation based model都是
由深度學習模型來實作的，藉由學習上一句與
本句的編碼解碼關係，來產生最佳回答句。
– 優點: 可以產生新的，語料中沒看過的答句
page 144

Chatbot
• Slot filling:
– Sequential tagging
– templates
page 145

Chatbot
• Api.ai: template/rule-based
page 146

Chatbot Challenges
• It is difficult to cross domain.
• Needs very big data
• It is challenging to connect to the background
knowledge.
However, chatbot performs satisfactory as a
small, limited bot. Many Facebook stores utilize
this kind of chatbot to sell things and provide
services.
page 147

Future Trend
• Application oriented NLP
– (character-based, no more segmentation/parsing…)
• Semantic oriented NLP
• Language independent NLP
• Multi-modal NLP
• Multi-sourced/featured NLP
• Knowledge empowered NLP
page 148

Final Wrap Up
• You have known what is NLP
• You have checked major NLP tools
• You have heard the cool things NLP can do
• Start NLP today!
14911 December 2016

Thank You
Q&A
11 December 2016
150

[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹

[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to [系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹

Similar to [系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹 (20)

More from 台灣資料科學年會

More from 台灣資料科學年會 (20)

Recently uploaded

Recently uploaded (20)

[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹