SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
NLTK, Gensim
( : )
PyCon Korea 2015
2015-08-29
Lucy Park ( )
1 / 69
KoNLPy
,
? !
NLTK
,
Gensim
, .
bag-of-words, document embedding
, KoNLPy, NLTK, Gensim
.
KoNLPy: http://konlpy.org
NLTK: http://nltk.org
Gensim: http://radimrehurek.com/gensim
2 / 69
(a.k.a. lucypark,
echojuliett, e9t)
.
CAVEAT: KoNLPy
Just another yak shaver...
11:49 <@sanxiyn> 또다시 yak shaving의 신비한 세계
11:51 <@sanxiyn> yak shaving이 뭔지 다 아시죠?
11:51 <디토군> 방금 찾아보고 왔음
11:51 <@mana> (조용히 설명을 기대중)
11:51 <@sanxiyn> 나무를 베려고 하는데
11:52 <@sanxiyn> 도끼질을 하다가
11:52 <@sanxiyn> 도끼가 더 잘 들면 나무를 쉽게 벨텐데 해서
11:52 <@sanxiyn> 도끼 날을 세우다가
11:52 <@sanxiyn> 도끼 가는 돌이 더 좋으면 도끼 날을 더 빨리 세울텐데 해서
11:52 <@sanxiyn> 좋은 숫돌이 있는 곳을 수소문해 보니
11:52 <@mana> …
11:52 <&홍민희> 그거 전형적인 제 행동이네요
11:52 <@sanxiyn> 저 멀이 어디에 세계 최고의 숫돌이 난다고
11:52 <@sanxiyn> 거기까지 야크를 타고 가려다가
11:52 <@mana> 항상하던 짓이라서 타이핑을 할 수 없었습니다
11:52 <@sanxiyn> 야크 털을 깎아서…
11:52 <@sanxiyn> etc.
3 / 69
/
KoNLPy, NLTK, Gensim (?)
!
Gensim ,
.
,
.
.
?
3
4 / 69
1997
20 ...
5 / 69
IBM .
6 / 69
. ,
.
.
-- Bengio et al., Deep Learning 1 , Book in preparation for MIT Press, 2015.
7 / 69
" "
Representations of text
8 / 69
?
, .
" " .
9 / 69
{ , }
≈
⊂
10 / 69
, ,
(binary)
set
one-hot vector
1-of-V
= , = , = , . . . =w
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
1
0
0
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
w ( )
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
0
1
0
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
w
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
0
0
1
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
w
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
0
0
0
⋮
1
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
wi
i
|V|
11 / 69
.
ex: " " " ", " " " " 0 .
( = ( = 0w )⊤
w w )⊤
w
* BOW (discrete) (local representation) .
(distributed representation) 12 / 69
,
bag of words(BOW)
/ ( "term existance" )
n (a.k.a. term frequency (TF))
ex: term frequency–inverse document frequency (TF-IDF)
=d1
binary
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
1
1
0
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
=d1
tf
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
3
1
0
⋮
0
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
13 / 69
.*
.
.
" ".
* : Vincent Spruyt, The Curse of Dimensionality in classification, 2014. 14 / 69
" ?"
(WordNet) X is-a Y (hypernym)
(DAG; directed acyclic graph) . ( : .)
from nltk.corpus import wordnet as wn
wn.synsets('computer') # "computer"
#=> [Synset('computer.n.01'), Synset('calculator.n.01')]
: ( ) /
15 / 69
:
1. , .
from nltk.corpus import wordnet as wn
wn.synsets('nginx')
#=> []
2. .
wn.synsets('yolo') # : you only live once
#=> []
3. NLTK ... ( !)
wn.synsets(' ', lang='kor')
#=> WordNetError: Language is not supported.
16 / 69
.. . ?
17 / 69
.
18 / 69
.
You shall know a word by the company it keeps.
-- J.R. Firth (1957)
19 / 69
1: ?
__ .
, __ .
__ .
: [ ]
20 / 69
2: " " ?
?
21 / 69
.
?
SO .
22 / 69
;;
23 / 69
, (context)
.
(context) := (window) /
1-8 , 3-17 *
" "
ex: "PyCon" "R" "Python" ?
* Jurafsky, Martin, Speech and Language Processing 3rd Ed. 19.1 , Book in preparation, 2015. 24 / 69
"Co-occurrence"
25 / 69
Co-occurrence
documents = [[" ", " ", " ", " "],
[" ", "R", " ", " "]]
1. Term-document matrix ( ):
Computation (!)
x_td = [[1, 1], #
[1, 0], #
[0, 1], # R
[1, 1], #
[1, 1]] #
1. Term-term matrix ( ):
syntactic, semantic
( ==5):
x_tt = [[0, 1, 1, 2, 0], #
[1, 0, 0, 1, 1], #
[1, 0, 0, 1, 1], # R
[2, 1, 1, 0, 2], #
[0, 1, 1, 2, 0]] #
∈Xtd ℝ|V|×M
M
∈Xtt ℝ|V|×|V|
26 / 69
Co-occurrence matrix
.
import math
def dot_product(v1, v2):
return sum(map(lambda x: x[0] * x[1], zip(v1, v2)))
def cosine_measure(v1, v2):
prod = dot_product(v1, v2)
len1 = math.sqrt(dot_product(v1, v1))
len2 = math.sqrt(dot_product(v2, v2))
return prod / (len1 * len2)
:
skewed ( , )
discriminative (ex: list(' '))
?
27 / 69
Neural embeddings
: !
(language model)
ex: [" ", " ", " ", " "] ? (ex: "!", "." "?")
, .
Neural network language model (NNLM)*
* Bengio, Ducharme, Vincent, A Neural Probabilistic Language Model, {NIPS, 2000; JMLR, 2003}. 28 / 69
word2vec *
T. Mikolov word2vec neural embedding
! (n -> m )
* Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their
Compositionality. In Proceedings of NIPS, 2013. 29 / 69
: word2vec *
* , , , (1946-2015) , 49 2 , 2015. 30 / 69
neural embedding ?
31 / 69
doc2vec *
: ( ) !
.
.
|V|
* Le, Mikolov, Distributed Representations of Sentences and Documents, ICML, 2014. ( doc2vec paragraph vector
) 32 / 69
: Term frequency -> . .
documents = [
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' ']
]
words = set(sum(documents, []))
print(words)
# => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '}
def term_frequency(document):
# do something
return vector
vectors = [term_frequency(doc) for doc in documents]
print(vectors)
# => {
# 'S1': [1, 1, 0, 1, 0, 0, 1, 1],
# 'S2': [1, 1, 0, 1, 0, 0, 1, 1],
# 'S3': [1, 0, 0, 1, 1, 1, 1, 0],
# 'S4': [1, 1, 0, 1, 1, 1, 0, 0],
# 'S5': [1, 0, 1, 1, 1, 0, 1, 0],
# 'S6': [1, 1, 1, 1, 1, 0, 0, 0]
# }
|V|
33 / 69
: doc2vec -> . .
documents = [
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' '],
[' ', ' ', ' ', ' ', ' ']
]
words = set(sum(documents, []))
print(words)
# => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '}
def doc2vec(document):
# do something else ( Gensim !)
return vector
vectors = [doc2vec(doc) for doc in documents]
print(vectors)
# => {
# 'S1': [-0.01599941, 0.02135301, 0.0927715],
# 'S2': [0.02333261, 0.00211833, 0.00254255],
# 'S3': [-0.00117272, -0.02043504, 0.00593186],
# 'S4': [0.0089237, -0.00397806, -0.13199195],
# 'S5': [0.02555059, 0.01561624, 0.03603476],
# 'S6': [-0.02114407, -0.01552016, 0.01289922]
# }
|V|
34 / 69
.
:
Sparse, long vectors Dense, short vectors
1-hot-vectors word2vec
bag-of-words
(term existance, TF, TF-IDF )
doc2vec
.
35 / 69
Sentiment analysis for Korean movie reviews
36 / 69
, ?
37 / 69
!
38 / 69
Naver sentiment movie corpus v1.0
Maas et al., 2011
:
100 140 ( ' ')
20 ( 64 )
ratings_train.txt: 15 , ratings_test.txt: 5
/ (i.e., random guess yields 50%
accuracy)
: 19MB
URL: http://github.com/e9t/nsmc/
39 / 69
README coming soon
40 / 69
$ cat ratings_train.txt | head -n 25
id document label
9976970 .. 0
3819312 ... .... 1
10265843 0
9045019 .. .. 0
6483659 !
5403919 3 1 8 . ... . 0
7797314 . 0
9443947 ..
7156791 1
5912145 ? .. ?
9008700 . ♥ 1
10217543 90 !! ~
5957425 0
8628627 . . .
9864035
6852435 1
9143163 .
4891476 0
7465483 !!♥
3989148 , . . 1
4581211 . 1
2718894 1
9705777 . ....
471131 . 1
41 / 69
, ?
1. KoNLPy
2. NLTK
3. Bag-of-words(term existence) classify
4. Gensim doc2vec classify
42 / 69
1. Data preprocessing (feat. KoNLPy)
def read_data(filename):
with open(filename, 'r') as f:
data = [line.split('t') for line in f.read().splitlines()]
data = data[1:] # header
return data
train_data = read_data('ratings_train.txt')
test_data = read_data('ratings_test.txt')
NOTE: pandas.read_csv() quoting parameter
csv.QUOTE_NONE (3) (ex:
)
43 / 69
1. Data preprocessing (feat. KoNLPy)
# row, column
print(len(train_data)) # nrows: 150000
print(len(train_data[0])) # ncols: 3
print(len(test_data)) # nrows: 50000
print(len(test_data[0])) # ncols: 3
44 / 69
1. Data preprocessing (feat. KoNLPy)
from konlpy.tag import Twitter
pos_tagger = Twitter()
def tokenize(doc):
# norm, stem optional
return ['/'.join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)]
train_docs = [(tokenize(row[1]), row[2]) for row in train_data]
test_docs = [(tokenize(row[1]), row[2]) for row in test_data]
#
from pprint import pprint
pprint(train_docs[0])
# => [([' /Exclamation',
# ' /Noun',
# '../Punctuation',
# ' /Noun',
# ' /Noun',
# ' /Verb',
# ' /Noun'],
# '0')]
NOTE:
. 45 / 69
1. Data preprocessing (feat. KoNLPy)
" ?"
, x
*
" (POS) ?"
ex: ' /Josa' vs ' /Noun'
* https://github.com/karpathy/char-rnn 46 / 69
2. Data exploration (feat. NLTK)
!
Training data token
tokens = [t for d in train_docs for t in d[0]]
print(len(tokens))
# => 2194536
47 / 69
2. Data exploration (feat. NLTK)
nltk.Text() ! (Exploration )
document ,
.
import nltk
text = nltk.Text(tokens, name='NMSC')
print(text)
# => <Text: NMSC>
48 / 69
2. Data exploration (feat. NLTK)
Count tokens
print(len(text.tokens)) # returns number of tokens
# => 2194536
print(len(set(text.tokens))) # returns number of unique tokens
# => 48765
pprint(text.vocab().most_common(10)) # returns frequency distribution
# => [('./Punctuation', 68630),
# (' /Noun', 51365),
# (' /Verb', 50281),
# (' /Josa', 39123),
# (' /Verb', 34764),
# (' /Josa', 30480),
# ('../Punctuation', 29055),
# (' /Josa', 27108),
# (' /Josa', 26696),
# (' /Josa', 23481)]
49 / 69
2. Data exploration (feat. NLTK)
Plot tokens by frequency
text.plot(50) # Plot sorted frequency of top 50 tokens
;;
50 / 69
2. Data exploration (feat. NLTK)
!
Troubleshooting: For those who see rectangles instead of letters in the saved plot file,
include the following configurations before drawing the plot:
Some example fonts:
Mac OS: /Library/Fonts/AppleGothic.ttf
from matplotlib import font_manager, rc
font_fname = 'c:/windows/fonts/gulim.ttc' # A font of your choice
font_name = font_manager.FontProperties(fname=font_fname).get_name()
rc('font', family=font_name)
51 / 69
2. Data exploration (feat. NLTK)
. .
.
52 / 69
2. Data exploration (feat. NLTK)
Collocations ( ): (ex: "text" + "mining")
text.collocations() #
# => /Determiner /Noun; /Suffix /Josa; /Determiner /Noun; /Noun
# /Verb; /Noun /Josa; 10/Number /Noun; /Noun /Suffix; /Noun
# /Adjective; /Noun /Josa; /Noun /Josa; /Noun /Josa; /Suffix
# /Josa; /Noun /Noun; /Noun /Josa; /Suffix /Josa; /Noun
# /Josa; /Noun /Josa; /Suffix /Josa; /Noun /Suffix; /Noun
# /Josa
NOTE: nltk.Text() , source code API .
.
53 / 69
3. Sentiment classification with term-existance*
term .
# 2000
# WARNING: time/memory efficient
selected_words = [f[0] for f in text.vocab().most_common(2000)]
def term_exists(doc):
return {'exists({})'.format(word): (word in set(doc)) for word in selected_words}
# training corpus
train_docs = train_docs[:10000]
train_xy = [(term_exists(d), c) for d, c in train_docs]
test_xy = [(term_exists(d), c) for d, c in test_docs]
Try this:
. 50 ?
TF, TF-IDF ?
(sparse matrix) or scikit-learn TfidfVectorizer() !
* Natural Language Processing with Python Chapter 6. Learning to classify text 54 / 69
3. Sentiment classification with term-existance
. !
.
NLTK :
Naive Bayes Classifiers
Decision Tree Classifiers
Maximum Entropy Classifiers
.
scikit-learn .
NLTK Naive Bayes Classifier !
55 / 69
3. Sentiment classification with term-existance
Naive Bayes classifier
#
classifier = nltk.NaiveBayesClassifier.train(train_xy)
print(nltk.classify.accuracy(classifier, test_xy))
# => 0.80418
classifier.show_most_informative_features(10)
# => Most Informative Features
# exists( /Noun) = True 1 : 0 = 38.0 : 1.0
# exists( /Noun) = True 0 : 1 = 30.1 : 1.0
# exists(♥/Foreign) = True 1 : 0 = 24.5 : 1.0
# exists( /Noun) = True 0 : 1 = 22.1 : 1.0
# exists( /Noun) = True 0 : 1 = 19.5 : 1.0
# exists( /Noun) = True 0 : 1 = 19.4 : 1.0
# exists( /Noun) = True 1 : 0 = 18.9 : 1.0
# exists( /Noun) = True 0 : 1 = 16.9 : 1.0
# exists( /Noun) = True 1 : 0 = 16.9 : 1.0
# exists( /Noun) = True 1 : 0 = 15.9 : 1.0
/ baseline accuracy 0.50
1 accuracy 0.80 .
?
56 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
doc2vec / .
Gensim . *
: Radim Rehureck
* Gensim 0.12.0 . ! 57 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
!!!
Gensim ...
58 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
doc2vec .
doc2vec
from gensim.models.doc2vec import TaggedDocument OK
from collections import namedtuple
TaggedDocument = namedtuple('TaggedDocument', 'words tags')
# 15 training documents
tagged_train_docs = [TaggedDocument(d, [c]) for d, c in train_docs]
tagged_test_docs = [TaggedDocument(d, [c]) for d, c in test_docs]
59 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
from gensim.models import doc2vec
#
doc_vectorizer = doc2vec.Doc2Vec(size=300, alpha=0.025, min_alpha=0.025, seed=1234)
doc_vectorizer.build_vocab(tagged_train_docs)
# Train document vectors!
for epoch in range(10):
doc_vectorizer.train(tagged_train_docs)
doc_vectorizer.alpha -= 0.002 # decrease the learning rate
doc_vectorizer.min_alpha = doc_vectorizer.alpha # fix the learning rate, no decay
# To save
# doc_vectorizer.save('doc2vec.model')
Try this:
Vector size, epoch
60 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
! (1/3)
pprint(doc_vectorizer.most_similar(' /Noun'))
# => [(' /Noun', 0.5669919848442078),
# (' /Noun', 0.5522832274436951),
# (' /Noun', 0.5021427869796753),
# (' /Noun', 0.5000861287117004),
# (' /Noun', 0.4368450343608856),
# (' /Noun', 0.42848479747772217),
# (' /Noun', 0.42714330554008484),
# (' /Noun', 0.41590073704719543),
# (' /Noun', 0.41056352853775024),
# (' /Noun', 0.4052993059158325)]
... .
61 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
! (2/3)
pprint(doc_vectorizer.most_similar(' /KoreanParticle'))
# => [(' /KoreanParticle', 0.5768033862113953),
# (' /KoreanParticle', 0.4822016954421997),
# ('!!!!/Punctuation', 0.4395076632499695),
# ('!!!/Punctuation', 0.4077949523925781),
# ('!!/Punctuation', 0.4026390314102173),
# ('~/Punctuation', 0.40038347244262695),
# ('~~/Punctuation', 0.39946430921554565),
# ('!/Punctuation', 0.3899948000907898),
# ('^^/Punctuation', 0.3852730989456177),
# ('~^^/Punctuation', 0.3797937035560608)]
! .
62 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
, v(KING) – v(MAN) + v(WOMAN) = v(QUEEN)
?
63 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
! (3/3)
pprint(doc_vectorizer.most_similar(positive=[' /Noun', ' /Noun'], negative=[' /Noun']))
# => [(' /Noun', 0.32974398136138916),
# (' /Noun', 0.305545836687088),
# (' /Noun', 0.2899821400642395),
# (' /Noun', 0.2856029272079468),
# (' /Noun', 0.2840743064880371),
# (' /Noun', 0.28247544169425964),
# (' /Noun', 0.2795347571372986),
# (' /Noun', 0.2794691324234009),
# (' /Noun', 0.27567577362060547),
# (' /Noun', 0.2734225392341614)]
...? ?
64 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
, ;;
, . ? *
text.concordance(' /Noun', lines=10)
# => Displaying 10 of 145 matches:
# Josa /Noun /Josa ,,/Punctuation /Noun /Noun ...../Punctuation /Noun
# /Noun /Noun ../Punctuation /Noun /Noun /Noun /Noun /Noun /Josa /Nou
# nction /Noun /Josa /Adjective /Noun /Verb /Verb /Noun /Josa
# /Noun /Noun /Noun ?/Punctuation /Noun /Noun ./Punctuation /Noun /No
# b /Noun /Noun /KoreanParticle /Noun /Noun /Adjective ./Punctuation
# osa /Noun /Josa /Noun /Josa /Noun /Josa /Noun /Noun /Josa /Nou
# /Noun /Josa /Noun ./Punctuation /Noun ,/Punctuation /Noun /Noun /Suf
# Josa /Noun /Noun /Noun /Noun /Noun /Josa /Noun /Suffix /Josa /
# /Verb ./Punctuation /Adjective /Noun /Noun .../Punctuation /Noun
# tive /Noun /Noun .../Punctuation /Noun /Noun ../Punctuation /Noun /J
* Huang, Socher, Improving Word Representations via Global Context and Multiple Word Prototypes, ACL, 2012. 65 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
. .
train_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_train_docs]
train_y = [doc.tags[0] for doc in tagged_train_docs]
len(train_x) # term existance
# => 150000
len(train_x[0])
# => 300
test_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_test_docs]
test_y = [doc.tags[0] for doc in tagged_test_docs]
len(test_x)
# => 50000
len(test_x[0])
# => 300
66 / 69
4. Sentiment classification with doc2vec (feat. Gensim)
scikit-learn LogisticRegression() .
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=1234)
classifier.fit(train_x, train_y)
classifier.score(test_x, test_y)
# => 0.78246000000000004
! Accuracy 0.78 .
Term existance accuracy 2% , .
, term existance , doc2vec
!
67 / 69
.
KoNLPy , NLTK , Gensim
.
Term existance document vector .
task apply
!
/ *
/
!
* , , , : , SNUDM Technical Reports, Apr 14,
2015. 68 / 69
:D
http://lucypark.kr
@echojuliett
69 / 69

Contenu connexe

Tendances

Android media codec 사용하기
Android media codec 사용하기Android media codec 사용하기
Android media codec 사용하기Taehwan kwon
 
Introduction to Unity3D and Building your First Game
Introduction to Unity3D and Building your First GameIntroduction to Unity3D and Building your First Game
Introduction to Unity3D and Building your First GameSarah Sexton
 
[0821 박민근] 렌즈 플레어(lens flare)
[0821 박민근] 렌즈 플레어(lens flare)[0821 박민근] 렌즈 플레어(lens flare)
[0821 박민근] 렌즈 플레어(lens flare)MinGeun Park
 
Ejercicio práctico lectura de imagen
Ejercicio práctico lectura de imagenEjercicio práctico lectura de imagen
Ejercicio práctico lectura de imagentonicues
 
[데브루키160409 박민근] UniRx 시작하기
[데브루키160409 박민근] UniRx 시작하기[데브루키160409 박민근] UniRx 시작하기
[데브루키160409 박민근] UniRx 시작하기MinGeun Park
 
Flyers 1 Skills builder (2018)
Flyers 1 Skills builder (2018)Flyers 1 Skills builder (2018)
Flyers 1 Skills builder (2018)Thoi Lieu
 
Review buku Media Pembelajaran
Review buku Media PembelajaranReview buku Media Pembelajaran
Review buku Media Pembelajarandhea_nattasha
 
Customize renderpipeline
Customize renderpipelineCustomize renderpipeline
Customize renderpipelineAkilarLiao
 
Lenguaje audiovisual 1
Lenguaje audiovisual 1Lenguaje audiovisual 1
Lenguaje audiovisual 1Zulma Aramayo
 
Shiva svarodaya
Shiva svarodayaShiva svarodaya
Shiva svarodayahaji65
 
Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3stevemcauley
 
디지로그 제품에서 추출된 감성요소 분석을 통한 감성디자인 가이드라인 제안
디지로그 제품에서 추출된 감성요소 분석을 통한 감성디자인 가이드라인 제안디지로그 제품에서 추출된 감성요소 분석을 통한 감성디자인 가이드라인 제안
디지로그 제품에서 추출된 감성요소 분석을 통한 감성디자인 가이드라인 제안The Clockworks Innovate
 
B.sc animation fifth sem sample paper of studio design and project management
B.sc animation fifth sem sample paper of studio design and project managementB.sc animation fifth sem sample paper of studio design and project management
B.sc animation fifth sem sample paper of studio design and project managementpaiils111
 

Tendances (20)

Android media codec 사용하기
Android media codec 사용하기Android media codec 사용하기
Android media codec 사용하기
 
Introduction to Unity3D and Building your First Game
Introduction to Unity3D and Building your First GameIntroduction to Unity3D and Building your First Game
Introduction to Unity3D and Building your First Game
 
Preproducción
PreproducciónPreproducción
Preproducción
 
Niveles de iconicidad en el comic
Niveles de iconicidad en el comicNiveles de iconicidad en el comic
Niveles de iconicidad en el comic
 
Multimedia
MultimediaMultimedia
Multimedia
 
[0821 박민근] 렌즈 플레어(lens flare)
[0821 박민근] 렌즈 플레어(lens flare)[0821 박민근] 렌즈 플레어(lens flare)
[0821 박민근] 렌즈 플레어(lens flare)
 
Membuat animasi 2 d dengan adobe flash animasi dasar
Membuat animasi 2 d dengan adobe flash animasi dasarMembuat animasi 2 d dengan adobe flash animasi dasar
Membuat animasi 2 d dengan adobe flash animasi dasar
 
Ejercicio práctico lectura de imagen
Ejercicio práctico lectura de imagenEjercicio práctico lectura de imagen
Ejercicio práctico lectura de imagen
 
[데브루키160409 박민근] UniRx 시작하기
[데브루키160409 박민근] UniRx 시작하기[데브루키160409 박민근] UniRx 시작하기
[데브루키160409 박민근] UniRx 시작하기
 
Flyers 1 Skills builder (2018)
Flyers 1 Skills builder (2018)Flyers 1 Skills builder (2018)
Flyers 1 Skills builder (2018)
 
Review buku Media Pembelajaran
Review buku Media PembelajaranReview buku Media Pembelajaran
Review buku Media Pembelajaran
 
Customize renderpipeline
Customize renderpipelineCustomize renderpipeline
Customize renderpipeline
 
El montaje cinematográfico
El montaje cinematográficoEl montaje cinematográfico
El montaje cinematográfico
 
Lenguaje audiovisual 3
Lenguaje audiovisual 3Lenguaje audiovisual 3
Lenguaje audiovisual 3
 
Lenguaje audiovisual 1
Lenguaje audiovisual 1Lenguaje audiovisual 1
Lenguaje audiovisual 1
 
8.animasi
8.animasi8.animasi
8.animasi
 
Shiva svarodaya
Shiva svarodayaShiva svarodaya
Shiva svarodaya
 
Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3
 
디지로그 제품에서 추출된 감성요소 분석을 통한 감성디자인 가이드라인 제안
디지로그 제품에서 추출된 감성요소 분석을 통한 감성디자인 가이드라인 제안디지로그 제품에서 추출된 감성요소 분석을 통한 감성디자인 가이드라인 제안
디지로그 제품에서 추출된 감성요소 분석을 통한 감성디자인 가이드라인 제안
 
B.sc animation fifth sem sample paper of studio design and project management
B.sc animation fifth sem sample paper of studio design and project managementB.sc animation fifth sem sample paper of studio design and project management
B.sc animation fifth sem sample paper of studio design and project management
 

Similaire à KoNLPy, NLTK, Gensim for Sentiment Analysis of Korean Movie Reviews

Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingEelco Visser
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingEelco Visser
 
Python memory management_v2
Python memory management_v2Python memory management_v2
Python memory management_v2Jeffrey Clark
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
 
Scala eXchange opening
Scala eXchange openingScala eXchange opening
Scala eXchange openingMartin Odersky
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015ihji
 
RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq ModelsRNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq ModelsEmory NLP
 
python chapter 1
python chapter 1python chapter 1
python chapter 1Raghu nath
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2Raghu nath
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with javaGary Sieling
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with RYanchang Zhao
 
Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: DeanonymizingDavid Evans
 
Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingDeclare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingEelco Visser
 
Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1Ke Wei Louis
 
Lies, Damned Lies, and Substrings
Lies, Damned Lies, and SubstringsLies, Damned Lies, and Substrings
Lies, Damned Lies, and SubstringsHaseeb Qureshi
 

Similaire à KoNLPy, NLTK, Gensim for Sentiment Analysis of Korean Movie Reviews (20)

Compiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term RewritingCompiler Construction | Lecture 5 | Transformation by Term Rewriting
Compiler Construction | Lecture 5 | Transformation by Term Rewriting
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term RewritingCS4200 2019 | Lecture 5 | Transformation by Term Rewriting
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
 
Python memory management_v2
Python memory management_v2Python memory management_v2
Python memory management_v2
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
 
Scala eXchange opening
Scala eXchange openingScala eXchange opening
Scala eXchange opening
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
 
RNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq ModelsRNN, LSTM and Seq-2-Seq Models
RNN, LSTM and Seq-2-Seq Models
 
Music as data
Music as dataMusic as data
Music as data
 
Alastair Butler - 2015 - Round trips with meaning stopovers
Alastair Butler - 2015 - Round trips with meaning stopoversAlastair Butler - 2015 - Round trips with meaning stopovers
Alastair Butler - 2015 - Round trips with meaning stopovers
 
python chapter 1
python chapter 1python chapter 1
python chapter 1
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
Python
PythonPython
Python
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Data Exploration and Visualization with R
Data Exploration and Visualization with RData Exploration and Visualization with R
Data Exploration and Visualization with R
 
Class 31: Deanonymizing
Class 31: DeanonymizingClass 31: Deanonymizing
Class 31: Deanonymizing
 
Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingDeclare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term Rewriting
 
Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1
 
Lies, Damned Lies, and Substrings
Lies, Damned Lies, and SubstringsLies, Damned Lies, and Substrings
Lies, Damned Lies, and Substrings
 

Plus de Eunjeong (Lucy) Park

파이썬과 커뮤니티와 한국어 오픈데이터
파이썬과 커뮤니티와 한국어 오픈데이터파이썬과 커뮤니티와 한국어 오픈데이터
파이썬과 커뮤니티와 한국어 오픈데이터Eunjeong (Lucy) Park
 
The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)Eunjeong (Lucy) Park
 
자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLPEunjeong (Lucy) Park
 
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안Eunjeong (Lucy) Park
 
On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondEunjeong (Lucy) Park
 
Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesEunjeong (Lucy) Park
 

Plus de Eunjeong (Lucy) Park (6)

파이썬과 커뮤니티와 한국어 오픈데이터
파이썬과 커뮤니티와 한국어 오픈데이터파이썬과 커뮤니티와 한국어 오픈데이터
파이썬과 커뮤니티와 한국어 오픈데이터
 
The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)The beginner’s guide to 웹 크롤링 (스크래핑)
The beginner’s guide to 웹 크롤링 (스크래핑)
 
자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP자바, 미안하다! 파이썬 한국어 NLP
자바, 미안하다! 파이썬 한국어 NLP
 
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
대한민국을 위한 Open Linked Political Data 플랫폼, "정치in" 제안
 
On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and Beyond
 
Introduction to Data Mining for Newbies
Introduction to Data Mining for NewbiesIntroduction to Data Mining for Newbies
Introduction to Data Mining for Newbies
 

Dernier

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Dernier (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

KoNLPy, NLTK, Gensim for Sentiment Analysis of Korean Movie Reviews

  • 1. NLTK, Gensim ( : ) PyCon Korea 2015 2015-08-29 Lucy Park ( ) 1 / 69
  • 2. KoNLPy , ? ! NLTK , Gensim , . bag-of-words, document embedding , KoNLPy, NLTK, Gensim . KoNLPy: http://konlpy.org NLTK: http://nltk.org Gensim: http://radimrehurek.com/gensim 2 / 69
  • 3. (a.k.a. lucypark, echojuliett, e9t) . CAVEAT: KoNLPy Just another yak shaver... 11:49 <@sanxiyn> 또다시 yak shaving의 신비한 세계 11:51 <@sanxiyn> yak shaving이 뭔지 다 아시죠? 11:51 <디토군> 방금 찾아보고 왔음 11:51 <@mana> (조용히 설명을 기대중) 11:51 <@sanxiyn> 나무를 베려고 하는데 11:52 <@sanxiyn> 도끼질을 하다가 11:52 <@sanxiyn> 도끼가 더 잘 들면 나무를 쉽게 벨텐데 해서 11:52 <@sanxiyn> 도끼 날을 세우다가 11:52 <@sanxiyn> 도끼 가는 돌이 더 좋으면 도끼 날을 더 빨리 세울텐데 해서 11:52 <@sanxiyn> 좋은 숫돌이 있는 곳을 수소문해 보니 11:52 <@mana> … 11:52 <&홍민희> 그거 전형적인 제 행동이네요 11:52 <@sanxiyn> 저 멀이 어디에 세계 최고의 숫돌이 난다고 11:52 <@sanxiyn> 거기까지 야크를 타고 가려다가 11:52 <@mana> 항상하던 짓이라서 타이핑을 할 수 없었습니다 11:52 <@sanxiyn> 야크 털을 깎아서… 11:52 <@sanxiyn> etc. 3 / 69
  • 4. / KoNLPy, NLTK, Gensim (?) ! Gensim , . , . . ? 3 4 / 69
  • 7. . , . . -- Bengio et al., Deep Learning 1 , Book in preparation for MIT Press, 2015. 7 / 69
  • 9. ? , . " " . 9 / 69
  • 11. , , (binary) set one-hot vector 1-of-V = , = , = , . . . =w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 0 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ( ) ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 0 1 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 0 0 0 ⋮ 1 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ wi i |V| 11 / 69
  • 12. . ex: " " " ", " " " " 0 . ( = ( = 0w )⊤ w w )⊤ w * BOW (discrete) (local representation) . (distributed representation) 12 / 69
  • 13. , bag of words(BOW) / ( "term existance" ) n (a.k.a. term frequency (TF)) ex: term frequency–inverse document frequency (TF-IDF) =d1 binary ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ =d1 tf ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 3 1 0 ⋮ 0 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ 13 / 69
  • 14. .* . . " ". * : Vincent Spruyt, The Curse of Dimensionality in classification, 2014. 14 / 69
  • 15. " ?" (WordNet) X is-a Y (hypernym) (DAG; directed acyclic graph) . ( : .) from nltk.corpus import wordnet as wn wn.synsets('computer') # "computer" #=> [Synset('computer.n.01'), Synset('calculator.n.01')] : ( ) / 15 / 69
  • 16. : 1. , . from nltk.corpus import wordnet as wn wn.synsets('nginx') #=> [] 2. . wn.synsets('yolo') # : you only live once #=> [] 3. NLTK ... ( !) wn.synsets(' ', lang='kor') #=> WordNetError: Language is not supported. 16 / 69
  • 17. .. . ? 17 / 69
  • 19. . You shall know a word by the company it keeps. -- J.R. Firth (1957) 19 / 69
  • 20. 1: ? __ . , __ . __ . : [ ] 20 / 69
  • 21. 2: " " ? ? 21 / 69
  • 24. , (context) . (context) := (window) / 1-8 , 3-17 * " " ex: "PyCon" "R" "Python" ? * Jurafsky, Martin, Speech and Language Processing 3rd Ed. 19.1 , Book in preparation, 2015. 24 / 69
  • 26. Co-occurrence documents = [[" ", " ", " ", " "], [" ", "R", " ", " "]] 1. Term-document matrix ( ): Computation (!) x_td = [[1, 1], # [1, 0], # [0, 1], # R [1, 1], # [1, 1]] # 1. Term-term matrix ( ): syntactic, semantic ( ==5): x_tt = [[0, 1, 1, 2, 0], # [1, 0, 0, 1, 1], # [1, 0, 0, 1, 1], # R [2, 1, 1, 0, 2], # [0, 1, 1, 2, 0]] # ∈Xtd ℝ|V|×M M ∈Xtt ℝ|V|×|V| 26 / 69
  • 27. Co-occurrence matrix . import math def dot_product(v1, v2): return sum(map(lambda x: x[0] * x[1], zip(v1, v2))) def cosine_measure(v1, v2): prod = dot_product(v1, v2) len1 = math.sqrt(dot_product(v1, v1)) len2 = math.sqrt(dot_product(v2, v2)) return prod / (len1 * len2) : skewed ( , ) discriminative (ex: list(' ')) ? 27 / 69
  • 28. Neural embeddings : ! (language model) ex: [" ", " ", " ", " "] ? (ex: "!", "." "?") , . Neural network language model (NNLM)* * Bengio, Ducharme, Vincent, A Neural Probabilistic Language Model, {NIPS, 2000; JMLR, 2003}. 28 / 69
  • 29. word2vec * T. Mikolov word2vec neural embedding ! (n -> m ) * Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. 29 / 69
  • 30. : word2vec * * , , , (1946-2015) , 49 2 , 2015. 30 / 69
  • 32. doc2vec * : ( ) ! . . |V| * Le, Mikolov, Distributed Representations of Sentences and Documents, ICML, 2014. ( doc2vec paragraph vector ) 32 / 69
  • 33. : Term frequency -> . . documents = [ [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '] ] words = set(sum(documents, [])) print(words) # => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '} def term_frequency(document): # do something return vector vectors = [term_frequency(doc) for doc in documents] print(vectors) # => { # 'S1': [1, 1, 0, 1, 0, 0, 1, 1], # 'S2': [1, 1, 0, 1, 0, 0, 1, 1], # 'S3': [1, 0, 0, 1, 1, 1, 1, 0], # 'S4': [1, 1, 0, 1, 1, 1, 0, 0], # 'S5': [1, 0, 1, 1, 1, 0, 1, 0], # 'S6': [1, 1, 1, 1, 1, 0, 0, 0] # } |V| 33 / 69
  • 34. : doc2vec -> . . documents = [ [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '], [' ', ' ', ' ', ' ', ' '] ] words = set(sum(documents, [])) print(words) # => {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '} def doc2vec(document): # do something else ( Gensim !) return vector vectors = [doc2vec(doc) for doc in documents] print(vectors) # => { # 'S1': [-0.01599941, 0.02135301, 0.0927715], # 'S2': [0.02333261, 0.00211833, 0.00254255], # 'S3': [-0.00117272, -0.02043504, 0.00593186], # 'S4': [0.0089237, -0.00397806, -0.13199195], # 'S5': [0.02555059, 0.01561624, 0.03603476], # 'S6': [-0.02114407, -0.01552016, 0.01289922] # } |V| 34 / 69
  • 35. . : Sparse, long vectors Dense, short vectors 1-hot-vectors word2vec bag-of-words (term existance, TF, TF-IDF ) doc2vec . 35 / 69
  • 36. Sentiment analysis for Korean movie reviews 36 / 69
  • 37. , ? 37 / 69
  • 39. Naver sentiment movie corpus v1.0 Maas et al., 2011 : 100 140 ( ' ') 20 ( 64 ) ratings_train.txt: 15 , ratings_test.txt: 5 / (i.e., random guess yields 50% accuracy) : 19MB URL: http://github.com/e9t/nsmc/ 39 / 69
  • 41. $ cat ratings_train.txt | head -n 25 id document label 9976970 .. 0 3819312 ... .... 1 10265843 0 9045019 .. .. 0 6483659 ! 5403919 3 1 8 . ... . 0 7797314 . 0 9443947 .. 7156791 1 5912145 ? .. ? 9008700 . ♥ 1 10217543 90 !! ~ 5957425 0 8628627 . . . 9864035 6852435 1 9143163 . 4891476 0 7465483 !!♥ 3989148 , . . 1 4581211 . 1 2718894 1 9705777 . .... 471131 . 1 41 / 69
  • 42. , ? 1. KoNLPy 2. NLTK 3. Bag-of-words(term existence) classify 4. Gensim doc2vec classify 42 / 69
  • 43. 1. Data preprocessing (feat. KoNLPy) def read_data(filename): with open(filename, 'r') as f: data = [line.split('t') for line in f.read().splitlines()] data = data[1:] # header return data train_data = read_data('ratings_train.txt') test_data = read_data('ratings_test.txt') NOTE: pandas.read_csv() quoting parameter csv.QUOTE_NONE (3) (ex: ) 43 / 69
  • 44. 1. Data preprocessing (feat. KoNLPy) # row, column print(len(train_data)) # nrows: 150000 print(len(train_data[0])) # ncols: 3 print(len(test_data)) # nrows: 50000 print(len(test_data[0])) # ncols: 3 44 / 69
  • 45. 1. Data preprocessing (feat. KoNLPy) from konlpy.tag import Twitter pos_tagger = Twitter() def tokenize(doc): # norm, stem optional return ['/'.join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)] train_docs = [(tokenize(row[1]), row[2]) for row in train_data] test_docs = [(tokenize(row[1]), row[2]) for row in test_data] # from pprint import pprint pprint(train_docs[0]) # => [([' /Exclamation', # ' /Noun', # '../Punctuation', # ' /Noun', # ' /Noun', # ' /Verb', # ' /Noun'], # '0')] NOTE: . 45 / 69
  • 46. 1. Data preprocessing (feat. KoNLPy) " ?" , x * " (POS) ?" ex: ' /Josa' vs ' /Noun' * https://github.com/karpathy/char-rnn 46 / 69
  • 47. 2. Data exploration (feat. NLTK) ! Training data token tokens = [t for d in train_docs for t in d[0]] print(len(tokens)) # => 2194536 47 / 69
  • 48. 2. Data exploration (feat. NLTK) nltk.Text() ! (Exploration ) document , . import nltk text = nltk.Text(tokens, name='NMSC') print(text) # => <Text: NMSC> 48 / 69
  • 49. 2. Data exploration (feat. NLTK) Count tokens print(len(text.tokens)) # returns number of tokens # => 2194536 print(len(set(text.tokens))) # returns number of unique tokens # => 48765 pprint(text.vocab().most_common(10)) # returns frequency distribution # => [('./Punctuation', 68630), # (' /Noun', 51365), # (' /Verb', 50281), # (' /Josa', 39123), # (' /Verb', 34764), # (' /Josa', 30480), # ('../Punctuation', 29055), # (' /Josa', 27108), # (' /Josa', 26696), # (' /Josa', 23481)] 49 / 69
  • 50. 2. Data exploration (feat. NLTK) Plot tokens by frequency text.plot(50) # Plot sorted frequency of top 50 tokens ;; 50 / 69
  • 51. 2. Data exploration (feat. NLTK) ! Troubleshooting: For those who see rectangles instead of letters in the saved plot file, include the following configurations before drawing the plot: Some example fonts: Mac OS: /Library/Fonts/AppleGothic.ttf from matplotlib import font_manager, rc font_fname = 'c:/windows/fonts/gulim.ttc' # A font of your choice font_name = font_manager.FontProperties(fname=font_fname).get_name() rc('font', family=font_name) 51 / 69
  • 52. 2. Data exploration (feat. NLTK) . . . 52 / 69
  • 53. 2. Data exploration (feat. NLTK) Collocations ( ): (ex: "text" + "mining") text.collocations() # # => /Determiner /Noun; /Suffix /Josa; /Determiner /Noun; /Noun # /Verb; /Noun /Josa; 10/Number /Noun; /Noun /Suffix; /Noun # /Adjective; /Noun /Josa; /Noun /Josa; /Noun /Josa; /Suffix # /Josa; /Noun /Noun; /Noun /Josa; /Suffix /Josa; /Noun # /Josa; /Noun /Josa; /Suffix /Josa; /Noun /Suffix; /Noun # /Josa NOTE: nltk.Text() , source code API . . 53 / 69
  • 54. 3. Sentiment classification with term-existance* term . # 2000 # WARNING: time/memory efficient selected_words = [f[0] for f in text.vocab().most_common(2000)] def term_exists(doc): return {'exists({})'.format(word): (word in set(doc)) for word in selected_words} # training corpus train_docs = train_docs[:10000] train_xy = [(term_exists(d), c) for d, c in train_docs] test_xy = [(term_exists(d), c) for d, c in test_docs] Try this: . 50 ? TF, TF-IDF ? (sparse matrix) or scikit-learn TfidfVectorizer() ! * Natural Language Processing with Python Chapter 6. Learning to classify text 54 / 69
  • 55. 3. Sentiment classification with term-existance . ! . NLTK : Naive Bayes Classifiers Decision Tree Classifiers Maximum Entropy Classifiers . scikit-learn . NLTK Naive Bayes Classifier ! 55 / 69
  • 56. 3. Sentiment classification with term-existance Naive Bayes classifier # classifier = nltk.NaiveBayesClassifier.train(train_xy) print(nltk.classify.accuracy(classifier, test_xy)) # => 0.80418 classifier.show_most_informative_features(10) # => Most Informative Features # exists( /Noun) = True 1 : 0 = 38.0 : 1.0 # exists( /Noun) = True 0 : 1 = 30.1 : 1.0 # exists(♥/Foreign) = True 1 : 0 = 24.5 : 1.0 # exists( /Noun) = True 0 : 1 = 22.1 : 1.0 # exists( /Noun) = True 0 : 1 = 19.5 : 1.0 # exists( /Noun) = True 0 : 1 = 19.4 : 1.0 # exists( /Noun) = True 1 : 0 = 18.9 : 1.0 # exists( /Noun) = True 0 : 1 = 16.9 : 1.0 # exists( /Noun) = True 1 : 0 = 16.9 : 1.0 # exists( /Noun) = True 1 : 0 = 15.9 : 1.0 / baseline accuracy 0.50 1 accuracy 0.80 . ? 56 / 69
  • 57. 4. Sentiment classification with doc2vec (feat. Gensim) doc2vec / . Gensim . * : Radim Rehureck * Gensim 0.12.0 . ! 57 / 69
  • 58. 4. Sentiment classification with doc2vec (feat. Gensim) !!! Gensim ... 58 / 69
  • 59. 4. Sentiment classification with doc2vec (feat. Gensim) doc2vec . doc2vec from gensim.models.doc2vec import TaggedDocument OK from collections import namedtuple TaggedDocument = namedtuple('TaggedDocument', 'words tags') # 15 training documents tagged_train_docs = [TaggedDocument(d, [c]) for d, c in train_docs] tagged_test_docs = [TaggedDocument(d, [c]) for d, c in test_docs] 59 / 69
  • 60. 4. Sentiment classification with doc2vec (feat. Gensim) from gensim.models import doc2vec # doc_vectorizer = doc2vec.Doc2Vec(size=300, alpha=0.025, min_alpha=0.025, seed=1234) doc_vectorizer.build_vocab(tagged_train_docs) # Train document vectors! for epoch in range(10): doc_vectorizer.train(tagged_train_docs) doc_vectorizer.alpha -= 0.002 # decrease the learning rate doc_vectorizer.min_alpha = doc_vectorizer.alpha # fix the learning rate, no decay # To save # doc_vectorizer.save('doc2vec.model') Try this: Vector size, epoch 60 / 69
  • 61. 4. Sentiment classification with doc2vec (feat. Gensim) ! (1/3) pprint(doc_vectorizer.most_similar(' /Noun')) # => [(' /Noun', 0.5669919848442078), # (' /Noun', 0.5522832274436951), # (' /Noun', 0.5021427869796753), # (' /Noun', 0.5000861287117004), # (' /Noun', 0.4368450343608856), # (' /Noun', 0.42848479747772217), # (' /Noun', 0.42714330554008484), # (' /Noun', 0.41590073704719543), # (' /Noun', 0.41056352853775024), # (' /Noun', 0.4052993059158325)] ... . 61 / 69
  • 62. 4. Sentiment classification with doc2vec (feat. Gensim) ! (2/3) pprint(doc_vectorizer.most_similar(' /KoreanParticle')) # => [(' /KoreanParticle', 0.5768033862113953), # (' /KoreanParticle', 0.4822016954421997), # ('!!!!/Punctuation', 0.4395076632499695), # ('!!!/Punctuation', 0.4077949523925781), # ('!!/Punctuation', 0.4026390314102173), # ('~/Punctuation', 0.40038347244262695), # ('~~/Punctuation', 0.39946430921554565), # ('!/Punctuation', 0.3899948000907898), # ('^^/Punctuation', 0.3852730989456177), # ('~^^/Punctuation', 0.3797937035560608)] ! . 62 / 69
  • 63. 4. Sentiment classification with doc2vec (feat. Gensim) , v(KING) – v(MAN) + v(WOMAN) = v(QUEEN) ? 63 / 69
  • 64. 4. Sentiment classification with doc2vec (feat. Gensim) ! (3/3) pprint(doc_vectorizer.most_similar(positive=[' /Noun', ' /Noun'], negative=[' /Noun'])) # => [(' /Noun', 0.32974398136138916), # (' /Noun', 0.305545836687088), # (' /Noun', 0.2899821400642395), # (' /Noun', 0.2856029272079468), # (' /Noun', 0.2840743064880371), # (' /Noun', 0.28247544169425964), # (' /Noun', 0.2795347571372986), # (' /Noun', 0.2794691324234009), # (' /Noun', 0.27567577362060547), # (' /Noun', 0.2734225392341614)] ...? ? 64 / 69
  • 65. 4. Sentiment classification with doc2vec (feat. Gensim) , ;; , . ? * text.concordance(' /Noun', lines=10) # => Displaying 10 of 145 matches: # Josa /Noun /Josa ,,/Punctuation /Noun /Noun ...../Punctuation /Noun # /Noun /Noun ../Punctuation /Noun /Noun /Noun /Noun /Noun /Josa /Nou # nction /Noun /Josa /Adjective /Noun /Verb /Verb /Noun /Josa # /Noun /Noun /Noun ?/Punctuation /Noun /Noun ./Punctuation /Noun /No # b /Noun /Noun /KoreanParticle /Noun /Noun /Adjective ./Punctuation # osa /Noun /Josa /Noun /Josa /Noun /Josa /Noun /Noun /Josa /Nou # /Noun /Josa /Noun ./Punctuation /Noun ,/Punctuation /Noun /Noun /Suf # Josa /Noun /Noun /Noun /Noun /Noun /Josa /Noun /Suffix /Josa / # /Verb ./Punctuation /Adjective /Noun /Noun .../Punctuation /Noun # tive /Noun /Noun .../Punctuation /Noun /Noun ../Punctuation /Noun /J * Huang, Socher, Improving Word Representations via Global Context and Multiple Word Prototypes, ACL, 2012. 65 / 69
  • 66. 4. Sentiment classification with doc2vec (feat. Gensim) . . train_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_train_docs] train_y = [doc.tags[0] for doc in tagged_train_docs] len(train_x) # term existance # => 150000 len(train_x[0]) # => 300 test_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_test_docs] test_y = [doc.tags[0] for doc in tagged_test_docs] len(test_x) # => 50000 len(test_x[0]) # => 300 66 / 69
  • 67. 4. Sentiment classification with doc2vec (feat. Gensim) scikit-learn LogisticRegression() . from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state=1234) classifier.fit(train_x, train_y) classifier.score(test_x, test_y) # => 0.78246000000000004 ! Accuracy 0.78 . Term existance accuracy 2% , . , term existance , doc2vec ! 67 / 69
  • 68. . KoNLPy , NLTK , Gensim . Term existance document vector . task apply ! / * / ! * , , , : , SNUDM Technical Reports, Apr 14, 2015. 68 / 69