SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned from a VOC
Analysis System for a big Korean
Telecommunication Company
Ivan Berlocher
SALTLUX
Sentiment Analysis Symposium
Nov. 9th 2011
Communicating KnowledgeSentiment Analysis Symposium
Introduction
• Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003.
• Expertise domain:
Information Retrieval, Text/Data/Web/Graph Mining solutions and services based on
Semantic Web Technology.
• Main languages support: Korean, Japanese, English. For other use external
solutions.
• 70 employees in Seoul, one Development Center in Vietnam (12 employees)
One sales office in Japan (3 employees)
• Have several partnerships with other companies/institutes:
– Ontoprise in Germany
– Franz in California
– DERI in Ireland
• Have many partnerships with R&D (ETRI, KAIST, Universities…)
2
Communicating KnowledgeSentiment Analysis Symposium
Table of Contents
• Project & Environment Description
– Needs of Customer
– System (Main) Requirements
• VOC Data
– Sample Data
– Data Analysis
• System Overview
• Korean Linguistic
• Sentiment Analysis
• Lessons Learned
• Future work
3
Communicating KnowledgeSentiment Analysis Symposium
Project & Environment Description
4
• Needs of Customer
– Customer: Korean Corporation in Telecommunication
– Department of Voice of Customer Analysis
– Mission: Analysis (human typed) memos from all call centers for
identifying majors problems, make reports for decisions makers in
order to improve quality of services and augment customer
satisfaction.
– Data: human typed notes covering any kind of questions from
customers
• Information about subscriptions
• Inquiry or complaint about devices (phones) or services, dealership
• Complaints about quality of communication
• etc.
The numbers of notes: ~200 thousand a day. (~5 Millions a Month).
Required notes to be searchable during 1 year (~60 millions)
Communicating KnowledgeSentiment Analysis Symposium
Project & Environment Description
5
• System (Main) Requirements
• Distinguish between simple inquiries vs. complaints
• Classify into categories/departments of services
• Monitor Trends of Topics in real-time, daily, weekly, monthly
• Compare trends/tendency between by slice of times
• Find related Topics
• Manage personal vocabulary
• Anonymous”ize” personal data (people names, telephone, social
id, addresses etc.)
Project started in October 2010 for a 3 Months POC. (~10MM)
After acceptance(success) integration with real system for
another 3 months (~10 MM)
2 phases: ~200 000$
Communicating KnowledgeSentiment Analysis Symposium
VOC Data Sample
6
Communicating KnowledgeSentiment Analysis Symposium
VOC Data Sample
7
• Data often contain some
structured information
(metadata) but without any
standard.
• But most of time, no particular mark/meta.
Cause problem of Named Entities Recognition
more complex
All different input of same information
(연락처:Phone Number)
Communicating KnowledgeSentiment Analysis Symposium
VOC Data Analysis
8
• Data contains lot‟s of named entities:
Products/Services/People/Social ID/phones numbers
often related to privacy
• Data contains lot‟s of technical (domain) terms
• Real content to analysis is mostly very short(tweets like)
but sometimes very.
• Lot‟s of misspelling/mistyping
• Korean(Asian) problem of segmentation, amplified by
speed constraint
• Lot‟s of (non standard) abbreviations
Communicating KnowledgeSentiment Analysis Symposium
System Overview
9
Text
Segmentation
Morphological
Analyzer
Chunk/Phrase
Identification
Named
Entities
Recognition
Synonyms &
Normalization
Indexing
Distributed Indexes
Classifier
(Hybrid SVM
& Rules)
Analysis Phase
Searching/
Clustering
(TopicRank)
Timelines
Dumper
DFS
Timelines
20110713_0700_1.df
20110713_0700_2.df
20110713_0700_3.df
20110713_0710_1.df
20110713_0710_2.df
20110713_0710_3.df
Scheduler
Merger &
Ranker
Trend
(TopN)
DB
Web Server
(Web UI)
Complaint
Detector
• Overall Architecture
In the real system, for fast indexing, system has been parallelized on 18 Linux
machines.
Communicating KnowledgeSentiment Analysis Symposium
System Overview
10
• Home page
Communicating KnowledgeSentiment Analysis Symposium
System Overview
11
• Top N Keywords Extraction
Communicating KnowledgeSentiment Analysis Symposium
System Overview
12
• Related Keywords (Word Clustering)
Communicating KnowledgeSentiment Analysis Symposium
System Overview
13
• Trend (Timeline) view
Communicating KnowledgeSentiment Analysis Symposium
System Overview
14
• Tweets view
Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
15
• Brief introduction
Korean is alphabetic based with consonants/vowels, composition by
consonant/vowel or consonant/vowel/consonant.
„나는 학생입니다.” => 나 = ㄴ (N) + ㅏ(A) = NA
=> 학 = ㅎ (H) + ㅏ(A) + ㄱ (K) = HAK
One unit of consonant/vowel or consonant/vowel/consonant is a
syllable called “Eojol”(Syllable) and words are composed of several
“eojeol”.
Basic grammar:
Words a composition of one root (Nouns, Adjectives/Verbs) followed
by a flexion marking grammatical role (Subject/Object/Location etc.)
for nouns (Called “Josa”)
or aspects/mood (tense, honorific form etc. ) for verbs/adjectives
(Called “Eomi”).
Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
16
• Examples:
„나는 학생입니다.” => “나는” = “나” (NA: I/me) + “는” (Neun: Thema)
학생입니다 = “학생” + “입니다” = “학생”(Hak-seng: Student) +
“입니다”(Im-ni-da: am) => I‟m (a) student.
Lot‟s of (composite) inflectional forms:
학생+입니다 = Noun + Be
학생 +인/이예요/이다/입니까?/인데/인데요 etc. (was, will be …) (eomi)
학생 + Syntactic Role (이:Subject/에게:To/한테:From/을:Object) etc. (josa)
Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi)
Search Engine: 검색엔진.
High performance search engine: 고성능검색엔진
But usage of space is free/arbitrary.
Can write equivalently: 검색엔진 or 검색 엔진
Especially with SNS, space limited devices for speed constraints
(like real-time transcription of conversations) the space is more and more
un/mis- used.
=> Need Automatic Segmentation Correction.
Communicating KnowledgeSentiment Analysis Symposium
Project & Environment Description
17
• Automatic Segmentation Correction Illustration
Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
18
• Automatic Segmentation Correction Implementation
Binary Classification Approach:
Tagging each syllable as space or not before.
Can use any kind of Classifier.
Here we use CRF model (could be SVM)
with following set of features:
프랑스의 세계적인 디자이너 …
CRF
Accuracy at Character Level 96.25%
Precision at Word Level 95.58%
• Features
– 1gram, 2gram, 3gram, 4gram of characters (syllables)
– Korean or not, contains number
• Evaluation
– Accuracy (character)
– Word-precision
# words correct spaced word / # words produced by system
• Very simple to train (easy to get huge data)
• Not need of lexicon or any lexical information
• Perform surprisingly very well
Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
19
• Transliteration
- Korean used more and more English derived word
transliterated phonetically in Korean alphabet
(Reverse of “Romanization”).
Especially for foreign names (Companies, Products, People,
technical/domain terms)
– Transcription is non unique and non standard
Examples:
tablet, 태블릿, 태블릿 , 타블렛, 테블릿
Hitachi, 히타치, 히타찌, 히다찌, 히타찌
iPhone 4s, 아이폰 4s, 아이폰포에스, 아이폰 포에스
Communicating KnowledgeSentiment Analysis Symposium
Korean Linguistic
20
• Automatic transliteration recognition
- Make a rules based transliteration based on phonetic
transliteration acting similarly to Soundex, adapted for
Korean pronunciation.
tablet, 태블릿
T=>ㅌ/ㄸ/ㄷ
A => ㅏ/ㅓ/ㅔ/ㅐ
Etc.
This method has high recall but low precision and need post-processing filtering (Remove
known Korean words from lexicons, remove too short nouns etc.)
Result has to be corrected by human, so need of efficient workbench for productivity.
Gathered a 130 thousand entries dictionaries, mainly IT oriented.
Still need more Academic research to solve this problem.
Communicating KnowledgeSentiment Analysis Symposium
Sentiment Analysis
21
• Complaint Detection
Similar problem of standard Subjectivity Detection
(Detect if a sentence is sentiment bearing or not)
Simple Approach: Binary Classification
Using SVM,
manually tagged training/test corpuses.
(more than 20 thousand)
Features Space:
N-gram of Characters (Syllables/Eojol) + N-Gram of Words
using 2-4 grams gave best results.
Features Extraction is important to lower the features space.
Chi-square/Information Gain gave best results.
Communicating KnowledgeSentiment Analysis Symposium
Sentiment Analysis
22
Problems: No freely available resources such Sentiword-Net
Need to build it!
Build our general domain dictionary as baseline:
20 000 verbs/adjectives classified as positive/negative/neutral
Result is a lexicon of ~5000 entries (only positive/negative)
Enrich with manually extracted features from N-grams.
Precision oriented (92%) but still quite low recall (75%).
Overall Accuracy: 85%
=> Still working on ways to make recall better without
sacrificing precision.
Basic Ideas:
Bagging / Boosting (Combining several Classifiers)
Make hybrid models between (linguistic: semantic/syntactic) rules
and Machine Learning(statistics)
Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned
23
• Lessons Learned
- Still a quite big gap between expectation of customer and
reality. Need to explain and let him involved in process of
assessment and knowledge/domain vocabulary acquisition
- Need acquire a lot of lexicons:
=> Named entities/Synonyms/Stopwords/Senti-Word
- Quality and Quantity of this lexicons is a real assets of
Company. Acquiring lexicons require workbenches for
efficiently semi-supervised methods (Filter manually automatic
methods) to reduce costs.
- Tuning Classifiers parameters, features extraction, linguistic
knowledge etc. is time/expertise consuming.
- Simple Academic methods works quite well (even needs lot of
tuning)
- Beyond simple search engine, NLP components quality
became more and more important, especially for Sentiment
Analysis
Communicating KnowledgeSentiment Analysis Symposium
Lessons Learned
24
• Lessons Learned
- Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud ”, “Social
Network/Intelligence”…
- More and more Customers want to get data/opinion out of in-site system
(Blogs, Communities(BBS), Tweets etc.). Typical questions:
How many crawlers are needed for crawl all Korean tweets/blogs?
How about crawling Facebook?
- How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers?
Solutions required are required far more than Sentiment Analysis.
But often customer can‟t afford/don‟t want crawling infra-structure and maintenance fees.
New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS
(Software/Platform/Infrastructure) as Service.
Even in enterprise, distributed framework is required (not only web scale services)
- Customers (as least in Korea) love knowing technology and are more and more high level users.
They not only buy solutions but consulting/expertise.
- Projects are more and more expensive, and many require either Benchmarks/POC
Communicating KnowledgeSentiment Analysis Symposium
Future Work & Plan
25
• Future Work (On-going)
Acquire more entries in Sentiment dictionary
- Make a framework for handling Linguistic Rules and Statistical
(SVM/Rocchio)
- Coupling with Antonyms; and/or hints
- Better handling Negation
- Better Workbench for faster acquisition / (re-)training
- Co-Reference resolution
- (Full/Semi) Parsing ?
- More complex models than binary classification ?
- Building/Maintaining a Platform for Pass/Sass
A long long way to go…
Communicating KnowledgeSentiment Analysis Symposium 26
Questions?
Thank you.

Contenu connexe

Tendances

CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
 
Using and learning phrases
Using and learning phrasesUsing and learning phrases
Using and learning phrasesCassandra Jacobs
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Introduction of object oriented analysis & design by sarmad baloch
Introduction of object oriented analysis & design by sarmad balochIntroduction of object oriented analysis & design by sarmad baloch
Introduction of object oriented analysis & design by sarmad balochSarmad Baloch
 
Crowdsourcing Best Practices
Crowdsourcing Best Practices Crowdsourcing Best Practices
Crowdsourcing Best Practices Marta Sabou
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk Vijay Ganti
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLPVijay Ganti
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
FLEAT VI - Harvard University - Piet Desmet & Bert Wylin
FLEAT VI - Harvard University - Piet Desmet & Bert WylinFLEAT VI - Harvard University - Piet Desmet & Bert Wylin
FLEAT VI - Harvard University - Piet Desmet & Bert WylinPiet Desmet
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han
 
Requirements Engineering: focus on Natural Language Processing, Lecture 2
Requirements Engineering: focus on Natural Language Processing, Lecture 2Requirements Engineering: focus on Natural Language Processing, Lecture 2
Requirements Engineering: focus on Natural Language Processing, Lecture 2alessio_ferrari
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyAkshayaNagarajan10
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddingsRoelof Pieters
 

Tendances (20)

CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
Using and learning phrases
Using and learning phrasesUsing and learning phrases
Using and learning phrases
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Introduction of object oriented analysis & design by sarmad baloch
Introduction of object oriented analysis & design by sarmad balochIntroduction of object oriented analysis & design by sarmad baloch
Introduction of object oriented analysis & design by sarmad baloch
 
Crowdsourcing Best Practices
Crowdsourcing Best Practices Crowdsourcing Best Practices
Crowdsourcing Best Practices
 
Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive A...
Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive A...Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive A...
Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive A...
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
FLEAT VI - Harvard University - Piet Desmet & Bert Wylin
FLEAT VI - Harvard University - Piet Desmet & Bert WylinFLEAT VI - Harvard University - Piet Desmet & Bert Wylin
FLEAT VI - Harvard University - Piet Desmet & Bert Wylin
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
Requirements Engineering: focus on Natural Language Processing, Lecture 2
Requirements Engineering: focus on Natural Language Processing, Lecture 2Requirements Engineering: focus on Natural Language Processing, Lecture 2
Requirements Engineering: focus on Natural Language Processing, Lecture 2
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 

En vedette (13)

Presentatie1
Presentatie1Presentatie1
Presentatie1
 
Part3 descripcions 2nb
Part3 descripcions 2nbPart3 descripcions 2nb
Part3 descripcions 2nb
 
Descrip 2na part3
Descrip 2na part3Descrip 2na part3
Descrip 2na part3
 
Hoja de ruta propuesta of
Hoja de ruta propuesta ofHoja de ruta propuesta of
Hoja de ruta propuesta of
 
INVESTIGACIÓN ETNOGRÁFICA
INVESTIGACIÓN ETNOGRÁFICAINVESTIGACIÓN ETNOGRÁFICA
INVESTIGACIÓN ETNOGRÁFICA
 
Descriptions 2 a part1
Descriptions 2 a part1Descriptions 2 a part1
Descriptions 2 a part1
 
De kracht van de vernieuwing
De kracht van de vernieuwingDe kracht van de vernieuwing
De kracht van de vernieuwing
 
Hoja de ruta propuesta of
Hoja de ruta propuesta ofHoja de ruta propuesta of
Hoja de ruta propuesta of
 
Descriptions 2 a part1
Descriptions 2 a part1Descriptions 2 a part1
Descriptions 2 a part1
 
Roses are Red, Violets are Blue: Detection of Valid Sentiment-Target Pairs
Roses are Red, Violets are Blue: Detection of Valid Sentiment-Target PairsRoses are Red, Violets are Blue: Detection of Valid Sentiment-Target Pairs
Roses are Red, Violets are Blue: Detection of Valid Sentiment-Target Pairs
 
#2_The Walking Dead
#2_The Walking Dead#2_The Walking Dead
#2_The Walking Dead
 
#03 The Walking Dead
#03 The Walking Dead#03 The Walking Dead
#03 The Walking Dead
 
#1_The Walking Dead
#1_The Walking Dead#1_The Walking Dead
#1_The Walking Dead
 

Similaire à VOC real world enterprise needs

ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inKumari Naveen
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.pptHaHa501620
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxSyedNadeemAbbas6
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Experimenting with eXtreme Design (EKAW2010)
Experimenting with eXtreme Design (EKAW2010)Experimenting with eXtreme Design (EKAW2010)
Experimenting with eXtreme Design (EKAW2010)evabl444
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overviewalessio_ferrari
 
Classifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsClassifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsJinho Choi
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Successful Single-Source Content Development
Successful Single-Source Content Development Successful Single-Source Content Development
Successful Single-Source Content Development Xyleme
 
Reflective Plan Examples
Reflective Plan ExamplesReflective Plan Examples
Reflective Plan ExamplesMonica Turner
 
Terminology management as fitness v.2 iti
Terminology management as fitness v.2 itiTerminology management as fitness v.2 iti
Terminology management as fitness v.2 itiITIRussia
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
ESWC SS 2012 - Tuesday Tutorial Elena Simperl: Creating and Using Ontologies
ESWC SS 2012 - Tuesday Tutorial Elena Simperl: Creating and Using OntologiesESWC SS 2012 - Tuesday Tutorial Elena Simperl: Creating and Using Ontologies
ESWC SS 2012 - Tuesday Tutorial Elena Simperl: Creating and Using Ontologieseswcsummerschool
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systemsCJ Jenkins
 

Similaire à VOC real world enterprise needs (20)

ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Experimenting with eXtreme Design (EKAW2010)
Experimenting with eXtreme Design (EKAW2010)Experimenting with eXtreme Design (EKAW2010)
Experimenting with eXtreme Design (EKAW2010)
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
 
Classifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsClassifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer Pairs
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Successful Single-Source Content Development
Successful Single-Source Content Development Successful Single-Source Content Development
Successful Single-Source Content Development
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
Reflective Plan Examples
Reflective Plan ExamplesReflective Plan Examples
Reflective Plan Examples
 
Terminology management as fitness v.2 iti
Terminology management as fitness v.2 itiTerminology management as fitness v.2 iti
Terminology management as fitness v.2 iti
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Textmining
TextminingTextmining
Textmining
 
ESWC SS 2012 - Tuesday Tutorial Elena Simperl: Creating and Using Ontologies
ESWC SS 2012 - Tuesday Tutorial Elena Simperl: Creating and Using OntologiesESWC SS 2012 - Tuesday Tutorial Elena Simperl: Creating and Using Ontologies
ESWC SS 2012 - Tuesday Tutorial Elena Simperl: Creating and Using Ontologies
 
ESSENSE
ESSENSEESSENSE
ESSENSE
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 

Dernier

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 

Dernier (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 

VOC real world enterprise needs

  • 1. Communicating KnowledgeSentiment Analysis Symposium Lessons Learned from a VOC Analysis System for a big Korean Telecommunication Company Ivan Berlocher SALTLUX Sentiment Analysis Symposium Nov. 9th 2011
  • 2. Communicating KnowledgeSentiment Analysis Symposium Introduction • Saltlux Inc. is located in Seoul, Korea, established in 1979 and renovated in 2003. • Expertise domain: Information Retrieval, Text/Data/Web/Graph Mining solutions and services based on Semantic Web Technology. • Main languages support: Korean, Japanese, English. For other use external solutions. • 70 employees in Seoul, one Development Center in Vietnam (12 employees) One sales office in Japan (3 employees) • Have several partnerships with other companies/institutes: – Ontoprise in Germany – Franz in California – DERI in Ireland • Have many partnerships with R&D (ETRI, KAIST, Universities…) 2
  • 3. Communicating KnowledgeSentiment Analysis Symposium Table of Contents • Project & Environment Description – Needs of Customer – System (Main) Requirements • VOC Data – Sample Data – Data Analysis • System Overview • Korean Linguistic • Sentiment Analysis • Lessons Learned • Future work 3
  • 4. Communicating KnowledgeSentiment Analysis Symposium Project & Environment Description 4 • Needs of Customer – Customer: Korean Corporation in Telecommunication – Department of Voice of Customer Analysis – Mission: Analysis (human typed) memos from all call centers for identifying majors problems, make reports for decisions makers in order to improve quality of services and augment customer satisfaction. – Data: human typed notes covering any kind of questions from customers • Information about subscriptions • Inquiry or complaint about devices (phones) or services, dealership • Complaints about quality of communication • etc. The numbers of notes: ~200 thousand a day. (~5 Millions a Month). Required notes to be searchable during 1 year (~60 millions)
  • 5. Communicating KnowledgeSentiment Analysis Symposium Project & Environment Description 5 • System (Main) Requirements • Distinguish between simple inquiries vs. complaints • Classify into categories/departments of services • Monitor Trends of Topics in real-time, daily, weekly, monthly • Compare trends/tendency between by slice of times • Find related Topics • Manage personal vocabulary • Anonymous”ize” personal data (people names, telephone, social id, addresses etc.) Project started in October 2010 for a 3 Months POC. (~10MM) After acceptance(success) integration with real system for another 3 months (~10 MM) 2 phases: ~200 000$
  • 6. Communicating KnowledgeSentiment Analysis Symposium VOC Data Sample 6
  • 7. Communicating KnowledgeSentiment Analysis Symposium VOC Data Sample 7 • Data often contain some structured information (metadata) but without any standard. • But most of time, no particular mark/meta. Cause problem of Named Entities Recognition more complex All different input of same information (연락처:Phone Number)
  • 8. Communicating KnowledgeSentiment Analysis Symposium VOC Data Analysis 8 • Data contains lot‟s of named entities: Products/Services/People/Social ID/phones numbers often related to privacy • Data contains lot‟s of technical (domain) terms • Real content to analysis is mostly very short(tweets like) but sometimes very. • Lot‟s of misspelling/mistyping • Korean(Asian) problem of segmentation, amplified by speed constraint • Lot‟s of (non standard) abbreviations
  • 9. Communicating KnowledgeSentiment Analysis Symposium System Overview 9 Text Segmentation Morphological Analyzer Chunk/Phrase Identification Named Entities Recognition Synonyms & Normalization Indexing Distributed Indexes Classifier (Hybrid SVM & Rules) Analysis Phase Searching/ Clustering (TopicRank) Timelines Dumper DFS Timelines 20110713_0700_1.df 20110713_0700_2.df 20110713_0700_3.df 20110713_0710_1.df 20110713_0710_2.df 20110713_0710_3.df Scheduler Merger & Ranker Trend (TopN) DB Web Server (Web UI) Complaint Detector • Overall Architecture In the real system, for fast indexing, system has been parallelized on 18 Linux machines.
  • 10. Communicating KnowledgeSentiment Analysis Symposium System Overview 10 • Home page
  • 11. Communicating KnowledgeSentiment Analysis Symposium System Overview 11 • Top N Keywords Extraction
  • 12. Communicating KnowledgeSentiment Analysis Symposium System Overview 12 • Related Keywords (Word Clustering)
  • 13. Communicating KnowledgeSentiment Analysis Symposium System Overview 13 • Trend (Timeline) view
  • 14. Communicating KnowledgeSentiment Analysis Symposium System Overview 14 • Tweets view
  • 15. Communicating KnowledgeSentiment Analysis Symposium Korean Linguistic 15 • Brief introduction Korean is alphabetic based with consonants/vowels, composition by consonant/vowel or consonant/vowel/consonant. „나는 학생입니다.” => 나 = ㄴ (N) + ㅏ(A) = NA => 학 = ㅎ (H) + ㅏ(A) + ㄱ (K) = HAK One unit of consonant/vowel or consonant/vowel/consonant is a syllable called “Eojol”(Syllable) and words are composed of several “eojeol”. Basic grammar: Words a composition of one root (Nouns, Adjectives/Verbs) followed by a flexion marking grammatical role (Subject/Object/Location etc.) for nouns (Called “Josa”) or aspects/mood (tense, honorific form etc. ) for verbs/adjectives (Called “Eomi”).
  • 16. Communicating KnowledgeSentiment Analysis Symposium Korean Linguistic 16 • Examples: „나는 학생입니다.” => “나는” = “나” (NA: I/me) + “는” (Neun: Thema) 학생입니다 = “학생” + “입니다” = “학생”(Hak-seng: Student) + “입니다”(Im-ni-da: am) => I‟m (a) student. Lot‟s of (composite) inflectional forms: 학생+입니다 = Noun + Be 학생 +인/이예요/이다/입니까?/인데/인데요 etc. (was, will be …) (eomi) 학생 + Syntactic Role (이:Subject/에게:To/한테:From/을:Object) etc. (josa) Korean is highly agglomerative (concatenate prefix/nouns/josa/eomi) Search Engine: 검색엔진. High performance search engine: 고성능검색엔진 But usage of space is free/arbitrary. Can write equivalently: 검색엔진 or 검색 엔진 Especially with SNS, space limited devices for speed constraints (like real-time transcription of conversations) the space is more and more un/mis- used. => Need Automatic Segmentation Correction.
  • 17. Communicating KnowledgeSentiment Analysis Symposium Project & Environment Description 17 • Automatic Segmentation Correction Illustration
  • 18. Communicating KnowledgeSentiment Analysis Symposium Korean Linguistic 18 • Automatic Segmentation Correction Implementation Binary Classification Approach: Tagging each syllable as space or not before. Can use any kind of Classifier. Here we use CRF model (could be SVM) with following set of features: 프랑스의 세계적인 디자이너 … CRF Accuracy at Character Level 96.25% Precision at Word Level 95.58% • Features – 1gram, 2gram, 3gram, 4gram of characters (syllables) – Korean or not, contains number • Evaluation – Accuracy (character) – Word-precision # words correct spaced word / # words produced by system • Very simple to train (easy to get huge data) • Not need of lexicon or any lexical information • Perform surprisingly very well
  • 19. Communicating KnowledgeSentiment Analysis Symposium Korean Linguistic 19 • Transliteration - Korean used more and more English derived word transliterated phonetically in Korean alphabet (Reverse of “Romanization”). Especially for foreign names (Companies, Products, People, technical/domain terms) – Transcription is non unique and non standard Examples: tablet, 태블릿, 태블릿 , 타블렛, 테블릿 Hitachi, 히타치, 히타찌, 히다찌, 히타찌 iPhone 4s, 아이폰 4s, 아이폰포에스, 아이폰 포에스
  • 20. Communicating KnowledgeSentiment Analysis Symposium Korean Linguistic 20 • Automatic transliteration recognition - Make a rules based transliteration based on phonetic transliteration acting similarly to Soundex, adapted for Korean pronunciation. tablet, 태블릿 T=>ㅌ/ㄸ/ㄷ A => ㅏ/ㅓ/ㅔ/ㅐ Etc. This method has high recall but low precision and need post-processing filtering (Remove known Korean words from lexicons, remove too short nouns etc.) Result has to be corrected by human, so need of efficient workbench for productivity. Gathered a 130 thousand entries dictionaries, mainly IT oriented. Still need more Academic research to solve this problem.
  • 21. Communicating KnowledgeSentiment Analysis Symposium Sentiment Analysis 21 • Complaint Detection Similar problem of standard Subjectivity Detection (Detect if a sentence is sentiment bearing or not) Simple Approach: Binary Classification Using SVM, manually tagged training/test corpuses. (more than 20 thousand) Features Space: N-gram of Characters (Syllables/Eojol) + N-Gram of Words using 2-4 grams gave best results. Features Extraction is important to lower the features space. Chi-square/Information Gain gave best results.
  • 22. Communicating KnowledgeSentiment Analysis Symposium Sentiment Analysis 22 Problems: No freely available resources such Sentiword-Net Need to build it! Build our general domain dictionary as baseline: 20 000 verbs/adjectives classified as positive/negative/neutral Result is a lexicon of ~5000 entries (only positive/negative) Enrich with manually extracted features from N-grams. Precision oriented (92%) but still quite low recall (75%). Overall Accuracy: 85% => Still working on ways to make recall better without sacrificing precision. Basic Ideas: Bagging / Boosting (Combining several Classifiers) Make hybrid models between (linguistic: semantic/syntactic) rules and Machine Learning(statistics)
  • 23. Communicating KnowledgeSentiment Analysis Symposium Lessons Learned 23 • Lessons Learned - Still a quite big gap between expectation of customer and reality. Need to explain and let him involved in process of assessment and knowledge/domain vocabulary acquisition - Need acquire a lot of lexicons: => Named entities/Synonyms/Stopwords/Senti-Word - Quality and Quantity of this lexicons is a real assets of Company. Acquiring lexicons require workbenches for efficiently semi-supervised methods (Filter manually automatic methods) to reduce costs. - Tuning Classifiers parameters, features extraction, linguistic knowledge etc. is time/expertise consuming. - Simple Academic methods works quite well (even needs lot of tuning) - Beyond simple search engine, NLP components quality became more and more important, especially for Sentiment Analysis
  • 24. Communicating KnowledgeSentiment Analysis Symposium Lessons Learned 24 • Lessons Learned - Customers gain more and more interested in “Big Data”, “Listening Platform”, “Cloud ”, “Social Network/Intelligence”… - More and more Customers want to get data/opinion out of in-site system (Blogs, Communities(BBS), Tweets etc.). Typical questions: How many crawlers are needed for crawl all Korean tweets/blogs? How about crawling Facebook? - How identify “Anti communities” (like “Anti-Samsung”); Who are Power bloggers? Solutions required are required far more than Sentiment Analysis. But often customer can‟t afford/don‟t want crawling infra-structure and maintenance fees. New opportunities to deliver software in other forms than traditional packages selling: SaaS/PaaS (Software/Platform/Infrastructure) as Service. Even in enterprise, distributed framework is required (not only web scale services) - Customers (as least in Korea) love knowing technology and are more and more high level users. They not only buy solutions but consulting/expertise. - Projects are more and more expensive, and many require either Benchmarks/POC
  • 25. Communicating KnowledgeSentiment Analysis Symposium Future Work & Plan 25 • Future Work (On-going) Acquire more entries in Sentiment dictionary - Make a framework for handling Linguistic Rules and Statistical (SVM/Rocchio) - Coupling with Antonyms; and/or hints - Better handling Negation - Better Workbench for faster acquisition / (re-)training - Co-Reference resolution - (Full/Semi) Parsing ? - More complex models than binary classification ? - Building/Maintaining a Platform for Pass/Sass A long long way to go…
  • 26. Communicating KnowledgeSentiment Analysis Symposium 26 Questions? Thank you.