SlideShare une entreprise Scribd logo
1  sur  42
Michael Ferretti
Tony Russell-Rose
Vladimir Zelevinsky
Text Analytics:
Yesterday
Today
Tomorrow
2
Part 1 (of 3)
WHAT?
3
What is Text Analytics?
 A set of linguistic, analytical and predictive techniques to extract
structure and meaning extracted from unstructured documents
– Text Analytics ~= Natural Language Processing ~= Text Mining
– Text Mining → Scientific / technical context, automated processing
– Text Analytics → Business context, interactive apps
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
“NLP”
vs.
“Text Analytics”
4
Why is Text Analytics Important?
 ‘80% of corporate information is unstructured’
– Entire value chain for some organisation (media / publishing etc.)
– Retail / eCommerce: Product reviews
– User generated content: blogs, forums, wikis
– Voice of the Customer: social media + sentiment analysis
 161 billion gigabytes of digital information in 2006
– approximately 988 exabytes by 2010
– Audio / video still needs summaries & tags etc.
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
5
How Difficult Can It Be?
 As humans we do it effortlessly ... don’t we?
 DRUNK GETS NINE YEARS IN VIOLIN CASE
 PROSTITUTES APPEAL TO POPE
 STOLEN PAINTING FOUND BY TREE
 RED TAPE HOLDS UP NEW BRIDGE
 DEER KILL 300,000
 RESIDENTS CAN DROP OFF TREES
 INCLUDE CHILDREN WHEN BAKING COOKIES
 MINERS REFUSE TO WORK AFTER DEATH
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
6
Some Fundamentals
 Language is AMBIGUOUS
– To find structure, we must remove ambiguity!
 Lexical analysis (tokenisation)
– The cat sat on the mat
– I can’t tokenise this sentence
 Morphology (term variations, prefixes, suffixes, etc.)
– Computer, computing, compute, computed = comput*
– Delegate = de-leg-ate (?)
– Ratify = rat-ify (?)
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
7
More Fundamentals
 Syntax (part of speech tagging)
– Time flies like an arrow
– Fruit flies like a banana
– Eats shoots and leaves
 Parsing (grammar)
– I saw a venetian blind
– I saw a blind venetian
– Rugby is a game played by men with odd-shaped balls
 Sentence boundary detection
– Punctuation denotes the end of a sentence!
– “But not always!”, said Fred...
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
8
Named Entity Recognition/Information Extraction
 Companies in New York != New companies in York
 People, places, organisations ...
– Increase precision
– Support navigation
– Facilitate translation, summarisation, speech synthesis, etc.
 IE = template filling
– Entities + relationships
– Highly context dependent
 Problems with:
– Anaphora resolution
– Word sense disambiguation
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
9
Question Answering
 Give me answers, not documents!
 Fact-finding vs. exploratory search
– Yes/no questions ‘Is George W. Bush the current president of the
USA?’
– ‘Who’ questions ‘Who was the British Prime Minister before Margaret
Thatcher?’
– List questions ‘Which football teams have won the Champions
League this decade?
– Instruction-based questions ‘How do I cook lasagne?’
– Explanation questions ‘Why did World War I start?’
– Commands ‘Tell me the height of the Eiffel Tower.’
 Question analysis → document retrieval → answer extraction
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
10
Part 2 (of 3)
HOW?
11
Text Analytics is Computer Science + Semantics.
Semantics is the study of meaning.
Definitions
Universal flowchart:
12
No mind reading (yet). Have to use text.
Text approximates meaning.
Meaning is structured.
Meaning and structure
CONCEPT CONCEPT CONCEPT
13
Synonymy:
one concept maps to different words.
Polysemy:
one word maps to different concepts.
Problems with text
14
Simplest structure: salient terms
Many years later, as he faced the firing squad, Colonel
Aureliano Buendía was to remember that distant afternoon
when his father took him to discover ice.
– Marquez (1962)
15
Typed entities
People, places, organizations; etc.
Simple approach: word lists.
More difficult: trained extractors (including sentiment).
16
Highest clarity organizations for “baseball”
Top terms World Series Teams
1987 Cardinals; Twins Cardinals; Twins
1988 Dodgers; Mets Dodgers; Athletics
1989 Athetics; Giants Athetics; Giants
1991 Braves; Twins Braves; Twins
1992 Blue Jays; Braves Blue Jays; Braves
1996 Yankees; Braves Yankees; Braves
1997 Indians; Marlins Indians; Marlins
1998 Yankees; Padres Yankees; Padres
1999 Braves; Mets; Yankees Braves; Yankees
2000 Yankees; Mets Yankees; Mets
2001 Diamondbacks; Yankees Diamondbacks; Yankees
2003 Marlins Marlins; Yankees
Fail: 1990, 1993, 1995, 2002.
17
Salient terms on a timeline: baseball
No event in 1994!
Clarity scores for top terms for “baseball” search:
18
Salient terms on a timeline: Iraq
19
Excellent corpus:
Research articles.
Written by humans.
Tagged by authors.
Case study: ACM
But:
Half the articles untagged.
Tags sparse (90% of tags used once!)
Synonyms abound.
tags → controlled tag vocabulary → high-scoring salient tags
20
21
Co-occurrence:
salient terms that tend to occur together belong together.
Clusters
22
Clusters for disambiguation
23
Apple, meaning 1
24
Apple, meaning 2
25
Apple, meaning 3
26
Human brain is great at extracting information scent:
[word, word, word, …] → meaning
Information Scent
[island, Indonesia] [code, Sun] [coffee, beans, brew] → Java
27
Vector model
– Salton (1983)
Similarity between documents = cosine of the angle between their vectors
Can also rotate basis for the best representation: LSI
28
Semantic networks
emotion →
29
Custom Dimensions
30
Custom Dimensions
31
Sentence structure parsing
32
It is said Mrs. Clinton promises new jobs will be created by her.
N V V N N V A N V V V N
part of speech tagging
noun / verb phrase extraction
sentence structure analysis
anaphora resolution
passive tense flipping
triple filtering
hierarchy generation
Sentence structure parsing
33
Nouns by head noun:
[Mrs. + Hillary + Bill + President]
→ Clinton
Verbs by hypernyms (broadening synonyms):
[say + tell + propose + suggest + declare]
→ express
Hierarchy generation (also semantic network!)
34
Idea Navigation
35
Idea Navigation
36
Idea Navigation
37
Idea Navigation
38
Part 3 (of 3)
WHO?
39
 For profit:
– Lexalytics
 Text Enrichment module
 Text Enrichment with Sentiment Analysis
– Alias-i
 Term Discovery
– Nstein
 Newssift
 For fun:
– GATE (Sheffield University)
 Open source, linguistic focus
– RapidMiner (University of Dortmund / Rapid-I)
 Open source community edition, data mining focus
– WordNet, OpenCalais, LingPipe, NLPwiki, etc.
Text Analytics for fun and profit
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
40
 Market maturing & expanding: 75-200%
– Most vendors on target
 Dominant markets:
– CX, media/publishing, FS & insurance, intelligence, life sciences, e-
discovery
 Solutions still not standardized
– Need for self-service tuning & configuration
 Massive expansion in social media
– Lightweight NLP for buzz analysis, brand monitoring, etc.
 Partner ecosystem developing
– Marketing services providers, platform vendors, CRM + call centre
vendors, system integrators
Market Outlook: 2010
Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
41
Conclusions
What do we expect in the future?
 Extraction leads to generation
 Summarization
 Generalization
 Narratives
 Inference and conflict resolution
We are all interested in the future, for that is where you and I
are going to spend the rest of our lives. And remember, my
friend, future events such as these will affect you in the future.
– Edward Wood Jr. (1957)
42
Text analytics: what does it mean?
Unstructured text isn't unstructured. There's always structure.
Find the information scent. Let the users follow it.
Don’t trust that one query is enough. Let the users interact.
Text does not matter. Meaning does.

Contenu connexe

Tendances

Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewSeth Grimes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Directed versus undirected network analysis of student essays
Directed versus undirected network analysis of student essaysDirected versus undirected network analysis of student essays
Directed versus undirected network analysis of student essaysRoy Clariana
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using Rsantoshi mangalgi
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
 
Finding Influential People from Historical News Repository (Master's thesis w...
Finding Influential People from Historical News Repository (Master's thesis w...Finding Influential People from Historical News Repository (Master's thesis w...
Finding Influential People from Historical News Repository (Master's thesis w...Aayushee Gupta
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications sathish sak
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for MeaningTrey Grainger
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelTrey Grainger
 
Semantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering SystemsSemantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering SystemsAndre Freitas
 
How to be successful with search in your organisation
How to be successful with search in your organisationHow to be successful with search in your organisation
How to be successful with search in your organisationvoginip
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered SearchTrey Grainger
 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Grace Hui Yang
 
Topic map for Topic Maps case examples
Topic map for Topic Maps case examplesTopic map for Topic Maps case examples
Topic map for Topic Maps case examplestmra
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
 

Tendances (20)

Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry View
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Directed versus undirected network analysis of student essays
Directed versus undirected network analysis of student essaysDirected versus undirected network analysis of student essays
Directed versus undirected network analysis of student essays
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Finding Influential People from Historical News Repository (Master's thesis w...
Finding Influential People from Historical News Repository (Master's thesis w...Finding Influential People from Historical News Repository (Master's thesis w...
Finding Influential People from Historical News Repository (Master's thesis w...
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 
Semantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering SystemsSemantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering Systems
 
How to be successful with search in your organisation
How to be successful with search in your organisationHow to be successful with search in your organisation
How to be successful with search in your organisation
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction Information Retrieval Fundamentals - An introduction
Information Retrieval Fundamentals - An introduction
 
Text Mining
Text MiningText Mining
Text Mining
 
Topic map for Topic Maps case examples
Topic map for Topic Maps case examplesTopic map for Topic Maps case examples
Topic map for Topic Maps case examples
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 

En vedette

Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 
Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Tempero UK
 
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...SocialBiz UserGroup
 
Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Jun Julien Matsushita
 
Analysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataAnalysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataEducational Technology
 
Searching lexis nexis in power search mode
Searching lexis nexis in power search modeSearching lexis nexis in power search mode
Searching lexis nexis in power search modeJoyce Johnston
 
Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Laurence Borel
 
Data Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchData Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchWalkerSands
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Business Models in the Data Economy: A Case Study from the Business Partner D...
Business Models in the Data Economy: A Case Study from the Business Partner D...Business Models in the Data Economy: A Case Study from the Business Partner D...
Business Models in the Data Economy: A Case Study from the Business Partner D...Boris Otto
 
What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?Sparc Media Poland
 
Digital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyDigital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyTelenor Group
 
Topic and text analysis for sentiment, emotion, and computational social science
Topic and text analysis for sentiment, emotion, and computational social scienceTopic and text analysis for sentiment, emotion, and computational social science
Topic and text analysis for sentiment, emotion, and computational social scienceAlice Oh
 
Big Data: Mapping Twitter Communities
Big Data: Mapping Twitter CommunitiesBig Data: Mapping Twitter Communities
Big Data: Mapping Twitter CommunitiesSocialphysicist
 
Learn How a New Kind of Marketing Mix Modeling is Better for Media Planning
Learn How a New Kind of Marketing Mix Modeling is Better for Media PlanningLearn How a New Kind of Marketing Mix Modeling is Better for Media Planning
Learn How a New Kind of Marketing Mix Modeling is Better for Media PlanningThinkVine
 
How to Build a Basic Model with Analytica
How to Build a Basic Model with AnalyticaHow to Build a Basic Model with Analytica
How to Build a Basic Model with AnalyticaTorsten Röhner
 
Deep Social Insight
Deep Social InsightDeep Social Insight
Deep Social InsightSysomos
 
Staying on the Right Side of the Fence when Analyzing Human Data
Staying on the Right Side of the Fence when Analyzing Human DataStaying on the Right Side of the Fence when Analyzing Human Data
Staying on the Right Side of the Fence when Analyzing Human DataDataSift
 

En vedette (20)

Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization Analyzing social conversation: a guide to data mining and data visualization
Analyzing social conversation: a guide to data mining and data visualization
 
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
Tip from IBM Connect 2014: Socialytics = Social Business, Big Social Data and...
 
Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015Influence mapping Toolbox Presentation London 2015
Influence mapping Toolbox Presentation London 2015
 
Analysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter DataAnalysis and Visualization of Real-Time Twitter Data
Analysis and Visualization of Real-Time Twitter Data
 
Searching lexis nexis in power search mode
Searching lexis nexis in power search modeSearching lexis nexis in power search mode
Searching lexis nexis in power search mode
 
Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections? Can Digital Data help predict the results of the US elections?
Can Digital Data help predict the results of the US elections?
 
Data Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with ResearchData Driven PR: 8 Steps to Building Media Attention with Research
Data Driven PR: 8 Steps to Building Media Attention with Research
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Business Models in the Data Economy: A Case Study from the Business Partner D...
Business Models in the Data Economy: A Case Study from the Business Partner D...Business Models in the Data Economy: A Case Study from the Business Partner D...
Business Models in the Data Economy: A Case Study from the Business Partner D...
 
What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?What is 1st, 2nd, 3rd party data?
What is 1st, 2nd, 3rd party data?
 
Digital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensbyDigital Winners 2013: Aleksander stensby
Digital Winners 2013: Aleksander stensby
 
Market Mix Models: Shining a Light in the Black Box
Market Mix Models: Shining a Light in the Black BoxMarket Mix Models: Shining a Light in the Black Box
Market Mix Models: Shining a Light in the Black Box
 
Topic and text analysis for sentiment, emotion, and computational social science
Topic and text analysis for sentiment, emotion, and computational social scienceTopic and text analysis for sentiment, emotion, and computational social science
Topic and text analysis for sentiment, emotion, and computational social science
 
Text mining and Visualizations
Text mining  and VisualizationsText mining  and Visualizations
Text mining and Visualizations
 
Big Data: Mapping Twitter Communities
Big Data: Mapping Twitter CommunitiesBig Data: Mapping Twitter Communities
Big Data: Mapping Twitter Communities
 
Learn How a New Kind of Marketing Mix Modeling is Better for Media Planning
Learn How a New Kind of Marketing Mix Modeling is Better for Media PlanningLearn How a New Kind of Marketing Mix Modeling is Better for Media Planning
Learn How a New Kind of Marketing Mix Modeling is Better for Media Planning
 
How to Build a Basic Model with Analytica
How to Build a Basic Model with AnalyticaHow to Build a Basic Model with Analytica
How to Build a Basic Model with Analytica
 
Deep Social Insight
Deep Social InsightDeep Social Insight
Deep Social Insight
 
Staying on the Right Side of the Fence when Analyzing Human Data
Staying on the Right Side of the Fence when Analyzing Human DataStaying on the Right Side of the Fence when Analyzing Human Data
Staying on the Right Side of the Fence when Analyzing Human Data
 

Similaire à Text Analytics: Yesterday, Today and Tomorrow

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Dawn Anderson MSc DigM
 
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICSBig Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICSMatt Stubbs
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppttestbest6
 
Information Architecture 101
Information Architecture 101Information Architecture 101
Information Architecture 101Christina Wodtke
 
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Lewis Shepherd
 
Metadata & Taxonomy: The Foundation for Content and Digital Asset Management
Metadata & Taxonomy: The Foundation for Content and Digital Asset ManagementMetadata & Taxonomy: The Foundation for Content and Digital Asset Management
Metadata & Taxonomy: The Foundation for Content and Digital Asset ManagementVeeva Systems
 
Houston / Galveston PIO Network Social Media Training
Houston / Galveston PIO Network Social Media TrainingHouston / Galveston PIO Network Social Media Training
Houston / Galveston PIO Network Social Media TrainingNate Ritter
 
Beyond Siri on the iPhone: How could intelligent systems change the way we in...
Beyond Siri on the iPhone: How could intelligent systems change the way we in...Beyond Siri on the iPhone: How could intelligent systems change the way we in...
Beyond Siri on the iPhone: How could intelligent systems change the way we in...Yousif Almas
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
Houston / Galveston PIO Nework Social Media Training (ppt)
Houston / Galveston PIO Nework Social Media Training (ppt)Houston / Galveston PIO Nework Social Media Training (ppt)
Houston / Galveston PIO Nework Social Media Training (ppt)Nate Ritter
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBiplav Srivastava
 
Python 101 for Data Science to Absolute Beginners
Python 101 for Data Science to Absolute BeginnersPython 101 for Data Science to Absolute Beginners
Python 101 for Data Science to Absolute BeginnersSai Linn Thu
 
Social Knowledge: Are you ready for the Future?
Social Knowledge: Are you ready for the Future?Social Knowledge: Are you ready for the Future?
Social Knowledge: Are you ready for the Future?John Girard
 
What is Watson – An Overvie.pdf
What is Watson – An Overvie.pdfWhat is Watson – An Overvie.pdf
What is Watson – An Overvie.pdfskyadav35
 
NewMR - Kyle Findlay - June 2021
NewMR - Kyle Findlay - June 2021NewMR - Kyle Findlay - June 2021
NewMR - Kyle Findlay - June 2021Ray Poynter
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCTJ Stalcup
 

Similaire à Text Analytics: Yesterday, Today and Tomorrow (20)

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
Voice Search and Conversation Action Assistive Systems - Challenges & Opportu...
 
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICSBig Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppt
 
Information Architecture 101
Information Architecture 101Information Architecture 101
Information Architecture 101
 
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
 
Metadata & Taxonomy: The Foundation for Content and Digital Asset Management
Metadata & Taxonomy: The Foundation for Content and Digital Asset ManagementMetadata & Taxonomy: The Foundation for Content and Digital Asset Management
Metadata & Taxonomy: The Foundation for Content and Digital Asset Management
 
Houston / Galveston PIO Network Social Media Training
Houston / Galveston PIO Network Social Media TrainingHouston / Galveston PIO Network Social Media Training
Houston / Galveston PIO Network Social Media Training
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Oss swot
Oss swotOss swot
Oss swot
 
Beyond Siri on the iPhone: How could intelligent systems change the way we in...
Beyond Siri on the iPhone: How could intelligent systems change the way we in...Beyond Siri on the iPhone: How could intelligent systems change the way we in...
Beyond Siri on the iPhone: How could intelligent systems change the way we in...
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
Houston / Galveston PIO Nework Social Media Training (ppt)
Houston / Galveston PIO Nework Social Media Training (ppt)Houston / Galveston PIO Nework Social Media Training (ppt)
Houston / Galveston PIO Nework Social Media Training (ppt)
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near You
 
Python 101 for Data Science to Absolute Beginners
Python 101 for Data Science to Absolute BeginnersPython 101 for Data Science to Absolute Beginners
Python 101 for Data Science to Absolute Beginners
 
Social Knowledge: Are you ready for the Future?
Social Knowledge: Are you ready for the Future?Social Knowledge: Are you ready for the Future?
Social Knowledge: Are you ready for the Future?
 
What is Watson – An Overvie.pdf
What is Watson – An Overvie.pdfWhat is Watson – An Overvie.pdf
What is Watson – An Overvie.pdf
 
NewMR - Kyle Findlay - June 2021
NewMR - Kyle Findlay - June 2021NewMR - Kyle Findlay - June 2021
NewMR - Kyle Findlay - June 2021
 
2006 information operations book brief
2006 information operations book brief2006 information operations book brief
2006 information operations book brief
 
Thinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DCThinkful - Intro to Data Science - Washington DC
Thinkful - Intro to Data Science - Washington DC
 

Plus de Tony Russell-Rose

Visual approaches to patent retrieval
Visual approaches to patent retrievalVisual approaches to patent retrieval
Visual approaches to patent retrievalTony Russell-Rose
 
Towards Explainability in Professional Search
Towards Explainability in Professional SearchTowards Explainability in Professional Search
Towards Explainability in Professional SearchTony Russell-Rose
 
Putting search theory to work on large datasets
Putting search theory to work on large datasetsPutting search theory to work on large datasets
Putting search theory to work on large datasetsTony Russell-Rose
 
NLP techniques for automated query suggestions
NLP techniques for automated query suggestionsNLP techniques for automated query suggestions
NLP techniques for automated query suggestionsTony Russell-Rose
 
Think outside the search box: a AI-based approach to search strategy formulation
Think outside the search box: a AI-based approach to search strategy formulationThink outside the search box: a AI-based approach to search strategy formulation
Think outside the search box: a AI-based approach to search strategy formulationTony Russell-Rose
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingTony Russell-Rose
 
Designing the Search Experience: The Language of Discovery
Designing the Search Experience: The Language of DiscoveryDesigning the Search Experience: The Language of Discovery
Designing the Search Experience: The Language of DiscoveryTony Russell-Rose
 
User requirements for complex search strategies
User requirements for complex search strategiesUser requirements for complex search strategies
User requirements for complex search strategiesTony Russell-Rose
 
Sentiment analysis in healthcare
Sentiment analysis in healthcareSentiment analysis in healthcare
Sentiment analysis in healthcareTony Russell-Rose
 
A Model of Consumer Search Behaviour
A Model of Consumer Search BehaviourA Model of Consumer Search Behaviour
A Model of Consumer Search BehaviourTony Russell-Rose
 
A taxonomy of search strategies and their design implications
A taxonomy of search strategies and their design implicationsA taxonomy of search strategies and their design implications
A taxonomy of search strategies and their design implicationsTony Russell-Rose
 
From search to discovery: Information search strategies and design solutions
From search to discovery: Information search strategies and design solutionsFrom search to discovery: Information search strategies and design solutions
From search to discovery: Information search strategies and design solutionsTony Russell-Rose
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalTony Russell-Rose
 
UI Design Patterns for Search & Information Discovery
UI Design Patterns for Search & Information DiscoveryUI Design Patterns for Search & Information Discovery
UI Design Patterns for Search & Information DiscoveryTony Russell-Rose
 

Plus de Tony Russell-Rose (17)

Visual approaches to patent retrieval
Visual approaches to patent retrievalVisual approaches to patent retrieval
Visual approaches to patent retrieval
 
Towards Explainability in Professional Search
Towards Explainability in Professional SearchTowards Explainability in Professional Search
Towards Explainability in Professional Search
 
Putting search theory to work on large datasets
Putting search theory to work on large datasetsPutting search theory to work on large datasets
Putting search theory to work on large datasets
 
NLP techniques for automated query suggestions
NLP techniques for automated query suggestionsNLP techniques for automated query suggestions
NLP techniques for automated query suggestions
 
Think outside the search box: a AI-based approach to search strategy formulation
Think outside the search box: a AI-based approach to search strategy formulationThink outside the search box: a AI-based approach to search strategy formulation
Think outside the search box: a AI-based approach to search strategy formulation
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Designing the Search Experience: The Language of Discovery
Designing the Search Experience: The Language of DiscoveryDesigning the Search Experience: The Language of Discovery
Designing the Search Experience: The Language of Discovery
 
User requirements for complex search strategies
User requirements for complex search strategiesUser requirements for complex search strategies
User requirements for complex search strategies
 
Sentiment analysis in healthcare
Sentiment analysis in healthcareSentiment analysis in healthcare
Sentiment analysis in healthcare
 
A Model of Consumer Search Behaviour
A Model of Consumer Search BehaviourA Model of Consumer Search Behaviour
A Model of Consumer Search Behaviour
 
A Taxonomy of Site Search
A Taxonomy of Site SearchA Taxonomy of Site Search
A Taxonomy of Site Search
 
Patterns of Personalization
Patterns of PersonalizationPatterns of Personalization
Patterns of Personalization
 
A taxonomy of search strategies and their design implications
A taxonomy of search strategies and their design implicationsA taxonomy of search strategies and their design implications
A taxonomy of search strategies and their design implications
 
From search to discovery: Information search strategies and design solutions
From search to discovery: Information search strategies and design solutionsFrom search to discovery: Information search strategies and design solutions
From search to discovery: Information search strategies and design solutions
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information Retrieval
 
From Search to Discovery
From Search to DiscoveryFrom Search to Discovery
From Search to Discovery
 
UI Design Patterns for Search & Information Discovery
UI Design Patterns for Search & Information DiscoveryUI Design Patterns for Search & Information Discovery
UI Design Patterns for Search & Information Discovery
 

Dernier

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Dernier (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Text Analytics: Yesterday, Today and Tomorrow

  • 1. Michael Ferretti Tony Russell-Rose Vladimir Zelevinsky Text Analytics: Yesterday Today Tomorrow
  • 2. 2 Part 1 (of 3) WHAT?
  • 3. 3 What is Text Analytics?  A set of linguistic, analytical and predictive techniques to extract structure and meaning extracted from unstructured documents – Text Analytics ~= Natural Language Processing ~= Text Mining – Text Mining → Scientific / technical context, automated processing – Text Analytics → Business context, interactive apps Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential. “NLP” vs. “Text Analytics”
  • 4. 4 Why is Text Analytics Important?  ‘80% of corporate information is unstructured’ – Entire value chain for some organisation (media / publishing etc.) – Retail / eCommerce: Product reviews – User generated content: blogs, forums, wikis – Voice of the Customer: social media + sentiment analysis  161 billion gigabytes of digital information in 2006 – approximately 988 exabytes by 2010 – Audio / video still needs summaries & tags etc. Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 5. 5 How Difficult Can It Be?  As humans we do it effortlessly ... don’t we?  DRUNK GETS NINE YEARS IN VIOLIN CASE  PROSTITUTES APPEAL TO POPE  STOLEN PAINTING FOUND BY TREE  RED TAPE HOLDS UP NEW BRIDGE  DEER KILL 300,000  RESIDENTS CAN DROP OFF TREES  INCLUDE CHILDREN WHEN BAKING COOKIES  MINERS REFUSE TO WORK AFTER DEATH Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 6. 6 Some Fundamentals  Language is AMBIGUOUS – To find structure, we must remove ambiguity!  Lexical analysis (tokenisation) – The cat sat on the mat – I can’t tokenise this sentence  Morphology (term variations, prefixes, suffixes, etc.) – Computer, computing, compute, computed = comput* – Delegate = de-leg-ate (?) – Ratify = rat-ify (?) Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 7. 7 More Fundamentals  Syntax (part of speech tagging) – Time flies like an arrow – Fruit flies like a banana – Eats shoots and leaves  Parsing (grammar) – I saw a venetian blind – I saw a blind venetian – Rugby is a game played by men with odd-shaped balls  Sentence boundary detection – Punctuation denotes the end of a sentence! – “But not always!”, said Fred... Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 8. 8 Named Entity Recognition/Information Extraction  Companies in New York != New companies in York  People, places, organisations ... – Increase precision – Support navigation – Facilitate translation, summarisation, speech synthesis, etc.  IE = template filling – Entities + relationships – Highly context dependent  Problems with: – Anaphora resolution – Word sense disambiguation Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 9. 9 Question Answering  Give me answers, not documents!  Fact-finding vs. exploratory search – Yes/no questions ‘Is George W. Bush the current president of the USA?’ – ‘Who’ questions ‘Who was the British Prime Minister before Margaret Thatcher?’ – List questions ‘Which football teams have won the Champions League this decade? – Instruction-based questions ‘How do I cook lasagne?’ – Explanation questions ‘Why did World War I start?’ – Commands ‘Tell me the height of the Eiffel Tower.’  Question analysis → document retrieval → answer extraction Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 10. 10 Part 2 (of 3) HOW?
  • 11. 11 Text Analytics is Computer Science + Semantics. Semantics is the study of meaning. Definitions Universal flowchart:
  • 12. 12 No mind reading (yet). Have to use text. Text approximates meaning. Meaning is structured. Meaning and structure CONCEPT CONCEPT CONCEPT
  • 13. 13 Synonymy: one concept maps to different words. Polysemy: one word maps to different concepts. Problems with text
  • 14. 14 Simplest structure: salient terms Many years later, as he faced the firing squad, Colonel Aureliano Buendía was to remember that distant afternoon when his father took him to discover ice. – Marquez (1962)
  • 15. 15 Typed entities People, places, organizations; etc. Simple approach: word lists. More difficult: trained extractors (including sentiment).
  • 16. 16 Highest clarity organizations for “baseball” Top terms World Series Teams 1987 Cardinals; Twins Cardinals; Twins 1988 Dodgers; Mets Dodgers; Athletics 1989 Athetics; Giants Athetics; Giants 1991 Braves; Twins Braves; Twins 1992 Blue Jays; Braves Blue Jays; Braves 1996 Yankees; Braves Yankees; Braves 1997 Indians; Marlins Indians; Marlins 1998 Yankees; Padres Yankees; Padres 1999 Braves; Mets; Yankees Braves; Yankees 2000 Yankees; Mets Yankees; Mets 2001 Diamondbacks; Yankees Diamondbacks; Yankees 2003 Marlins Marlins; Yankees Fail: 1990, 1993, 1995, 2002.
  • 17. 17 Salient terms on a timeline: baseball No event in 1994! Clarity scores for top terms for “baseball” search:
  • 18. 18 Salient terms on a timeline: Iraq
  • 19. 19 Excellent corpus: Research articles. Written by humans. Tagged by authors. Case study: ACM But: Half the articles untagged. Tags sparse (90% of tags used once!) Synonyms abound. tags → controlled tag vocabulary → high-scoring salient tags
  • 20. 20
  • 21. 21 Co-occurrence: salient terms that tend to occur together belong together. Clusters
  • 26. 26 Human brain is great at extracting information scent: [word, word, word, …] → meaning Information Scent [island, Indonesia] [code, Sun] [coffee, beans, brew] → Java
  • 27. 27 Vector model – Salton (1983) Similarity between documents = cosine of the angle between their vectors Can also rotate basis for the best representation: LSI
  • 32. 32 It is said Mrs. Clinton promises new jobs will be created by her. N V V N N V A N V V V N part of speech tagging noun / verb phrase extraction sentence structure analysis anaphora resolution passive tense flipping triple filtering hierarchy generation Sentence structure parsing
  • 33. 33 Nouns by head noun: [Mrs. + Hillary + Bill + President] → Clinton Verbs by hypernyms (broadening synonyms): [say + tell + propose + suggest + declare] → express Hierarchy generation (also semantic network!)
  • 38. 38 Part 3 (of 3) WHO?
  • 39. 39  For profit: – Lexalytics  Text Enrichment module  Text Enrichment with Sentiment Analysis – Alias-i  Term Discovery – Nstein  Newssift  For fun: – GATE (Sheffield University)  Open source, linguistic focus – RapidMiner (University of Dortmund / Rapid-I)  Open source community edition, data mining focus – WordNet, OpenCalais, LingPipe, NLPwiki, etc. Text Analytics for fun and profit Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 40. 40  Market maturing & expanding: 75-200% – Most vendors on target  Dominant markets: – CX, media/publishing, FS & insurance, intelligence, life sciences, e- discovery  Solutions still not standardized – Need for self-service tuning & configuration  Massive expansion in social media – Lightweight NLP for buzz analysis, brand monitoring, etc.  Partner ecosystem developing – Marketing services providers, platform vendors, CRM + call centre vendors, system integrators Market Outlook: 2010 Copyright ©2010 Endeca Technologies, Inc. All rights reserved. Proprietary and confidential.
  • 41. 41 Conclusions What do we expect in the future?  Extraction leads to generation  Summarization  Generalization  Narratives  Inference and conflict resolution We are all interested in the future, for that is where you and I are going to spend the rest of our lives. And remember, my friend, future events such as these will affect you in the future. – Edward Wood Jr. (1957)
  • 42. 42 Text analytics: what does it mean? Unstructured text isn't unstructured. There's always structure. Find the information scent. Let the users follow it. Don’t trust that one query is enough. Let the users interact. Text does not matter. Meaning does.