SlideShare a Scribd company logo
1 of 19
Download to read offline
Words and More Words:
Challenges of Big (Text) Data
Edie Rasmussen
Visiting Professor, Nanyang Technological University
Professor, University of British Columbia
WKWSCI
SYMPOSIUM
2014
Big Data, Big Ideas for Smarter
Communities
Outline
• The Rise of Big Text Data
• Challenges for Text Data
• Research Opportunities
– Counting and Culturomics
– Extracting Meaning from Text
2
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
The Rise of Big Text Data
• Before there was Big Data, there were large
bibliographic databases:
– Dialog: ~180 scholarly databases
– Lexis/Nexis: 5 billion documents (business/law/news)
– Citation Indexes: > 40 million records
• IR techniques designed for rapid access to very
large (text) databases
• Swanson: “Undiscovered public knowledge”
(1987)
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
3
Current Text Sources
• Digitized Legacy Materials
– Google Books, Hathi Trust (11 million volumes, 500 TB)
• The Web
• Search Logs (over 2 million queries per minute)
• Wikipedia (~4.5 million English articles)
• Blogs (The Blogosphere)
• Twitter (The Twitterverse)
• Test Collections
– Smaller
– Experimentally more robust
4
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Challenges of Text
• Legacy Text/Digitization Costs
• Quality (OCR Errors; Metadata Errors)
• Availability (Access, Copyright, Privacy)
• Reliability
– Algorithmic dependencies
– Creator trustworthiness
• Authorship Issues (Identification, Authority)
• Lack of Structure
• Lack of Context
• Ambiguity of human language
• Breadth vs. Depth
5
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Processing Text
• Tokenizing, stopping, stemming
• Statistics of text: term values (tf*idf)
• “Bag of Words” approach
• Other evidence: network structures
• Similarity calculations
• Creating ranked lists
• Note: Probabilistic rather than Deterministic
6
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Counting and the Rise of Culturomics
• “Culturomics is the application of high-
throughput data collection and analysis to the
study of human culture”
• Database of >5 million digitized books (~4%)
• Michel et al. (Science, 2011): “Quantitative
analysis of culture using millions of digitized
books”
• Google’s N-Gram Viewer
7
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Using the N-Gram Viewer
8
typhoid
gout
1800 20001900
HIV
cholera
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
How Far Will Counting Take us?
• Many limitations (e.g. incomplete data set)
• Some surprisingly sophisticated analyses:
– Size of English lexicon
– Change in word usage (irregular verbs) over time
– Cultural turnover (inventions)
– The nature (duration) of fame
– Patterns of censorship (“suppression index”)
9
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Critiques of Culturomics
• “The death of theory”
• “…second-rate scholars will use the Google
Books corpus to churn out gigabytes of
uninformative graphs and insignificant
conclusions.” (Nunberg, 2011)
• Books as a representation of human history
• A “time sink”
10
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Social Media as Big Data
• ‘Internet Minute’
– 320+ new Twitter accounts
– 100,000 new Tweets
– 2+ million search queries
– 6 new Wikipedia articles
– 30 hours of video uploaded
(Source: Intel
http://www.intel.com/content/www/us/en/communications/interne
t-minute-infographic.html)
11
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Topic Detection and Tracking
• Tracking a story line over time
• News wire input, identify new story, find
subsequent instances
• Story segmentation, First story detection,
Clustering of like stories
• Interesting to news, business, security analysts
12
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Sentiment Analysis/Opinion Mining
• Rich data from Blogs and Tweets
• Basically a classification problem (SVM, Naïve
Bayes, etc.) - > positive, negative, neutral
• Involves Entity Extraction, NLP, sentiment
vocabularies
• Of interest to government and businesses
• See Stanford SA of movie reviews:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
13
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Trends and Predictions
• Can Tweets and Search Logs be used to
predict the future?
• Google Flu Trends, Google Dengue Trends
– Correlated with Search Terms
• Network analysis on Tweets on Arab Spring
• Assessing tone of global news data to predict
national stability, location of terrorists, etc.
(Leetaru)
• Predicting opinions (recommender systems)
14
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
TM: Question Answering
• Combines multiple sources of evidence:
– Question type identification
– Information retrieval of candidate text
– Natural language processing
– Entity extraction
– Hypothesis generation and scoring (confidence)
– Ranking hypotheses
15
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
16
Watson, 2011
Hans Peter Luhn, 1952
Watson, 2011
Structuring Research:
“Digging Into Data” Program
• Addresses: “how "big data" changes the research
landscape for the humanities and social sciences”
• 3 rounds of international research funding
• Canada, US, UK, plus Netherlands
• Team approach: scholars, scientists, information
professionals
• Requires international teams; funding from at
least two countries
• Wide range of datasets made available
• http://www.diggingintodata.org/
17
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
18
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities
Thank you!
19
WKWSCI SYMPOSIUM 2014
Big Data, Big Ideas for Smarter Communities

More Related Content

Viewers also liked

Customer segmentation
Customer segmentationCustomer segmentation
Customer segmentation
weave Belgium
 
Telco 2.0 Report Summary: Telcos' Role in Advertising Value Chain
Telco 2.0 Report Summary:  Telcos' Role in Advertising Value ChainTelco 2.0 Report Summary:  Telcos' Role in Advertising Value Chain
Telco 2.0 Report Summary: Telcos' Role in Advertising Value Chain
bazza1664
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service Providers
DataWorks Summit
 
Telco Churn Roi V3
Telco Churn Roi V3Telco Churn Roi V3
Telco Churn Roi V3
hkaul
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
Carlos Soares
 

Viewers also liked (20)

Role of Analytics in Customer Management
Role of Analytics in Customer ManagementRole of Analytics in Customer Management
Role of Analytics in Customer Management
 
Customer segmentation
Customer segmentationCustomer segmentation
Customer segmentation
 
Marketing campaign to sell long term deposits
Marketing campaign to sell long term depositsMarketing campaign to sell long term deposits
Marketing campaign to sell long term deposits
 
FAST Digital Telco
FAST Digital TelcoFAST Digital Telco
FAST Digital Telco
 
Telco 2.0 Report Summary: Telcos' Role in Advertising Value Chain
Telco 2.0 Report Summary:  Telcos' Role in Advertising Value ChainTelco 2.0 Report Summary:  Telcos' Role in Advertising Value Chain
Telco 2.0 Report Summary: Telcos' Role in Advertising Value Chain
 
Telco 4.0 Business Operating Model Value Proposition Overview
Telco 4.0 Business Operating Model Value Proposition   OverviewTelco 4.0 Business Operating Model Value Proposition   Overview
Telco 4.0 Business Operating Model Value Proposition Overview
 
Telco Paper by Blueocean Market Intelligence
Telco Paper by Blueocean Market IntelligenceTelco Paper by Blueocean Market Intelligence
Telco Paper by Blueocean Market Intelligence
 
Brand Building in the Age of Big Data by Mr. Gavin Coombes
Brand Building in the Age of Big Data by Mr. Gavin CoombesBrand Building in the Age of Big Data by Mr. Gavin Coombes
Brand Building in the Age of Big Data by Mr. Gavin Coombes
 
Telco churn presentation
Telco churn presentationTelco churn presentation
Telco churn presentation
 
Customer segmentation approach
Customer segmentation approachCustomer segmentation approach
Customer segmentation approach
 
Patient Powered Research with Big Data and Connected Communities by Assoc. P...
Patient Powered Research with Big Data and Connected Communities  by Assoc. P...Patient Powered Research with Big Data and Connected Communities  by Assoc. P...
Patient Powered Research with Big Data and Connected Communities by Assoc. P...
 
獲利世代Business Model Generation
獲利世代Business Model Generation獲利世代Business Model Generation
獲利世代Business Model Generation
 
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon DunwoodyLayering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
 
Roadmap to realizing the value of telco data – opportunities, challenges, use...
Roadmap to realizing the value of telco data – opportunities, challenges, use...Roadmap to realizing the value of telco data – opportunities, challenges, use...
Roadmap to realizing the value of telco data – opportunities, challenges, use...
 
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
AWS re:Invent 2016: Predicting Customer Churn with Amazon Machine Learning (M...
 
Benefiting from Big Data - A New Approach for the Telecom Industry
Benefiting from Big Data - A New Approach for the Telecom Industry  Benefiting from Big Data - A New Approach for the Telecom Industry
Benefiting from Big Data - A New Approach for the Telecom Industry
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service Providers
 
Telco Churn Roi V3
Telco Churn Roi V3Telco Churn Roi V3
Telco Churn Roi V3
 
Sample - Data Warehouse Requirements
Sample -  Data Warehouse RequirementsSample -  Data Warehouse Requirements
Sample - Data Warehouse Requirements
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
 

Similar to Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011
lljohnston
 
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroDigital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Michael Mitchell
 
Miscellaneous Info: The Digital Past, Present, Future
Miscellaneous Info: The Digital Past, Present, FutureMiscellaneous Info: The Digital Past, Present, Future
Miscellaneous Info: The Digital Past, Present, Future
Lee Cafferata
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
oiisdp
 

Similar to Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen (20)

Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011
 
Digital Humanities and “Digital” Social Sciences
Digital Humanities and “Digital” Social SciencesDigital Humanities and “Digital” Social Sciences
Digital Humanities and “Digital” Social Sciences
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Data sharing in the age of the Social Machine
Data sharing in the age of the Social MachineData sharing in the age of the Social Machine
Data sharing in the age of the Social Machine
 
Data stories
Data storiesData stories
Data stories
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse Commons
 
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'OroDigital Humanities Venice Group Presentation - Opening the Libro d'Oro
Digital Humanities Venice Group Presentation - Opening the Libro d'Oro
 
2014_WWW_BTOR
2014_WWW_BTOR2014_WWW_BTOR
2014_WWW_BTOR
 
Big Data in the Arts and Humanities
Big Data in the Arts and HumanitiesBig Data in the Arts and Humanities
Big Data in the Arts and Humanities
 
Big Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentationBig Data in the Arts and Humanities: Stirling presentation
Big Data in the Arts and Humanities: Stirling presentation
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
Miscellaneous Info: The Digital Past, Present, Future
Miscellaneous Info: The Digital Past, Present, FutureMiscellaneous Info: The Digital Past, Present, Future
Miscellaneous Info: The Digital Past, Present, Future
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with Data
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
 
Ralph schroeder and eric meyer
Ralph schroeder and eric meyerRalph schroeder and eric meyer
Ralph schroeder and eric meyer
 
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the MapNew Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott Library
 
Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods
 
AAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveysAAPOR - comparing found data from social media and made data from surveys
AAPOR - comparing found data from social media and made data from surveys
 

Recently uploaded

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Recently uploaded (20)

Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 

Words and More Words: Challenges of Big Data by Prof. Edie Rasmussen

  • 1. Words and More Words: Challenges of Big (Text) Data Edie Rasmussen Visiting Professor, Nanyang Technological University Professor, University of British Columbia WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 2. Outline • The Rise of Big Text Data • Challenges for Text Data • Research Opportunities – Counting and Culturomics – Extracting Meaning from Text 2 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 3. The Rise of Big Text Data • Before there was Big Data, there were large bibliographic databases: – Dialog: ~180 scholarly databases – Lexis/Nexis: 5 billion documents (business/law/news) – Citation Indexes: > 40 million records • IR techniques designed for rapid access to very large (text) databases • Swanson: “Undiscovered public knowledge” (1987) WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities 3
  • 4. Current Text Sources • Digitized Legacy Materials – Google Books, Hathi Trust (11 million volumes, 500 TB) • The Web • Search Logs (over 2 million queries per minute) • Wikipedia (~4.5 million English articles) • Blogs (The Blogosphere) • Twitter (The Twitterverse) • Test Collections – Smaller – Experimentally more robust 4 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 5. Challenges of Text • Legacy Text/Digitization Costs • Quality (OCR Errors; Metadata Errors) • Availability (Access, Copyright, Privacy) • Reliability – Algorithmic dependencies – Creator trustworthiness • Authorship Issues (Identification, Authority) • Lack of Structure • Lack of Context • Ambiguity of human language • Breadth vs. Depth 5 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 6. Processing Text • Tokenizing, stopping, stemming • Statistics of text: term values (tf*idf) • “Bag of Words” approach • Other evidence: network structures • Similarity calculations • Creating ranked lists • Note: Probabilistic rather than Deterministic 6 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 7. Counting and the Rise of Culturomics • “Culturomics is the application of high- throughput data collection and analysis to the study of human culture” • Database of >5 million digitized books (~4%) • Michel et al. (Science, 2011): “Quantitative analysis of culture using millions of digitized books” • Google’s N-Gram Viewer 7 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 8. Using the N-Gram Viewer 8 typhoid gout 1800 20001900 HIV cholera WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 9. How Far Will Counting Take us? • Many limitations (e.g. incomplete data set) • Some surprisingly sophisticated analyses: – Size of English lexicon – Change in word usage (irregular verbs) over time – Cultural turnover (inventions) – The nature (duration) of fame – Patterns of censorship (“suppression index”) 9 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 10. Critiques of Culturomics • “The death of theory” • “…second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions.” (Nunberg, 2011) • Books as a representation of human history • A “time sink” 10 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 11. Social Media as Big Data • ‘Internet Minute’ – 320+ new Twitter accounts – 100,000 new Tweets – 2+ million search queries – 6 new Wikipedia articles – 30 hours of video uploaded (Source: Intel http://www.intel.com/content/www/us/en/communications/interne t-minute-infographic.html) 11 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 12. TM: Topic Detection and Tracking • Tracking a story line over time • News wire input, identify new story, find subsequent instances • Story segmentation, First story detection, Clustering of like stories • Interesting to news, business, security analysts 12 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 13. TM: Sentiment Analysis/Opinion Mining • Rich data from Blogs and Tweets • Basically a classification problem (SVM, Naïve Bayes, etc.) - > positive, negative, neutral • Involves Entity Extraction, NLP, sentiment vocabularies • Of interest to government and businesses • See Stanford SA of movie reviews: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html 13 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 14. TM: Trends and Predictions • Can Tweets and Search Logs be used to predict the future? • Google Flu Trends, Google Dengue Trends – Correlated with Search Terms • Network analysis on Tweets on Arab Spring • Assessing tone of global news data to predict national stability, location of terrorists, etc. (Leetaru) • Predicting opinions (recommender systems) 14 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 15. TM: Question Answering • Combines multiple sources of evidence: – Question type identification – Information retrieval of candidate text – Natural language processing – Entity extraction – Hypothesis generation and scoring (confidence) – Ranking hypotheses 15 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 16. 16 Watson, 2011 Hans Peter Luhn, 1952 Watson, 2011
  • 17. Structuring Research: “Digging Into Data” Program • Addresses: “how "big data" changes the research landscape for the humanities and social sciences” • 3 rounds of international research funding • Canada, US, UK, plus Netherlands • Team approach: scholars, scientists, information professionals • Requires international teams; funding from at least two countries • Wide range of datasets made available • http://www.diggingintodata.org/ 17 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 18. 18 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities
  • 19. Thank you! 19 WKWSCI SYMPOSIUM 2014 Big Data, Big Ideas for Smarter Communities