How to Transform Clinical Trial Management with Advanced Data Analytics
Visual Analytics for Linguistics - Day 3 ESSLLI
1. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Visual Analytics for Linguistics - Day 3
Olga Scrivner
2. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
What You Will Learn
DAY 1 Introduction to Visual Analytics
DAY 2 Visualization Methods, Design, and Tools
DAY 3 Working with Unstructured Data
DAY 4 Working with Structured Data
DAY 5 Advanced Analytics
3. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Our Materials - Web Site
http:
//obscrivn.wixsite.com/visualization
4. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
What We Need
Interactive Text Mining Suite
Voyant
R and Rstudio
R libraries: ggplot2, plotly, reshape2
5. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
What We Need
Interactive Text Mining Suite
Voyant
R and Rstudio
R libraries: ggplot2, plotly, reshape2
6. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Quiz: Which Chart Are You?
https://www.sisense.com/blog/quiz-chart/
7. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart
The value of a column in the data set. This is done with
stat=“identity”, which leaves the y values unchanged.
The count of cases for each group - each x value
represents one group.
8. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Sample
9. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Sample
10. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Values
11. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Counts
To get a bar graph of counts, we do not map a variable to y,
and we use stat=“count”
12. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating a Bar Chart - Counts
13. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Title
14. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Line Chart
15. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Line Chart
16. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Area Chart
http://www.r-graph-gallery.com/136-stacked-area-chart/
17. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Scatter Plot
http://www.r-graph-gallery.com/272-basic-scatterplot-with-ggplot2/
18. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Bubble Plot
https://plot.ly/r/bubble-charts/
19. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Bubble Plot
https://plot.ly/r/bubble-charts/
20. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Heatmap
http:
//www.r-graph-gallery.com/215-interactive-heatmap-with-plotly/
21. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Heatmap
http://www.r-graph-gallery.com/215-interactive-heatmap-with-plotly/
22. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Heatmap
23. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Creating Word Cloud
24. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Word Cloud - Contest - 10 min
Create your own word cloud
Look at the function - type ?wordcloud2 and run
Can you change a shape of your cloud?
Save (or make a screenshot) and post it on
twitter/facebook etc
25. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Why Analyze Text?
The “epic transformation of archives” - shifting from print to
digital archival form (Folsom, 2007)
“As our collective knowledge continues to be digitized and
stored (...) it becomes more difficult to find and discover
what we are looking for.” (Blei 2012)
26. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Text Mining Challenges
source - 1) Dan Jurafsky, 2) Text Mining with R for Social Science Research (Ryan Wesslen)
27. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Basic Terminology
28. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
What is Bag of Words?
Simplest way to quantify text
Word order ignored
Term counts per document
N-grams (uni-grams, bi-grams)
Source - Chris Manning
29. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing
Tokenization (splitting words)
Cleaning (lower case, punctuation)
Stemming
Filter (stopwords)
Source - Wesslen
30. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing
Tokenization (splitting words)
Cleaning (lower case, punctuation)
Stemming
works, worked → work
Filter (stopwords)
Source - Wesslen
31. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing
Tokenization (splitting words)
Cleaning (lower case, punctuation)
Stemming
works, worked → work
Filter (stopwords)
and, the, a
Source - Wesslen
32. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Macro-analysis
Concept Macro-analysis (Jockers, 2013)
“the construction of abstract models”
(Jasinski, 2001)
Methods Tag clouds, heat maps, clusters, topics,
network graphs
Tools GUI: Voyant, Papermachine, ITMS
TUI: Mallet, Meta, R and Python packages
33. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Visual Analytics
Visual Analytics - “The science of analytical reasoning
facilitated by visual interactive interfaces” (Thomas et all.,
2005)
Graphs, maps and trees for literature analysis (Moretti,
2005)
34. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization Methods
Word clouds to analyze a novel (Vuillemot et al., 2009)
35. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization Methods
Social network graphs of characters in Greek tragedies
(Rydberg-Cox, 2011)
36. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization Methods
Literary fingerprint and summaries (Oelke et al., 2012)
37. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization Methods
Tracking emotion and sentiment in fairy tales
(Mohammad, 2012)
38. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Topic Modeling
Discovering underlying theme of collection from Science magazine
1990-2000 (Blei 2012)
39. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Topics - Word Term
40. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Topics - Word Term
41. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Wikipedia Topics
http:
//www.princeton.edu/~achaney/tmve/
wiki100k/browse/topic-presence.html
42. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Wikipedia Topics - Assignment - 10 min
1. Language Related Topic
2. Words: Dialect
3. Related Document: Macedonian Language
4. Related Document: Egyptian hieroglyphs
5. Go to Full article:
6. Find meaning:
43. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant
http://voyant-tools.org/
44. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant
http://voyant-tools.org/
45. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant - 10 min
http://voyant-tools.org/
Examine visualization charts (identify types
and properties)
Apply various filters and queries
46. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant Tools - Bubblelines - 7 min
http://docs.voyant-tools.org/tools/
Delete top terms
Search for man and woman
Make sure to have “separate lines for terms” clicked
Change search terms
47. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Voyant Tools - Pair Work - 10 min
http://docs.voyant-tools.org/tools/
Examine visualization methods
Select 5 methods
Look at the documentation and how to use them
48. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Interactive Text Mining Suite
A user-friendly tool for quantitative analysis and
visualization of unstructured data
Platform-independent
Interactive
49. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
ITMS Structure
1. File Uploads
Upload files (txt, pdf, rdf and Google books API)
2. Data Preparation
Data preprocessing (stopwords, stemming, metadata)
3. Data Visualization
Word frequencies, Cluster analysis and topic modeling
50. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
ITMS Structure
1. File Uploads
Upload files (txt, pdf, rdf and Google books API)
2. Data Preparation
Data preprocessing (stopwords, stemming, metadata)
3. Data Visualization
Word frequencies, Cluster analysis and topic modeling
51. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Workshop Files
Download 3 text files
https://iu.box.com/s/
knua9af3bip7g63s3zdax9ti4z243ldz
NY Times articles (3 documents in a plain text format)
ITMS Web site:
http://www.interactivetextminingsuite.com
52. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Upload File
53. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Upload File
54. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Upload File
55. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing Data
Before performing data analysis we should preprocess data.
56. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Preprocessing Options
Select preprocessing options and click apply.
57. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Stopwords
Stopwords (e.g. the, and): select Default for English
58. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Manual Removal of Stopwords
Based on the need, remove any additional stopwords that you
may consider a noise, e,g, paper, shows etc
Select apply
59. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Stemming
To improve analytics, you can stem all your tokens, ex.
instead of worked, works, working, you will have only one
relevant stem work
60. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Metadata Extraction
You can extract or upload metadata. You will need
datestamp (year) information for chronological topic
modeling.
61. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Visualization
62. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Word Cloud Representation
63. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Customization
64. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Cluster Analysis
You need to have at least three documents
Documents will be grouped based on their term similarity
measures
65. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Cluster Analysis
66. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Topic Modeling
LDA (Latent Dirichlet allocation)
STM (Structural Topic model)
Chronological topic visualization (lda): requires
metadata
67. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Topic Modeling Tuning
Selection of topics (how many different themes)
Selection of words per theme (how many words per
topic)
Selection of iteration
68. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Topic Model Selection
69. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
LDA Topic Model
70. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
STM Topic Model
71. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Other Formats - Google Books
Before switching to other data formats, refresh your local
browser.
Start with File Uploads and select Structured Data
72. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Other Formats - Google Books
Select your search terms and submit
Current limitation is 40 books
73. Visual Analytics
for Linguistics -
Day 3
Olga Scrivner
Course Info
Charts
Text
Visualization
ITMS
Preprocessing
Data
Data
Visualization
Cluster Analysis
Topic Modeling
Google Book API
Resources
http://www.rdatamining.com/examples/text-mining
https:
//en.wikibooks.org/wiki/R_Programming/Text_Processing
http://data.library.virginia.edu/
reading-pdf-files-into-r-for-text-mining/
http://www.katrinerk.com/courses/
words-in-a-haystack-an-introductory-statistics-course/
schedule-words-in-a-haystack/
r-code-the-text-mining-package
tm package