How does Digital Text relate to written non-digital text? What do we need to think about when using digital large-scale methods and interpreting the results.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Workshop on Digital Literacy - Digital text and data-intensive research
1. Digital Text and
Data-Intensive Research
Nina Tahmasebi, Associate Professor
University of Gothenburg
Digital Literacy | 2020-2021
Nina Tahmasebi, Digital Literacy, Sept. 2020
2. Centre for
Digital Humanities
(2018-2019)
Mathematics
(B.Sc & M.Sc)
2003-2008
Computer/ Data Science
(Phd + Postdoc)
2008-2014)
NLP /
Language Technology
(Researcher, Associate
Professor) 2014→
Nina Tahmasebi, Digital Literacy, Sept. 2020 2
5. Based on
• Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for
Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198-
227.
• Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of
Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449.
Nina Tahmasebi, Digital Literacy, Sept. 2020 5
6. When do we benefit from
computational methods?
Nina Tahmasebi, Digital Literacy, Sept. 2020 6
7. A single physical
piece can be
studied in detail.
A few physical pieces
can be studied and
compared in detail.
Too many physical
pieces cannot be
treated manually.
Nina Tahmasebi, Digital Literacy, Sept. 2020
10. From text to answers
text
text mining
method
research question
results
Nina Tahmasebi, Digital Literacy, Sept. 2020 10
11. From text to answers
text
research question
text mining
method
Nina Tahmasebi, Digital Literacy, Sept. 2020 11
results
12. Today’s outline
3. Research results and interpretation
1. Digital Text
2. Data-intensive research methodology
Nina Tahmasebi, Digital Literacy, Sept. 2020 12
14. A book:
• Empty pages in the
beginning / end
• Large letter at the
beginning of each chapter
• Images?
Nina Tahmasebi, Digital Literacy, Sept. 2020 14
15. A single physical
piece can be
studied in detail.
A few physical pieces
can be studied and
compared in detail.
Too many physical
pieces cannot be
treated manually.
Nina Tahmasebi, Digital Literacy, Sept. 2020 15
16. Too many physical
pieces cannot be
treated manually.
Digital Text
Nina Tahmasebi, Digital Literacy, Sept. 2020 16
17. Too many digital texts cannot
be studied in TOO LARGE
DETAIL either!
We need to ignore a lot of formatting
• White pages
• White space
• Fonts
• Capitalization of letters
• Etc…
Nina Tahmasebi, Digital Literacy, Sept. 2020 17
18.
19. Digital text
Printed texts
not available digitally
Printed texts
born digital
Other digital
publications
User generated textEdited text
Less errors of the kind
• OCR errors due to modern
fonts,
• Less dirty pages, younger age.
• Modern language
Data of the kind:
• News
• Professional blogs
• Reviews
A lot of errors
• Spelling errors
• Grammatical errors
• Abbreviations
• Smileys
(automatic) Metadata
The older the text, the more
errors
• Paper in bad quality
• Different fonts
• Skewed columns
• (Spelling variations)
Nina Tahmasebi, Digital Literacy, Sept. 2020 19
21. Corpus /dataset
• Corpus → linguistically oriented
• Dataset → any collection of text!
• Thematic
• Time periods
• Media types
• Genre
• …
• There are certain types of
questions that cannot be
answered by any text
Digital text
Nina Tahmasebi, Digital Literacy, Sept. 2020 21
22. Individual
Individual text
With individual intent
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Researcher/group
analyzing in detail
Smart search scenario
Nina Tahmasebi, Digital Literacy, Sept. 2020 22
23. Smart search/selection
• All interpretation and analysis is
left to human
• Often, the correctness of each
individual bit is simple to verify
• But what happens when we have
millions of bits and pieces?
→ We still cannot study manually
Researcher/group
analyzing in detail
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Nina Tahmasebi, Digital Literacy, Sept. 2020 23
25. Sources of error:
• We made a bad model:
• E.g. Lost formatting
• Too many OCR errors in the text
→ We cannot find what we are looking
for
→We find much more than we need
• What we are looking for
semantically is not covered by the
terms we use for search:
kvinna ≠ quinna
• Other sources of error?
Researcher/group
analyzing in detail
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Nina Tahmasebi, Digital Literacy, Sept. 2020 25
26. Researcher/group
analyzing in detail
Individual
Individual text
With individual intent
Signal change
Signal
topic, cluster, vector…
Multiple texts –
dataset/corpus
Researcher/group
analyzing in detail
Text mining scenario
Nina Tahmasebi, Digital Literacy, Sept. 2020 26
27. NLP
Evaluation
Information extraction
ICALL
Meaning change
Change in grammar, sentiment, argumentation
Temporal Information Retrieval
Macro analysis and signal change
Word sense induction
Word sense disambiguation
Role labeling
Event detection
IR
IE/signal/information
Word vectors/word matrices
tfIdf/mutual information
Models of grammar
Language Models
Language Technology /NLP
Lemmatization
Part of speech tagging
Parsning
Semantic enrichment (e.g., word sense disambiguation)
Extract temporal references in text
Link data
Filter: tex boiler plate, ads, recurrent data)
Clean words from noise
Normalize/remove stop words
Temporal references based on metadata
Pre-processing
Gather dataset
Include links
How long was each passage viewed
Metadata
Data collection
28. NLP
Evaluation
Information extraction
ICALL
Meaning change
Change in grammar, sentiment, argumentation
Temporal Information Retrieval
Macro analysis and signal change
Word sense induction
Word sense disambiguation
Role labeling
Event detection
IR
IE/signal/information
Word vectors/word matrices
tfIdf/mutual information
Models of grammar
Language Models
Language Technology /NLP
Lemmatization
Part of speech tagging
Parsning
Semantic enrichment (e.g., word sense disambiguation)
Extract temporal references in text
Link data
Filter: tex boiler plate, ads, recurrent data)
Clean words from noise
Normalize/remove stop words
Temporal references based on metadata
Pre-processing
Gather dataset
Include links
How long was each passage viewed
Metadata
Data collection
Text mining scenario
Smart search
scenario
30. Clean much – keep much information
Tokenize
Remove low-frequent words
Remove veeeery high-frequent words
Tokens with little information
• Numbers, punctuation marks etc.
Remove capitalization
Normalize (é → e, eeee→e)
→ Choices all depend on application
and research question
Matter of economy:
• We cannot afford
to keep it all
• So we keep what gives us
most value (= information)
frequency
information
Nina Tahmasebi, Digital Literacy, Sept. 2020 30
31. I like the room but not the sheet. (only verbs)
I like the room but not the sheet. (frequency filtering)
I like the room but not the sheet. (only nouns)
I like the room but not the sheet. (after lemmatization)
I like the room but not the sheets. (after stop word filtering)
I like the room but not the sheets.
Nina Tahmasebi, Digital Literacy, Sept. 2020 31
32. Nina Tahmasebi, Digital Literacy, Sept. 2020 32
3. Nouns. After a series of experiments, it was determined that the thematic
information in this corpus could best be captured by modeling only the remaining
nouns. Using the Standford POS tagger, each word in each segment was marked up with
a part of speech indicator and all but the nouns were removed.12
Jockers and Mimno, Significant Themes in
19th-Century Literature
33. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three.
34. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three.
Prezentio add. 5
37. In short, ladies and gentlemen, my
message today is that
data is gold. … Let's start mining it.
Neelie Kroes
Vice-President of the European Commission responsible for the
Digital Agenda, SPEECH/11/872 , 2011
Nina Tahmasebi, Digital Literacy, Sept. 2020
38. Is it true that data is gold?
Nina Tahmasebi, Digital Literacy, Sept. 2020
39. same data
+ different methods
= different answers
Nina Tahmasebi, Digital Literacy, Sept. 2020 39
40. Since there is infinite amount
of information in the text,
the text becomes infinitely
complex.
→ Currently, there are no
methods to mine all the
information
Nina Tahmasebi, Digital Literacy, Sept. 2020 40
47. Text-mining method
Dimensions
Filtering: Function words
Filtering: Stopwords
Part-of-speech tagging
Lemmatization
Tokenization
NLP pipeline: From text to result
Nina Tahmasebi, Digital Literacy, Sept. 2020 47
55. The better your method
(WRT the information related to
your research question)
→ the better the pieces
Amount
of
informa
tion
Amount of text
Text mining
method
Nina Tahmasebi, Digital Literacy, Sept. 2020 55
59. Results and research questions
Research
question
Sometimes the results
do not answer
the research question in full
Nina Tahmasebi, Digital Literacy, Sept. 2020 59
62. Research questions
Evidence
• Attack/demonstrations
• Homicide investigation
• Financial irregularities
• Data breach
Majority
• How well is our product received
• Which of our issues are
most/least attractive to our
voters?
• How will people vote?
Nina Tahmasebi, Digital Literacy, Sept. 2020 62
63. Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, Digital Literacy, Sept. 2020 63
64. Truths about data-
intensive research
Not all methods fit all data
Not all data fit all questions
Not all methods can answer all questions
Nothing lives separately,
it must be evaluated together:
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 64
65. Truths about data-
intensive research (II)
Gives us the possibility to ask
new kinds of questions
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 65
68. Truths about data-
intensive research (II)
Gives us the possibility to ask
new kinds of questions
Which kind of questions fit
your purposes?
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 68
71. ?
Store
Writer A
Male authors
Journal 1
Written language
Pharmacy
Writer B
Female authors
Journal 2
Spoken language
Are these different (significantly) or the same?
Sample 2Sample 1
Sample 2Sample 1
H1 H0
Nina Tahmasebi, Digital Literacy, Sept. 2020 71
72. Inference requires
random selection
• Only if the selection is random,
can we use the sample to draw
conclusions about the world
• We almost NEVER have a random
sample in a textual corpus!
→ We cannot draw conclusions
about the world Sample 2Sample 1
random
inference
Nina Tahmasebi, Digital Literacy, Sept. 2020 72
73. When we have little data, the uncertainty
is large:
• Is A larger than B?
But when we have large data, we are more
certain about our observations, STILL, our
errors can be much larger
• Because our selection is biased Sample 2
Sample 2
Sample 1
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Nina Tahmasebi, Digital Literacy, Sept. 2020 73
74. In corpus studies, we frequently do have enough data, so
the fact that a relation between two phenomena is
demonstrably non-random, does not support the
inference that it is not arbitrary. Language is never,
ever, ever, random,
Adam Kilgariff, 2005
Nina Tahmasebi, Digital Literacy, Sept. 2020 74
75. Method + Data = Results
result
Nina Tahmasebi, Digital Literacy, Sept. 2020 75
81. NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Range of math scores
Nina Tahmasebi, Digital Literacy, Sept. 2020 81Source: Factfullness
82. Men
Women
Comparison of the same data
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Source: Factfullness
Men
Women
Nina Tahmasebi, Digital Literacy, Sept. 2020 82
85. Experimental design
Even when the math is right, we need to question the
selection and the grounds on which our conclusions are.
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, Digital Literacy, Sept. 2020 85
91. Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, Digital Literacy, Sept. 2020 91
92. Experimental design
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, Digital Literacy, Sept. 2020 92
93. Prof. Hans Rosling
You can’t understand
the world without
numbers…
Factfullness
… and you cannot
understand it
only with numbers.
Nina Tahmasebi, Digital Literacy, Sept. 2020 93