SlideShare une entreprise Scribd logo
1  sur  94
Digital Text and
Data-Intensive Research
Nina Tahmasebi, Associate Professor
University of Gothenburg
Digital Literacy | 2020-2021
Nina Tahmasebi, Digital Literacy, Sept. 2020
Centre for
Digital Humanities
(2018-2019)
Mathematics
(B.Sc & M.Sc)
2003-2008
Computer/ Data Science
(Phd + Postdoc)
2008-2014)
NLP /
Language Technology
(Researcher, Associate
Professor) 2014→
Nina Tahmasebi, Digital Literacy, Sept. 2020 2
Views on text
Language
1010011010010
1001010010101
0011010010101
Data
Nina Tahmasebi, Digital Literacy, Sept. 2020 4
Based on
• Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for
Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198-
227.
• Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of
Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449.
Nina Tahmasebi, Digital Literacy, Sept. 2020 5
When do we benefit from
computational methods?
Nina Tahmasebi, Digital Literacy, Sept. 2020 6
A single physical
piece can be
studied in detail.
A few physical pieces
can be studied and
compared in detail.
Too many physical
pieces cannot be
treated manually.
Nina Tahmasebi, Digital Literacy, Sept. 2020
Nina Tahmasebi, Digital Literacy, Sept. 2020
Nina Tahmasebi, Digital Literacy, Sept. 2020
From text to answers
text
text mining
method
research question
results
Nina Tahmasebi, Digital Literacy, Sept. 2020 10
From text to answers
text
research question
text mining
method
Nina Tahmasebi, Digital Literacy, Sept. 2020 11
results
Today’s outline
3. Research results and interpretation
1. Digital Text
2. Data-intensive research methodology
Nina Tahmasebi, Digital Literacy, Sept. 2020 12
Digital Text
Nina Tahmasebi, Digital Literacy, Sept. 2020
A book:
• Empty pages in the
beginning / end
• Large letter at the
beginning of each chapter
• Images?
Nina Tahmasebi, Digital Literacy, Sept. 2020 14
A single physical
piece can be
studied in detail.
A few physical pieces
can be studied and
compared in detail.
Too many physical
pieces cannot be
treated manually.
Nina Tahmasebi, Digital Literacy, Sept. 2020 15
Too many physical
pieces cannot be
treated manually.
Digital Text
Nina Tahmasebi, Digital Literacy, Sept. 2020 16
Too many digital texts cannot
be studied in TOO LARGE
DETAIL either!
We need to ignore a lot of formatting
• White pages
• White space
• Fonts
• Capitalization of letters
• Etc…
Nina Tahmasebi, Digital Literacy, Sept. 2020 17
Digital text
Printed texts
not available digitally
Printed texts
born digital
Other digital
publications
User generated textEdited text
Less errors of the kind
• OCR errors due to modern
fonts,
• Less dirty pages, younger age.
• Modern language
Data of the kind:
• News
• Professional blogs
• Reviews
A lot of errors
• Spelling errors
• Grammatical errors
• Abbreviations
• Smileys
(automatic) Metadata
The older the text, the more
errors
• Paper in bad quality
• Different fonts
• Skewed columns
• (Spelling variations)
Nina Tahmasebi, Digital Literacy, Sept. 2020 19
Nina Tahmasebi, Digital Literacy, Sept. 2020 20
Corpus /dataset
• Corpus → linguistically oriented
• Dataset → any collection of text!
• Thematic
• Time periods
• Media types
• Genre
• …
• There are certain types of
questions that cannot be
answered by any text
Digital text
Nina Tahmasebi, Digital Literacy, Sept. 2020 21
Individual
Individual text
With individual intent
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Researcher/group
analyzing in detail
Smart search scenario
Nina Tahmasebi, Digital Literacy, Sept. 2020 22
Smart search/selection
• All interpretation and analysis is
left to human
• Often, the correctness of each
individual bit is simple to verify
• But what happens when we have
millions of bits and pieces?
→ We still cannot study manually
Researcher/group
analyzing in detail
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Nina Tahmasebi, Digital Literacy, Sept. 2020 23
Nina Tahmasebi, Digital Literacy, Sept. 2020
Sources of error:
• We made a bad model:
• E.g. Lost formatting
• Too many OCR errors in the text
→ We cannot find what we are looking
for
→We find much more than we need
• What we are looking for
semantically is not covered by the
terms we use for search:
kvinna ≠ quinna
• Other sources of error?
Researcher/group
analyzing in detail
Multiple texts –
dataset/corpus
Bits and pieces from
a large dataset
Nina Tahmasebi, Digital Literacy, Sept. 2020 25
Researcher/group
analyzing in detail
Individual
Individual text
With individual intent
Signal change
Signal
topic, cluster, vector…
Multiple texts –
dataset/corpus
Researcher/group
analyzing in detail
Text mining scenario
Nina Tahmasebi, Digital Literacy, Sept. 2020 26
NLP
Evaluation
Information extraction
ICALL
Meaning change
Change in grammar, sentiment, argumentation
Temporal Information Retrieval
Macro analysis and signal change
Word sense induction
Word sense disambiguation
Role labeling
Event detection
IR
IE/signal/information
Word vectors/word matrices
tfIdf/mutual information
Models of grammar
Language Models
Language Technology /NLP
Lemmatization
Part of speech tagging
Parsning
Semantic enrichment (e.g., word sense disambiguation)
Extract temporal references in text
Link data
Filter: tex boiler plate, ads, recurrent data)
Clean words from noise
Normalize/remove stop words
Temporal references based on metadata
Pre-processing
Gather dataset
Include links
How long was each passage viewed
Metadata
Data collection
NLP
Evaluation
Information extraction
ICALL
Meaning change
Change in grammar, sentiment, argumentation
Temporal Information Retrieval
Macro analysis and signal change
Word sense induction
Word sense disambiguation
Role labeling
Event detection
IR
IE/signal/information
Word vectors/word matrices
tfIdf/mutual information
Models of grammar
Language Models
Language Technology /NLP
Lemmatization
Part of speech tagging
Parsning
Semantic enrichment (e.g., word sense disambiguation)
Extract temporal references in text
Link data
Filter: tex boiler plate, ads, recurrent data)
Clean words from noise
Normalize/remove stop words
Temporal references based on metadata
Pre-processing
Gather dataset
Include links
How long was each passage viewed
Metadata
Data collection
Text mining scenario
Smart search
scenario
Nina Tahmasebi, Digital Literacy, Sept. 2020 29
Clean much – keep much information
Tokenize
Remove low-frequent words
Remove veeeery high-frequent words
Tokens with little information
• Numbers, punctuation marks etc.
Remove capitalization
Normalize (é → e, eeee→e)
→ Choices all depend on application
and research question
Matter of economy:
• We cannot afford
to keep it all
• So we keep what gives us
most value (= information)
frequency
information
Nina Tahmasebi, Digital Literacy, Sept. 2020 30
I like the room but not the sheet. (only verbs)
I like the room but not the sheet. (frequency filtering)
I like the room but not the sheet. (only nouns)
I like the room but not the sheet. (after lemmatization)
I like the room but not the sheets. (after stop word filtering)
I like the room but not the sheets.
Nina Tahmasebi, Digital Literacy, Sept. 2020 31
Nina Tahmasebi, Digital Literacy, Sept. 2020 32
3. Nouns. After a series of experiments, it was determined that the thematic
information in this corpus could best be captured by modeling only the remaining
nouns. Using the Standford POS tagger, each word in each segment was marked up with
a part of speech indicator and all but the nouns were removed.12
Jockers and Mimno, Significant Themes in
19th-Century Literature
When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three.
When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday
with a party of special magnificence, there was much talk and excitement in Hobbiton.
Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his
remarkable disappearance and unexpected return. The riches he had brought back from his travels had now
become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was
full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to
marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at
fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark.
There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that
anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth.
‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’
But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to
forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course,
the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant
families. But he had no close friends, until some of his younger cousins began to grow up.
The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted
Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally
dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live
here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At
that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and
coming of age at thirty-three.
Prezentio add. 5
Nina Tahmasebi, Digital Literacy, Sept. 2020 35
Amount of
information
Amount of text
Text mining
method
Nina Tahmasebi, Digital Literacy, Sept. 2020
In short, ladies and gentlemen, my
message today is that
data is gold. … Let's start mining it.
Neelie Kroes
Vice-President of the European Commission responsible for the
Digital Agenda, SPEECH/11/872 , 2011
Nina Tahmasebi, Digital Literacy, Sept. 2020
Is it true that data is gold?
Nina Tahmasebi, Digital Literacy, Sept. 2020
same data
+ different methods
= different answers
Nina Tahmasebi, Digital Literacy, Sept. 2020 39
Since there is infinite amount
of information in the text,
the text becomes infinitely
complex.
→ Currently, there are no
methods to mine all the
information
Nina Tahmasebi, Digital Literacy, Sept. 2020 40
Data-intensive
research methodology
Nina Tahmasebi, Digital Literacy, Sept. 2020
Traditional research methodology
Research
question
Text
Nina Tahmasebi, Digital Literacy, Sept. 2020 42
Data-intensive research methodology
Research
question
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 43
Data-intensive research methodology
Research
question
Text
(digital large-scale text)
Hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 44
Data Hypothesis
Data Hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 45
Hypothesis
Data-intensive research methodology
Text mining
method
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 46
Text-mining method
Dimensions
Filtering: Function words
Filtering: Stopwords
Part-of-speech tagging
Lemmatization
Tokenization
NLP pipeline: From text to result
Nina Tahmasebi, Digital Literacy, Sept. 2020 47
Hypothesis
Data-intensive research methodology
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 48
Results as a window to the text
Nina Tahmasebi, Digital Literacy, Sept. 2020 49
Viewpoint on the data
Nina Tahmasebi, Digital Literacy, Sept. 2020 50
Nina Tahmasebi, Digital Literacy, Sept. 2020 51
Nina Tahmasebi, Digital Literacy, Sept. 2020 52
Nina Tahmasebi, Digital Literacy, Sept. 2020 53
Nina Tahmasebi, Digital Literacy, Sept. 2020 54
The better your method
(WRT the information related to
your research question)
→ the better the pieces
Amount
of
informa
tion
Amount of text
Text mining
method
Nina Tahmasebi, Digital Literacy, Sept. 2020 55
Hypothesis
Data-intensive research methodology
Text mining
method
results
Text
(digital large-scale text)
Research
question
Nina Tahmasebi, Digital Literacy, Sept. 2020 56
Data-intensive research methodology
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Research
question
Nina Tahmasebi, Digital Literacy, Sept. 2020 57
Data-intensive research methodology
results
results
results
Text mining
method
Text
(digital large-scale text)
Research
question
Nina Tahmasebi, Digital Literacy, Sept. 2020 58
Results and research questions
Research
question
Sometimes the results
do not answer
the research question in full
Nina Tahmasebi, Digital Literacy, Sept. 2020 59
Nina Tahmasebi, Digital Literacy, Sept. 2020 60
Image: https://ipec.co.zwNina Tahmasebi, Digital Literacy, Sept. 2020
Research questions
Evidence
• Attack/demonstrations
• Homicide investigation
• Financial irregularities
• Data breach
Majority
• How well is our product received
• Which of our issues are
most/least attractive to our
voters?
• How will people vote?
Nina Tahmasebi, Digital Literacy, Sept. 2020 62
Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, Digital Literacy, Sept. 2020 63
Truths about data-
intensive research
Not all methods fit all data
Not all data fit all questions
Not all methods can answer all questions
Nothing lives separately,
it must be evaluated together:
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 64
Truths about data-
intensive research (II)
Gives us the possibility to ask
new kinds of questions
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 65
Nina Tahmasebi, Digital Literacy, Sept. 2020
Nina Tahmasebi, Digital Literacy, Sept. 2020m
Truths about data-
intensive research (II)
Gives us the possibility to ask
new kinds of questions
Which kind of questions fit
your purposes?
Hypothesis
Text mining
method
results
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 68
Results and
research questions
Hypotes
Text mining
method
resultat
Text
(digital large-scale text)
Nina Tahmasebi, Digital Literacy, Sept. 2020 69
Reduction vs. representation
digitization
preprocessing
method
hypothesis
choice
Nina Tahmasebi, Digital Literacy, Sept. 2020 70
?
Store
Writer A
Male authors
Journal 1
Written language
Pharmacy
Writer B
Female authors
Journal 2
Spoken language
Are these different (significantly) or the same?
Sample 2Sample 1
Sample 2Sample 1
H1 H0
Nina Tahmasebi, Digital Literacy, Sept. 2020 71
Inference requires
random selection
• Only if the selection is random,
can we use the sample to draw
conclusions about the world
• We almost NEVER have a random
sample in a textual corpus!
→ We cannot draw conclusions
about the world Sample 2Sample 1
random
inference
Nina Tahmasebi, Digital Literacy, Sept. 2020 72
When we have little data, the uncertainty
is large:
• Is A larger than B?
But when we have large data, we are more
certain about our observations, STILL, our
errors can be much larger
• Because our selection is biased Sample 2
Sample 2
Sample 1
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Sample 2
Nina Tahmasebi, Digital Literacy, Sept. 2020 73
In corpus studies, we frequently do have enough data, so
the fact that a relation between two phenomena is
demonstrably non-random, does not support the
inference that it is not arbitrary. Language is never,
ever, ever, random,
Adam Kilgariff, 2005
Nina Tahmasebi, Digital Literacy, Sept. 2020 74
Method + Data = Results
result
Nina Tahmasebi, Digital Literacy, Sept. 2020 75
result
hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 76
Reject 1 Data 2 Method / Preprocessing 3 Hypothesis
result
hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 77
Accept 1 Method 2
Correct interpretation
of the results
result
hypothesis
Nina Tahmasebi, Digital Literacy, Sept. 2020 78
Math results, average difference
Men
Women
Nina Tahmasebi, Digital Literacy, Sept. 2020 79Source: Factfullness
Men
Women
Math results, average difference
Nina Tahmasebi, Digital Literacy, Sept. 2020 80Source: Factfullness
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Range of math scores
Nina Tahmasebi, Digital Literacy, Sept. 2020 81Source: Factfullness
Men
Women
Comparison of the same data
NUMBER OF INDIVIDUALS WITH
DIFFERENT MATH SCORES 2016
Men
Women
Source: Factfullness
Men
Women
Nina Tahmasebi, Digital Literacy, Sept. 2020 82
result
hypothesis
1 Method 2
Correct interpretation
of the results
3
Where do the
results live?
Nina Tahmasebi, Digital Literacy, Sept. 2020 83
result
hypothesis
1 Method 2
Correct interpretation
of the results
3
Where do the
results live?
Nina Tahmasebi, Digital Literacy, Sept. 2020 84
Experimental design
Even when the math is right, we need to question the
selection and the grounds on which our conclusions are.
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, Digital Literacy, Sept. 2020 85
Evaluation
Nina Tahmasebi, Digital Literacy, Sept. 2020
Evaluation
individual
individual text
signal
topic, cluster, vector…
signal change
collective text
minimum optimum medium
Nina Tahmasebi, Digital Literacy, Sept. 2020
Representativeness
Nina Tahmasebi, Digital Literacy, Sept. 2020 88
Conclusions
Nina Tahmasebi, Digital Literacy, Sept. 2020
Nina Tahmasebi, Digital Literacy, Sept. 2020 90
Digital research needs to be
evaluated on the combination
of data, method, and
research question
Nina Tahmasebi, Digital Literacy, Sept. 2020 91
Experimental design
• What is the corresponding number elsewhere?
• What are we measuring?
• Why will this answer our questions?
Nina Tahmasebi, Digital Literacy, Sept. 2020 92
Prof. Hans Rosling
You can’t understand
the world without
numbers…
Factfullness
… and you cannot
understand it
only with numbers.
Nina Tahmasebi, Digital Literacy, Sept. 2020 93
Tack!
Nina.tahmasebi@gu.se
nina@tahmasebi.se
Nina Tahmasebi, Digital Literacy, Sept. 2020 94

Contenu connexe

Dernier

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Dernier (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Workshop on Digital Literacy - Digital text and data-intensive research

  • 1. Digital Text and Data-Intensive Research Nina Tahmasebi, Associate Professor University of Gothenburg Digital Literacy | 2020-2021 Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 2. Centre for Digital Humanities (2018-2019) Mathematics (B.Sc & M.Sc) 2003-2008 Computer/ Data Science (Phd + Postdoc) 2008-2014) NLP / Language Technology (Researcher, Associate Professor) 2014→ Nina Tahmasebi, Digital Literacy, Sept. 2020 2
  • 4. Nina Tahmasebi, Digital Literacy, Sept. 2020 4
  • 5. Based on • Tahmasebi, Nina, and Simon Hengchen. "The Strengths and Pitfalls of Large-Scale Text Mining for Literary Studies." Samlaren: tidskrift för svensk litteraturvetenskaplig forskning 140 (2019): 198- 227. • Tahmasebi, Nina, Hagen, Niclas, Brodén, Daniel, & Malm, Mats. (2019). "A Convergence of Methodologies: Notes on a Data-intensive research methodology." DHN2019. p. 437-449. Nina Tahmasebi, Digital Literacy, Sept. 2020 5
  • 6. When do we benefit from computational methods? Nina Tahmasebi, Digital Literacy, Sept. 2020 6
  • 7. A single physical piece can be studied in detail. A few physical pieces can be studied and compared in detail. Too many physical pieces cannot be treated manually. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 8. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 9. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 10. From text to answers text text mining method research question results Nina Tahmasebi, Digital Literacy, Sept. 2020 10
  • 11. From text to answers text research question text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020 11 results
  • 12. Today’s outline 3. Research results and interpretation 1. Digital Text 2. Data-intensive research methodology Nina Tahmasebi, Digital Literacy, Sept. 2020 12
  • 13. Digital Text Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 14. A book: • Empty pages in the beginning / end • Large letter at the beginning of each chapter • Images? Nina Tahmasebi, Digital Literacy, Sept. 2020 14
  • 15. A single physical piece can be studied in detail. A few physical pieces can be studied and compared in detail. Too many physical pieces cannot be treated manually. Nina Tahmasebi, Digital Literacy, Sept. 2020 15
  • 16. Too many physical pieces cannot be treated manually. Digital Text Nina Tahmasebi, Digital Literacy, Sept. 2020 16
  • 17. Too many digital texts cannot be studied in TOO LARGE DETAIL either! We need to ignore a lot of formatting • White pages • White space • Fonts • Capitalization of letters • Etc… Nina Tahmasebi, Digital Literacy, Sept. 2020 17
  • 18.
  • 19. Digital text Printed texts not available digitally Printed texts born digital Other digital publications User generated textEdited text Less errors of the kind • OCR errors due to modern fonts, • Less dirty pages, younger age. • Modern language Data of the kind: • News • Professional blogs • Reviews A lot of errors • Spelling errors • Grammatical errors • Abbreviations • Smileys (automatic) Metadata The older the text, the more errors • Paper in bad quality • Different fonts • Skewed columns • (Spelling variations) Nina Tahmasebi, Digital Literacy, Sept. 2020 19
  • 20. Nina Tahmasebi, Digital Literacy, Sept. 2020 20
  • 21. Corpus /dataset • Corpus → linguistically oriented • Dataset → any collection of text! • Thematic • Time periods • Media types • Genre • … • There are certain types of questions that cannot be answered by any text Digital text Nina Tahmasebi, Digital Literacy, Sept. 2020 21
  • 22. Individual Individual text With individual intent Multiple texts – dataset/corpus Bits and pieces from a large dataset Researcher/group analyzing in detail Smart search scenario Nina Tahmasebi, Digital Literacy, Sept. 2020 22
  • 23. Smart search/selection • All interpretation and analysis is left to human • Often, the correctness of each individual bit is simple to verify • But what happens when we have millions of bits and pieces? → We still cannot study manually Researcher/group analyzing in detail Multiple texts – dataset/corpus Bits and pieces from a large dataset Nina Tahmasebi, Digital Literacy, Sept. 2020 23
  • 24. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 25. Sources of error: • We made a bad model: • E.g. Lost formatting • Too many OCR errors in the text → We cannot find what we are looking for →We find much more than we need • What we are looking for semantically is not covered by the terms we use for search: kvinna ≠ quinna • Other sources of error? Researcher/group analyzing in detail Multiple texts – dataset/corpus Bits and pieces from a large dataset Nina Tahmasebi, Digital Literacy, Sept. 2020 25
  • 26. Researcher/group analyzing in detail Individual Individual text With individual intent Signal change Signal topic, cluster, vector… Multiple texts – dataset/corpus Researcher/group analyzing in detail Text mining scenario Nina Tahmasebi, Digital Literacy, Sept. 2020 26
  • 27. NLP Evaluation Information extraction ICALL Meaning change Change in grammar, sentiment, argumentation Temporal Information Retrieval Macro analysis and signal change Word sense induction Word sense disambiguation Role labeling Event detection IR IE/signal/information Word vectors/word matrices tfIdf/mutual information Models of grammar Language Models Language Technology /NLP Lemmatization Part of speech tagging Parsning Semantic enrichment (e.g., word sense disambiguation) Extract temporal references in text Link data Filter: tex boiler plate, ads, recurrent data) Clean words from noise Normalize/remove stop words Temporal references based on metadata Pre-processing Gather dataset Include links How long was each passage viewed Metadata Data collection
  • 28. NLP Evaluation Information extraction ICALL Meaning change Change in grammar, sentiment, argumentation Temporal Information Retrieval Macro analysis and signal change Word sense induction Word sense disambiguation Role labeling Event detection IR IE/signal/information Word vectors/word matrices tfIdf/mutual information Models of grammar Language Models Language Technology /NLP Lemmatization Part of speech tagging Parsning Semantic enrichment (e.g., word sense disambiguation) Extract temporal references in text Link data Filter: tex boiler plate, ads, recurrent data) Clean words from noise Normalize/remove stop words Temporal references based on metadata Pre-processing Gather dataset Include links How long was each passage viewed Metadata Data collection Text mining scenario Smart search scenario
  • 29. Nina Tahmasebi, Digital Literacy, Sept. 2020 29
  • 30. Clean much – keep much information Tokenize Remove low-frequent words Remove veeeery high-frequent words Tokens with little information • Numbers, punctuation marks etc. Remove capitalization Normalize (é → e, eeee→e) → Choices all depend on application and research question Matter of economy: • We cannot afford to keep it all • So we keep what gives us most value (= information) frequency information Nina Tahmasebi, Digital Literacy, Sept. 2020 30
  • 31. I like the room but not the sheet. (only verbs) I like the room but not the sheet. (frequency filtering) I like the room but not the sheet. (only nouns) I like the room but not the sheet. (after lemmatization) I like the room but not the sheets. (after stop word filtering) I like the room but not the sheets. Nina Tahmasebi, Digital Literacy, Sept. 2020 31
  • 32. Nina Tahmasebi, Digital Literacy, Sept. 2020 32 3. Nouns. After a series of experiments, it was determined that the thematic information in this corpus could best be captured by modeling only the remaining nouns. Using the Standford POS tagger, each word in each segment was marked up with a part of speech indicator and all but the nouns were removed.12 Jockers and Mimno, Significant Themes in 19th-Century Literature
  • 33. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three.
  • 34. When Mr. Bilbo Baggins of Bag End announced that he would shortly be celebrating his eleventy-first birthday with a party of special magnificence, there was much talk and excitement in Hobbiton. Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty. At ninety-nine they began to call him well-preserved, but unchanged would have been nearer the mark. There were some that shook their heads and thought this was too much of a good thing; it seemed unfair that anyone should possess (apparently) perpetual youth as well as (reputedly) inexhaustible wealth. ‘It will have to be paid for,’ they said. ‘It isn’t natural, and trouble will come of it!’ But so far trouble had not come; and as Mr. Baggins was generous with his money, most people were willing to forgive him his oddities and his good fortune. He remained on visiting terms with his relatives (except, of course, the Sackville-Bagginses), and he had many devoted admirers among the hobbits of poor and unimportant families. But he had no close friends, until some of his younger cousins began to grow up. The eldest of these, and Bilbo’s favourite, was young Frodo Baggins. When Bilbo was ninety-nine, he adopted Frodo as his heir, and brought him to live at Bag End; and the hopes of the Sackville-Bagginses were finally dashed. Bilbo and Frodo happened to have the same birthday, September 22nd. ‘You had better come and live here, Frodo my lad,’ said Bilbo one day; ‘and then we can celebrate our birthday-parties comfortably together.’ At that time Frodo was still in his tweens, as the hobbits called the irresponsible twenties between childhood and coming of age at thirty-three. Prezentio add. 5
  • 35. Nina Tahmasebi, Digital Literacy, Sept. 2020 35
  • 36. Amount of information Amount of text Text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 37. In short, ladies and gentlemen, my message today is that data is gold. … Let's start mining it. Neelie Kroes Vice-President of the European Commission responsible for the Digital Agenda, SPEECH/11/872 , 2011 Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 38. Is it true that data is gold? Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 39. same data + different methods = different answers Nina Tahmasebi, Digital Literacy, Sept. 2020 39
  • 40. Since there is infinite amount of information in the text, the text becomes infinitely complex. → Currently, there are no methods to mine all the information Nina Tahmasebi, Digital Literacy, Sept. 2020 40
  • 42. Traditional research methodology Research question Text Nina Tahmasebi, Digital Literacy, Sept. 2020 42
  • 43. Data-intensive research methodology Research question Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 43
  • 44. Data-intensive research methodology Research question Text (digital large-scale text) Hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 44
  • 45. Data Hypothesis Data Hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 45
  • 46. Hypothesis Data-intensive research methodology Text mining method Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 46
  • 47. Text-mining method Dimensions Filtering: Function words Filtering: Stopwords Part-of-speech tagging Lemmatization Tokenization NLP pipeline: From text to result Nina Tahmasebi, Digital Literacy, Sept. 2020 47
  • 48. Hypothesis Data-intensive research methodology Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 48
  • 49. Results as a window to the text Nina Tahmasebi, Digital Literacy, Sept. 2020 49
  • 50. Viewpoint on the data Nina Tahmasebi, Digital Literacy, Sept. 2020 50
  • 51. Nina Tahmasebi, Digital Literacy, Sept. 2020 51
  • 52. Nina Tahmasebi, Digital Literacy, Sept. 2020 52
  • 53. Nina Tahmasebi, Digital Literacy, Sept. 2020 53
  • 54. Nina Tahmasebi, Digital Literacy, Sept. 2020 54
  • 55. The better your method (WRT the information related to your research question) → the better the pieces Amount of informa tion Amount of text Text mining method Nina Tahmasebi, Digital Literacy, Sept. 2020 55
  • 56. Hypothesis Data-intensive research methodology Text mining method results Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 56
  • 57. Data-intensive research methodology Hypothesis Text mining method results Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 57
  • 58. Data-intensive research methodology results results results Text mining method Text (digital large-scale text) Research question Nina Tahmasebi, Digital Literacy, Sept. 2020 58
  • 59. Results and research questions Research question Sometimes the results do not answer the research question in full Nina Tahmasebi, Digital Literacy, Sept. 2020 59
  • 60. Nina Tahmasebi, Digital Literacy, Sept. 2020 60
  • 61. Image: https://ipec.co.zwNina Tahmasebi, Digital Literacy, Sept. 2020
  • 62. Research questions Evidence • Attack/demonstrations • Homicide investigation • Financial irregularities • Data breach Majority • How well is our product received • Which of our issues are most/least attractive to our voters? • How will people vote? Nina Tahmasebi, Digital Literacy, Sept. 2020 62
  • 63. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, Digital Literacy, Sept. 2020 63
  • 64. Truths about data- intensive research Not all methods fit all data Not all data fit all questions Not all methods can answer all questions Nothing lives separately, it must be evaluated together: Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 64
  • 65. Truths about data- intensive research (II) Gives us the possibility to ask new kinds of questions Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 65
  • 66. Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 67. Nina Tahmasebi, Digital Literacy, Sept. 2020m
  • 68. Truths about data- intensive research (II) Gives us the possibility to ask new kinds of questions Which kind of questions fit your purposes? Hypothesis Text mining method results Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 68
  • 69. Results and research questions Hypotes Text mining method resultat Text (digital large-scale text) Nina Tahmasebi, Digital Literacy, Sept. 2020 69
  • 71. ? Store Writer A Male authors Journal 1 Written language Pharmacy Writer B Female authors Journal 2 Spoken language Are these different (significantly) or the same? Sample 2Sample 1 Sample 2Sample 1 H1 H0 Nina Tahmasebi, Digital Literacy, Sept. 2020 71
  • 72. Inference requires random selection • Only if the selection is random, can we use the sample to draw conclusions about the world • We almost NEVER have a random sample in a textual corpus! → We cannot draw conclusions about the world Sample 2Sample 1 random inference Nina Tahmasebi, Digital Literacy, Sept. 2020 72
  • 73. When we have little data, the uncertainty is large: • Is A larger than B? But when we have large data, we are more certain about our observations, STILL, our errors can be much larger • Because our selection is biased Sample 2 Sample 2 Sample 1 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Sample 2 Nina Tahmasebi, Digital Literacy, Sept. 2020 73
  • 74. In corpus studies, we frequently do have enough data, so the fact that a relation between two phenomena is demonstrably non-random, does not support the inference that it is not arbitrary. Language is never, ever, ever, random, Adam Kilgariff, 2005 Nina Tahmasebi, Digital Literacy, Sept. 2020 74
  • 75. Method + Data = Results result Nina Tahmasebi, Digital Literacy, Sept. 2020 75
  • 76. result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 76
  • 77. Reject 1 Data 2 Method / Preprocessing 3 Hypothesis result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 77
  • 78. Accept 1 Method 2 Correct interpretation of the results result hypothesis Nina Tahmasebi, Digital Literacy, Sept. 2020 78
  • 79. Math results, average difference Men Women Nina Tahmasebi, Digital Literacy, Sept. 2020 79Source: Factfullness
  • 80. Men Women Math results, average difference Nina Tahmasebi, Digital Literacy, Sept. 2020 80Source: Factfullness
  • 81. NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Range of math scores Nina Tahmasebi, Digital Literacy, Sept. 2020 81Source: Factfullness
  • 82. Men Women Comparison of the same data NUMBER OF INDIVIDUALS WITH DIFFERENT MATH SCORES 2016 Men Women Source: Factfullness Men Women Nina Tahmasebi, Digital Literacy, Sept. 2020 82
  • 83. result hypothesis 1 Method 2 Correct interpretation of the results 3 Where do the results live? Nina Tahmasebi, Digital Literacy, Sept. 2020 83
  • 84. result hypothesis 1 Method 2 Correct interpretation of the results 3 Where do the results live? Nina Tahmasebi, Digital Literacy, Sept. 2020 84
  • 85. Experimental design Even when the math is right, we need to question the selection and the grounds on which our conclusions are. • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, Digital Literacy, Sept. 2020 85
  • 86. Evaluation Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 87. Evaluation individual individual text signal topic, cluster, vector… signal change collective text minimum optimum medium Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 89. Conclusions Nina Tahmasebi, Digital Literacy, Sept. 2020
  • 90. Nina Tahmasebi, Digital Literacy, Sept. 2020 90
  • 91. Digital research needs to be evaluated on the combination of data, method, and research question Nina Tahmasebi, Digital Literacy, Sept. 2020 91
  • 92. Experimental design • What is the corresponding number elsewhere? • What are we measuring? • Why will this answer our questions? Nina Tahmasebi, Digital Literacy, Sept. 2020 92
  • 93. Prof. Hans Rosling You can’t understand the world without numbers… Factfullness … and you cannot understand it only with numbers. Nina Tahmasebi, Digital Literacy, Sept. 2020 93