MacroMicroZoom.pdf

Microscope, macroscope and zoom lens:
close, distant and scalable reading in the
Humanities
Digital Humanities Summer School,
University of Oxford,
7th
July 2023
Martin Wynne
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics
https://orcid.org/0000-0002-4155-0530
martin.wynne@ling-phil.ox.ac.uk

Martin Wynne Text Analysis 2
Summary
●
What is text analysis?
●
Close reading, distant reading, textual
interpretation
●
Corpus linguistics: the vanguard of digital
humanities

Types of text analysis
●
the study of rhetoric ('how language is used to persuade')
●
close reading ('focus on the words')
●
stylistics ('study of the language of literature')
●
stylometry ('quantifying aspects of the language of texts, especially for authorship attribution and investigating genres')
●
corpus linguistics ('developing and analysing large electronic datasets representative of particular language varieties')
●
distant reading ('studying and doing things with more text than you can read')
●
macroanalysis (‘plotting features in large corpora over time’)
●
discourse analysis ('categorizing and analysing structural elements of discourse')
●
critical corpus discourse analysis ('using corpus linguistic methods to reveal hidden agendas and motivations in texts')
●
deconstruction ('what the words are not saying or failing to say')
●
forensic linguistics (‘gathering legal evidence about attribution and meaning’)
●
qualitative social science ('annotation and analysis of interviews, survey results, etc.')
●
...and more...

My meaning of text analysis
A diverse, open and fluid set of methods, datasets and tools, to be
used in support of a variety of research processes, with the aim of
interpreting texts

...is not just for linguists, but also not for literary scholars,
historians, political scientists, sociologists, journalists, activists,
forensic scientists…
...and it can be useful for all or any of them
My sort of text analysis

“Text analysis tools aid the interpreter asking questions of texts”
Geoffrey Rockwell
https://web.archive.org/web/20150410205354/http://tada.mcmaster.ca/Main/WhatTA

Methods and Techniques
Search
- search large texts quickly
- search a collection of texts or corpus
- search enhanced by linguistic annotation
- complex searches
Analyze
- patterns of words
- collocations
- expanded co-text around words
- wordlists, keywords
- clusters, ngrams
Compare
- compare texts
- compare sections of a text with each other
- compare a text with a reference corpus
Visualize
- concordances
- distribution of features in a text or corpus

Close Reading
Elaine Showalter describes close reading
as:
“...slow reading, a deliberate attempt to
detach ourselves from the magical power of
story-telling and pay attention to language,
imagery, allusion, intertextuality, syntax
and form.”
It is, in her words, ‘a form of
defamiliarisation we use in order to break
through our habitual and casual reading
practices’ (Teaching Literature, p.98).
Further introductory reading:
●
https://www.york.ac.uk/english/writing-at-york/writing-resources/close-reading/
●
https://writingcenter.fas.harvard.edu/pages/how-do-close-reading
●
http://theliterarylink.com/closereading.html

Close Reading
●
Traditional criticism (biographical, social, historical, psychological)
...the new paradigm...
●
Practical criticism, New Criticism (concentrate on the ‘words on the page’)
●
Hermeneutics (theory and practice of interpretation)
●
Interpretation (always provisional, never final)
●
Inductive reasoning (not deductive and mathematical, based on experience and probabilities)
...the next new paradigm...
●
New Historicism (literature must be understood in its historical context)
...
https://www.english.cam.ac.uk/classroom/pracrit.htm
https://www.oxfordbibliographies.com/view/document/obo-9780190221911/obo-978019022191
1-0015.xml

Close reading as the paradigm for
text-based humanities scholarship

But what do you do with a million books?
There are only about 30,000 days in a human life -- at a book a
day, it would take 30 lifetimes to read a million books and our
research libraries contain more than ten times that number. Only
machines can read through the 400,000 books already publicly
available for free download from the Open Content Alliance.
 Gregory Crane, “What do you do with a million books?”
D-Lib Magazine, March 2006

And 5 million books?
We constructed a corpus of digitized texts containing about 4% of all books ever
printed. Analysis of this corpus enables us to investigate cultural trends
quantitatively. We survey the vast terrain of “culturomics” focusing on linguistic
and cultural phenomena that were reflected in the English language between
1800 and 2000. We show how this approach can provide insights about fields as
diverse as lexicography, the evolution of grammar, collective memory, the
adoption of technology, the pursuit of fame, censorship, and historical
epidemiology. “Culturomics” extends the boundaries of rigorous quantitative
inquiry to a wide array of new phenomena spanning the social sciences and the
humanities.
www.sciencexpress.org / 16 December 2010

Distant reading: where distance, let me
repeat it, is a condition of knowledge: it
allows you to focus on units that are much
smaller or much larger than the text:
devices, themes, tropes—or genres and
systems. And if, between the very small
and the very large, the text itself
disappears, well, it is one of those cases
when one can justifiably say, less is more.
If we want to understand the system in its
entirety, we must accept losing something.
We always pay a price for theoretical
knowledge: reality is infinitely rich;
concepts are abstract, are poor. But it’s
precisely this ‘poverty’ that makes it
possible to handle them, and therefore to
know. This is why less is actually more.
Franco Moretti, “Conjectures on World
Literature” Distant Reading, 2013.

Distant Reading
A canon of 200 novels, for instance,
sounds very large for 19th-century
Britain (and is much larger than the
current one), but it still less than 1% of
the novels that were actually published
[…] and close reading won’t help here, a
novel a day every day of the year would
take a century or so … And it’s not even
a matter of time, but of method: a field
this large cannot be understood by
stitching together separate bits of
knowledge about individual cases,
because it isn’t a sum of individual
cases: it’s a collective system, that
should be grasped as such, as a whole.
Franco Moretti, Graphs, Maps, Trees:
Abstract Models for Literary History, 2005

What are we ultimately aiming for
when it comes to digital scholarship in the Humanities?
Ways to combine close reading with
big data approaches.

From “distant”
(not) reading to
close reading and
back again...
Digital Humanities
as a locus for
“scalable” reading
practices
DATA: digitally
assisted text
analysis
Martin Mueller,
Northwestern

What do you need to know in order to move to
interpretation?
1. You need to know what’s in your dataset.
2. You need to know how to find what you are looking for.
3. You need to know how to make sense of what you find.

Software tools
●
AntConc
●
Sketch Engine
●
CQPweb
●
#LancsBox
●
English-corpora.org
●
KonText
●
Voyant Tools
●
CliC
●
Hansard at Huddersfield
●
...and more

Finding resources
●
CLARIN Virtual Language Observatory
(https://vlo.clarin.eu/)
●
CLARIN Resource Families
(https://www.clarin.eu/resource-families/)

Corpus Query Tools:
a CLARIN Resource Family
https://www.clarin.eu/resource-families/corpus-query-tools

The 'aftermath' of the seminar
Subject: Les Francais des Corpus – Aftermath
Dear colleagues,
First, many thanks for presenting at /attending
the Francais des Corpus Workshop and for making
it such a success.
I promised I would keep you in touch with one
another and hope that the full list of your e-
mail addresses above makes that possible.
…

KWIC concordance from Written BNC2014 generated in #lancsbox X
(a representative corpus of British English released in 2021).

'aftermath'
Collocates:
War
Gulf
coup
World
disaster
Tiananmen
death
revolution
defeat
Chernobyl
affair
riots
battle
massacre
wars
election
Crisis
events
explosion
invasion
trial
fire
June
Square
victory
accident
attempt
Significant collocates in the British National Corpus
(a representative corpus of British English released in 1994).
BNCWeb parameters:
There are 1486 different types in your collocation database
for the query "[word="aftermath"%c] [word="of"%c]".
(Your query "aftermath of" returned 544 hits in 337 different texts)
The selected range was 1 to 4.
Corpus basis for calculation: the whole BNC.
Type of calculation: Log-likelihood
Tag restriction: any noun
Collocates occur at least 5 times in the whole BNC.
Words collocate at least 5 times.

J. R. Firth (1890-1960)
“The complete meaning of a word is
always contextual, and no study of
meaning apart from context can be taken
seriously.”
J. R. Firth (1935). "The Technique of Semantics." Transactions of the Philological Society,
36-72; p. 37 (Reprinted in Firth (1957).
“You shall know a word by the company
it keeps.”
J. R. Firth (1957). "Papers in Linguistics, 1934-1951". Oxford: Oxford University Press.

What is a corpus?
“…a collection of pieces of language, selected and
ordered according to explicit linguistic criteria in
order to be used as a sample of the language.”
(Sinclair 1996)

What is Corpus Linguistics?
(1) Focus on linguistic performance, rather than competence
(2) Focus on linguistic description, rather than linguistic universals
(3) Focus on quantitative, as well as qualitative models of language
(4) Focus on a more empiricist, rather than rationalist view of
scientific inquiry.
(Leech 1992)

Antconc: explore your own texts and corpora
●
Download for free from
https://www.laurenceanthony.net/software/antconc/
●
Use with any 'plain' text’
●
Multilingual
capabilities
●
Does not interpret
mark-up or metadata

#LancsBox
Download for free from https://lancsbox.lancs.ac.uk/
●
Works with your own data or existing corpora
●
Visualizes language data
●
Analyses data in any language
●
Automatically annotates data for part-of-speech (for
some languages)
●
Wizard tool produces a prose report
●
Works with major operating systems (Windows, Mac,
Linux)
●
Latest version #LancsBox X launched 2023

CQPweb:
Online interface for indexed corpora
http://cqpweb.lancs.ac.uk
...but now also with a new feature
to upload data, in limited ways...

SketchEngine: an online interface for
your corpus
https://www.sketchengine.eu/
Access to Sketch Engine is by paid subscription. Individual licences are available from €6.56
per month, with free trials available.

A new opportunity
"It is not easy to justify assertions about the alleged frequency of infrequency of
some particular belief or attitude in the past. How many examples does one need to
cite in order to prove the point? Lacking any satisfactory method of quantifying
these matters, all I can do is to record my impressions after long immersion in the
period."
Keith Thomas, The Ends of Life, Oxford University Press, 2009.
“But the sad truth is that much of what it has taken me a lifetime to build up by
painful accumulation can now be achieved by a moderately diligent student in the
course of a morning.”
Keith Thomas, Diary, London Review of Books, 10 June 2010.

Some (more or less) testable assertions
Tudor
 “The idea of a "Tudor era" in history is a misleading invention, claims an Oxford University
historian. Cliff Davies says his research shows the term "Tudor" was barely ever used
during the time of Tudor monarchs.” (http://www.bbc.co.uk/news/education-18240901
May 2012)
Holocaust
 “I will argue that “The Holocaust” is an ideological representation of the Nazi
holocaust...Until recently, however, the Nazi holocaust barely figured in American life.
Between the end of World War II and the late 60s, only a handful of books and films
touched on the subject”. (Norman Finkelstein, The Holocaust Industry. Verso, 2000.)
State
●
“...no political writer before the middle of the sixteenth century used the word 'state' in
anything like its modern political sense” [referring to the machinery of government and
social control] (Quentin Skinner, The Foundations of Modern Political Thought, Cambridge
University Press, 1978).

0
6
/
0
7
/
2
3
Annotation
Annotation of texts should include structural markup, metadata, and linguistic
annotation, including:
- Standardized metadata for basic categories such as language, relevant dates,
author, title and text type;
- Part-of-speech tagging;
- Lemmatization; and
- Modernized (or otherwise normalized) forms
...and these can be the basis for further levels of annotation, such as:
- semantic tags
- named entity recognition
- etc.

Digital scholarship in the Humanities
and Digital Science
Issues and assumptions in scientific research:
●
Consensus (and compromise) about funding priorities
●
Adoption of technical standards
●
Standards for the representation of knowledge and interpretations (agreement on concepts and categories!)
●
Reproducibility and replicability of research
●
Sharing of generic tools
●
Curation of tools and data in professional service centres
●
Support for software sustainability
●
Promotion of interoperability of resources and tools
●
Sharing research outputs
●
Research leading to an accumulation of knowledge
●
Increasingly data-driven research

CLARIN ERIC in members and centres
40
Official membership
• 23 members
• 3 observers
• 1 linked party
A distributed network of >60 centres
25 CTS certified data centres,
strong focus on FAIRness & interoperability
• federated login:
• central metadata harvesting for easy discovery:
• chained services:
• language data - in written, spoken, video or multimodal form
• advanced tools - to discover, explore, exploit, annotate, analyse
or combine data sets, wherever they are located

CLARIN corpus resources and tools
Corpora: at least 4130 - see VLO (https://vlo.clarin.eu/) !
Online interfaces:
● Corpuscle
● Korp
● KonText
● NoSketch Engine
● D* (Diacollo demo)
● TEITOK
Federated content search: https://contentsearch.clarin.eu/
Resource Families:
● 13 curated guides to different types of corpora and how to get them
● Coming soon: Desktop corpus tools and Online corpus tools

Online and desktop tools for corpus analysis
“Corpus, concordance, collocation”

Diachronic collocations in a text collection: DiaCollo from the Deutsches Textarchiv

Types of Text Analysis: Further Reading
●
Baker, P (2006), Using Corpora in Discourse Analysis, London: Continuum [summary and further information at https://www.lancaster.ac.uk/staff/bakerjp/usingcorpora.htm
]
●
Baker, P (2012), ‘Acceptable Bias? Using Corpus Linguistics Methods with Critical Discourse Analysis’, Critical Discourse Studies 9.3 (2012): 247-56. Web.
●
Bode, K (2017), The Equivalence of “Close” and “Distant” Reading; or, Toward a New Object for Data-Rich Literary History, Modern Language Quarterly (2017) 78 (1): 77–106.
DOI 10.1215/00267929-3699787
●
Cheng, W. (2013). ‘Corpus-based linguistic approaches to critical discourse analysis. In The encyclopedia of applied linguistics’ (pp. 1-8). Wiley-Blackwell.
https://doi.org/10.1002/9781405198431.wbeal0262 [full book chapter available from https://www.researchgate.net/publication/262070226]
●
Gadd. Ian. ‘The Use and Misuse of Early English Books Online’ in Literature Compass 6/3 (2009): 680–692 https://doi.org/10.1111/j.1741-4113.2009.00632.x
●
Hamed, D (2020), ‘Keywords and collocations in US presidential discourse since 1993: a corpus-assisted analysis’, in Journal of Humanities and Applied Social Sciences, Vol. 3 No.
2, 2021 pp. 137-158 Emerald Publishing Limited 2632-279X DOI 10.1108/JHASS-01-2020-0019
●
Kichuk, Diana. ‘Metamorphosis: Remediation in Early English Books Online (EEBO)’. Literary and Linguistic Computing 22.3 (2007): 291–303. [available from
https://hfroehlich.files.wordpress.com/2016/07/lit-linguist-computing-2007-kichuk-291-303.pdf
]
●
Leech, G. N., & Short, M. H. (1981). Style in Fiction. London: Longman.
●
Mahlberg, M. (2013), Corpus Stylistics and Dickens’s Fiction, Routledge.
●
Martin, Shawn. ‘EEBO, Microfilm, and Umberto Eco: Historical Lessons and Future Directions for Building Electronic Collections’. Microform & Imaging Review 36.4 (2007): 159–
64 [available from https://repository.upenn.edu/cgi/viewcontent.cgi?article=1072&context=library_papers
]
●
Showalter, E (2002), Teaching Literature, London: Wiley-Blackwell.
●
Sinclair, J (1991), Corpus, Concordance, Collocation, Oxford: OUP.
●
Rockwell, G (2005), ‘What is Text Analysis’ [https://web.archive.org/web/20150410205354/http://tada.mcmaster.ca/Main/WhatTA]
●
Underwood, Ted (2015), Seven ways humanists are using computers to understand text. (blog post at
https://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
)
●
John Unsworth, “How Not To Read A Million Books,” with Tanya Clement, Sara Steger, and Kirsten Uszkalo, Harvard University, Cambridge, MA (October 2008) [blog post at
https://people.brandeis.edu/~unsworth/hownot2read.rutgers.html
]
●
Text Analysis in ‘Tooling up for Digital Humanties’ blog at http://toolingup.stanford.edu/?page_id=981
●
More information about the Text Creation Partnership https://quod.lib.umich.edu/e/eebogroup/

MacroMicroZoom.pdf

Recommandé

Recommandé

Contenu connexe

Similaire à MacroMicroZoom.pdf

Similaire à MacroMicroZoom.pdf (20)

Plus de Martin Wynne

Plus de Martin Wynne (10)

Dernier

Dernier (20)

MacroMicroZoom.pdf