An overview of current approaches to the study of literature which make use of the techniques, tools and resources of corpus linguistics. Written and presented in 2008.
6. 6
What is corpus stylistics?
The use of the resources, tools and
methodologies of corpus linguistics to
carry out literary analysis on the basis
of the language of literature.
7. 7
Corpus Stylistics - Methods
Examining and analysing texts and
corpora
Comparing texts and corpora
Building and annotating resources
8. 8
“I'm just going out to
commit certain deeds.”
In an episode of The Simpsons, Homer has
planned with Moe to steal Moe's car and drive it
into the water, so that Moe can claim the
insurance money. Before Homer goes out to steal
the car, he is eating dinner with the family, and is
trying to act innocently, as if it is a normal
evening. He makes various mistakes, and when
he gets up to leave, he says, “I'm just going out to
commit certain deeds.”
9. 9
Consult a corpus to see how a
word / phrase / construction /
collocation 'normally' occurs.
For example, we can look at 'commit' and 'deeds'
in the British National Corpus, and try to answer
questions like “Why is this funny?”, and “Why are
commit and deeds the wrong words to use here?”
Requirements: access to a general reference
corpus and analysis tools (preferably online) for
concordance, collocation, cluster, distribution,
word frequency lists
10. 10
Analyse an electronic
version of a literary text,
using text analysis tools.
How does author X use expression Y?
How often does she use Y?
Does she prefer another expression in certain
contexts?
In what parts of the novel / play / poem does
she tend to use Y?
Requirements: (reliable) electronic version of the
text (in an appropriate format), plus relevant tools
(preferably online)
11. 11
Analysing a literary corpus
Ask questions like those above, but across the
oeuvre of an author, or across a literary genre or
time period.
Furthermore, analyse variation in an author's work
(e.g. compare one novel with the rest)
Requirements: a relevant corpus, plus tools that
allow for internal comparisons
12. 12
Analysing an author's work
Clusters >4 words in Dickens -
among the top 25:
AS IF HE HAD BEEN
IN THE COURSE OF THE
A QUARTER OF AN HOUR
AT THE BOTTOM OF THE
WHAT DO YOU THINK OF
IN THE MIDDLE OF THE
AS IF IT HAD BEEN
AT THE TOP OF THE
ON THE OTHER SIDE OF
AT THE END OF THE
AS A MATTER OF COURSE
THE OTHER SIDE OF THE
UP AND DOWN THE ROOM
– Names and labels
21
– Speech 16
– “As if” 6
– Body parts 12
– other 22
Categorisation of cluster types
(more than 5 words):
Mahlberg, M. “Corpus stylistics: bridging the gap
between linguistic and literary studies” In M. Hoey,
M. Mahlberg, M. Stubbs, W. Teubert. Text,
Discourse, and Corpora. London: Continuum. 2007.
13. 13
Making internal
comparisons within a text
Comparing the speech of one character with the
rest, e.g. Romeo and Juliet .
Comparing one act or scene with the rest.
Comparing the style of one section of a novel
with the rest.
Requirements: text processing tools to separate
text elements, or markup to tag text structure and
markup-aware tools, plus keywords software
14. 14
Comparing a text to a
reference corpus
Compare the frequency, distribution and usage of
words in the text with a reference corpus.
E.g. A Conneticut Yankee in King Arthur's Court
by Mark Twain, compared to the British National
Corpus (BNC)
Requirements: many reference corpora, literary
and non-literary, different languages, genres, time
periods, etc.
16. 16
Comparing a literary corpus
to a general reference corpus
Identifying and characterizing an author's style,
e.g. comparing all of Mark Twain's work with US
fiction in the period 1870-1910;
Identifying and characterizing literary style (of a
period, or genre, etc),
e.g. comparing a corpus of US fiction with a
corpus of non-fiction from the same period, or
comparing dramatic dialogue in plays with real
conversation in a spoken corpus.
Requirements: More literary corpora, more
reference corpora, more computing power!
17. 17
Tracing historical change
Diachronic studies of the language of literature,
studying language change, changes in style,
genre, etc.
Requirements: sets of historical literary corpora
of various time periods, or a diachronic corpus
which allows internal comparisons, or a collection
of texts (with dates) which can be cross-searched
18. 18
Annotating and manually
analyzing texts and
corpora
Can be used to test, refine and develop theories
about the language of literature.
Theories are forced to demonstrate textual
evidence, account for all textual phenomena.
Frequencies and relevant frequencies can be
calculated.
Requirements: lots of time, money and expertise!
19. 19
Building and Annotating
The Speech, Thought and Writing Presentation Corpus
Elena Semino, Mick Short, Martin Wynne et al
Lancaster University
Identifying, categorising and analysing the functions of all
occurrences of reported speech, thought and writing (e.g.
direct speech, indirect speech, free indirect speech, direct
thought, etc.) in a small corpus of fictional and non-fictional
texts (and later also speech)
20. 20
Building and annotating (2)
VICI
Free University of Amsterdam
Gerard Steen et al
Identifying and categorising metaphorical
expressions in a subset of the BNC corpus;
analysing usage and distributions across text
types and modes
21. 21
Further types of analysis
More levels of annotation: parsing, semantic
tagging, etc.
Stylometry
Text mining
Multilingual, parallel, comparable, translation
corpora
Socio-cultural and historical investigations in literary
corpora
But note, please, that you don't need annotation for
many useful techniques!
Requirements: various!
22. 22
A new type of Shakespeare
dictionary: Jonathan Culpeper
A proposal for a dictionary of the language of Shakespeare, involving
better integration of linguistic description, frequency information and
non-linguistic information.
− How often does X occur?
− How often do the particular meanings of X occur?
− What kind of words does X tend to co-occur with?
− How often do the particular ‘grammatical categories’ of X occur?
− What kinds of register does X co-occur with?
− What kinds of speaker/addressee does X co-occur with?
− Is X part of a particular lexical field (semantic category) and how does
that field distribute across the plays?
− How can the above help differentiate X word from Y word?
− Etc.
(1) a particular theoretical approach to meanings, (2) a particular
methodology ….. enter Corpus Linguistics
23. 23
Using large-scale literary
corpora
For example, Matthew Jockers, Sarah Allison
and others at Stanford University, using large
collections of literary texts, from commercial
providers, applying corpus linguistic and data
mining techniques to address literary research
questions
e.g. Joe Shapiro comparing quantity of narrative
v. descriptive passages in US 19th
Century
literature
Perhaps, particular potential for historical literary
and linguistic studies
24. 24
Basic methods: summary
1. Examine the norms in a general reference corpus
2. Perform text analysis on an electronic literary text
3. Make internal comparisons in a literary text
4. Analyse a literary corpus
5. Make internal comparisons in a literary corpus
6. Compare a text to a reference corpus
7. Compare a literary corpus to a non-literary corpus
8. Compare different literary corpora with each other
9. Build and annotate corpora
10. Others!
25. 25
Methods: conclusion
It is becoming increasingly possible to test
empirically claims about the language of literature,
to search for and provide evidence from texts, and
to establish the norms of literary and non-literary
style.
Stylistics typically makes use of a toolkit of
linguistic techniques, methods and resources.
Corpus stylistics will become a powerful addition
to this toolkit in the future.
26. 26
Resources for Corpus
Stylistics
What do we need?
● Reliable electronic editions of literary texts
● Relevant reference corpora
● Analysis tools
● Interoperability
● Shared access
● Sustainability
● Methodology
● Expertise
27. 27
Research Infrastructure
The vision is for a set of relevant texts, corpora and tools,
hosted in various locations around the world, available
online from the user's desktop, via a single sign-on; all
the resources and tools working together using high-
speed connections and high-performance computing.
Plus tools for showing, sharing and collaborating in a
virtual workspace.
CLARIN is working to build this infrastructure for the use
of language resources and technologies across the
humanities and social sciences.
28. 28
Links
Oxford Text Archive (OTA)
http://www.ota.ox.ac.uk/
PALA Corpus Stylistics Special Interest Group
http://www.pala.ac.uk/sigs/corpus-style/
Corpus-style mailing list
http://www.jiscmail.ac.uk/lists/corpus-style.html
Speech, Thought and Writing Presentation Project
http://bowland-files.lancs.ac.uk/stwp/
British National Corpus
http://www.natcorp.ox.ac.uk/
Brigham Young University Corpora from Mark Davies
http://corpus.byu.edu/