What might a spoken corpus tell us about language

What might a corpus of spoken data tell usWhat might a corpus of spoken data tell us
about language?about language?
UCL Digital Humanities Seminar
November 15
Sean Wallis
Survey of English Usage
University College London
s.wallis@ucl.ac.uk

OutlineOutline
• What can a corpus tell us?
• The 3A cycle
• What can a parsed corpus tell us?
• ICE-GB and DCPSE
• Diachronic changes
– Modal shall/will over time
• Intra-structural priming
– NP premodification
• The value of interaction evidence

What can a corpus tell us?What can a corpus tell us?
• Three kinds of evidence may be obtained
from a corpus
 Frequency (distribution) evidence of a particular
known linguistic event
 Coverage (discovery) evidence of new events
 Interaction evidence of the relationship between
events
• But if these ‘events’ are lexical, this evidence
can only really tell us about lexis
– So corpus linguistics has always involved
annotation

The 3A cycleThe 3A cycle
• Plain text corpora
– evidence of lexical phenomena
Text

• Need to annotate
– add knowledge of frameworks
– classify and relate phenomena
– general annotation scheme
• not focused on particular research goals
Annotation
Corpus
Text

• Corpus research = the ‘3A’ cycle
– Annotation
Annotation
Corpus
Text

– Annotation ↔ Abstraction
Annotation
Abstraction
Corpus
Text
Dataset
data transformation
(“operationalisation”)

– Annotation ↔ Abstraction ↔ Analysis
Annotation
Abstraction
Analysis
Corpus
Text
Dataset
Hypotheses
data transformation
(“operationalisation”)

AnnotationAnnotation ↔↔ AbstractionAbstraction
• Abstraction
– selects data from annotated corpus
– maps it to a regular dataset for statistical analysis
– bi-directional (“concretisation”)
• allows us to interpret statistically significant results

AnnotationAnnotation ↔↔ AbstractionAbstraction
• Abstraction
– selects data from annotated corpus
– maps it to a regular dataset for statistical analysis
– bi-directional (“concretisation”)
• allows us to interpret statistically significant results
• Even ‘lexical’ questions need annotation:
– 1st person declarative modal verb shall/will
abstraction relies on annotation

What can aWhat can a parsedparsed corpus tellcorpus tell
us?us?
• Three kinds of evidence may be obtained
from a parsed corpus
 Frequency evidence of a particular known rule,
structure or linguistic event
 Coverage evidence of new rules, etc.
 Interaction evidence of the relationship between
rules, structures and events
• BUT evidence is necessarily framed within a
particular grammatical scheme
– So… (an obvious question) how might we
evaluate this grammar?

What can a parsed corpus tellWhat can a parsed corpus tell
us?us?
• Parsed corpora contain (lots of) trees
– Use Fuzzy Tree Fragment queries to get data
– An FTF

us?us?
• Parsed corpora contain (lots of) trees
– Use Fuzzy Tree Fragment queries to get data
– An FTF
– A matching
case in a tree
– Using
ICECUP
(Nelson et al,
2002)

us?us?
• Trees as handle on data
– make useful distinctions
– retrieve cases reliably
– not necessary to “agree” to framework used
• provided distinctions are meaningful

us?us?
• Trees as trace of language production
process
– interaction between decisions leave a probabilistic
effect on overall performance
• not simple to distinguish between source
– depends on the framework
• but may also validate it

Why spoken corpora?Why spoken corpora?
• Speech predates writing
– historically – literacy growth and spread
– child development – internal speech during writing

• Scale
– professional authors recommend 1,000 words/day
– 1 hour of speech ≈ 8,000 words (DCPSE)

• Scale
• Spontaneity
– production process lost: many written sources edited

• Scale
• Spontaneity
– production process lost: many written sources edited
• Dialogue
– interaction between speakers

ICE-GB and DCPSEICE-GB and DCPSE
• British Component of the International
Corpus of English (1990-92)
– 1 million words (nominal)
– 60% spoken, 40% written
– speech component is orthographically transcribed
– fully parsed
• marked up, POS-tagged, parsed, hand-corrected
• Diachronic Corpus of Present-day Spoken
English
– 800,000 words (nominal)
– orthographically transcribed and fully parsed
– created from subsamples of LLC and ICE-GB
• Matching numbers of texts in text categories
• Not sampled over equal duration
– LLC (1958-1977) – ICE-GB (1990-1992)

0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
ModalModal shallshall vs.vs. willwill over timeover time
• Plotting modal shall/will over time (DCPSE)
• Small amounts
of data / year

0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
• Small amounts
of data / year
• Confidence
intervals identify
the degree of
certainty in our
results

0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
• Small amounts
of data / year
• Confidence
intervals identify
the degree of
certainty in our
results
• Highly skewed p
in some cases
– p = 0 or 1
(circled)

0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
• Small amounts of
data / year
• Confidence
intervals identify
the degree of
certainty in our
results
• We can now
estimate an
approximate
downwards
curve
(Aarts et al., 2013)

Intra-structural primingIntra-structural priming
• Priming effects within a structure
– Study repeating an additive step in structures
• Consider
– a phrase or clause that may (in principle) be
extended ad infinitum
• e.g. an NP with a noun head
N

• Consider
– a single additive step applied to this structure
• e.g. add an attributive AJP before the head
N
AJP

• Consider
– Q. What is the effect of repeatedly applying this
operation to the structure?
ship
N
N
AJP

• Consider
ship
NAJP
tall
N
AJP

• Consider
ship
NAJP
very greentall
AJP
N
AJP

• Consider
ship
NAJP
very greentall
AJP
N
AJP
AJP
old

NP premodificationNP premodification
• Sequential probability analysis
– calculate probability of adding each AJP
– error bars: Wilson intervals
– probability falls
• second < first
• third < second
– decisions interact
– Every AJP added
makes it harder
to add another
0.00
0.05
0.10
0.15
0.20
0 1 2 3 4 5
probability

NP premodification:NP premodification:
explanations?explanations?
• Feedback loop: for each successive AJP,
it is more difficult to add a further AJP
• Possible explanations include:
 logical and semantic constraints
• tend to say the tall green ship
• do not tend to say tall short ship or green tall ship
 communicative economy
• once speaker said tall green ship, tends to only say ship
 memory/processing constraints
• unlikely: this is a small structure, as are AJPs

NP premod’n: speech vs. writingNP premod’n: speech vs. writing
• Spoken vs. written subcorpora
– Same overall pattern
– Spoken data tends to have fewer attributive AJPs
• Support for communicative economy or
memory/processing hypotheses?
– Significance tests
• Paired 2x1 Wilson tests
(Wallis 2011)
• first and second
observed spoken
probabilities are
significantly smaller
than written
0.00
0.05
0.10
0.15
0.20
0.25
0 1 2 3 4 5
probability
written
spoken

Potential sources of interactionPotential sources of interaction
• shared context
– topic or ‘content words’ (Noriega)
• idiomatic conventions
– semantic ordering of attributive adjectives (tall green ship)
• logical-semantic constraints
– exclusion of incompatible adjectives (?tall short ship)
• communicative constraints
– brevity on repetition (just say ship next time)
• psycholinguistic processing constraints
– attention and memory of speakers

What use is interactionWhat use is interaction
evidence?evidence?
• Corpus linguistics
– Optimising existing grammar
• e.g. co-ordination, compound nouns
• Theoretical linguistics
– Comparing different grammars, same
language
– Comparing different languages or periods
• Psycholinguistics
– Search for evidence of language production
constraints in spontaneous speech corpora
• speech and language therapy
• language acquisition and development

us?us?
• Trees as trace of language production
process
– interaction between decisions leave a probabilistic
effect on overall performance
• not simple to distinguish between source
– results enabled by the framework
• but may also validate it

The importance of annotationThe importance of annotation
• Key element of a ‘3A cycle’
– Annotation ↔ Abstraction ↔ Analysis
• Richer annotation
– more effective abstraction
– deeper research questions?
• Multiple layers of annotation
– new research questions
– studying interaction between layers
• Algorithmic vs. human annotation

More informationMore information
• Full paper
Wallis, S.A. (2014) What might a corpus of parsed spoken data tell
us about language? In L. Veselovská and M. Janebová (eds.)
Complex Visibles Out There. Olomouc: Palacký University, 2014.
pp 641-662.
• Published at http://corplingstats.wordpress.com/2014/06/24/corpus
• References
Aarts, B. Close, J. and Wallis, S.A. (2013) Choices over time:
methodological issues in current change. In Aarts, Close, Leech
and Wallis (eds.)The Verb Phrase in English. CUP.
Nelson, G., Wallis, S.A. and Aarts, B. (2002) Exploring Natural
Language. Amsterdam: John Benjamins.
Wallis, S.A. (2011) Comparing χ2
tests for separability of distribution
and effect. London: Survey of English Usage.
• Published at http://corplingstats.wordpress.com/2012/03/31/comparing

More informationMore information
• Useful links
– Survey of English Usage
• www.ucl.ac.uk/english-usage
– Fuzzy Tree Fragments
• www.ucl.ac.uk/english-usage/resources/ftfs
– Author’s corpus linguistics statistics and
methodology research blog
• http://corplingstats.wordpress.com

What might a spoken corpus tell us about language

Recommandé

Recommandé

Contenu connexe

Similaire à What might a spoken corpus tell us about language

Similaire à What might a spoken corpus tell us about language (20)

Plus de UCLDH

Plus de UCLDH (20)

Dernier

Dernier (20)

What might a spoken corpus tell us about language