SlideShare une entreprise Scribd logo
1  sur  39
What might a corpus of spoken data tell usWhat might a corpus of spoken data tell us
about language?about language?
UCL Digital Humanities Seminar
November 15
Sean Wallis
Survey of English Usage
University College London
s.wallis@ucl.ac.uk
OutlineOutline
• What can a corpus tell us?
• The 3A cycle
• What can a parsed corpus tell us?
• ICE-GB and DCPSE
• Diachronic changes
– Modal shall/will over time
• Intra-structural priming
– NP premodification
• The value of interaction evidence
What can a corpus tell us?What can a corpus tell us?
• Three kinds of evidence may be obtained
from a corpus
 Frequency (distribution) evidence of a particular
known linguistic event
 Coverage (discovery) evidence of new events
 Interaction evidence of the relationship between
events
• But if these ‘events’ are lexical, this evidence
can only really tell us about lexis
– So corpus linguistics has always involved
annotation
The 3A cycleThe 3A cycle
• Plain text corpora
– evidence of lexical phenomena
Text
The 3A cycleThe 3A cycle
• Plain text corpora
– evidence of lexical phenomena
• Need to annotate
– add knowledge of frameworks
– classify and relate phenomena
– general annotation scheme
• not focused on particular research goals
Annotation
Corpus
Text
The 3A cycleThe 3A cycle
• Plain text corpora
– evidence of lexical phenomena
• Need to annotate
– add knowledge of frameworks
– classify and relate phenomena
– general annotation scheme
• not focused on particular research goals
• Corpus research = the ‘3A’ cycle
– Annotation
Annotation
Corpus
Text
The 3A cycleThe 3A cycle
• Plain text corpora
– evidence of lexical phenomena
• Need to annotate
– add knowledge of frameworks
– classify and relate phenomena
– general annotation scheme
• not focused on particular research goals
• Corpus research = the ‘3A’ cycle
– Annotation ↔ Abstraction
Annotation
Abstraction
Corpus
Text
Dataset
data transformation
(“operationalisation”)
The 3A cycleThe 3A cycle
• Plain text corpora
– evidence of lexical phenomena
• Need to annotate
– add knowledge of frameworks
– classify and relate phenomena
– general annotation scheme
• not focused on particular research goals
• Corpus research = the ‘3A’ cycle
– Annotation ↔ Abstraction ↔ Analysis
Annotation
Abstraction
Analysis
Corpus
Text
Dataset
Hypotheses
data transformation
(“operationalisation”)
AnnotationAnnotation ↔↔ AbstractionAbstraction
• Abstraction
– selects data from annotated corpus
– maps it to a regular dataset for statistical analysis
– bi-directional (“concretisation”)
• allows us to interpret statistically significant results
AnnotationAnnotation ↔↔ AbstractionAbstraction
• Abstraction
– selects data from annotated corpus
– maps it to a regular dataset for statistical analysis
– bi-directional (“concretisation”)
• allows us to interpret statistically significant results
• Even ‘lexical’ questions need annotation:
– 1st person declarative modal verb shall/will
abstraction relies on annotation
What can aWhat can a parsedparsed corpus tellcorpus tell
us?us?
• Three kinds of evidence may be obtained
from a parsed corpus
 Frequency evidence of a particular known rule,
structure or linguistic event
 Coverage evidence of new rules, etc.
 Interaction evidence of the relationship between
rules, structures and events
• BUT evidence is necessarily framed within a
particular grammatical scheme
– So… (an obvious question) how might we
evaluate this grammar?
What can a parsed corpus tellWhat can a parsed corpus tell
us?us?
• Parsed corpora contain (lots of) trees
– Use Fuzzy Tree Fragment queries to get data
– An FTF
What can a parsed corpus tellWhat can a parsed corpus tell
us?us?
• Parsed corpora contain (lots of) trees
– Use Fuzzy Tree Fragment queries to get data
– An FTF
– A matching
case in a tree
– Using
ICECUP
(Nelson et al,
2002)
What can a parsed corpus tellWhat can a parsed corpus tell
us?us?
• Trees as handle on data
– make useful distinctions
– retrieve cases reliably
– not necessary to “agree” to framework used
• provided distinctions are meaningful
What can a parsed corpus tellWhat can a parsed corpus tell
us?us?
• Trees as handle on data
– make useful distinctions
– retrieve cases reliably
– not necessary to “agree” to framework used
• provided distinctions are meaningful
• Trees as trace of language production
process
– interaction between decisions leave a probabilistic
effect on overall performance
• not simple to distinguish between source
– depends on the framework
• but may also validate it
Why spoken corpora?Why spoken corpora?
• Speech predates writing
– historically – literacy growth and spread
– child development – internal speech during writing
Why spoken corpora?Why spoken corpora?
• Speech predates writing
– historically – literacy growth and spread
– child development – internal speech during writing
• Scale
– professional authors recommend 1,000 words/day
– 1 hour of speech ≈ 8,000 words (DCPSE)
Why spoken corpora?Why spoken corpora?
• Speech predates writing
– historically – literacy growth and spread
– child development – internal speech during writing
• Scale
– professional authors recommend 1,000 words/day
– 1 hour of speech ≈ 8,000 words (DCPSE)
• Spontaneity
– production process lost: many written sources edited
Why spoken corpora?Why spoken corpora?
• Speech predates writing
– historically – literacy growth and spread
– child development – internal speech during writing
• Scale
– professional authors recommend 1,000 words/day
– 1 hour of speech ≈ 8,000 words (DCPSE)
• Spontaneity
– production process lost: many written sources edited
• Dialogue
– interaction between speakers
ICE-GB and DCPSEICE-GB and DCPSE
• British Component of the International
Corpus of English (1990-92)
– 1 million words (nominal)
– 60% spoken, 40% written
– speech component is orthographically transcribed
– fully parsed
• marked up, POS-tagged, parsed, hand-corrected
• Diachronic Corpus of Present-day Spoken
English
– 800,000 words (nominal)
– orthographically transcribed and fully parsed
– created from subsamples of LLC and ICE-GB
• Matching numbers of texts in text categories
• Not sampled over equal duration
– LLC (1958-1977) – ICE-GB (1990-1992)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
ModalModal shallshall vs.vs. willwill over timeover time
• Plotting modal shall/will over time (DCPSE)
• Small amounts
of data / year
ModalModal shallshall vs.vs. willwill over timeover time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
• Small amounts
of data / year
• Confidence
intervals identify
the degree of
certainty in our
results
ModalModal shallshall vs.vs. willwill over timeover time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
• Small amounts
of data / year
• Confidence
intervals identify
the degree of
certainty in our
results
• Highly skewed p
in some cases
– p = 0 or 1
(circled)
ModalModal shallshall vs.vs. willwill over timeover time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
• Small amounts of
data / year
• Confidence
intervals identify
the degree of
certainty in our
results
• We can now
estimate an
approximate
downwards
curve
(Aarts et al., 2013)
Intra-structural primingIntra-structural priming
• Priming effects within a structure
– Study repeating an additive step in structures
• Consider
– a phrase or clause that may (in principle) be
extended ad infinitum
• e.g. an NP with a noun head
N
Intra-structural primingIntra-structural priming
• Priming effects within a structure
– Study repeating an additive step in structures
• Consider
– a phrase or clause that may (in principle) be
extended ad infinitum
• e.g. an NP with a noun head
– a single additive step applied to this structure
• e.g. add an attributive AJP before the head
N
AJP
Intra-structural primingIntra-structural priming
• Priming effects within a structure
– Study repeating an additive step in structures
• Consider
– a phrase or clause that may (in principle) be
extended ad infinitum
• e.g. an NP with a noun head
– a single additive step applied to this structure
• e.g. add an attributive AJP before the head
– Q. What is the effect of repeatedly applying this
operation to the structure?
ship
N
N
AJP
Intra-structural primingIntra-structural priming
• Priming effects within a structure
– Study repeating an additive step in structures
• Consider
– a phrase or clause that may (in principle) be
extended ad infinitum
• e.g. an NP with a noun head
– a single additive step applied to this structure
• e.g. add an attributive AJP before the head
– Q. What is the effect of repeatedly applying this
operation to the structure?
ship
NAJP
tall
N
AJP
Intra-structural primingIntra-structural priming
• Priming effects within a structure
– Study repeating an additive step in structures
• Consider
– a phrase or clause that may (in principle) be
extended ad infinitum
• e.g. an NP with a noun head
– a single additive step applied to this structure
• e.g. add an attributive AJP before the head
– Q. What is the effect of repeatedly applying this
operation to the structure?
ship
NAJP
very greentall
AJP
N
AJP
Intra-structural primingIntra-structural priming
• Priming effects within a structure
– Study repeating an additive step in structures
• Consider
– a phrase or clause that may (in principle) be
extended ad infinitum
• e.g. an NP with a noun head
– a single additive step applied to this structure
• e.g. add an attributive AJP before the head
– Q. What is the effect of repeatedly applying this
operation to the structure?
ship
NAJP
very greentall
AJP
N
AJP
AJP
old
NP premodificationNP premodification
• Sequential probability analysis
– calculate probability of adding each AJP
– error bars: Wilson intervals
– probability falls
• second < first
• third < second
– decisions interact
– Every AJP added
makes it harder
to add another
0.00
0.05
0.10
0.15
0.20
0 1 2 3 4 5
probability
NP premodification:NP premodification:
explanations?explanations?
• Feedback loop: for each successive AJP,
it is more difficult to add a further AJP
• Possible explanations include:
 logical and semantic constraints
• tend to say the tall green ship
• do not tend to say tall short ship or green tall ship
 communicative economy
• once speaker said tall green ship, tends to only say ship
 memory/processing constraints
• unlikely: this is a small structure, as are AJPs
NP premod’n: speech vs. writingNP premod’n: speech vs. writing
• Spoken vs. written subcorpora
– Same overall pattern
– Spoken data tends to have fewer attributive AJPs
• Support for communicative economy or
memory/processing hypotheses?
– Significance tests
• Paired 2x1 Wilson tests
(Wallis 2011)
• first and second
observed spoken
probabilities are
significantly smaller
than written
0.00
0.05
0.10
0.15
0.20
0.25
0 1 2 3 4 5
probability
written
spoken
Potential sources of interactionPotential sources of interaction
• shared context
– topic or ‘content words’ (Noriega)
• idiomatic conventions
– semantic ordering of attributive adjectives (tall green ship)
• logical-semantic constraints
– exclusion of incompatible adjectives (?tall short ship)
• communicative constraints
– brevity on repetition (just say ship next time)
• psycholinguistic processing constraints
– attention and memory of speakers
What use is interactionWhat use is interaction
evidence?evidence?
• Corpus linguistics
– Optimising existing grammar
• e.g. co-ordination, compound nouns
• Theoretical linguistics
– Comparing different grammars, same
language
– Comparing different languages or periods
• Psycholinguistics
– Search for evidence of language production
constraints in spontaneous speech corpora
• speech and language therapy
• language acquisition and development
What can a parsed corpus tellWhat can a parsed corpus tell
us?us?
• Trees as handle on data
– make useful distinctions
– retrieve cases reliably
– not necessary to “agree” to framework used
• provided distinctions are meaningful
• Trees as trace of language production
process
– interaction between decisions leave a probabilistic
effect on overall performance
• not simple to distinguish between source
– results enabled by the framework
• but may also validate it
The importance of annotationThe importance of annotation
• Key element of a ‘3A cycle’
– Annotation ↔ Abstraction ↔ Analysis
• Richer annotation
– more effective abstraction
– deeper research questions?
• Multiple layers of annotation
– new research questions
– studying interaction between layers
• Algorithmic vs. human annotation
More informationMore information
• Full paper
Wallis, S.A. (2014) What might a corpus of parsed spoken data tell
us about language? In L. Veselovská and M. Janebová (eds.)
Complex Visibles Out There. Olomouc: Palacký University, 2014.
pp 641-662.
• Published at http://corplingstats.wordpress.com/2014/06/24/corpus
• References
Aarts, B. Close, J. and Wallis, S.A. (2013) Choices over time:
methodological issues in current change. In Aarts, Close, Leech
and Wallis (eds.)The Verb Phrase in English. CUP.
Nelson, G., Wallis, S.A. and Aarts, B. (2002) Exploring Natural
Language. Amsterdam: John Benjamins.
Wallis, S.A. (2011) Comparing χ2
tests for separability of distribution
and effect. London: Survey of English Usage.
• Published at http://corplingstats.wordpress.com/2012/03/31/comparing
More informationMore information
• Useful links
– Survey of English Usage
• www.ucl.ac.uk/english-usage
– Fuzzy Tree Fragments
• www.ucl.ac.uk/english-usage/resources/ftfs
– Author’s corpus linguistics statistics and
methodology research blog
• http://corplingstats.wordpress.com

Contenu connexe

Similaire à What might a spoken corpus tell us about language

Writing and presenting literature review
Writing and presenting literature reviewWriting and presenting literature review
Writing and presenting literature review
ansarikharkovi
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 

Similaire à What might a spoken corpus tell us about language (20)

11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
11 - qualitative research data analysis ( Dr. Abdullah Al-Beraidi - Dr. Ibrah...
 
https://www.slideshare.net/amaresimachew/hot-topics-132093738
https://www.slideshare.net/amaresimachew/hot-topics-132093738https://www.slideshare.net/amaresimachew/hot-topics-132093738
https://www.slideshare.net/amaresimachew/hot-topics-132093738
 
Writing and presenting literature review
Writing and presenting literature reviewWriting and presenting literature review
Writing and presenting literature review
 
Writing a scientific manuscript
Writing a scientific manuscriptWriting a scientific manuscript
Writing a scientific manuscript
 
Chapter Three of Your Thesis.ppt
Chapter Three of Your Thesis.pptChapter Three of Your Thesis.ppt
Chapter Three of Your Thesis.ppt
 
Chapter Three of Your Thesis.ppt
Chapter Three of Your Thesis.pptChapter Three of Your Thesis.ppt
Chapter Three of Your Thesis.ppt
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
 
Temporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information RetrievalTemporal Web Dynamics and Implications for Information Retrieval
Temporal Web Dynamics and Implications for Information Retrieval
 
Text Representations for Deep learning
Text Representations for Deep learningText Representations for Deep learning
Text Representations for Deep learning
 
Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2Scientific and Technical Translation in English: Week 2
Scientific and Technical Translation in English: Week 2
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful data
 
Research 101: Academic Writing Style .
Research 101: Academic Writing Style   .Research 101: Academic Writing Style   .
Research 101: Academic Writing Style .
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
time_series.pptx
time_series.pptxtime_series.pptx
time_series.pptx
 
Hacks for academic writing
Hacks for academic writingHacks for academic writing
Hacks for academic writing
 
From Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and BeyondFrom Semantics to Self-supervised Learning for Speech and Beyond
From Semantics to Self-supervised Learning for Speech and Beyond
 

Plus de UCLDH

Plus de UCLDH (20)

Neil Tarrant Defining Nature’s Limits 9 March 2022.pptx
Neil Tarrant Defining Nature’s Limits 9 March 2022.pptxNeil Tarrant Defining Nature’s Limits 9 March 2022.pptx
Neil Tarrant Defining Nature’s Limits 9 March 2022.pptx
 
Archiving the Medici: History and Future (1370s-2020s)
Archiving the Medici: History and Future (1370s-2020s)Archiving the Medici: History and Future (1370s-2020s)
Archiving the Medici: History and Future (1370s-2020s)
 
The Pleasures and Sorrows of digitising primary source collections: The Case ...
The Pleasures and Sorrows of digitising primary source collections: The Case ...The Pleasures and Sorrows of digitising primary source collections: The Case ...
The Pleasures and Sorrows of digitising primary source collections: The Case ...
 
CVT Connect: Co-producing a digital platform for people with learning disabil...
CVT Connect: Co-producing a digital platform for people with learning disabil...CVT Connect: Co-producing a digital platform for people with learning disabil...
CVT Connect: Co-producing a digital platform for people with learning disabil...
 
The opportunity of accessibility: increasing impact and improving the user ex...
The opportunity of accessibility: increasing impact and improving the user ex...The opportunity of accessibility: increasing impact and improving the user ex...
The opportunity of accessibility: increasing impact and improving the user ex...
 
National Trust 'For Everyone' strategy
National Trust 'For Everyone' strategyNational Trust 'For Everyone' strategy
National Trust 'For Everyone' strategy
 
Digital Lives of People with Learning Disabilities
Digital Lives of People with Learning DisabilitiesDigital Lives of People with Learning Disabilities
Digital Lives of People with Learning Disabilities
 
Digital Content and Disability - The Librarian Perspective
Digital Content and Disability - The Librarian PerspectiveDigital Content and Disability - The Librarian Perspective
Digital Content and Disability - The Librarian Perspective
 
SensusAccess: Alternate Media Made Easy
SensusAccess: Alternate Media Made EasySensusAccess: Alternate Media Made Easy
SensusAccess: Alternate Media Made Easy
 
Accessible Publishing
Accessible PublishingAccessible Publishing
Accessible Publishing
 
“It is Time for the Slaves to Speak”: Transatlantic Abolitionism and African ...
“It is Time for the Slaves to Speak”: Transatlantic Abolitionism and African ...“It is Time for the Slaves to Speak”: Transatlantic Abolitionism and African ...
“It is Time for the Slaves to Speak”: Transatlantic Abolitionism and African ...
 
Oceanic Exchanges presentation
Oceanic Exchanges presentationOceanic Exchanges presentation
Oceanic Exchanges presentation
 
Digital Face project presentation
Digital Face project presentationDigital Face project presentation
Digital Face project presentation
 
CrossCult presentation
CrossCult presentationCrossCult presentation
CrossCult presentation
 
Computational History and the Transformation of Public Discourse in Finland, ...
Computational History and the Transformation of Public Discourse in Finland, ...Computational History and the Transformation of Public Discourse in Finland, ...
Computational History and the Transformation of Public Discourse in Finland, ...
 
Where does the born- and reborn-digital material take the Digital Humanities?
Where does the born- and reborn-digital material take the Digital Humanities?Where does the born- and reborn-digital material take the Digital Humanities?
Where does the born- and reborn-digital material take the Digital Humanities?
 
Humanities Crowdsourcing on the Zooniverse Platform
Humanities Crowdsourcing on the Zooniverse PlatformHumanities Crowdsourcing on the Zooniverse Platform
Humanities Crowdsourcing on the Zooniverse Platform
 
Managing library collections with friends, favours and a spoonful of sugar
Managing library collections with friends, favours and a spoonful of sugarManaging library collections with friends, favours and a spoonful of sugar
Managing library collections with friends, favours and a spoonful of sugar
 
L taylor ucl_caribbean_digital_dreams_2017
L taylor ucl_caribbean_digital_dreams_2017L taylor ucl_caribbean_digital_dreams_2017
L taylor ucl_caribbean_digital_dreams_2017
 
Greta and Emily Franzini (UCLDH and Göttingen), 'Brothers Grimm, Jane Austen ...
Greta and Emily Franzini (UCLDH and Göttingen), 'Brothers Grimm, Jane Austen ...Greta and Emily Franzini (UCLDH and Göttingen), 'Brothers Grimm, Jane Austen ...
Greta and Emily Franzini (UCLDH and Göttingen), 'Brothers Grimm, Jane Austen ...
 

Dernier

SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdf
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdfContoh Aksi Nyata Refleksi Diri ( NUR ).pdf
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdf
cupulin
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
中 央社
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 

Dernier (20)

Supporting Newcomer Multilingual Learners
Supporting Newcomer  Multilingual LearnersSupporting Newcomer  Multilingual Learners
Supporting Newcomer Multilingual Learners
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdf
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdfContoh Aksi Nyata Refleksi Diri ( NUR ).pdf
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdf
 
Observing-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxObserving-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptx
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Rich Dad Poor Dad ( PDFDrive.com )--.pdf
Rich Dad Poor Dad ( PDFDrive.com )--.pdfRich Dad Poor Dad ( PDFDrive.com )--.pdf
Rich Dad Poor Dad ( PDFDrive.com )--.pdf
 
e-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopale-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopal
 
The Liver & Gallbladder (Anatomy & Physiology).pptx
The Liver &  Gallbladder (Anatomy & Physiology).pptxThe Liver &  Gallbladder (Anatomy & Physiology).pptx
The Liver & Gallbladder (Anatomy & Physiology).pptx
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 

What might a spoken corpus tell us about language

  • 1. What might a corpus of spoken data tell usWhat might a corpus of spoken data tell us about language?about language? UCL Digital Humanities Seminar November 15 Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk
  • 2. OutlineOutline • What can a corpus tell us? • The 3A cycle • What can a parsed corpus tell us? • ICE-GB and DCPSE • Diachronic changes – Modal shall/will over time • Intra-structural priming – NP premodification • The value of interaction evidence
  • 3. What can a corpus tell us?What can a corpus tell us? • Three kinds of evidence may be obtained from a corpus  Frequency (distribution) evidence of a particular known linguistic event  Coverage (discovery) evidence of new events  Interaction evidence of the relationship between events • But if these ‘events’ are lexical, this evidence can only really tell us about lexis – So corpus linguistics has always involved annotation
  • 4. The 3A cycleThe 3A cycle • Plain text corpora – evidence of lexical phenomena Text
  • 5. The 3A cycleThe 3A cycle • Plain text corpora – evidence of lexical phenomena • Need to annotate – add knowledge of frameworks – classify and relate phenomena – general annotation scheme • not focused on particular research goals Annotation Corpus Text
  • 6. The 3A cycleThe 3A cycle • Plain text corpora – evidence of lexical phenomena • Need to annotate – add knowledge of frameworks – classify and relate phenomena – general annotation scheme • not focused on particular research goals • Corpus research = the ‘3A’ cycle – Annotation Annotation Corpus Text
  • 7. The 3A cycleThe 3A cycle • Plain text corpora – evidence of lexical phenomena • Need to annotate – add knowledge of frameworks – classify and relate phenomena – general annotation scheme • not focused on particular research goals • Corpus research = the ‘3A’ cycle – Annotation ↔ Abstraction Annotation Abstraction Corpus Text Dataset data transformation (“operationalisation”)
  • 8. The 3A cycleThe 3A cycle • Plain text corpora – evidence of lexical phenomena • Need to annotate – add knowledge of frameworks – classify and relate phenomena – general annotation scheme • not focused on particular research goals • Corpus research = the ‘3A’ cycle – Annotation ↔ Abstraction ↔ Analysis Annotation Abstraction Analysis Corpus Text Dataset Hypotheses data transformation (“operationalisation”)
  • 9. AnnotationAnnotation ↔↔ AbstractionAbstraction • Abstraction – selects data from annotated corpus – maps it to a regular dataset for statistical analysis – bi-directional (“concretisation”) • allows us to interpret statistically significant results
  • 10. AnnotationAnnotation ↔↔ AbstractionAbstraction • Abstraction – selects data from annotated corpus – maps it to a regular dataset for statistical analysis – bi-directional (“concretisation”) • allows us to interpret statistically significant results • Even ‘lexical’ questions need annotation: – 1st person declarative modal verb shall/will abstraction relies on annotation
  • 11. What can aWhat can a parsedparsed corpus tellcorpus tell us?us? • Three kinds of evidence may be obtained from a parsed corpus  Frequency evidence of a particular known rule, structure or linguistic event  Coverage evidence of new rules, etc.  Interaction evidence of the relationship between rules, structures and events • BUT evidence is necessarily framed within a particular grammatical scheme – So… (an obvious question) how might we evaluate this grammar?
  • 12. What can a parsed corpus tellWhat can a parsed corpus tell us?us? • Parsed corpora contain (lots of) trees – Use Fuzzy Tree Fragment queries to get data – An FTF
  • 13. What can a parsed corpus tellWhat can a parsed corpus tell us?us? • Parsed corpora contain (lots of) trees – Use Fuzzy Tree Fragment queries to get data – An FTF – A matching case in a tree – Using ICECUP (Nelson et al, 2002)
  • 14. What can a parsed corpus tellWhat can a parsed corpus tell us?us? • Trees as handle on data – make useful distinctions – retrieve cases reliably – not necessary to “agree” to framework used • provided distinctions are meaningful
  • 15. What can a parsed corpus tellWhat can a parsed corpus tell us?us? • Trees as handle on data – make useful distinctions – retrieve cases reliably – not necessary to “agree” to framework used • provided distinctions are meaningful • Trees as trace of language production process – interaction between decisions leave a probabilistic effect on overall performance • not simple to distinguish between source – depends on the framework • but may also validate it
  • 16. Why spoken corpora?Why spoken corpora? • Speech predates writing – historically – literacy growth and spread – child development – internal speech during writing
  • 17. Why spoken corpora?Why spoken corpora? • Speech predates writing – historically – literacy growth and spread – child development – internal speech during writing • Scale – professional authors recommend 1,000 words/day – 1 hour of speech ≈ 8,000 words (DCPSE)
  • 18. Why spoken corpora?Why spoken corpora? • Speech predates writing – historically – literacy growth and spread – child development – internal speech during writing • Scale – professional authors recommend 1,000 words/day – 1 hour of speech ≈ 8,000 words (DCPSE) • Spontaneity – production process lost: many written sources edited
  • 19. Why spoken corpora?Why spoken corpora? • Speech predates writing – historically – literacy growth and spread – child development – internal speech during writing • Scale – professional authors recommend 1,000 words/day – 1 hour of speech ≈ 8,000 words (DCPSE) • Spontaneity – production process lost: many written sources edited • Dialogue – interaction between speakers
  • 20. ICE-GB and DCPSEICE-GB and DCPSE • British Component of the International Corpus of English (1990-92) – 1 million words (nominal) – 60% spoken, 40% written – speech component is orthographically transcribed – fully parsed • marked up, POS-tagged, parsed, hand-corrected • Diachronic Corpus of Present-day Spoken English – 800,000 words (nominal) – orthographically transcribed and fully parsed – created from subsamples of LLC and ICE-GB • Matching numbers of texts in text categories • Not sampled over equal duration – LLC (1958-1977) – ICE-GB (1990-1992)
  • 21. 0.0 0.2 0.4 0.6 0.8 1.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 p(shall | {shall, will}) ModalModal shallshall vs.vs. willwill over timeover time • Plotting modal shall/will over time (DCPSE) • Small amounts of data / year
  • 22. ModalModal shallshall vs.vs. willwill over timeover time • Plotting modal shall/will over time (DCPSE) 0.0 0.2 0.4 0.6 0.8 1.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 p(shall | {shall, will}) • Small amounts of data / year • Confidence intervals identify the degree of certainty in our results
  • 23. ModalModal shallshall vs.vs. willwill over timeover time • Plotting modal shall/will over time (DCPSE) 0.0 0.2 0.4 0.6 0.8 1.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 p(shall | {shall, will}) • Small amounts of data / year • Confidence intervals identify the degree of certainty in our results • Highly skewed p in some cases – p = 0 or 1 (circled)
  • 24. ModalModal shallshall vs.vs. willwill over timeover time • Plotting modal shall/will over time (DCPSE) 0.0 0.2 0.4 0.6 0.8 1.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 p(shall | {shall, will}) • Small amounts of data / year • Confidence intervals identify the degree of certainty in our results • We can now estimate an approximate downwards curve (Aarts et al., 2013)
  • 25. Intra-structural primingIntra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e.g. an NP with a noun head N
  • 26. Intra-structural primingIntra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e.g. an NP with a noun head – a single additive step applied to this structure • e.g. add an attributive AJP before the head N AJP
  • 27. Intra-structural primingIntra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e.g. an NP with a noun head – a single additive step applied to this structure • e.g. add an attributive AJP before the head – Q. What is the effect of repeatedly applying this operation to the structure? ship N N AJP
  • 28. Intra-structural primingIntra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e.g. an NP with a noun head – a single additive step applied to this structure • e.g. add an attributive AJP before the head – Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP tall N AJP
  • 29. Intra-structural primingIntra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e.g. an NP with a noun head – a single additive step applied to this structure • e.g. add an attributive AJP before the head – Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP very greentall AJP N AJP
  • 30. Intra-structural primingIntra-structural priming • Priming effects within a structure – Study repeating an additive step in structures • Consider – a phrase or clause that may (in principle) be extended ad infinitum • e.g. an NP with a noun head – a single additive step applied to this structure • e.g. add an attributive AJP before the head – Q. What is the effect of repeatedly applying this operation to the structure? ship NAJP very greentall AJP N AJP AJP old
  • 31. NP premodificationNP premodification • Sequential probability analysis – calculate probability of adding each AJP – error bars: Wilson intervals – probability falls • second < first • third < second – decisions interact – Every AJP added makes it harder to add another 0.00 0.05 0.10 0.15 0.20 0 1 2 3 4 5 probability
  • 32. NP premodification:NP premodification: explanations?explanations? • Feedback loop: for each successive AJP, it is more difficult to add a further AJP • Possible explanations include:  logical and semantic constraints • tend to say the tall green ship • do not tend to say tall short ship or green tall ship  communicative economy • once speaker said tall green ship, tends to only say ship  memory/processing constraints • unlikely: this is a small structure, as are AJPs
  • 33. NP premod’n: speech vs. writingNP premod’n: speech vs. writing • Spoken vs. written subcorpora – Same overall pattern – Spoken data tends to have fewer attributive AJPs • Support for communicative economy or memory/processing hypotheses? – Significance tests • Paired 2x1 Wilson tests (Wallis 2011) • first and second observed spoken probabilities are significantly smaller than written 0.00 0.05 0.10 0.15 0.20 0.25 0 1 2 3 4 5 probability written spoken
  • 34. Potential sources of interactionPotential sources of interaction • shared context – topic or ‘content words’ (Noriega) • idiomatic conventions – semantic ordering of attributive adjectives (tall green ship) • logical-semantic constraints – exclusion of incompatible adjectives (?tall short ship) • communicative constraints – brevity on repetition (just say ship next time) • psycholinguistic processing constraints – attention and memory of speakers
  • 35. What use is interactionWhat use is interaction evidence?evidence? • Corpus linguistics – Optimising existing grammar • e.g. co-ordination, compound nouns • Theoretical linguistics – Comparing different grammars, same language – Comparing different languages or periods • Psycholinguistics – Search for evidence of language production constraints in spontaneous speech corpora • speech and language therapy • language acquisition and development
  • 36. What can a parsed corpus tellWhat can a parsed corpus tell us?us? • Trees as handle on data – make useful distinctions – retrieve cases reliably – not necessary to “agree” to framework used • provided distinctions are meaningful • Trees as trace of language production process – interaction between decisions leave a probabilistic effect on overall performance • not simple to distinguish between source – results enabled by the framework • but may also validate it
  • 37. The importance of annotationThe importance of annotation • Key element of a ‘3A cycle’ – Annotation ↔ Abstraction ↔ Analysis • Richer annotation – more effective abstraction – deeper research questions? • Multiple layers of annotation – new research questions – studying interaction between layers • Algorithmic vs. human annotation
  • 38. More informationMore information • Full paper Wallis, S.A. (2014) What might a corpus of parsed spoken data tell us about language? In L. Veselovská and M. Janebová (eds.) Complex Visibles Out There. Olomouc: Palacký University, 2014. pp 641-662. • Published at http://corplingstats.wordpress.com/2014/06/24/corpus • References Aarts, B. Close, J. and Wallis, S.A. (2013) Choices over time: methodological issues in current change. In Aarts, Close, Leech and Wallis (eds.)The Verb Phrase in English. CUP. Nelson, G., Wallis, S.A. and Aarts, B. (2002) Exploring Natural Language. Amsterdam: John Benjamins. Wallis, S.A. (2011) Comparing χ2 tests for separability of distribution and effect. London: Survey of English Usage. • Published at http://corplingstats.wordpress.com/2012/03/31/comparing
  • 39. More informationMore information • Useful links – Survey of English Usage • www.ucl.ac.uk/english-usage – Fuzzy Tree Fragments • www.ucl.ac.uk/english-usage/resources/ftfs – Author’s corpus linguistics statistics and methodology research blog • http://corplingstats.wordpress.com