I discuss the basics of corpus linguistics, the application of corpus linguistics on linguistic studies and second language learning, as well as some freely available corpus linguistics resources for beginner corpus linguists.
Citation: Zubaidi, N. (2021). Corpus linguistics: An introduction. UM de Universe 2021. doi: 10.13140/RG.2.2.25479.11683
4. Jens Martensson
• Very few language education program
studies in Indonesia offered corpus
linguistics (CL) course to students.
• CL has been treated as method only,
rather than as theory and field of
study.
• Zubaidi et al. (2021): most senior high
school English teachers in Malang (N=27)
have never learned nor used CL in their
teaching.
• Purpose: Introduce corpus linguistics to
beginner language practitioners
4
6. Jens Martensson
Corpus (pl. Corpora)
• Corpus (Latin for “body”)
• collection of LARGE, STRUCTURED,
AUTHENTIC TEXTS
• ELECTRONICALLY stored and processed
(MACHINE READABLE DATA/TEXT)
• SAMPLED to be REPRESENTATIVE of a
particular language use/variety (Xiao, n.d.)
• Corpus vs text archive vs database
• The LARGER the texts, the MORE RELIABLE
the generalization of language use.
6
(Zubaidi, 2021) UM de Universe
7. Jens Martensson
1 2
3 4
5 6
7
Most corpora are written
• Written text is EASIER TO OBTAIN than
spoken text
• Newspapers
• Fiction (e.g. Novels, poems)
• Technical Literature (e.g. manuals,
medicine)
• Personal letters & e-mail
• Advertising (e.g. political propaganda)
• Belief and Thought (e.g. Quran, Bible)
• www
(Zubaidi, 2021) UM de Universe
8. Jens Martensson
Corpus Linguistics (CL)
• The study of language as expressed in
corpora (samples) of "real world" text.
• Aim: checking OCCURRENCES/validating
LINGUISTIC RULES in a specific language
area
• Four primary characteristics of CL:
• SAMPLING and REPRESENTATIVENESS;
• FINITE SIZE;
• MACHINE-READABLE form;
• standard reference.
8
(Zubaidi, 2021) UM de Universe
9. Jens Martensson 9
(Zubaidi, 2021) UM de Universe
Theoretical, Interdisciplinary and Applied Linguistics
(Dendrinos, n.d.)
10. Jens Martensson
Theoretical Linguistics
• competence (what is
grammatical?)
• introspection
• indefinitely many types,
productivity
• grammatical vs.
ungrammatical
Corpus Linguistics
• performance (what is
attested?)
• instances
• finite number of types
• degrees of grammaticality
10
(Zubaidi, 2021) UM de Universe
Comparison
15. Jens Martensson
Size of Corpora
• CORPUS SIZE increases with the
DEVELOPMENT OF TECHNOLOGY
• 1960s-70s: 1 million (Brown and LOB)
• 1980s: 20 millions (The
Birmingham/Cobuild)
• 1990s: 100 millions (BNC)
• 2000s: 645 millions (The Bank of English)
• 2021: billions (BYU corpora)
15
(Zubaidi, 2021) UM de Universe
16. Jens Martensson
Types of Corpora
• Raw vs. annotated corpora
• Automatically annotated vs. manually annotated
corpora
• General/balanced/reference vs. special corpora
• Spoken vs. written language
• Monolingual vs. Multilingual Corpora
• Parallel vs. comparable corpora
• Synchronic vs. diachronic corpora
• Static/sample vs. dynamic/monitor corpora
• Native vs. learner corpora
• Developmental vs. learner/interlanguage corpora
16
(Zubaidi, 2021) UM de Universe
17. Jens Martensson
Popular English Corpora
• The British National Corpus (BNC)
• The Bank of English (BoE)
• BYU AMERICAN ENGLISH CORPUS (COCA,
WIKIPEDIA, GOOGLE, GOOGLE BOOKS)
• Corpora of Brown family (Brown, LOB, FLOB, Frown)
• ICE corpora (GB, EA, HK, Singapore, Philippines, New
Zealand etc)
• London-Lund corpus of spoken English
• SBCSAE
• The Helsinki Diachronic Corpus of English Texts (8th -
18th Century)
• The International Corpus of Learner English (ICLE)
• MICASE
17
(Zubaidi, 2021) UM de Universe
19. Jens Martensson
CL on language teaching &
Classroom
• Technology has become globally
widespread and accessible
• Larger, powerful computers that can
analyze large data are available
• Many corpus-related resources are
available FOR FREE
• Language teachers and learners can use
corpora
• http://iteslj.org/Articles/Krieger-
Corpus.html
19
(Zubaidi, 2021) UM de Universe
20. Jens Martensson
CL analysis
• Basis analysis:
• Listing, Sorting, Counting of
Concordances (KWIC)
• Complex analysis:
• Processing using complex programs
(e.g. Complex Ana, WordSmith Tools)
20
(Zubaidi, 2021) UM de Universe
21. Jens Martensson
Possible application of CL
1. Corpora as a source of
empirical data
2. Corpora in language
teaching and learning
3. Corpora in lexical studies
4. Corpora in speech research
5. Corpora in grammar
studies
6. Corpora and semantic
studies
7. Corpora in pragmatic and
discourse studies
21
(Adorjan, 2020; Nkemleke, 2008, 2009)
1. Corpora in sociolinguistic
studies
2. Corpora and stylistic
studies
3. Corpora in historical
linguistics
4. in dialectology and
variational studies
5. in Psycholinguistics
6. in cultural studies
22. Jens Martensson
Examples: CL application
1. Corpora in Teaching and Learning
• Real life language data for textbook
examples
• Critical look at existing language teaching
material
22
(Adorjan, 2020; Nkemleke, 2008, 2009)
23. Jens Martensson
CL Classroom Application
• Low contact vs high contact uses (Kitao &
Kitao, n.d.)
• Low: teacher uses CL to help in teaching
• High: student uses the corpora to learn
about language
• Data-driven learning: e.g. determining word’s
connotation (whether positive or negative)
23
(Zubaidi, 2021) UM de Universe
24. Jens Martensson
CL Classroom Application
• Low contact vs high contact uses (Kitao &
Kitao, n.d.)
• Low: teacher uses CL to help in teaching
• High: student uses the corpora to learn
about language
• Data-driven learning: e.g.
determining word’s connotation
(whether positive or negative)
24
(Zubaidi, 2021) UM de Universe
25. Jens Martensson
CL Classroom Application:
Strengths
• CL: rich, varied, and authentic language
database.
• Effective tool to teach and learn vocabulary
• it may motivate and attract students to the
language type
• valuable tool for conducting linguistic research,
and
• training learners to actively control their learning
(Adorjan, 2020; Ebrahimi & Faghih, 2016)
25
(Zubaidi, 2021) UM de Universe
26. Jens Martensson
CL Classroom Application:
Weaknesses
• only accessible using computers/mobile phones
and the internet
• Time- and energy-consuming and difficult to learn
the software and design corpus-based activities.
(Adorjan, 2020; Ebrahimi & Faghih, 2016)
26
(Zubaidi, 2021) UM de Universe
28. Jens Martensson
1 2
3 4
5 6
28
Corpus-related Internet
resources
1. General resources on corpus linguistics
2. Vocabulary frequency lists and
frequency level checkers
3. Online corpora, concordancers, and
other text-analysis software
4. E-texts
5. Information about using corpus
linguistics for language teaching
(Zubaidi, 2021) UM de Universe
29. Jens Martensson
FREE Corpus Analysis Tools
• Types: Tools with specific corpora vs tools
with any/collection of texts
• General: Word, Excel, etc.
• Specialized:
• Counting words
• Finding example of specific words or
parts of speech
• Analyzing word frequencies
• Evaluating readability
http://www.cis.doshisha.ac.jp/kkitao/library/
resource/corpus/corpus.htm
29
(Zubaidi, 2021) UM de Universe
30. Jens Martensson
Corpus Analysis Tools: Concordancer
• SOFTWARE: AntConc, MonoConc, Wordsmith
• ONLINE:
• Turbo Lingo: http://www.staff.amu.edu.pl/~sipkadan/lingo.htm
• VIEW (Variation in English Words and Phrases):
http://view.byu.edu/
• BNCweb: http://bncweb.lancs.ac.uk/bncwebSignup/
• Lextutor: http://www.lextutor.ca/concordancers/concord_e.html
• WebCorp: http://www.webcorp.org.uk/
• Text Lex Compare: http://www.lextutor.ca/text_lex_compare/
30
31. Jens Martensson
References
• Dendrinos, B. (n.d.) Unit 1: An Introduction to Applied Linguistics. Applied
Linguistics to Foreign Language Teaching and Learning. University of Athens,
Greece.
• Nkemleke, D. (2008). Corpus Linguistics and Language Education: Development
and Utility of the Corpus of Cameroon English. Humboldt Kolleg Kamerun.
• Nkemleke, D.A. (n.d.). Corpus Linguistic Development with reference to
Cameroon. University of Yaounde I.
• Say, B.
• Volk, M. (n.d.). Korpuslinguistik mit und für Computerlinguistik. Universität Zürich.
• Xiao, R. (n.d.). Corpus design and types of corpora. University of Lancaster.
31
(Zubaidi, 2021) UM de Universe
32. Jens Martensson
Conclusion
CL is fast developing area.
• Trend is on the utilization of technology into
various areas of linguistics (theoretical &
applied) and literature.
• It needs to be taught not only as a research
method, but also as field of study.
• Our ever-changing curriculum has not
accommodated it. We need a breakthrough.
32
(Zubaidi, 2021) UM de Universe