SlideShare une entreprise Scribd logo
1  sur  12
What can corpus
software do?
PRESENTED
BY
Ata ul ghafer and shoiba sabir
Applied Linguistics
GCUF
outlines
1. Things computers can do really well
2. Things computers cannot do at all
3. Sorting out your data
4. Concordances
5. Word lists
6. Key word lists
Things computers do really
well
 Computers don’t get tired
 Computers can multi-task. Strictly speaking, most of them don’t
actually do all the operations I shall describe simultaneously.
 computers are quite unable to get bored or frustrated.
 The computer, at the same time as it is polling the keyboard, is
also displaying results on the screen, calculating more results and,
perhaps in another tab or another window, running numerous
other operations.
 It is also true but less well known that it may actually get the
wrong answer.
Things computers cannot do
at all
 Computers do not notice what they are doing at all. Ask them
to perform a task 200 times and they will not assume the next
request will again be for that same task, or helpfully do it
without being asked. This is a tremendous strength. Many
human errors come from our noticing patterns of repetition:
‘she asked me to pass the salt again’ leading to ‘she fancies
me’,
 Computers cannot prefer one answer to another, as a mouse
could, or complain. Nor could they know what any data means.
 Nor can a computer know what the user meant. If s/he leans
over the computer and a sleeve accidentally presses a key,
there’s no way the computer can guess that this was
involuntary.
2.Sorting out your data
 Re-formatting and re-organising to match up better with what
computers can manage
 Because computers have no intelligence, we need to prepare
our corpora carefully in advance so that the texts contain
nothing unexpected and are in the right format for use. We
have already seen that there is an issue of whether the
underlying text format assumes that each written character
is to be stored using one byte or more than one. What if
some characters take up one byte but others take up two,
three, four or more.
3. Concordances
 There are two basic methods: the ‘on the fly’ system
and an index-based one. If working ‘on the fly’, the
concordancer reads each text file in order and searches
for the desired string(s) in the long stream of bytes
which it has just read in.
 If it finds the desired string, it then checks to see
whether any other contextual requirements are met,
such as whether the string is bounded appropriately to
left and right, for example by a space or a tab or a
punctuation symbol
 Alternatively, the entire corpus is first processed to
build up an index which gives a set of pointers. This
processing usually involves tokenising where each word
of the source text is represented by a number, and a
sizeable database is created which ‘knows’ (i.e.) holds
pointers to) all the instances of all the words. Then
when the search starts, the database is asked for all
instances of the word being sought, and additional
requirements such as context words can be put to the
database too. Once the request has been met, again the
program builds up a set of structured pieces of
information about the context, the filename, the exact
location of each word in the context, and so forth.
4. Word lists
 Word lists are usually created ‘on the fly’. A ‘current
word’ variable is first allocated in the computer’s
memory and set to be empty. Then, as the stream of
bytes is processed, every time a character is
encountered that is alphanumeric, it’s added to that
‘current word’. If on the other hand the character turns
out to be non-alphanumeric, such as a comma or a tab
or a space, the program assumes that an end-of-word
may have been reached It may have to check this in
certain cases such as the apostrophe in mother’s, where
a setting may allow for that character not to count as
an end-of-word character.
5. Key word lists
 A key word, as identified in WordSmith Tools, is a word (or
word cluster) which is found to occur with unusual
frequency in a given text or set of texts. In order to know
what is expected, a reference of some sort must be used. In
WordSmith’s implementation, key word lists use as their
starting points word lists created as described above. One
word list is first made of the text or set of texts which one is
interested in studying, and another one is then made of
some suitable reference corpus. This may be a superset of
the text-type including the one(s) being studied, such as
when one uses a word list based on the Daily Mirror’s news
texts 2004–7 as a reference when studying one specific Daily
Mirror text. Or it might be a general-purpose corpus like the
British National Corpus (BNC) used as a reference corpus
when one is studying the specific Daily Mirror text.
 The screenshot in Figure 11.9 shows the key word clusters of the play
Hamlet, compared with the clusters of all the Shakespeare plays
(including Hamlet) as identified in the word list above.
 What can corpus software do? Routledge chpt 11

Contenu connexe

Tendances

An Overview of Syllabuses in English Language Teaching
An Overview of Syllabuses in English Language TeachingAn Overview of Syllabuses in English Language Teaching
An Overview of Syllabuses in English Language Teaching
jetnang
 
An Introduction to Applied Linguistics part 2
An Introduction to Applied Linguistics part 2An Introduction to Applied Linguistics part 2
An Introduction to Applied Linguistics part 2
Samira Rahmdel
 
Standardization
StandardizationStandardization
Standardization
Sama Ahmad
 
Beginning concepts in psycholinguistics
Beginning concepts in psycholinguisticsBeginning concepts in psycholinguistics
Beginning concepts in psycholinguistics
Ahmed Qadoury Abed
 

Tendances (20)

Language Testing/ Assessment
Language Testing/ AssessmentLanguage Testing/ Assessment
Language Testing/ Assessment
 
Different Levels of Stylistics Analysis 1.Phonological level 2.Graphologic...
Different Levels of Stylistics Analysis  1.Phonological level   2.Graphologic...Different Levels of Stylistics Analysis  1.Phonological level   2.Graphologic...
Different Levels of Stylistics Analysis 1.Phonological level 2.Graphologic...
 
An Overview of Syllabuses in English Language Teaching
An Overview of Syllabuses in English Language TeachingAn Overview of Syllabuses in English Language Teaching
An Overview of Syllabuses in English Language Teaching
 
An Introduction to Applied Linguistics part 2
An Introduction to Applied Linguistics part 2An Introduction to Applied Linguistics part 2
An Introduction to Applied Linguistics part 2
 
Language policy dilemma in pakistan
Language policy dilemma in pakistanLanguage policy dilemma in pakistan
Language policy dilemma in pakistan
 
Genre Analysis.pptx
Genre Analysis.pptxGenre Analysis.pptx
Genre Analysis.pptx
 
Second Language Acquisition 631
Second Language Acquisition 631Second Language Acquisition 631
Second Language Acquisition 631
 
Cultural approaches to discourse
Cultural approaches to discourseCultural approaches to discourse
Cultural approaches to discourse
 
Corpus linguistics intro
Corpus linguistics introCorpus linguistics intro
Corpus linguistics intro
 
Standardization
StandardizationStandardization
Standardization
 
ESP Trends
ESP TrendsESP Trends
ESP Trends
 
Me1
Me1Me1
Me1
 
Generative grammar
Generative grammarGenerative grammar
Generative grammar
 
Translation Studies
Translation StudiesTranslation Studies
Translation Studies
 
Applied Linguistics
Applied LinguisticsApplied Linguistics
Applied Linguistics
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Language planning
Language planningLanguage planning
Language planning
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
Introduction to corpus linguistics 1
Introduction to corpus linguistics 1Introduction to corpus linguistics 1
Introduction to corpus linguistics 1
 
Beginning concepts in psycholinguistics
Beginning concepts in psycholinguisticsBeginning concepts in psycholinguistics
Beginning concepts in psycholinguistics
 

Similaire à What can corpus software do? Routledge chpt 11

Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
mghgk
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation
RIILP
 
(Assmt 1; Week 3 paper) Using ecree Doing the paper and s.docx
(Assmt 1; Week 3 paper)  Using ecree        Doing the paper and s.docx(Assmt 1; Week 3 paper)  Using ecree        Doing the paper and s.docx
(Assmt 1; Week 3 paper) Using ecree Doing the paper and s.docx
AASTHA76
 
Savage_Samuel.doc
Savage_Samuel.docSavage_Samuel.doc
Savage_Samuel.doc
butest
 

Similaire à What can corpus software do? Routledge chpt 11 (20)

The search engine index
The search engine indexThe search engine index
The search engine index
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
NLP todo
NLP todoNLP todo
NLP todo
 
Parser
ParserParser
Parser
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation
 
An Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired FileAn Efficient Search Engine for Searching Desired File
An Efficient Search Engine for Searching Desired File
 
From ELIZA to Alexa and Beyond
From ELIZA to Alexa and BeyondFrom ELIZA to Alexa and Beyond
From ELIZA to Alexa and Beyond
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Aaai 2006 Pedersen
 
Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
Fuzzy Matching or Fuzzy Logic Explained
Fuzzy Matching or Fuzzy Logic ExplainedFuzzy Matching or Fuzzy Logic Explained
Fuzzy Matching or Fuzzy Logic Explained
 
Cleaning and sorting data
Cleaning and sorting dataCleaning and sorting data
Cleaning and sorting data
 
(Assmt 1; Week 3 paper) Using ecree Doing the paper and s.docx
(Assmt 1; Week 3 paper)  Using ecree        Doing the paper and s.docx(Assmt 1; Week 3 paper)  Using ecree        Doing the paper and s.docx
(Assmt 1; Week 3 paper) Using ecree Doing the paper and s.docx
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11
 
Naming Things (with notes)
Naming Things (with notes)Naming Things (with notes)
Naming Things (with notes)
 
Savage_Samuel.doc
Savage_Samuel.docSavage_Samuel.doc
Savage_Samuel.doc
 
E1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7dE1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7d
 

Plus de RajpootBhatti5

Plus de RajpootBhatti5 (13)

what is stylistics and its levels 1.Phonological level 2.Graphological leve...
what is stylistics and its levels 1.Phonological level   2.Graphological leve...what is stylistics and its levels 1.Phonological level   2.Graphological leve...
what is stylistics and its levels 1.Phonological level 2.Graphological leve...
 
Universal grammar (ug)
Universal grammar (ug)Universal grammar (ug)
Universal grammar (ug)
 
ILR
ILRILR
ILR
 
Register theory
Register theoryRegister theory
Register theory
 
Types of corpus linguistics Parallel ,aligned...
 Types of corpus linguistics Parallel ,aligned... Types of corpus linguistics Parallel ,aligned...
Types of corpus linguistics Parallel ,aligned...
 
Binding theory
Binding theoryBinding theory
Binding theory
 
Researching language learning in the age of social
Researching language learning in the age of socialResearching language learning in the age of social
Researching language learning in the age of social
 
Call and less commonly taught languages
Call and less commonly taught languagesCall and less commonly taught languages
Call and less commonly taught languages
 
Call tele collaboration
Call  tele collaborationCall  tele collaboration
Call tele collaboration
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
 
What corpora are available? by David Y. W.D
What corpora are available? by David Y. W.DWhat corpora are available? by David Y. W.D
What corpora are available? by David Y. W.D
 
phonemes
 phonemes  phonemes
phonemes
 
Marxism theory
Marxism theoryMarxism theory
Marxism theory
 

Dernier

4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
Cara Menggugurkan Kandungan 087776558899
 

Dernier (20)

Enhancing Business Visibility PR Firms in San Francisco
Enhancing Business Visibility PR Firms in San FranciscoEnhancing Business Visibility PR Firms in San Francisco
Enhancing Business Visibility PR Firms in San Francisco
 
What is Google Search Console and What is it provide?
What is Google Search Console and What is it provide?What is Google Search Console and What is it provide?
What is Google Search Console and What is it provide?
 
Major SEO Trends in 2024 - Banyanbrain Digital
Major SEO Trends in 2024 - Banyanbrain DigitalMajor SEO Trends in 2024 - Banyanbrain Digital
Major SEO Trends in 2024 - Banyanbrain Digital
 
Unveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptx
Unveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptxUnveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptx
Unveiling the Legacy of the Rosetta stone A Key to Ancient Knowledge.pptx
 
Busty Desi⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 135 Noida Escorts >༒8448380779 Escort Service
 
personal branding kit for music business
personal branding kit for music businesspersonal branding kit for music business
personal branding kit for music business
 
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
Five Essential Tools for International SEO - Natalia Witczyk - SearchNorwich 15
 
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Sector 49 Noida Escorts >༒8448380779 Escort Service
 
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdfTAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
TAM_AdEx-Cross_Media_Report-Banking_Finance_Investment_(BFSI)_2023.pdf
 
Alpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptxAlpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptx
 
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdfMicro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
Micro-Choices, Max Impact Personalizing Your Journey, One Moment at a Time.pdf
 
2024 Social Trends Report V4 from Later.com
2024 Social Trends Report V4 from Later.com2024 Social Trends Report V4 from Later.com
2024 Social Trends Report V4 from Later.com
 
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
4 TRIK CARA MENGGUGURKAN JANIN ATAU ABORSI KANDUNGAN
 
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptxDigital-Marketing-Into-by-Zoraiz-Ahmad.pptx
Digital-Marketing-Into-by-Zoraiz-Ahmad.pptx
 
Analysis of Sineing Website and how to fix
Analysis of Sineing Website and how to fixAnalysis of Sineing Website and how to fix
Analysis of Sineing Website and how to fix
 
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best PracticesInstant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
 
Elevate Your Advertising Game: Introducing Billion Broadcaster Lift Advertising
Elevate Your Advertising Game: Introducing Billion Broadcaster Lift AdvertisingElevate Your Advertising Game: Introducing Billion Broadcaster Lift Advertising
Elevate Your Advertising Game: Introducing Billion Broadcaster Lift Advertising
 
SP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdfSP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdf
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh Benday
 
20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx
20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx
20180928 Hofstede Insights Conference Milan The Power of Culture Led Brands.pptx
 

What can corpus software do? Routledge chpt 11

  • 1. What can corpus software do? PRESENTED BY Ata ul ghafer and shoiba sabir Applied Linguistics GCUF
  • 2. outlines 1. Things computers can do really well 2. Things computers cannot do at all 3. Sorting out your data 4. Concordances 5. Word lists 6. Key word lists
  • 3. Things computers do really well  Computers don’t get tired  Computers can multi-task. Strictly speaking, most of them don’t actually do all the operations I shall describe simultaneously.  computers are quite unable to get bored or frustrated.  The computer, at the same time as it is polling the keyboard, is also displaying results on the screen, calculating more results and, perhaps in another tab or another window, running numerous other operations.  It is also true but less well known that it may actually get the wrong answer.
  • 4. Things computers cannot do at all  Computers do not notice what they are doing at all. Ask them to perform a task 200 times and they will not assume the next request will again be for that same task, or helpfully do it without being asked. This is a tremendous strength. Many human errors come from our noticing patterns of repetition: ‘she asked me to pass the salt again’ leading to ‘she fancies me’,  Computers cannot prefer one answer to another, as a mouse could, or complain. Nor could they know what any data means.  Nor can a computer know what the user meant. If s/he leans over the computer and a sleeve accidentally presses a key, there’s no way the computer can guess that this was involuntary.
  • 5. 2.Sorting out your data  Re-formatting and re-organising to match up better with what computers can manage  Because computers have no intelligence, we need to prepare our corpora carefully in advance so that the texts contain nothing unexpected and are in the right format for use. We have already seen that there is an issue of whether the underlying text format assumes that each written character is to be stored using one byte or more than one. What if some characters take up one byte but others take up two, three, four or more.
  • 6. 3. Concordances  There are two basic methods: the ‘on the fly’ system and an index-based one. If working ‘on the fly’, the concordancer reads each text file in order and searches for the desired string(s) in the long stream of bytes which it has just read in.  If it finds the desired string, it then checks to see whether any other contextual requirements are met, such as whether the string is bounded appropriately to left and right, for example by a space or a tab or a punctuation symbol
  • 7.  Alternatively, the entire corpus is first processed to build up an index which gives a set of pointers. This processing usually involves tokenising where each word of the source text is represented by a number, and a sizeable database is created which ‘knows’ (i.e.) holds pointers to) all the instances of all the words. Then when the search starts, the database is asked for all instances of the word being sought, and additional requirements such as context words can be put to the database too. Once the request has been met, again the program builds up a set of structured pieces of information about the context, the filename, the exact location of each word in the context, and so forth.
  • 8. 4. Word lists  Word lists are usually created ‘on the fly’. A ‘current word’ variable is first allocated in the computer’s memory and set to be empty. Then, as the stream of bytes is processed, every time a character is encountered that is alphanumeric, it’s added to that ‘current word’. If on the other hand the character turns out to be non-alphanumeric, such as a comma or a tab or a space, the program assumes that an end-of-word may have been reached It may have to check this in certain cases such as the apostrophe in mother’s, where a setting may allow for that character not to count as an end-of-word character.
  • 9.
  • 10. 5. Key word lists  A key word, as identified in WordSmith Tools, is a word (or word cluster) which is found to occur with unusual frequency in a given text or set of texts. In order to know what is expected, a reference of some sort must be used. In WordSmith’s implementation, key word lists use as their starting points word lists created as described above. One word list is first made of the text or set of texts which one is interested in studying, and another one is then made of some suitable reference corpus. This may be a superset of the text-type including the one(s) being studied, such as when one uses a word list based on the Daily Mirror’s news texts 2004–7 as a reference when studying one specific Daily Mirror text. Or it might be a general-purpose corpus like the British National Corpus (BNC) used as a reference corpus when one is studying the specific Daily Mirror text.
  • 11.  The screenshot in Figure 11.9 shows the key word clusters of the play Hamlet, compared with the clusters of all the Shakespeare plays (including Hamlet) as identified in the word list above.