SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Lexicographic Evidence
In this part how to design acquire and process a
design, acquire,
collection of linguistic data which will form the raw
material for a dictionary is going to be explained
explained.
Comprehension Q
C
h
i Questions (1)
ti
1. What is a reliable dictionary?
2. What is subjective evidence and its limits?
3.
3 What is a citation?
4. What should be the basic steps in setting up a
reading programme?
5. What
5 Wh t are th advantages and di d
the d
t
d disadvantages of
t
f
citations?
Comprehension Q
C
h
i Questions (2)
ti
6. What is a corpus?
7. What are the points that should be considered in
g g
p
designing a corpus?
8. How large should a corpus be?
9. How do we decide what kinds of written or spoken
material our corpus should include?
10. Can a corpus be representative?
Comprehension Q
C
h
i Questions (3)
ti
11. What i ‘ k i ’?
11 Wh t is ‘skewing’?
12. What
12 Wh are the questions that should be
h
i
h
h ld b

answered before starting to form the corpus?

13. What is linguistic annotation?
A ‘R li bl ’ Di ti
‘Reliable’ Dictionary
A reliable dictionary is one whose
generalizations
about
word
behavior
approximate closely to the ways in which
people normally use language when engaging
in real communicative acts. Yet, it is
difficult to determine how people normally
p p
y
use words. There is a need for evidence.
Subjective Evidence and Its Limits
Introspection: consulting your own mental l
l
l lexicon, is a

form of evidence, but it cannot form the basis of a
reliable dictionary alone, since one individual’s store of
linguistic k
li
i ti
knowledge i
l d
is i
inevitably i
it bl
incomplete and
l t
d
idiosyncratic.

Informant-testing: in which speakers of a language are

questioned about their use of words, is also of limited
value for mainstream lexicography for similar reasons.
g p y
Both f h
B h of them are essentially subjective f
i ll
bj
i
forms of
f
evidence.
Creating a reliable dictionary involves a number of
challenging tasks, but it is for sure that the observation
of language in use is the indispensable first stage in the
f
g g
p
f
g
process.
Citations
Cit ti
A citation is a short extract from a
text which provides evidence for a
word, phrase, usage, or meaning i
d h
i
in
authentic use.
Until the late
twentieth
century,
the
OED’s
citations would be written in
longhand on index cards known as
slips.
slips These were filed alphabetically
according to the keyword of the
citation.
it ti
DNA

If a blog has a common ancestor
with the diary one can say that it
diary,
has a DNA.
E.g. MySpace
E g ‘MySpace’ shares at least some
of its DNA with the ‘scrapbook’.
Setting up a Reading Programme
d
Some dictionary publishers provide online
forms to enable members of the public to
contribute citations Most of these publishers
citations.
get unusable citations since their programmes
are not well-planned. A good reading
p
g
g
programme, on the other hand, will often have
great value.
Setting up a Reading Programme
d
There is a need for at least four main data fields:
1- keyword or phrase: the usage that the citation illustrates,
filed under the headword to which it relates.
2- the citation itself: usually a single sentence is adequate, but
there may be more than one.
3- Information about the source of the citation: the date, title,
and author’s name are all important; additional information
(
(such as the page number) may be useful for specialized or
p g
)
y
p
historical dictionaries.
4- a comment field: this gives readers the option of adding a
c mm nt f
th g
r a r th
pt n f a ng
note to clarify the citation; it may, for example, be a new
meaning that needs explaining, or it may be characteristic of
one particular dialect.
Advantages of Cit ti
Ad
t
f Citations
1- they are helpful to monitor language change
y
p
g g
g
2
2- They give information about the terminology
from a specific subject field or a particular
variety or dialect.
y
3
3- They are helpful in training the
lexicographers
Disadvantages of Cit ti
Di d
t
f Citations
11 Collecting data in this way is labour intensive
labour-intensive,
so volumes will always be low.
2- Although instances of usage are authentic,
there is a bi s bj ti
th
big subjective element in th i
l m nt
their
selection
The Central Role of the Corpus
h
l
l
f h
Citation bank alone - even the largest one –
cannot usually supply language data in the
required volumes so the case for a large
q
m
f
g
corpus is clear.
A “corpus” is a collection of pieces of
language text in electronic form, selected
g g
,
according to external criteria to represent,
as far as possible, a language or language
variety as a source of data for li
i
fd
f linguistic
i i
research (Sinclair 2005).
Some I
S
Inescapable T th
bl Truths
There is no such thing as a perfect corpus for
g
p
p
lexicography.
F
First of all, the corpus is a sample. It is not possible to
f
,
p
mp .
p
examine every extant example of usage for the languages. To
create a sample that fairly reflects the wider population,
there is a need for carefully selected criteria.
Secondly, selecting texts on the basis of their ‘quality’, and
excluding those which fail this test, is fundamentally at odds
with th d s ipti
ith the descriptive ethos of corpus lin isti s Wh is t
th s f
p s linguistics. Who
to
judge which texts are ‘good’, and on what basis? It is clear
that a lexicographic corpus must be a genuine – and inclusivesnapshot of a language, not a set of texts that have been
specially chosen to advance someone’s notion of what
constitutes ‘good’ usage.
Corpora: Design Issues
C
D i I
Designing a corpus means making decisions about:
11 how large it will be
be.
22 which broad categories of text it will include
include.
33 what proportions of each category it will include
include.
4- hi h i di id l texts
4 which individual t t it will include.
ill i l d
Size: How large is large enough?
It i f sure th t th more d t we h
is for
that the
data
have th more we
the
learn. Yet, there are also some hypotheses on the size
of the corpus. Zipf’s Law predicts that the tenth most
frequent word in a corpus will occur twice as often as
the 20th most frequent word, ten times as often as the
100th most frequent word, and 100 times as often as
q
,
the 1000th most frequent word. Thus, it can be said
that in a corpus of 100 million words, a simple right or
left sorted corpus clearly shows most of the normal
patterns of usage for all words except the very rare.
Different texts, different styles
ff
d ff
l
However large its size may be if the words are
be,
taken from only a limited area (for instance from
newspapers), they cannot represent all aspects of
the language, and th results m
th l n
nd the
s lts may b misl din
be misleading.
(For instance; the meaning of the word party will
most frequently occur as a political organization
q
y
p
g
rather than a social event. A corpus consisting of a
single type of text will reflect only the stylistic
and subject-matter features of that particular
genre. It will as corpus linguistics say, a ‘skewed’
corpus. Therefore, the corpus should include
different texts and d ff
d ff
d different styles.
l
Can a Corpus be Representative?
The standard way of avoiding bias is to collect a ‘random sample’.
Yet
Y t random s
d
sampling may not represent th l
li
t
s t the language well. O
ll One
partial solution is to apply stratified sampling. This involves
breaking up the total population into a number of subcategories or
types, then creating independent random samples from each of
these groupings. But this immediately raises two questions:
g
p g
y
q
1- How do we define these subcategories?
2
2- How do we decide what proportions of each subcategory the
corpus should include?
It is almost impossible to define the population
that the corpus should be representative of,
and since the population is unlimited, it is
d
h
l
l
d
logically impossible to establish ‘correct’
proportions of each component. An achievable
ti
f
h
t A
hi
bl
objective should be “a balanced corpus”.
Selecting Texts
S l ti T t
The corpus collection is usually recursive.
p
y
First some texts from a range of sources are gathered
Next the texts are analyzed to identify recurring clusters
f g
f
.
of linguistic features.
It enables us to establish provisional categories of texts,
grouped on the basis of shared linguistic features.
Then more texts are collected to reflect these feature
distributions.
Then the analysis is repeated on the enlarged corpus, on
more texts.
The process thus proceeds in a cyclical fashion until we
collect a large corpus whose contents reflect the proportions
in which the various key features are observable in large
bodies f text.
b di of t t
Spoken D t A S
S k
Data:
Special C
i l Case
With a corpus of spoken language, there are no
language
obvious objective measures that can be used to
define the target population. The spoken data
population
should represent the variables like gender,
social class, age and religion. The conversations
, g
g
that form the corpus should reflect the
diversity of the spoken language.
A Note on ‘Skewing’
N t
‘Sk i ’
Skewing refers to a form of bias in data
whereby a particular feature is either over or
under represented to a degree that distorts
the general picture. As corpora grow larger,
usually problems with skewing gradually recede.
yp
gg
y
There are some questions that should be answered
before starting to form the corpus.
Language: Will the corpus be monolingual, bilingual, or
g g
p
g
g
multilingual? This is an important question before
starting to form the corpus.
Time: Will the corpus be synchronic or diachronic? In
a synchronic corpus, the constituent texts come from
one specific period of time, whereas the texts making
p
p
g
up a diachronic corpus come from an extended period.
Mode: Will the corpus include written texts, spoken
texts or both? The status of the chat room
conversations which have the characteristics of both
spoken and written texts is another point that require
p
p
q
attention in corpus formation.
Medium
M di
Medium refers to the channel in which the text
appears. A simple classification here would
distinguish print media and spoken media. The
former in l d
f m
include b ks n sp p s m
books, newspapers, magazines,
in s
journals, dissertations, movie scripts, government
documents and legal statutes. Spoken media
g
p
include face-to-face conversations, broadcasts and
podcasts, public meetings, and educational settings.
Once again traditional categories became blurred
again,
when we add the web to the mix. Some ‘new’ text
types (blogs and social networking sites, for
example) are exclusive to the web, b
l )
l
h
b but many
documents exist in both print and electronic media.
Dealing ith S bl
D li with Sublanguages
When we think about the vocabulary of a
language, it is useful to make a broad
distinction
between
core
usages
and
sublanguages. The word deuce is part of a
sublanguage: it belongs to the vocabulary of
tennis.
tennis A word like important, on the other
hand, belongs to the core vocabulary of
English. The following question arises at this
g
f
g q
point: will we include the sublanguages?
Collecting Written Data
In the past, the work of lexicographers was
p
g p
not so easy. Earlier corpora made extensive
use of scanning and keyboarding which were
both l
b h slow and l b
d labour-intensive processes.
Today it is possible to find the digital form of
various t t
i
texts.
Collecting Spoken Data
Traditionally, spoken data has been difficult
rad t onally,
d ff cult
and extensive to collect. Consequently,
although the majority of communicative events
g
j
y
in a language occur in spoken mode, few
corpora include high proportions of spoken
material. For instance, only 10 per cent of the
BNC is spoken. Nowadays, web-derived spoken
data hi h ff
d t which offers up-to-date material i l
t d t
t i l in large
quantities and at low cost begins to look like an
attractive alternative
alternative.
Collecting Data from the Web
The
Th question of ‘‘whether th web is a
sti
f h th the
b
corpus’ is a hotly debated topic in
language engineering circles. For
g p y,
lexicography, it is better to see the
web as a source of texts from which
a lexicographic corpus can be
assembled.
Sample Size
There are arguments for using complete
texts rather than extracts. In many
registers, the discourse structure and
g
rhetorical f
h
l features of a text may vary as it
f
proceeds from its opening paragraphs,
through its central sections, to the
concluding chapters. The BNC’s solution to
this was to ensure that 40000 word samples
were taken variously from the beginning
beginning,
middle, and end of its source documents.
Copyright and Permissions
C
i ht
d P
i i
Unless a corpus is made up of much older texts, most
of its source material is likely to be protected by
copyright. S corpus-builders should get permissions
i ht So,
b ild
h ld
t
i i
from the copyright owners to include the documents in
their corpus. This is not an easy task. It is one of the
most time consuming aspects of the project It is
project.
recommended that the corpus builders should never
offer to pay for permission to include a text. Once
money starts changing hands a precedent would be
hands,
established that could have fatal consequences to
corpus-creation efforts worldwide.
Processing and Annotating
g
g
the Data
To give the final f
g
f
form to the corpus f
p from its raw
state, some operations are carried on.
Clean-up, standardization,
p
and text encoding
Essentially
the
process
of
taking
a
heterogeneous collection of input document
collect on
nput
and converting them all to a standard, usable
form. For instance, non-linguistic sounds in
g
spoken data (like erm, ooh, mhm) and unusable
texts in written data (like indexes, tables,
diagrams) are not included in the corpus.
Documentation
D
i
Providing each input text with a unique
‘header document’ which records its essential
header document wh ch
ts essent al
features. Headers typically give bibliographic
information (title, author’s name, date and
place of publication, and the like) and
precisely locate each text in whatever
typology is being used.
Linguistic Annotation
Enriching raw text by adding grammatical
information which will enable corpus users
to frame sophisticated queries and extract
p
q
maximum benefit from the data. For
instance, She is tagged as a personal
pronoun, and R ll is tagged as a general
d Really
d
l
adverb. A well-tagged corpus allows us to
focus on each pattern in turn and view a
manageable number of examples.
Final Thoughts
Fi l Th
ht
In this part, a methodology for building a
corpus for use in lexicography has been
p
g p y
outlined. It is for sure that this is a difficult
task, and there is no perfect corpus since
p
p
language is diverse and dynamic. The aim is to
form a balanced, standardized, well-tagged
gg
corpus. For many kinds of research, a corpus
with meticulously detailed headers and finey
grained linguistic annotation is precisely what
is needed.
Turkish Summary: Sözlüksel Kanıt
Bu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olan
verilerin nasıl tasarlandığı, toplandığını ve işlendiğini anlatılmıştır.
Öncelikle, sözlüğü hazırlayacak kişilerin kendi sözcük dağarcıklarının
önemi vurgulanmalıdır. Ancak şurası kesin ki ne kadar geniş olursa
olsun bireysel sözcük dağarcığı böyle bir çalışma için yeterli değildir.
Geçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idi
idi.
Günümüzde bu yöntem eskiye oranla önemini biraz yitirse de hala
kullanılmaktadır. Hatta, internet üzerinden, gönüllülerden veriler
toplanması amacıyla özel programlar geliştirilmiştir. Bilgisayar ve
internet teknolojilerinin gelişmesiyle en fazla kullanılan veri toplama
yöntemi derleme yöntemi olmuştur Bölüm boyunca derlemeyi
olmuştur.
hazırlayan kişilerin pek çok soruyla karşı karşıya oldukları ve işlerinin
ne derece zor olduğu vurgulanmıştır. Dil çeşitlilik gösteren ve dinamik
bir yapıdadır bu nedenle mükemmel bir derleme yapılması beklenemez.
Sözlükçülerin amacı dengeli, kullanılan dili en iyi temsil eden, dili
kullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkan
değişkenleri dikkate alan, ve sözlük hazırlanmasında işe yarayacak
şekilde düzenlenmiş bir derleme oluşturmak olmalıdır.

Contenu connexe

Tendances

The role of context in interpretation
The role of context in interpretationThe role of context in interpretation
The role of context in interpretationH. R. Marasabessy
 
Presupposition
PresuppositionPresupposition
PresuppositionAyesha Mir
 
THE SEVEN STANDARDS OF TEXTUALITY_INTRO TO STYLISTICS.pptx
THE SEVEN STANDARDS OF TEXTUALITY_INTRO TO STYLISTICS.pptxTHE SEVEN STANDARDS OF TEXTUALITY_INTRO TO STYLISTICS.pptx
THE SEVEN STANDARDS OF TEXTUALITY_INTRO TO STYLISTICS.pptxZoeRejeanCabungcalRa
 
Origin of pidgin and creole
Origin of pidgin and creoleOrigin of pidgin and creole
Origin of pidgin and creoleStudent
 
Morphology # Productivity in Word-Formation
Morphology # Productivity in Word-FormationMorphology # Productivity in Word-Formation
Morphology # Productivity in Word-FormationAni Istiana
 
Untranslatability in translation
Untranslatability in translation Untranslatability in translation
Untranslatability in translation Mohsine Mahraj
 
Categories of the Theory of Grammar (Halliday, 1961)
Categories of the Theory of Grammar (Halliday, 1961)Categories of the Theory of Grammar (Halliday, 1961)
Categories of the Theory of Grammar (Halliday, 1961)Anh Le
 
Derivational vs inflectional morphology
Derivational vs inflectional morphologyDerivational vs inflectional morphology
Derivational vs inflectional morphologyDr. Mohsin Khan
 
Componential analysis ppt
Componential analysis pptComponential analysis ppt
Componential analysis pptAlveenaNazir
 
Constituency tests, presented by dr. shadia yousef banjar.pptx
Constituency tests, presented by dr. shadia yousef banjar.pptxConstituency tests, presented by dr. shadia yousef banjar.pptx
Constituency tests, presented by dr. shadia yousef banjar.pptxDr. Shadia Banjar
 
Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics mimizin
 
semantics and pragmatics (1)
semantics and pragmatics (1)semantics and pragmatics (1)
semantics and pragmatics (1)ramazan demirtas
 
Lecture 2 sentence structure constituents
Lecture 2 sentence structure constituentsLecture 2 sentence structure constituents
Lecture 2 sentence structure constituentsssuser1f22f9
 
Morphology-Syntax Interface
Morphology-Syntax InterfaceMorphology-Syntax Interface
Morphology-Syntax InterfaceDr. Mohsin Khan
 

Tendances (20)

The role of context in interpretation
The role of context in interpretationThe role of context in interpretation
The role of context in interpretation
 
Hstorical Linguistics
Hstorical LinguisticsHstorical Linguistics
Hstorical Linguistics
 
catford
catfordcatford
catford
 
Presupposition
PresuppositionPresupposition
Presupposition
 
THE SEVEN STANDARDS OF TEXTUALITY_INTRO TO STYLISTICS.pptx
THE SEVEN STANDARDS OF TEXTUALITY_INTRO TO STYLISTICS.pptxTHE SEVEN STANDARDS OF TEXTUALITY_INTRO TO STYLISTICS.pptx
THE SEVEN STANDARDS OF TEXTUALITY_INTRO TO STYLISTICS.pptx
 
Origin of pidgin and creole
Origin of pidgin and creoleOrigin of pidgin and creole
Origin of pidgin and creole
 
Morphology # Productivity in Word-Formation
Morphology # Productivity in Word-FormationMorphology # Productivity in Word-Formation
Morphology # Productivity in Word-Formation
 
PHONES
PHONESPHONES
PHONES
 
Untranslatability in translation
Untranslatability in translation Untranslatability in translation
Untranslatability in translation
 
Categories of the Theory of Grammar (Halliday, 1961)
Categories of the Theory of Grammar (Halliday, 1961)Categories of the Theory of Grammar (Halliday, 1961)
Categories of the Theory of Grammar (Halliday, 1961)
 
Derivational vs inflectional morphology
Derivational vs inflectional morphologyDerivational vs inflectional morphology
Derivational vs inflectional morphology
 
Componential analysis ppt
Componential analysis pptComponential analysis ppt
Componential analysis ppt
 
Constituency tests, presented by dr. shadia yousef banjar.pptx
Constituency tests, presented by dr. shadia yousef banjar.pptxConstituency tests, presented by dr. shadia yousef banjar.pptx
Constituency tests, presented by dr. shadia yousef banjar.pptx
 
Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics
 
semantics and pragmatics (1)
semantics and pragmatics (1)semantics and pragmatics (1)
semantics and pragmatics (1)
 
Lexicography
 Lexicography Lexicography
Lexicography
 
Language and ethnic group
Language and ethnic groupLanguage and ethnic group
Language and ethnic group
 
Lecture 2 sentence structure constituents
Lecture 2 sentence structure constituentsLecture 2 sentence structure constituents
Lecture 2 sentence structure constituents
 
Forensic Linguistics
Forensic LinguisticsForensic Linguistics
Forensic Linguistics
 
Morphology-Syntax Interface
Morphology-Syntax InterfaceMorphology-Syntax Interface
Morphology-Syntax Interface
 

En vedette

umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationUmair Ijaz
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011Lenochka83
 
lexicography
lexicographylexicography
lexicographyayfa
 
The Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyThe Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyIhsan Ibadurrahman
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY mimisy
 

En vedette (9)

umair ijaz's Lexicography presentation
umair ijaz's Lexicography presentationumair ijaz's Lexicography presentation
umair ijaz's Lexicography presentation
 
Lexicography 2011
Lexicography 2011Lexicography 2011
Lexicography 2011
 
lexicography
lexicographylexicography
lexicography
 
Lexicography
LexicographyLexicography
Lexicography
 
The Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in LexicographyThe Use of Corpus Linguistics in Lexicography
The Use of Corpus Linguistics in Lexicography
 
Dictionaries
DictionariesDictionaries
Dictionaries
 
Lexicology
LexicologyLexicology
Lexicology
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Lexicology
LexicologyLexicology
Lexicology
 

Similaire à lexicographic evidence

Your Annotated Bibliography must have 8 sources. Please go back to t.docx
Your Annotated Bibliography must have 8 sources. Please go back to t.docxYour Annotated Bibliography must have 8 sources. Please go back to t.docx
Your Annotated Bibliography must have 8 sources. Please go back to t.docxbudbarber38650
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfFaishaMaeTangog
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Cornelius Puschmann
 
Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysisRubyaShaheen
 
ENG II Honors Curriculum Map
ENG II Honors Curriculum MapENG II Honors Curriculum Map
ENG II Honors Curriculum MapKatye Jones
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptionsNina Zotina
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptionsLubasweet
 
Language Descriptions
Language DescriptionsLanguage Descriptions
Language DescriptionsApelsinka
 
The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...Tiffany Sandoval
 
Automatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsAutomatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsJeff Nelson
 
How to Write a Literature Review in 30 Minutes or Less
How to Write a Literature Review in 30 Minutes or LessHow to Write a Literature Review in 30 Minutes or Less
How to Write a Literature Review in 30 Minutes or LessJonathan Underwood
 
June2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language ExamJune2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language Examsteddyss
 

Similaire à lexicographic evidence (20)

Your Annotated Bibliography must have 8 sources. Please go back to t.docx
Your Annotated Bibliography must have 8 sources. Please go back to t.docxYour Annotated Bibliography must have 8 sources. Please go back to t.docx
Your Annotated Bibliography must have 8 sources. Please go back to t.docx
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Large-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdfLarge-scale norming and statistical analysis of 870 American English idioms.pdf
Large-scale norming and statistical analysis of 870 American English idioms.pdf
 
Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)Corpora, Blogs and Linguistic Variation (Paderborn)
Corpora, Blogs and Linguistic Variation (Paderborn)
 
English 10.docx
English 10.docxEnglish 10.docx
English 10.docx
 
Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysis
 
APA Example Of Annotated Bibliography
APA Example Of Annotated BibliographyAPA Example Of Annotated Bibliography
APA Example Of Annotated Bibliography
 
ENG II Honors Curriculum Map
ENG II Honors Curriculum MapENG II Honors Curriculum Map
ENG II Honors Curriculum Map
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptions
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptions
 
language descriptions
language descriptionslanguage descriptions
language descriptions
 
Esp753
Esp753Esp753
Esp753
 
Esp.753.language descriptions
Esp.753.language descriptionsEsp.753.language descriptions
Esp.753.language descriptions
 
language_descriptions
language_descriptionslanguage_descriptions
language_descriptions
 
Language Descriptions
Language DescriptionsLanguage Descriptions
Language Descriptions
 
Esp.language descriptions
Esp.language descriptionsEsp.language descriptions
Esp.language descriptions
 
The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...
 
Automatic Profiling Of Learner Texts
Automatic Profiling Of Learner TextsAutomatic Profiling Of Learner Texts
Automatic Profiling Of Learner Texts
 
How to Write a Literature Review in 30 Minutes or Less
How to Write a Literature Review in 30 Minutes or LessHow to Write a Literature Review in 30 Minutes or Less
How to Write a Literature Review in 30 Minutes or Less
 
June2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language ExamJune2010 feedback How to tackle the yr 13 Language Exam
June2010 feedback How to tackle the yr 13 Language Exam
 

Plus de Duygu Aşıklar

06 planning the dictionary
06 planning the dictionary06 planning the dictionary
06 planning the dictionaryDuygu Aşıklar
 
05 linguistic theory meets lexicography
05 linguistic theory meets lexicography05 linguistic theory meets lexicography
05 linguistic theory meets lexicographyDuygu Aşıklar
 
dictionary types and dictionary users
dictionary types and dictionary usersdictionary types and dictionary users
dictionary types and dictionary usersDuygu Aşıklar
 

Plus de Duygu Aşıklar (6)

07 planning the entry
07 planning the entry07 planning the entry
07 planning the entry
 
06 planning the dictionary
06 planning the dictionary06 planning the dictionary
06 planning the dictionary
 
05 linguistic theory meets lexicography
05 linguistic theory meets lexicography05 linguistic theory meets lexicography
05 linguistic theory meets lexicography
 
dictionary types and dictionary users
dictionary types and dictionary usersdictionary types and dictionary users
dictionary types and dictionary users
 
methods and resources
methods and resourcesmethods and resources
methods and resources
 
what's a dictionary?
 what's a dictionary? what's a dictionary?
what's a dictionary?
 

Dernier

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Dernier (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

lexicographic evidence

  • 1. Lexicographic Evidence In this part how to design acquire and process a design, acquire, collection of linguistic data which will form the raw material for a dictionary is going to be explained explained.
  • 2. Comprehension Q C h i Questions (1) ti 1. What is a reliable dictionary? 2. What is subjective evidence and its limits? 3. 3 What is a citation? 4. What should be the basic steps in setting up a reading programme? 5. What 5 Wh t are th advantages and di d the d t d disadvantages of t f citations?
  • 3. Comprehension Q C h i Questions (2) ti 6. What is a corpus? 7. What are the points that should be considered in g g p designing a corpus? 8. How large should a corpus be? 9. How do we decide what kinds of written or spoken material our corpus should include? 10. Can a corpus be representative?
  • 4. Comprehension Q C h i Questions (3) ti 11. What i ‘ k i ’? 11 Wh t is ‘skewing’? 12. What 12 Wh are the questions that should be h i h h ld b answered before starting to form the corpus? 13. What is linguistic annotation?
  • 5. A ‘R li bl ’ Di ti ‘Reliable’ Dictionary A reliable dictionary is one whose generalizations about word behavior approximate closely to the ways in which people normally use language when engaging in real communicative acts. Yet, it is difficult to determine how people normally p p y use words. There is a need for evidence.
  • 6. Subjective Evidence and Its Limits Introspection: consulting your own mental l l l lexicon, is a form of evidence, but it cannot form the basis of a reliable dictionary alone, since one individual’s store of linguistic k li i ti knowledge i l d is i inevitably i it bl incomplete and l t d idiosyncratic. Informant-testing: in which speakers of a language are questioned about their use of words, is also of limited value for mainstream lexicography for similar reasons. g p y Both f h B h of them are essentially subjective f i ll bj i forms of f evidence. Creating a reliable dictionary involves a number of challenging tasks, but it is for sure that the observation of language in use is the indispensable first stage in the f g g p f g process.
  • 7. Citations Cit ti A citation is a short extract from a text which provides evidence for a word, phrase, usage, or meaning i d h i in authentic use. Until the late twentieth century, the OED’s citations would be written in longhand on index cards known as slips. slips These were filed alphabetically according to the keyword of the citation. it ti
  • 8. DNA If a blog has a common ancestor with the diary one can say that it diary, has a DNA. E.g. MySpace E g ‘MySpace’ shares at least some of its DNA with the ‘scrapbook’.
  • 9. Setting up a Reading Programme d Some dictionary publishers provide online forms to enable members of the public to contribute citations Most of these publishers citations. get unusable citations since their programmes are not well-planned. A good reading p g g programme, on the other hand, will often have great value.
  • 10. Setting up a Reading Programme d There is a need for at least four main data fields: 1- keyword or phrase: the usage that the citation illustrates, filed under the headword to which it relates. 2- the citation itself: usually a single sentence is adequate, but there may be more than one. 3- Information about the source of the citation: the date, title, and author’s name are all important; additional information ( (such as the page number) may be useful for specialized or p g ) y p historical dictionaries. 4- a comment field: this gives readers the option of adding a c mm nt f th g r a r th pt n f a ng note to clarify the citation; it may, for example, be a new meaning that needs explaining, or it may be characteristic of one particular dialect.
  • 11. Advantages of Cit ti Ad t f Citations 1- they are helpful to monitor language change y p g g g 2 2- They give information about the terminology from a specific subject field or a particular variety or dialect. y 3 3- They are helpful in training the lexicographers
  • 12. Disadvantages of Cit ti Di d t f Citations 11 Collecting data in this way is labour intensive labour-intensive, so volumes will always be low. 2- Although instances of usage are authentic, there is a bi s bj ti th big subjective element in th i l m nt their selection
  • 13. The Central Role of the Corpus h l l f h Citation bank alone - even the largest one – cannot usually supply language data in the required volumes so the case for a large q m f g corpus is clear. A “corpus” is a collection of pieces of language text in electronic form, selected g g , according to external criteria to represent, as far as possible, a language or language variety as a source of data for li i fd f linguistic i i research (Sinclair 2005).
  • 14. Some I S Inescapable T th bl Truths There is no such thing as a perfect corpus for g p p lexicography. F First of all, the corpus is a sample. It is not possible to f , p mp . p examine every extant example of usage for the languages. To create a sample that fairly reflects the wider population, there is a need for carefully selected criteria. Secondly, selecting texts on the basis of their ‘quality’, and excluding those which fail this test, is fundamentally at odds with th d s ipti ith the descriptive ethos of corpus lin isti s Wh is t th s f p s linguistics. Who to judge which texts are ‘good’, and on what basis? It is clear that a lexicographic corpus must be a genuine – and inclusivesnapshot of a language, not a set of texts that have been specially chosen to advance someone’s notion of what constitutes ‘good’ usage.
  • 15. Corpora: Design Issues C D i I Designing a corpus means making decisions about: 11 how large it will be be. 22 which broad categories of text it will include include. 33 what proportions of each category it will include include. 4- hi h i di id l texts 4 which individual t t it will include. ill i l d
  • 16. Size: How large is large enough? It i f sure th t th more d t we h is for that the data have th more we the learn. Yet, there are also some hypotheses on the size of the corpus. Zipf’s Law predicts that the tenth most frequent word in a corpus will occur twice as often as the 20th most frequent word, ten times as often as the 100th most frequent word, and 100 times as often as q , the 1000th most frequent word. Thus, it can be said that in a corpus of 100 million words, a simple right or left sorted corpus clearly shows most of the normal patterns of usage for all words except the very rare.
  • 17. Different texts, different styles ff d ff l However large its size may be if the words are be, taken from only a limited area (for instance from newspapers), they cannot represent all aspects of the language, and th results m th l n nd the s lts may b misl din be misleading. (For instance; the meaning of the word party will most frequently occur as a political organization q y p g rather than a social event. A corpus consisting of a single type of text will reflect only the stylistic and subject-matter features of that particular genre. It will as corpus linguistics say, a ‘skewed’ corpus. Therefore, the corpus should include different texts and d ff d ff d different styles. l
  • 18. Can a Corpus be Representative? The standard way of avoiding bias is to collect a ‘random sample’. Yet Y t random s d sampling may not represent th l li t s t the language well. O ll One partial solution is to apply stratified sampling. This involves breaking up the total population into a number of subcategories or types, then creating independent random samples from each of these groupings. But this immediately raises two questions: g p g y q 1- How do we define these subcategories? 2 2- How do we decide what proportions of each subcategory the corpus should include?
  • 19. It is almost impossible to define the population that the corpus should be representative of, and since the population is unlimited, it is d h l l d logically impossible to establish ‘correct’ proportions of each component. An achievable ti f h t A hi bl objective should be “a balanced corpus”.
  • 20. Selecting Texts S l ti T t The corpus collection is usually recursive. p y First some texts from a range of sources are gathered Next the texts are analyzed to identify recurring clusters f g f . of linguistic features. It enables us to establish provisional categories of texts, grouped on the basis of shared linguistic features. Then more texts are collected to reflect these feature distributions. Then the analysis is repeated on the enlarged corpus, on more texts. The process thus proceeds in a cyclical fashion until we collect a large corpus whose contents reflect the proportions in which the various key features are observable in large bodies f text. b di of t t
  • 21. Spoken D t A S S k Data: Special C i l Case With a corpus of spoken language, there are no language obvious objective measures that can be used to define the target population. The spoken data population should represent the variables like gender, social class, age and religion. The conversations , g g that form the corpus should reflect the diversity of the spoken language.
  • 22. A Note on ‘Skewing’ N t ‘Sk i ’ Skewing refers to a form of bias in data whereby a particular feature is either over or under represented to a degree that distorts the general picture. As corpora grow larger, usually problems with skewing gradually recede. yp gg y
  • 23. There are some questions that should be answered before starting to form the corpus. Language: Will the corpus be monolingual, bilingual, or g g p g g multilingual? This is an important question before starting to form the corpus. Time: Will the corpus be synchronic or diachronic? In a synchronic corpus, the constituent texts come from one specific period of time, whereas the texts making p p g up a diachronic corpus come from an extended period. Mode: Will the corpus include written texts, spoken texts or both? The status of the chat room conversations which have the characteristics of both spoken and written texts is another point that require p p q attention in corpus formation.
  • 24. Medium M di Medium refers to the channel in which the text appears. A simple classification here would distinguish print media and spoken media. The former in l d f m include b ks n sp p s m books, newspapers, magazines, in s journals, dissertations, movie scripts, government documents and legal statutes. Spoken media g p include face-to-face conversations, broadcasts and podcasts, public meetings, and educational settings. Once again traditional categories became blurred again, when we add the web to the mix. Some ‘new’ text types (blogs and social networking sites, for example) are exclusive to the web, b l ) l h b but many documents exist in both print and electronic media.
  • 25. Dealing ith S bl D li with Sublanguages When we think about the vocabulary of a language, it is useful to make a broad distinction between core usages and sublanguages. The word deuce is part of a sublanguage: it belongs to the vocabulary of tennis. tennis A word like important, on the other hand, belongs to the core vocabulary of English. The following question arises at this g f g q point: will we include the sublanguages?
  • 26. Collecting Written Data In the past, the work of lexicographers was p g p not so easy. Earlier corpora made extensive use of scanning and keyboarding which were both l b h slow and l b d labour-intensive processes. Today it is possible to find the digital form of various t t i texts.
  • 27. Collecting Spoken Data Traditionally, spoken data has been difficult rad t onally, d ff cult and extensive to collect. Consequently, although the majority of communicative events g j y in a language occur in spoken mode, few corpora include high proportions of spoken material. For instance, only 10 per cent of the BNC is spoken. Nowadays, web-derived spoken data hi h ff d t which offers up-to-date material i l t d t t i l in large quantities and at low cost begins to look like an attractive alternative alternative.
  • 28. Collecting Data from the Web The Th question of ‘‘whether th web is a sti f h th the b corpus’ is a hotly debated topic in language engineering circles. For g p y, lexicography, it is better to see the web as a source of texts from which a lexicographic corpus can be assembled.
  • 29. Sample Size There are arguments for using complete texts rather than extracts. In many registers, the discourse structure and g rhetorical f h l features of a text may vary as it f proceeds from its opening paragraphs, through its central sections, to the concluding chapters. The BNC’s solution to this was to ensure that 40000 word samples were taken variously from the beginning beginning, middle, and end of its source documents.
  • 30. Copyright and Permissions C i ht d P i i Unless a corpus is made up of much older texts, most of its source material is likely to be protected by copyright. S corpus-builders should get permissions i ht So, b ild h ld t i i from the copyright owners to include the documents in their corpus. This is not an easy task. It is one of the most time consuming aspects of the project It is project. recommended that the corpus builders should never offer to pay for permission to include a text. Once money starts changing hands a precedent would be hands, established that could have fatal consequences to corpus-creation efforts worldwide.
  • 31. Processing and Annotating g g the Data To give the final f g f form to the corpus f p from its raw state, some operations are carried on.
  • 32. Clean-up, standardization, p and text encoding Essentially the process of taking a heterogeneous collection of input document collect on nput and converting them all to a standard, usable form. For instance, non-linguistic sounds in g spoken data (like erm, ooh, mhm) and unusable texts in written data (like indexes, tables, diagrams) are not included in the corpus.
  • 33. Documentation D i Providing each input text with a unique ‘header document’ which records its essential header document wh ch ts essent al features. Headers typically give bibliographic information (title, author’s name, date and place of publication, and the like) and precisely locate each text in whatever typology is being used.
  • 34. Linguistic Annotation Enriching raw text by adding grammatical information which will enable corpus users to frame sophisticated queries and extract p q maximum benefit from the data. For instance, She is tagged as a personal pronoun, and R ll is tagged as a general d Really d l adverb. A well-tagged corpus allows us to focus on each pattern in turn and view a manageable number of examples.
  • 35. Final Thoughts Fi l Th ht In this part, a methodology for building a corpus for use in lexicography has been p g p y outlined. It is for sure that this is a difficult task, and there is no perfect corpus since p p language is diverse and dynamic. The aim is to form a balanced, standardized, well-tagged gg corpus. For many kinds of research, a corpus with meticulously detailed headers and finey grained linguistic annotation is precisely what is needed.
  • 36. Turkish Summary: Sözlüksel Kanıt Bu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olan verilerin nasıl tasarlandığı, toplandığını ve işlendiğini anlatılmıştır. Öncelikle, sözlüğü hazırlayacak kişilerin kendi sözcük dağarcıklarının önemi vurgulanmalıdır. Ancak şurası kesin ki ne kadar geniş olursa olsun bireysel sözcük dağarcığı böyle bir çalışma için yeterli değildir. Geçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idi idi. Günümüzde bu yöntem eskiye oranla önemini biraz yitirse de hala kullanılmaktadır. Hatta, internet üzerinden, gönüllülerden veriler toplanması amacıyla özel programlar geliştirilmiştir. Bilgisayar ve internet teknolojilerinin gelişmesiyle en fazla kullanılan veri toplama yöntemi derleme yöntemi olmuştur Bölüm boyunca derlemeyi olmuştur. hazırlayan kişilerin pek çok soruyla karşı karşıya oldukları ve işlerinin ne derece zor olduğu vurgulanmıştır. Dil çeşitlilik gösteren ve dinamik bir yapıdadır bu nedenle mükemmel bir derleme yapılması beklenemez. Sözlükçülerin amacı dengeli, kullanılan dili en iyi temsil eden, dili kullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkan değişkenleri dikkate alan, ve sözlük hazırlanmasında işe yarayacak şekilde düzenlenmiş bir derleme oluşturmak olmalıdır.