1. Lexicographic Evidence
In this part how to design acquire and process a
design, acquire,
collection of linguistic data which will form the raw
material for a dictionary is going to be explained
explained.
2. Comprehension Q
C
h
i Questions (1)
ti
1. What is a reliable dictionary?
2. What is subjective evidence and its limits?
3.
3 What is a citation?
4. What should be the basic steps in setting up a
reading programme?
5. What
5 Wh t are th advantages and di d
the d
t
d disadvantages of
t
f
citations?
3. Comprehension Q
C
h
i Questions (2)
ti
6. What is a corpus?
7. What are the points that should be considered in
g g
p
designing a corpus?
8. How large should a corpus be?
9. How do we decide what kinds of written or spoken
material our corpus should include?
10. Can a corpus be representative?
4. Comprehension Q
C
h
i Questions (3)
ti
11. What i ‘ k i ’?
11 Wh t is ‘skewing’?
12. What
12 Wh are the questions that should be
h
i
h
h ld b
answered before starting to form the corpus?
13. What is linguistic annotation?
5. A ‘R li bl ’ Di ti
‘Reliable’ Dictionary
A reliable dictionary is one whose
generalizations
about
word
behavior
approximate closely to the ways in which
people normally use language when engaging
in real communicative acts. Yet, it is
difficult to determine how people normally
p p
y
use words. There is a need for evidence.
6. Subjective Evidence and Its Limits
Introspection: consulting your own mental l
l
l lexicon, is a
form of evidence, but it cannot form the basis of a
reliable dictionary alone, since one individual’s store of
linguistic k
li
i ti
knowledge i
l d
is i
inevitably i
it bl
incomplete and
l t
d
idiosyncratic.
Informant-testing: in which speakers of a language are
questioned about their use of words, is also of limited
value for mainstream lexicography for similar reasons.
g p y
Both f h
B h of them are essentially subjective f
i ll
bj
i
forms of
f
evidence.
Creating a reliable dictionary involves a number of
challenging tasks, but it is for sure that the observation
of language in use is the indispensable first stage in the
f
g g
p
f
g
process.
7. Citations
Cit ti
A citation is a short extract from a
text which provides evidence for a
word, phrase, usage, or meaning i
d h
i
in
authentic use.
Until the late
twentieth
century,
the
OED’s
citations would be written in
longhand on index cards known as
slips.
slips These were filed alphabetically
according to the keyword of the
citation.
it ti
8. DNA
If a blog has a common ancestor
with the diary one can say that it
diary,
has a DNA.
E.g. MySpace
E g ‘MySpace’ shares at least some
of its DNA with the ‘scrapbook’.
9. Setting up a Reading Programme
d
Some dictionary publishers provide online
forms to enable members of the public to
contribute citations Most of these publishers
citations.
get unusable citations since their programmes
are not well-planned. A good reading
p
g
g
programme, on the other hand, will often have
great value.
10. Setting up a Reading Programme
d
There is a need for at least four main data fields:
1- keyword or phrase: the usage that the citation illustrates,
filed under the headword to which it relates.
2- the citation itself: usually a single sentence is adequate, but
there may be more than one.
3- Information about the source of the citation: the date, title,
and author’s name are all important; additional information
(
(such as the page number) may be useful for specialized or
p g
)
y
p
historical dictionaries.
4- a comment field: this gives readers the option of adding a
c mm nt f
th g
r a r th
pt n f a ng
note to clarify the citation; it may, for example, be a new
meaning that needs explaining, or it may be characteristic of
one particular dialect.
11. Advantages of Cit ti
Ad
t
f Citations
1- they are helpful to monitor language change
y
p
g g
g
2
2- They give information about the terminology
from a specific subject field or a particular
variety or dialect.
y
3
3- They are helpful in training the
lexicographers
12. Disadvantages of Cit ti
Di d
t
f Citations
11 Collecting data in this way is labour intensive
labour-intensive,
so volumes will always be low.
2- Although instances of usage are authentic,
there is a bi s bj ti
th
big subjective element in th i
l m nt
their
selection
13. The Central Role of the Corpus
h
l
l
f h
Citation bank alone - even the largest one –
cannot usually supply language data in the
required volumes so the case for a large
q
m
f
g
corpus is clear.
A “corpus” is a collection of pieces of
language text in electronic form, selected
g g
,
according to external criteria to represent,
as far as possible, a language or language
variety as a source of data for li
i
fd
f linguistic
i i
research (Sinclair 2005).
14. Some I
S
Inescapable T th
bl Truths
There is no such thing as a perfect corpus for
g
p
p
lexicography.
F
First of all, the corpus is a sample. It is not possible to
f
,
p
mp .
p
examine every extant example of usage for the languages. To
create a sample that fairly reflects the wider population,
there is a need for carefully selected criteria.
Secondly, selecting texts on the basis of their ‘quality’, and
excluding those which fail this test, is fundamentally at odds
with th d s ipti
ith the descriptive ethos of corpus lin isti s Wh is t
th s f
p s linguistics. Who
to
judge which texts are ‘good’, and on what basis? It is clear
that a lexicographic corpus must be a genuine – and inclusivesnapshot of a language, not a set of texts that have been
specially chosen to advance someone’s notion of what
constitutes ‘good’ usage.
15. Corpora: Design Issues
C
D i I
Designing a corpus means making decisions about:
11 how large it will be
be.
22 which broad categories of text it will include
include.
33 what proportions of each category it will include
include.
4- hi h i di id l texts
4 which individual t t it will include.
ill i l d
16. Size: How large is large enough?
It i f sure th t th more d t we h
is for
that the
data
have th more we
the
learn. Yet, there are also some hypotheses on the size
of the corpus. Zipf’s Law predicts that the tenth most
frequent word in a corpus will occur twice as often as
the 20th most frequent word, ten times as often as the
100th most frequent word, and 100 times as often as
q
,
the 1000th most frequent word. Thus, it can be said
that in a corpus of 100 million words, a simple right or
left sorted corpus clearly shows most of the normal
patterns of usage for all words except the very rare.
17. Different texts, different styles
ff
d ff
l
However large its size may be if the words are
be,
taken from only a limited area (for instance from
newspapers), they cannot represent all aspects of
the language, and th results m
th l n
nd the
s lts may b misl din
be misleading.
(For instance; the meaning of the word party will
most frequently occur as a political organization
q
y
p
g
rather than a social event. A corpus consisting of a
single type of text will reflect only the stylistic
and subject-matter features of that particular
genre. It will as corpus linguistics say, a ‘skewed’
corpus. Therefore, the corpus should include
different texts and d ff
d ff
d different styles.
l
18. Can a Corpus be Representative?
The standard way of avoiding bias is to collect a ‘random sample’.
Yet
Y t random s
d
sampling may not represent th l
li
t
s t the language well. O
ll One
partial solution is to apply stratified sampling. This involves
breaking up the total population into a number of subcategories or
types, then creating independent random samples from each of
these groupings. But this immediately raises two questions:
g
p g
y
q
1- How do we define these subcategories?
2
2- How do we decide what proportions of each subcategory the
corpus should include?
19. It is almost impossible to define the population
that the corpus should be representative of,
and since the population is unlimited, it is
d
h
l
l
d
logically impossible to establish ‘correct’
proportions of each component. An achievable
ti
f
h
t A
hi
bl
objective should be “a balanced corpus”.
20. Selecting Texts
S l ti T t
The corpus collection is usually recursive.
p
y
First some texts from a range of sources are gathered
Next the texts are analyzed to identify recurring clusters
f g
f
.
of linguistic features.
It enables us to establish provisional categories of texts,
grouped on the basis of shared linguistic features.
Then more texts are collected to reflect these feature
distributions.
Then the analysis is repeated on the enlarged corpus, on
more texts.
The process thus proceeds in a cyclical fashion until we
collect a large corpus whose contents reflect the proportions
in which the various key features are observable in large
bodies f text.
b di of t t
21. Spoken D t A S
S k
Data:
Special C
i l Case
With a corpus of spoken language, there are no
language
obvious objective measures that can be used to
define the target population. The spoken data
population
should represent the variables like gender,
social class, age and religion. The conversations
, g
g
that form the corpus should reflect the
diversity of the spoken language.
22. A Note on ‘Skewing’
N t
‘Sk i ’
Skewing refers to a form of bias in data
whereby a particular feature is either over or
under represented to a degree that distorts
the general picture. As corpora grow larger,
usually problems with skewing gradually recede.
yp
gg
y
23. There are some questions that should be answered
before starting to form the corpus.
Language: Will the corpus be monolingual, bilingual, or
g g
p
g
g
multilingual? This is an important question before
starting to form the corpus.
Time: Will the corpus be synchronic or diachronic? In
a synchronic corpus, the constituent texts come from
one specific period of time, whereas the texts making
p
p
g
up a diachronic corpus come from an extended period.
Mode: Will the corpus include written texts, spoken
texts or both? The status of the chat room
conversations which have the characteristics of both
spoken and written texts is another point that require
p
p
q
attention in corpus formation.
24. Medium
M di
Medium refers to the channel in which the text
appears. A simple classification here would
distinguish print media and spoken media. The
former in l d
f m
include b ks n sp p s m
books, newspapers, magazines,
in s
journals, dissertations, movie scripts, government
documents and legal statutes. Spoken media
g
p
include face-to-face conversations, broadcasts and
podcasts, public meetings, and educational settings.
Once again traditional categories became blurred
again,
when we add the web to the mix. Some ‘new’ text
types (blogs and social networking sites, for
example) are exclusive to the web, b
l )
l
h
b but many
documents exist in both print and electronic media.
25. Dealing ith S bl
D li with Sublanguages
When we think about the vocabulary of a
language, it is useful to make a broad
distinction
between
core
usages
and
sublanguages. The word deuce is part of a
sublanguage: it belongs to the vocabulary of
tennis.
tennis A word like important, on the other
hand, belongs to the core vocabulary of
English. The following question arises at this
g
f
g q
point: will we include the sublanguages?
26. Collecting Written Data
In the past, the work of lexicographers was
p
g p
not so easy. Earlier corpora made extensive
use of scanning and keyboarding which were
both l
b h slow and l b
d labour-intensive processes.
Today it is possible to find the digital form of
various t t
i
texts.
27. Collecting Spoken Data
Traditionally, spoken data has been difficult
rad t onally,
d ff cult
and extensive to collect. Consequently,
although the majority of communicative events
g
j
y
in a language occur in spoken mode, few
corpora include high proportions of spoken
material. For instance, only 10 per cent of the
BNC is spoken. Nowadays, web-derived spoken
data hi h ff
d t which offers up-to-date material i l
t d t
t i l in large
quantities and at low cost begins to look like an
attractive alternative
alternative.
28. Collecting Data from the Web
The
Th question of ‘‘whether th web is a
sti
f h th the
b
corpus’ is a hotly debated topic in
language engineering circles. For
g p y,
lexicography, it is better to see the
web as a source of texts from which
a lexicographic corpus can be
assembled.
29. Sample Size
There are arguments for using complete
texts rather than extracts. In many
registers, the discourse structure and
g
rhetorical f
h
l features of a text may vary as it
f
proceeds from its opening paragraphs,
through its central sections, to the
concluding chapters. The BNC’s solution to
this was to ensure that 40000 word samples
were taken variously from the beginning
beginning,
middle, and end of its source documents.
30. Copyright and Permissions
C
i ht
d P
i i
Unless a corpus is made up of much older texts, most
of its source material is likely to be protected by
copyright. S corpus-builders should get permissions
i ht So,
b ild
h ld
t
i i
from the copyright owners to include the documents in
their corpus. This is not an easy task. It is one of the
most time consuming aspects of the project It is
project.
recommended that the corpus builders should never
offer to pay for permission to include a text. Once
money starts changing hands a precedent would be
hands,
established that could have fatal consequences to
corpus-creation efforts worldwide.
31. Processing and Annotating
g
g
the Data
To give the final f
g
f
form to the corpus f
p from its raw
state, some operations are carried on.
32. Clean-up, standardization,
p
and text encoding
Essentially
the
process
of
taking
a
heterogeneous collection of input document
collect on
nput
and converting them all to a standard, usable
form. For instance, non-linguistic sounds in
g
spoken data (like erm, ooh, mhm) and unusable
texts in written data (like indexes, tables,
diagrams) are not included in the corpus.
33. Documentation
D
i
Providing each input text with a unique
‘header document’ which records its essential
header document wh ch
ts essent al
features. Headers typically give bibliographic
information (title, author’s name, date and
place of publication, and the like) and
precisely locate each text in whatever
typology is being used.
34. Linguistic Annotation
Enriching raw text by adding grammatical
information which will enable corpus users
to frame sophisticated queries and extract
p
q
maximum benefit from the data. For
instance, She is tagged as a personal
pronoun, and R ll is tagged as a general
d Really
d
l
adverb. A well-tagged corpus allows us to
focus on each pattern in turn and view a
manageable number of examples.
35. Final Thoughts
Fi l Th
ht
In this part, a methodology for building a
corpus for use in lexicography has been
p
g p y
outlined. It is for sure that this is a difficult
task, and there is no perfect corpus since
p
p
language is diverse and dynamic. The aim is to
form a balanced, standardized, well-tagged
gg
corpus. For many kinds of research, a corpus
with meticulously detailed headers and finey
grained linguistic annotation is precisely what
is needed.
36. Turkish Summary: Sözlüksel Kanıt
Bu bölümde oluşturulması planlanan bir sözlüğe kaynaklık edecek olan
verilerin nasıl tasarlandığı, toplandığını ve işlendiğini anlatılmıştır.
Öncelikle, sözlüğü hazırlayacak kişilerin kendi sözcük dağarcıklarının
önemi vurgulanmalıdır. Ancak şurası kesin ki ne kadar geniş olursa
olsun bireysel sözcük dağarcığı böyle bir çalışma için yeterli değildir.
Geçmişte en sık kullanılan veri toplama yöntemi alıntılama yöntemi idi
idi.
Günümüzde bu yöntem eskiye oranla önemini biraz yitirse de hala
kullanılmaktadır. Hatta, internet üzerinden, gönüllülerden veriler
toplanması amacıyla özel programlar geliştirilmiştir. Bilgisayar ve
internet teknolojilerinin gelişmesiyle en fazla kullanılan veri toplama
yöntemi derleme yöntemi olmuştur Bölüm boyunca derlemeyi
olmuştur.
hazırlayan kişilerin pek çok soruyla karşı karşıya oldukları ve işlerinin
ne derece zor olduğu vurgulanmıştır. Dil çeşitlilik gösteren ve dinamik
bir yapıdadır bu nedenle mükemmel bir derleme yapılması beklenemez.
Sözlükçülerin amacı dengeli, kullanılan dili en iyi temsil eden, dili
kullanan kişilere ve dilin kullanıldığı ortamlara göre ortaya çıkan
değişkenleri dikkate alan, ve sözlük hazırlanmasında işe yarayacak
şekilde düzenlenmiş bir derleme oluşturmak olmalıdır.