4 salient features of corpus

•

0 j'aime•2,239 vues

ThennarasuSakkan

Read more Dr. Niladri's work. These are from his work.

Formation

SALIENT FEATURES OF CORPUS
THENNARASU SAKKAN
26TH JULY 2019

QUANTITY:
It should be big in size containing large amount of data either in
spoken or written form.
Size is virtually the sum of its components, which constitute its
body.

QUALITY:
Quality = authenticity. All the texts should be obtained from
actual examples of speech and writing.
The role of a linguist is very important here. He has to verify
if language data is collected from ordinary communication,
and not from experimental conditions or artificial
circumstances.

REPRESENTATION:
It should be balanced to all areas of language use to represent
maximum linguistic diversities, as future analysis devised on it
needs verification and authentication of information from the
corpus presenting a language.
It should include samples from a wide range of texts.

SIMPLICITY:
It should contain plain texts in simple format.
This means that we expect an unbroken string of characters (or
words) without any additional linguistic information marked-up
within texts.
A simple plain text is opposed to any kind of annotation with
various types of linguistic and non-linguistic information.

EQUALITY:
Samples used in corpus should be of even size.
However, this is a controversial issue and will not be adopted
everywhere.
Sampling model may change considerably to make a corpus more
representative and multi-dimensional.

RETREAVABILITY:
Data, information, examples, and references should be easily
retrievable from corpus by the end-users.
This pays attention to preserving techniques of language data in
electronic format in computer.
The present technology makes it possible to generate corpus in PC
and preserve it in such way that we can easily retrieve data as and
when required.

VERIFIABILITY:
Corpus should be open to any kind of empirical verification.
We can use data form corpus for any kind of verification.
This puts corpus linguistics steps ahead of intuitive approach
to language study.

AUGMENTATION:
It should be increased regularly.
This will put corpus 'at par‘(original value of share) to register
linguistic changes occurring in a language in course of time.
Over time, by addition of new linguistic data, a corpus achieves
historical dimension for diachronic studies, and for displaying
linguistic cues to arrest changes in life and society.

DOCUMENTATION:
Full information of components should be kept separate from the
text itself.
It is always better to keep documentation information separate
from the text, and include only a minimal header containing
reference to documentation.
In case of corpus management, this allows effective separation of
plain texts from annotation with only a small amount of
programming effort.

Contenu connexe

Tendances

The locality principle- kirankiran nazir

History of linguistics presentationFariha asghar

Corpus LinguisticsProf.Ravindra Borse

StylisticsIqra Tehseen

DiglossiaMah Noor

Transformational-Generative GrammarRuth Ann Llego

Language planningErhan Bektaş

Language VariationDr. Cupid Lucid

Language and its componentsMIMOUN SEHIBI

The London School of LinguisticsUniversidad Autónoma de Nuevo León: Facultad de Filosofía y Letras

Corpus linguistics the basicsJorge Baptista

StylisticsНаталья Завьялова

Schools of thoughtValeria Roldán

History of translstudiesMuhmmad Asif

Cohesion, Coherence and TextualityDr. Mohsin Khan

Contrastive analysis abderrahim bellahcen

Branches of linguisticsMr. El-Sayed Ramadan

Transformational generative grammarBaishakhi Amin

Principles of parametersVelnar

Tendances (20)

The locality principle- kiran

History of linguistics presentation

Corpus Linguistics

Stylistics

Diglossia

Transformational-Generative Grammar

Language planning

Language Variation

Language and its components

The London School of Linguistics

Corpus linguistics the basics

Stylistics

Schools of thought

History of translstudies

Cohesion, Coherence and Textuality

Contrastive analysis

Branches of linguistics

Transformational generative grammar

Principles of parameters

Similaire à 4 salient features of corpus

Corpus linguisticsJitendra Patil

Corpus LinguisticsProf.Ravindra Borse

Corpus linguisticsAlicia Ruiz

Corpus study designbikashtaly

IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal

Computer assisted text and corpus analysisRubyaShaheen

Treebank annotationMohit Jasapara

Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq

Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation Systemijtsrd

A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit

A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESLinda Garcia

ELA Appendix Bjwalts

Generating similarity cluster of Indonesian languages with semi-supervised cl...IJECEIAES

Parafraseo-Chenggang.pdfUniversidad Nacional de San Martin

Corpus linguisticsjesuspickers80

lexicographyayfa

A statistical approach to term extraction.pdfJasmine Dixon

Cross-lingual event-mining using wordnet as a shared knowledge interfacepathsproject

Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE

Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap

Similaire à 4 salient features of corpus (20)

Corpus linguistics

Corpus Linguistics

Corpus linguistics

Corpus study design

IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models

Computer assisted text and corpus analysis

Treebank annotation

Building of Database for English-Azerbaijani Machine Translation Expert System

Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System

A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES

ELA Appendix B

Generating similarity cluster of Indonesian languages with semi-supervised cl...

Parafraseo-Chenggang.pdf

Corpus linguistics

lexicography

A statistical approach to term extraction.pdf

Cross-lingual event-mining using wordnet as a shared knowledge interface

Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...

Plus de ThennarasuSakkan

11 terms in corpus linguistics1 (1)ThennarasuSakkan

11 terms in Corpus Linguistics1 (2)ThennarasuSakkan

8 issues in pos taggingThennarasuSakkan

7 probability and statistics an introductionThennarasuSakkan

6 shallow parsing introductionThennarasuSakkan

5a use of annotated corpusThennarasuSakkan

5 relevance of annotated corpusThennarasuSakkan

2 why python for nlpThennarasuSakkan

1 computational linguistics an introductionThennarasuSakkan

Plus de ThennarasuSakkan (9)

11 terms in corpus linguistics1 (1)

11 terms in Corpus Linguistics1 (2)

8 issues in pos tagging

7 probability and statistics an introduction

6 shallow parsing introduction

5a use of annotated corpus

5 relevance of annotated corpus

2 why python for nlp

1 computational linguistics an introduction

Dernier

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr

9953330565 Low Rate Call Girls In Rohini Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

ESSENTIAL of (CS/IT/IS) class 06 (database)Dr. Mazin Mohamed alkathiri

Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe

OS-operating systems- ch04 (Threads) ...Dr. Mazin Mohamed alkathiri

DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu

KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE

Hierarchy of management that covers different levels of managementmkooblal

Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94

Software Engineering Methodologies (overview)eniolaolutunde

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR

History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR

internship ppt on smartinternz platform as salesforce developerunnathinaik

CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna

Interactive Powerpoint_How to Master effective communicationnomboosow

Dernier (20)

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...

9953330565 Low Rate Call Girls In Rohini Delhi NCR

ESSENTIAL of (CS/IT/IS) class 06 (database)

Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf

OS-operating systems- ch04 (Threads) ...

DATA STRUCTURE AND ALGORITHM for beginners

KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...

Hierarchy of management that covers different levels of management

Historical philosophical, theoretical, and legal foundations of special and i...

Software Engineering Methodologies (overview)

ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝

History Class XII Ch. 3 Kinship, Caste and Class (1).pptx

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️

internship ppt on smartinternz platform as salesforce developer

CELL CYCLE Division Science 8 quarter IV.pptx

Interactive Powerpoint_How to Master effective communication

4 salient features of corpus

1. SALIENT FEATURES OF CORPUS THENNARASU SAKKAN 26TH JULY 2019

2. QUANTITY: It should be big in size containing large amount of data either in spoken or written form. Size is virtually the sum of its components, which constitute its body.

3. QUALITY: Quality = authenticity. All the texts should be obtained from actual examples of speech and writing. The role of a linguist is very important here. He has to verify if language data is collected from ordinary communication, and not from experimental conditions or artificial circumstances.

4. REPRESENTATION: It should be balanced to all areas of language use to represent maximum linguistic diversities, as future analysis devised on it needs verification and authentication of information from the corpus presenting a language. It should include samples from a wide range of texts.

5. SIMPLICITY: It should contain plain texts in simple format. This means that we expect an unbroken string of characters (or words) without any additional linguistic information marked-up within texts. A simple plain text is opposed to any kind of annotation with various types of linguistic and non-linguistic information.

6. EQUALITY: Samples used in corpus should be of even size. However, this is a controversial issue and will not be adopted everywhere. Sampling model may change considerably to make a corpus more representative and multi-dimensional.

7. RETREAVABILITY: Data, information, examples, and references should be easily retrievable from corpus by the end-users. This pays attention to preserving techniques of language data in electronic format in computer. The present technology makes it possible to generate corpus in PC and preserve it in such way that we can easily retrieve data as and when required.

8. VERIFIABILITY: Corpus should be open to any kind of empirical verification. We can use data form corpus for any kind of verification. This puts corpus linguistics steps ahead of intuitive approach to language study.

9. AUGMENTATION: It should be increased regularly. This will put corpus 'at par‘(original value of share) to register linguistic changes occurring in a language in course of time. Over time, by addition of new linguistic data, a corpus achieves historical dimension for diachronic studies, and for displaying linguistic cues to arrest changes in life and society.

10. DOCUMENTATION: Full information of components should be kept separate from the text itself. It is always better to keep documentation information separate from the text, and include only a minimal header containing reference to documentation. In case of corpus management, this allows effective separation of plain texts from annotation with only a small amount of programming effort.

11. ANY QUESTION!

4 salient features of corpus

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 4 salient features of corpus

Similaire à 4 salient features of corpus (20)

Plus de ThennarasuSakkan

Plus de ThennarasuSakkan (9)

Dernier

Dernier (20)

4 salient features of corpus