SlideShare a Scribd company logo
1 of 24
Introduction
What are ‘Low resource’ languages?

Half of the world’s 7,000 languages have been
predicted to go extinct within this century
(Krauss 1992).

There is corpora for statistically none of them
available.
Introduction
• Only around thirty languages currently enjoy
  full technological resources

• Only a 100 or so have basic resources such as
  dictionaries, spellcheckers, or parsers
  (Scannell 2007; Krauwer 2003).
Introduction
Why make corpora?
• Linguistic data can be analysed by linguists
  interested in theoretical questions
• Utilised by data scientists and computational
  linguists to provide better tools and
  applications
• Archived for posterity.
Outline
•   The Tʉlʉʉsɨke Kɨlaangi Facebook Group
•   Previous work (in brief)
•   Legality of using Facebook
•   Corpus creation process
•   An XML Schema for data archival
Tʉlʉʉsɨke Kɨlaangi
Rangi:
  –   Bantu language
  –   350,000 speakers
  –   Spoken mainly in Tanzania
  –   A few linguists working on it – mainly Oliver
      Stegen (Edinburgh, SIL)
Tʉlʉʉsɨke Kɨlaangi
Facebook Group:
  –   Founded by Oliver Stegen
  –   339 Members
  –   Since February 11, 2011
  –   Created for corpora generation.
  –   For talking in Rangi – but there is often English
      and Swahili code switching.
Previous Work
• Twitter corpora: Large datasets, lots of opinion
  mining.
  – Examples: US elections, Arab Spring


• Án Crúbadán by Kevin Scannell
Previous Work
Previous Work
• Work on Facebook corpora:
  –
  –
  – Ok, there is some work, but it is very sparse. (If
    you know of any, let me know.)
Legal Issues
• Disclaimer: This is not sound legal advice, and
  I am not opening a lawyer-client relationship
  with you by telling you any of this. This is
  merely what I think I’ve figured out by staring
  at the literature and Facebook for a very, very
  long time.
Legal Issues
• Facebook’s Statement of Rights and
  Responsibilities, section 3.2 states:
  – ”You will not collect users’ content or information,
    or otherwise access Facebook, using automated
    means (such as harvesting bots, robots, spiders, or
    scrapers) without our permission.”
• Automated Data Collection Terms:
  – All automated processes on the site are
    forbidden, unless there is express written
    consent.
Legal Issues
• “You agree that any violation of these terms
  may result in your immediate ban from all
  Facebook websites, products and services.
  You acknowledge and agree that a breach or
  threatened breach of these terms would
  cause irreparable injury…” – Facebook
Legal Issues
• Work around:
  –   Use only ‘public’ information
  –   EU Directive 96/9/EC
  –   ‘Fair Use’
  –   Implied licenses
  –   Not using a crawler or scraper.
Privacy
• Facebook wants written consent from each
  user.
• Standard procedure in language
  documentation.
• Required by most universities (and often
  journals.)
Privacy
• Unnecessary here:
  –   All data is in the public domain.
  –   The data will not be shared or monetized
  –   All names and personal data are anonymised
  –   The data is being used purely for research.
  –   The group I’m looking at was set up for this
      purpose, and there has been personal
      communication confirming this by Stegen.
The Tool
• Load page into a browser normally
  – the source code has already been collected into the
    system, and automation is not necessary for
    retrieving more URLs.
• Manually click on “Display more posts...” and
  “View all comments”
  – An Ajax query is sent to the database, and the posts
    are loaded in the browser.
• Copy and save the HTML source code.
• Clean and sort with Python (Beautiful Soup).
XML Storage
• The data is massive.
• From February 11, 2011 to February 17, 2011
  is almost 300k lines of HTML.
• Mining this is not trivial.
XML Storage
• XML = extensible markup language
• Not reliant on any single, particular program.
• Widely used for data storage already.
• XML works by conforming to a schema.
• Easily converted into RDF and other useful
  storage formats.
• Easy to understand for both humans and
  machines.
• Can also be stored independently of the data.
Results
• The largest corpus currently available for
  Rangi:
  – Án Crúbadán crawler: this corpus is 108
    documents large, and is comprised of 17,908
    words and 123,354 characters.
• This Facebook corpus:
  – 990 threads, 64,891 words and 571,182
    characters.
Future Work
• Eventually, I hope to make this corpus public.



• Multilingual identification.
THANKS

  Questions?


https://github.com/RichardLitt/lrl

More Related Content

What's hot

Natural approach
Natural approach  Natural approach
Natural approach
Joel Acosta
 
THE ROLE OF THE CULTURE IN THE ENGLISH LANGUAGE LEARNING AND TEACHING IN THE ...
THE ROLE OF THE CULTURE IN THE ENGLISH LANGUAGE LEARNING AND TEACHING IN THE ...THE ROLE OF THE CULTURE IN THE ENGLISH LANGUAGE LEARNING AND TEACHING IN THE ...
THE ROLE OF THE CULTURE IN THE ENGLISH LANGUAGE LEARNING AND TEACHING IN THE ...
Muhmmad Asif
 
Functional stylistics
Functional stylisticsFunctional stylistics
Functional stylistics
Navera Rahman
 

What's hot (20)

Intro to-stylistics
Intro to-stylisticsIntro to-stylistics
Intro to-stylistics
 
The role of corrective feedback in second language learning
The role of corrective feedback in second language learningThe role of corrective feedback in second language learning
The role of corrective feedback in second language learning
 
SYSTEMIC FUNCTIONAL LINGUISTICS: REGISTER & GENRE
SYSTEMIC FUNCTIONAL LINGUISTICS: REGISTER & GENRESYSTEMIC FUNCTIONAL LINGUISTICS: REGISTER & GENRE
SYSTEMIC FUNCTIONAL LINGUISTICS: REGISTER & GENRE
 
Natural approach
Natural approach  Natural approach
Natural approach
 
Communicative Language Teaching
Communicative Language TeachingCommunicative Language Teaching
Communicative Language Teaching
 
Implicature
ImplicatureImplicature
Implicature
 
Second Language Aquisition
Second Language AquisitionSecond Language Aquisition
Second Language Aquisition
 
direct and indirect speech
direct and indirect speechdirect and indirect speech
direct and indirect speech
 
Translation theories
Translation theoriesTranslation theories
Translation theories
 
Suggestopedia - Georgi Lozanov - teaching methods
Suggestopedia - Georgi Lozanov - teaching methods Suggestopedia - Georgi Lozanov - teaching methods
Suggestopedia - Georgi Lozanov - teaching methods
 
Teaching Speaking
Teaching SpeakingTeaching Speaking
Teaching Speaking
 
the relevance theory- pragmatics
the relevance theory- pragmaticsthe relevance theory- pragmatics
the relevance theory- pragmatics
 
THE ROLE OF THE CULTURE IN THE ENGLISH LANGUAGE LEARNING AND TEACHING IN THE ...
THE ROLE OF THE CULTURE IN THE ENGLISH LANGUAGE LEARNING AND TEACHING IN THE ...THE ROLE OF THE CULTURE IN THE ENGLISH LANGUAGE LEARNING AND TEACHING IN THE ...
THE ROLE OF THE CULTURE IN THE ENGLISH LANGUAGE LEARNING AND TEACHING IN THE ...
 
Multimodal discourse analysis
Multimodal discourse analysisMultimodal discourse analysis
Multimodal discourse analysis
 
Politeness (Pragmatics)
Politeness (Pragmatics)Politeness (Pragmatics)
Politeness (Pragmatics)
 
Chap 4 1
Chap 4  1Chap 4  1
Chap 4 1
 
What is Applied Linguistics?
What is Applied Linguistics?What is Applied Linguistics?
What is Applied Linguistics?
 
Speech error and slip of tongue
Speech error and slip of tongueSpeech error and slip of tongue
Speech error and slip of tongue
 
Functional stylistics
Functional stylisticsFunctional stylistics
Functional stylistics
 
Translation mistakes
Translation mistakesTranslation mistakes
Translation mistakes
 

Similar to Building Corpora from Social Media

05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
Gambari Amosa Isiaka
 
Getting started with computers & the internet
Getting started with computers & the internetGetting started with computers & the internet
Getting started with computers & the internet
Martha Bogart
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
Tola Odugbesan
 
Technical skills in multimedia for odl learners
Technical skills in multimedia for odl learnersTechnical skills in multimedia for odl learners
Technical skills in multimedia for odl learners
Daniel Koloseni
 

Similar to Building Corpora from Social Media (20)

05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Internet browsing
Internet browsingInternet browsing
Internet browsing
 
Unit 1
Unit 1Unit 1
Unit 1
 
Getting started with computers & the internet
Getting started with computers & the internetGetting started with computers & the internet
Getting started with computers & the internet
 
11. introduction to internet
11. introduction to internet11. introduction to internet
11. introduction to internet
 
Development of the CyberCemetery (2011)
Development of the CyberCemetery (2011)Development of the CyberCemetery (2011)
Development of the CyberCemetery (2011)
 
Web Introduction
Web IntroductionWeb Introduction
Web Introduction
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
Everything about Internet
Everything about InternetEverything about Internet
Everything about Internet
 
Info 2402 irt-chapter_3
Info 2402 irt-chapter_3Info 2402 irt-chapter_3
Info 2402 irt-chapter_3
 
Storing and sharing
Storing and sharingStoring and sharing
Storing and sharing
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Cosi Usage Data
Cosi   Usage DataCosi   Usage Data
Cosi Usage Data
 
Application of internet
Application of internetApplication of internet
Application of internet
 
Internet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptxInternet & Library Use 2022 .pptx
Internet & Library Use 2022 .pptx
 
The Coming Explosion of Records at FamilySearch - Presentation
The Coming Explosion of Records at FamilySearch - PresentationThe Coming Explosion of Records at FamilySearch - Presentation
The Coming Explosion of Records at FamilySearch - Presentation
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Technical skills in multimedia for odl learners
Technical skills in multimedia for odl learnersTechnical skills in multimedia for odl learners
Technical skills in multimedia for odl learners
 
Writing The Research Paper A Handbook (7th ed) - Ch 5 computers and the resea...
Writing The Research Paper A Handbook (7th ed) - Ch 5 computers and the resea...Writing The Research Paper A Handbook (7th ed) - Ch 5 computers and the resea...
Writing The Research Paper A Handbook (7th ed) - Ch 5 computers and the resea...
 

More from Richard Littauer

On Tocharian Exceptionality to the centum/satem Isogloss
On Tocharian Exceptionality to the centum/satem IsoglossOn Tocharian Exceptionality to the centum/satem Isogloss
On Tocharian Exceptionality to the centum/satem Isogloss
Richard Littauer
 
Evolution of Morphological Agreement - Peche Kucha
Evolution of Morphological Agreement - Peche KuchaEvolution of Morphological Agreement - Peche Kucha
Evolution of Morphological Agreement - Peche Kucha
Richard Littauer
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
Richard Littauer
 

More from Richard Littauer (14)

Academic Research in the Blogosphere: Adapting to New Risks and Opportunities...
Academic Research in the Blogosphere: Adapting to New Risks and Opportunities...Academic Research in the Blogosphere: Adapting to New Risks and Opportunities...
Academic Research in the Blogosphere: Adapting to New Risks and Opportunities...
 
Named Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 PresentationNamed Entity Recognition - ACL 2011 Presentation
Named Entity Recognition - ACL 2011 Presentation
 
Marcu 2000 presentation
Marcu 2000 presentationMarcu 2000 presentation
Marcu 2000 presentation
 
Barzilay & Lapata 2008 presentation
Barzilay & Lapata 2008 presentationBarzilay & Lapata 2008 presentation
Barzilay & Lapata 2008 presentation
 
Saarland and UdS
Saarland and UdSSaarland and UdS
Saarland and UdS
 
Visualising Typological Relationships: Plotting WALS with Heat Maps
Visualising Typological Relationships: Plotting WALS with Heat MapsVisualising Typological Relationships: Plotting WALS with Heat Maps
Visualising Typological Relationships: Plotting WALS with Heat Maps
 
On Tocharian Exceptionality to the centum/satem Isogloss
On Tocharian Exceptionality to the centum/satem IsoglossOn Tocharian Exceptionality to the centum/satem Isogloss
On Tocharian Exceptionality to the centum/satem Isogloss
 
The Evolution of Morphological Agreement
The Evolution of Morphological AgreementThe Evolution of Morphological Agreement
The Evolution of Morphological Agreement
 
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
 
Evolution of Morphological Agreement - Peche Kucha
Evolution of Morphological Agreement - Peche KuchaEvolution of Morphological Agreement - Peche Kucha
Evolution of Morphological Agreement - Peche Kucha
 
Workflow Classification and Open-Sourcing Methods: Towards a New Publication ...
Workflow Classification and Open-Sourcing Methods: Towards a New Publication ...Workflow Classification and Open-Sourcing Methods: Towards a New Publication ...
Workflow Classification and Open-Sourcing Methods: Towards a New Publication ...
 
The Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer SimulationThe Evolution of Speech Segmentation: A Computer Simulation
The Evolution of Speech Segmentation: A Computer Simulation
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
 
A Reanalysis of Anatomical Changes for Language
A Reanalysis of Anatomical Changes for LanguageA Reanalysis of Anatomical Changes for Language
A Reanalysis of Anatomical Changes for Language
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Recently uploaded (20)

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

Building Corpora from Social Media

  • 1.
  • 2. Introduction What are ‘Low resource’ languages? Half of the world’s 7,000 languages have been predicted to go extinct within this century (Krauss 1992). There is corpora for statistically none of them available.
  • 3. Introduction • Only around thirty languages currently enjoy full technological resources • Only a 100 or so have basic resources such as dictionaries, spellcheckers, or parsers (Scannell 2007; Krauwer 2003).
  • 4. Introduction Why make corpora? • Linguistic data can be analysed by linguists interested in theoretical questions • Utilised by data scientists and computational linguists to provide better tools and applications • Archived for posterity.
  • 5. Outline • The Tʉlʉʉsɨke Kɨlaangi Facebook Group • Previous work (in brief) • Legality of using Facebook • Corpus creation process • An XML Schema for data archival
  • 6. Tʉlʉʉsɨke Kɨlaangi Rangi: – Bantu language – 350,000 speakers – Spoken mainly in Tanzania – A few linguists working on it – mainly Oliver Stegen (Edinburgh, SIL)
  • 7.
  • 8. Tʉlʉʉsɨke Kɨlaangi Facebook Group: – Founded by Oliver Stegen – 339 Members – Since February 11, 2011 – Created for corpora generation. – For talking in Rangi – but there is often English and Swahili code switching.
  • 9. Previous Work • Twitter corpora: Large datasets, lots of opinion mining. – Examples: US elections, Arab Spring • Án Crúbadán by Kevin Scannell
  • 11. Previous Work • Work on Facebook corpora: – – – Ok, there is some work, but it is very sparse. (If you know of any, let me know.)
  • 12. Legal Issues • Disclaimer: This is not sound legal advice, and I am not opening a lawyer-client relationship with you by telling you any of this. This is merely what I think I’ve figured out by staring at the literature and Facebook for a very, very long time.
  • 13. Legal Issues • Facebook’s Statement of Rights and Responsibilities, section 3.2 states: – ”You will not collect users’ content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission.” • Automated Data Collection Terms: – All automated processes on the site are forbidden, unless there is express written consent.
  • 14. Legal Issues • “You agree that any violation of these terms may result in your immediate ban from all Facebook websites, products and services. You acknowledge and agree that a breach or threatened breach of these terms would cause irreparable injury…” – Facebook
  • 15. Legal Issues • Work around: – Use only ‘public’ information – EU Directive 96/9/EC – ‘Fair Use’ – Implied licenses – Not using a crawler or scraper.
  • 16. Privacy • Facebook wants written consent from each user. • Standard procedure in language documentation. • Required by most universities (and often journals.)
  • 17. Privacy • Unnecessary here: – All data is in the public domain. – The data will not be shared or monetized – All names and personal data are anonymised – The data is being used purely for research. – The group I’m looking at was set up for this purpose, and there has been personal communication confirming this by Stegen.
  • 18. The Tool • Load page into a browser normally – the source code has already been collected into the system, and automation is not necessary for retrieving more URLs. • Manually click on “Display more posts...” and “View all comments” – An Ajax query is sent to the database, and the posts are loaded in the browser. • Copy and save the HTML source code. • Clean and sort with Python (Beautiful Soup).
  • 19. XML Storage • The data is massive. • From February 11, 2011 to February 17, 2011 is almost 300k lines of HTML. • Mining this is not trivial.
  • 20. XML Storage • XML = extensible markup language • Not reliant on any single, particular program. • Widely used for data storage already. • XML works by conforming to a schema. • Easily converted into RDF and other useful storage formats. • Easy to understand for both humans and machines. • Can also be stored independently of the data.
  • 21.
  • 22. Results • The largest corpus currently available for Rangi: – Án Crúbadán crawler: this corpus is 108 documents large, and is comprised of 17,908 words and 123,354 characters. • This Facebook corpus: – 990 threads, 64,891 words and 571,182 characters.
  • 23. Future Work • Eventually, I hope to make this corpus public. • Multilingual identification.

Editor's Notes

  1. The schema above does not allow for other linguistic annotation, such as part-of-speech tagging, or morphological or syntactic annotation. It is meant primarily as a storage format, to maintain the context of each comment and all detail that may be relevant to linguists from the original page. A different annotation format would need to be used for further annotations, but that is beyond the scope of this paper.