2. Introduction
What are ‘Low resource’ languages?
Half of the world’s 7,000 languages have been
predicted to go extinct within this century
(Krauss 1992).
There is corpora for statistically none of them
available.
3. Introduction
• Only around thirty languages currently enjoy
full technological resources
• Only a 100 or so have basic resources such as
dictionaries, spellcheckers, or parsers
(Scannell 2007; Krauwer 2003).
4. Introduction
Why make corpora?
• Linguistic data can be analysed by linguists
interested in theoretical questions
• Utilised by data scientists and computational
linguists to provide better tools and
applications
• Archived for posterity.
5. Outline
• The Tʉlʉʉsɨke Kɨlaangi Facebook Group
• Previous work (in brief)
• Legality of using Facebook
• Corpus creation process
• An XML Schema for data archival
6. Tʉlʉʉsɨke Kɨlaangi
Rangi:
– Bantu language
– 350,000 speakers
– Spoken mainly in Tanzania
– A few linguists working on it – mainly Oliver
Stegen (Edinburgh, SIL)
7.
8. Tʉlʉʉsɨke Kɨlaangi
Facebook Group:
– Founded by Oliver Stegen
– 339 Members
– Since February 11, 2011
– Created for corpora generation.
– For talking in Rangi – but there is often English
and Swahili code switching.
9. Previous Work
• Twitter corpora: Large datasets, lots of opinion
mining.
– Examples: US elections, Arab Spring
• Án Crúbadán by Kevin Scannell
11. Previous Work
• Work on Facebook corpora:
–
–
– Ok, there is some work, but it is very sparse. (If
you know of any, let me know.)
12. Legal Issues
• Disclaimer: This is not sound legal advice, and
I am not opening a lawyer-client relationship
with you by telling you any of this. This is
merely what I think I’ve figured out by staring
at the literature and Facebook for a very, very
long time.
13. Legal Issues
• Facebook’s Statement of Rights and
Responsibilities, section 3.2 states:
– ”You will not collect users’ content or information,
or otherwise access Facebook, using automated
means (such as harvesting bots, robots, spiders, or
scrapers) without our permission.”
• Automated Data Collection Terms:
– All automated processes on the site are
forbidden, unless there is express written
consent.
14. Legal Issues
• “You agree that any violation of these terms
may result in your immediate ban from all
Facebook websites, products and services.
You acknowledge and agree that a breach or
threatened breach of these terms would
cause irreparable injury…” – Facebook
15. Legal Issues
• Work around:
– Use only ‘public’ information
– EU Directive 96/9/EC
– ‘Fair Use’
– Implied licenses
– Not using a crawler or scraper.
16. Privacy
• Facebook wants written consent from each
user.
• Standard procedure in language
documentation.
• Required by most universities (and often
journals.)
17. Privacy
• Unnecessary here:
– All data is in the public domain.
– The data will not be shared or monetized
– All names and personal data are anonymised
– The data is being used purely for research.
– The group I’m looking at was set up for this
purpose, and there has been personal
communication confirming this by Stegen.
18. The Tool
• Load page into a browser normally
– the source code has already been collected into the
system, and automation is not necessary for
retrieving more URLs.
• Manually click on “Display more posts...” and
“View all comments”
– An Ajax query is sent to the database, and the posts
are loaded in the browser.
• Copy and save the HTML source code.
• Clean and sort with Python (Beautiful Soup).
19. XML Storage
• The data is massive.
• From February 11, 2011 to February 17, 2011
is almost 300k lines of HTML.
• Mining this is not trivial.
20. XML Storage
• XML = extensible markup language
• Not reliant on any single, particular program.
• Widely used for data storage already.
• XML works by conforming to a schema.
• Easily converted into RDF and other useful
storage formats.
• Easy to understand for both humans and
machines.
• Can also be stored independently of the data.
21.
22. Results
• The largest corpus currently available for
Rangi:
– Án Crúbadán crawler: this corpus is 108
documents large, and is comprised of 17,908
words and 123,354 characters.
• This Facebook corpus:
– 990 threads, 64,891 words and 571,182
characters.
The schema above does not allow for other linguistic annotation, such as part-of-speech tagging, or morphological or syntactic annotation. It is meant primarily as a storage format, to maintain the context of each comment and all detail that may be relevant to linguists from the original page. A different annotation format would need to be used for further annotations, but that is beyond the scope of this paper.