As part of the ContentMine project (http://contentmine.org), this talk gives an up-to-date (in 2014) overview of text and data mining.
Written by Michelle Brook, Charles Oppenheim and Peter Murray-Rust.
This presentation was given by Charles Oppenheim at WOSP 2014.
Social, Political and Legal Aspects of Text and Data Mining (TDM)
1. SOCIAL, POLITICAL AND LEGAL
ASPECTS OF TEXT AND DATA
MINING (TDM)
Michelle Brook, The Content Mine,
michelle@contentmine.org
Peter Murray-Rust, University of Cambridge and
Shuttleworth Fellow, pm286@cam.ac.uk
Charles Oppenheim, Visiting Professor at City, Northampton
and Robert Gordon Universities,
c.oppenheim@btinternet.com
2. SO WHAT ARE THE NON-TECHNICAL
PROBLEMS OF TDM?
• LEGAL - copyright, database rights and licensing
• SOCIAL - The lack of awareness, and relative
technological gap, between many TDM tools and the
skills of many academics
• POLITICAL – the massive gap between publishers’
approaches to TDM and researchers’ needs; also the
lack of specific TDM exceptions to copyright in most
countries’ laws
3. COPYING IS OFTEN INVOLVED IN TDM
• PDFs, the lingua franca of academic journals, are not
machine readable
• For TDM purposes, they must be transferred into a
different digital form
• That form is often custom and specific to the
research question being asked and the most
appropriate tools to answer that question
• So there is a need to copy/adapt the original PDF
4.
5. COPYRIGHT/DATABASE RIGHT
• Gives the owner the right to authorise, or to refuse to authorise, any of
the so-called restricted acts, including: copying; adapting; redisseminating
all, or a “substantial” part, of a copyright work (similar rules apply to
databases)
• Substantial does not mean “most of”, but rather “what is important”
• If someone does such restricted acts without permission, they have
infringed the right and can be sued
• However, there are certain (but very restricted) exceptions to copyright,
whereby someone CAN copy, etc., without having to ask for permission or
pay fees
• Only a few countries (UK recently being one of them) have a specific
exception for TDM in their laws
• In the absence of such an exception in a country’s national law,
researchers much ask for permission (request a licence) from the
copyright owners. Generally, the copyright owners are publishers,
because authors have (foolishly) assigned their copyright to them
6. THE NEW UK LAW
• Came into force in June 2014
• Specific exception to copyright for TDM
• UK researchers do not have to ask for permission, pay fees,
etc., to do TDM as long as it is for “non-commercial” purposes
and long as they have “lawful access” to the raw materials.
• What is, or is not “non-commercial” is controversial, but what
is clear is that the question must be asked at the time the
TDM was undertaken, so unexpected commercial benefits at
the end of the project as OK, so long as at the time the intent
was non-0commercial
• “Lawful access” usually means licensed content, whether OA
or a subscription to the materials
7. THE PROBLEMS OF APPROACHING
PUBLISHERS FOR LICENCES FOR TDM
• Many publishers want unreasonably high fees and/or place restrictions on
what could be done with their materials after TDM, and/or require
researchers to use its API, and/or take an extremely long time to decide
how to respond to a TDM request
• TDM researchers have to approach multiple publishers, each of whom
have different attitudes, conditions, and speed of response to such
requests.
• This is very costly to a researcher, and has significant impact upon the take
up of TDM, as well as inhibiting academics from sharing the outputs of
their TDM research
• These problems are inhibiting the take-up of TDM, thereby limiting the
potential benefits this technology enables.
• Also explains why so many TDM experiments are limited to OA materials
8. PUBLISHER TDM LICENCE INITIATIVES
GENERALLY DO NOT HELP
• Publishers have started offering their own TDM licences and policies
• Their licences often impose unfair (and in the case of the UK,
unenforceable) constraints on researchers’ freedom to exploit TDM.
• Why “unenforceable”? Because UK law specifically states that any
contract or licence term that prevents anyone from doing TDM in the
manner prescribed in the new exception shall be deemed null and void
• There are exceptions of course – Springer and Royal Society in particular
offer generous TDM provisions.
• So why are publishers offering restrictive licences in the UK?
• One can only surmise that they hope licensees are ignorant of the new
law, or the publishers in fact don’t know about it. So they are either
deliberately misleading, or ignorant
9. WHAT POLITICAL INITATIVES ARE
NEEDED?
• Under EU law, countries in the EU are able to introduce
exceptions for non-commercial TDM research,
• However, so far only the UK has taken advantage of this. The
EC is considering an EU-wide exception for TDM, and the
Republic of Ireland is also considering such a change to its
national law.
• Outside of Europe, only one or two Far East countries have
introduced such exceptions.
• There needs to be an international treaty requiring all
countries to include an exception for TDM in their national
laws
10. WHAT CAN PUBLISHERS DO TO HELP?
• Offer all researchers world-wide the same freedom as is now
available to UK researchers to undertake TDM for non-commercial
research purposes, so long as the user has lawful
access to the original materials
• Earn goodwill amongst the TDM research community by
offering user- friendly APIs (without, of course, REQUIRING a
researcher to use them), free advice, and discussion fora for
the exchange of experience and ideas in the theory and
practice of TDM
• Develop clear agreed statements as to what types of research
they agree is “non-commercial” and which is “commercial”.
11. ADDRESSING THE
RESEARCHER/TECHNOLOGY GAP
• Current TDM researchers are very technologically adept and
work will need to be done to develop the existing tools to be
easier to use by those with less expertise.
• While The Content Mine and other organisations such as
Software Carpentry are running workshops to help academics
become more technically confident, much more needs to be
done.
• The TDM community needs to help close the gaps in
knowledge, ability and awareness
• Funders and institutions also have a responsibility to ensure
academics and PhD students are trained in such skills and
technologies
12. IN CONCLUSION
• The main barriers against the uptake of TDM are primarily a lack of
awareness among academics, a skills gap, legal issues around
copyright and database rights, and restrictions being implemented
by publishers’ licences. These problems are all solvable
• Other countries should change their laws to make TDM lawful
• Publishers should work with the TDM academic community to
develop agreed statements as to what types of research they agree
is “non-commercial” and which is “commercial”, and prevent any
possible chilling effect from ambiguity around these terms
• Funders and institutions should be exploring how to teach TDM
techniques to interested academics and research students
• Thank you for your attention.
13. SOME USEFUL
RESOURCES/ACKNOWLEDGEMENT
• Use of TDM to detect scientific fraud -
http://www.nature.com/news/fraud-found-by-reading-between-the-lines-
1.15859
• General overview of benefits of TDM - D. McDonald and U. Kelly, The value
and benefits of text mining (2012), http://www.jisc.ac.uk/reports/value-and-
benefits-of-text-mining
• Official guidance on the new UK copyright exception for TDM -
https://www.gov.uk/government/uploads/system/uploads/attachment_d
ata/file/315014/copyright-guidance-research.pdf
• Excellent general overview of the change to UK law and its implications -
http://copyrightuser.org/topics/text-and-data-mining/ - provides link to
the precise wording in the law
• Details of Springer’s and Royal Society’s initiatives at
http://www.springer.com/gb/rights-permissions/springer-s-text-and-data-mining-policy/
29056 and http://royalsocietypublishing.org/text-data-mining
• Image shown in this presentation is from Wikipedia and is covered by a
Creative Commons CC BY licence