Text and Data Mining (TDM) for scientific research or for any other purpose is included in the provisions of the Directive 2019/790/EU on Copyright in the Digital Single Market. Research on TDM operations in the National Libraries of EU Member States was conducted and is presented.
1. Text & Data Mining in Archives, Libraries & Museums:
Research on TDM of National Libraries in the EU
Centre of International and European Economic Law
&
Jean Monnet Foundation For Europe
Prof. Maria Kanellopoulou Botti
Department of Archives, Library Science & Museology
Ionian University
Attorney-at-Law
&
Dr. Marinos Papadopoulos
Attorney-at-Law
Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
1
2. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
2
Text and Data Mining (TDM)
Art.3 & Art.4 of Directive 2019/790/EU on Copyright in the Digital Single Market
(DSM Directive).
TDM includes Web Harvesting and Web Archiving activities.
A statutory mandatory exception of Copyright that has long been requested (e.g., IFLA
Statement on Text and Data Mining, 2013).
The TDM exception inspired from, and contain partly the same conditions as the scientific
research exception.
1. Has to be implemented across all EU Member States in order to ensure effective
harmonization of the law.
2. Must not be subject to contractual overrides regarding TDM implemented for scientific
purpose.
3. Must not be subject to lock-up behind technological protection measures.
3. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
3
What TDM is
TDM is automated analytical technique aimed at analyzing text and data in digital form in
order to generate information which includes but is not limited to patterns, trends and
correlations. It is any activity where computer technology is used to index, analyze,
evaluate and interpret mass quantities of content and data (Recitals 8, 11).
TDM is an inherent part of Artificial Intelligence and Machine Learning research.
TDM works in the following manner:
1. Identifying
2. Copying
a. Pre-processing
i. Tokenization
ii. Normalization (stemming or lemmatization)
iii. Parsing (POS tagging)
b. Uploading
3. Extracting
4. Recombining
4. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
4
European v American perspective on TDM
In the US legal environment, courts have
found that reproducing copyrighted works
as one step in the process of knowledge
discovery through text data mining is
transformative, and thus ultimately the act
of reproduction of works through the TDM
process is a fair use of those works that
fits in the first fair use factor of the US
Copyright Act. The concept of
“transformative use” fits in the concept of
“non-expressive use” the latter being
considered as a subset of the former.
In the EU Copyright law, the notion of
reproduction is accepted at its broadest
meaning as is clearly stated in art.2 of the
InfoSoc Directive and is also indicated in
Recital 21 of the InfoSoc Directive. In the
EU Copyright law, the meaning of
reproduction is to determined technically
rather than functionally. Thus, copying of
works in the framework of the TDM
process in the EU Copyright law falls
within the legal meaning of reproduction
which is an exclusive right of the author of
a work.
5. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
5
Lawful Access
1. Access to a work through a subscription or access to content based on open access
(Recitals 10, 14). Access to content that is freely available online. Access to work
that is allowed by an existing exception or limitation to Copyright.
2. Access to a database in respect of terms of use and the conditions of access to a
database set by the rightholder of the database. Access to work that is allowed by an
existing exception or limitation to Copyright.
3. Lawful access = normal use = lawful use (Recital 33 of InfoSoc Directive)
4. Lawful access does not allow the circumvention of Technical Protection Means (TPM)
6. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
6
Purpose-specific TDM (art.3) v TDM for any purpose (art.4)
1. Art.3: the TDM exception of Copyright is provided for the purpose of scientific
research.
2. Art.4: the TDM exception of Copyright is not purpose-specific.
“creative work undertaken on a systematic basis in order
to increase the stock of knowledge, including knowledge
of man, culture and society, and the use of this stock of
knowledge to devise new applications.” (OECD definition)
7. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
7
The beneficiary of purpose-specific TDM
1. Art.3: a research organization and/or a cultural heritage organization.
a university, including its libraries, a
research institute or any other entity,
the primary goal of which is to
conduct “scientific research” or to
carry out educational activities
involving also the conduct of scientific
research: (a) on a not-for-profit basis
or by reinvesting all the profits in its
scientific research; or (b) pursuant to
a public interest mission recognized
by a Member State, and in such a
way that the access to the results
generated by such scientific research
cannot be enjoyed on a preferential
basis by an undertaking that
exercises a decisive influence upon
such organization (Art.2§1)
a publicly accessible library or museum, an
archive or a film or audio heritage institution
regardless of the type of works or other subject
matter that they hold in their permanent
collections; cultural heritage organizations
should also be understood to include, inter alia,
national libraries, national archives,
educational establishments, research
organizations and public sector broadcasting
organizations (Art.2§2, Recital 13)
8. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
8
The beneficiary of TDM (that is not purpose-specific)
1. Art.4: any public or private, non-profit or for-profit, legal or physical person (Recital
18)
9. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
9
Empirical Research of TDM in National Libraries of EU Member States
SURVEY’S IDENTITY
Name A survey on web archiving in EU Member States’ national libraries
Kind Empirical research via questionnaire
Medium Internet by Google Forms
Provider Ionian University
Co-Funded by Greece and the European Union – European Social Fund
Part of A research project titled “Web Archiving in Public Libraries and IP Law” within the framework of
the Operational Program “Human Resources Development, Education and Lifelong Learning” of
NSRF - Partnership Agreement 2014-2020
Duration March – July 2019
Target group National Libraries of EU Member States’
Language English
Basic Fields/components 1. Library’s policies on Web-harvesting / Arrangement / Procedures, 2.Technological
issues, 3. Legal issues, 4. Access/Utilization, 5. Co-operation & Perspectives 6.Proposals
and useful observations
Question’s number 17
Main scope Collecting elements on current web archiving situation
Expected results Enhancing countries involved in Web Archiving, complications, perspectives, new projects
10. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
1
0
Empirical Research of TDM in National Libraries of EU Member States
EU Member States National Libraries that responded to the research undertaken
11. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
11
Empirical Research of TDM in National Libraries of EU Member States
17 Questions:
1. Policy issues questions on Web-harvesting, library arrangements and procedures
2. Technological issues questions
3. Legal issues questions
4. Access/utilization questions
5. Co-operation & perspectives questions
6. Proposals and useful observations (open-ended questions)
12. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
12
Empirical Research of TDM in National Libraries of EU Member States
The importance of Web-Harvesting/Archiving for EU National Libraries
13. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
13
Empirical Research of TDM in National Libraries of EU Member States
Operators’ number per Web-harvesting/archiving in EU National Libraries (the
question was not replied by all surveyed EU National Libraries)
Country Operators No. Country Operators No.
Denmark 7-8 Hungary 5
France 4 Sweden 3
Slovenia 3 Belgium 1
Greece 3 Germany 4
Spain 5 Estonia 4
14. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
14
Empirical Research of TDM in National Libraries of EU Member States
Use of quality filters for Web-Harvesting of EU National Libraries
15. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
15
Empirical Research of TDM in National Libraries of EU Member States
The main purpose for Web-Harvesting/Archiving of EU National Libraries
16. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
16
Empirical Research of TDM in National Libraries of EU Member States
The use of third parties for Web-Harvesting of EU National Libraries
17. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
17
Empirical Research of TDM in National Libraries of EU Member States
The use of software for Web-Harvesting of EU National Libraries
18. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
18
Empirical Research of TDM in National Libraries of EU Member States
Software
Archive-it of Internet Archive W3ACT
Heritrix crawl engine, Annotation Curation Tool Repox Software
Heritrix Proprietary software of the service provider
Heritrix bundled with NetarchiveSuite Heritrix with Net Archive Suit (NAS)
Heritrix 3, ArchiveIt, Webrecorder (as an
experiment) NetarchiveSuite, Heritrix, Free text
search using Solr, and Wayback. Developing search
frontend and playback engine SolrWayback.
Archive-It
Heritrix, Net Archive Suite, Open Wayback, SolR Heritrix (and the Web Curator Tool)
Web Curator Tool, Heritrix NetarchiveSuite and Heritrix.
Heritrix web harvesting software OWA-Client, developed by service provider
Heritrix (harvesting), SOLR (indexing), Wayback
(search and representation)
Heritrix
19. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
19
Empirical Research of TDM in National Libraries of EU Member States
Concern for author’s consent before execution of Web-Harvesting of EU National Libraries
20. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
20
Empirical Research of TDM in National Libraries of EU Member States
Concern for personal data protection before execution of Web-Harvesting of EU National Libraries
21. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
21
Empirical Research of TDM in National Libraries of EU Member States
Concern for intellectual property protection in process of Web-Harvesting of EU National Libraries
22. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
22
Empirical Research of TDM in National Libraries of EU Member States
The terms of access to and use of works harvested from the Web and archived by EU
National Libraries
1. Usually only inside the library in the research reading rooms (7).
2. On legal deposit terminals with firewall (3).
3. Only on Library premises to registered users (6).
4. Available online with the specific permission of the website holder and publishers (5).
5. Available online on the permission of National Library (1)
6. The web archive is publicly available without restrictions. Intellectual property right
holders can request their material to be accessible only on library premises (1).
7. The archived websites are available for research purposes only (3).
8. Only printing is permitted and not in all libraries (3).
23. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
23
Empirical Research of TDM in National Libraries of EU Member States
Inquiry of user-satisfaction from Web-harvesting service of EU National Libraries
24. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
24
Empirical Research of TDM in National Libraries of EU Member States
Forms of co-operation for Web-harvesting service of EU National Libraries
25. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
25
Empirical Research of TDM in National Libraries of EU Member States
Connection of Web-harvesting systems of EU National Libraries and e-book publishers
26. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
26
Empirical Research of TDM in National Libraries of EU Member States
Answers to question for plans for new projects related to Web-harvesting of EU National Libraries
1. Integration of the web documents metadata in the National Library Service Catalog.
2. Exploring using the web recorder tool to archive websites and push the WARCs gathered
in this way into library’s collection.
3. More stakeholder involvement and projects related to raise awareness on web harvesting.
4. Searching for use of new tools for harvesting content from social and streaming media
platforms.
5. Harvesting of press websites with paywall (an automated authentication of the crawler).
6. Cooperation with the Internet Archive, in order to achieve better bulk harvesting.
7. Upgrading library’s services with the support of another software (MINT) which will
enable to enrich metadata during the harvesting process.
8. Web-harvesting of new thematic fields on digital music, climate change etc.
9. Increasing the Web-harvested collections constantly.
10. Modernizing and expanding the Web-harvesting environment, including the system used
for access to harvested works where library will switch from an in-house system to Open
Wayback system.
11. Social media harvesting depending on whether there will be funding.
27. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
27
Empirical Research of TDM in National Libraries of EU Member States
The most important problem in Web-harvesting operation of EU National Libraries
28. Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
28
Empirical Research of TDM in National Libraries of EU Member States
Proposals & Observations for Web-harvesting operation of EU National Libraries
1. The necessity to continually improve technology in general (e.g., to extract material from
large and dynamic web pages that are not yet satisfying or feasible with Heritrix).
2. Legal issues are always at the forefront of interest because the legislation is general and
incomplete and allows only for limited access to content harvested from the Web; library
experts also noticed the necessity of protecting and securing their web collections.
3. Libraries prefer the development of small collections with works harvested from different
websites initially (quality and variety is important for them); they consider the
development of extensive collections subsequently and at a later stage in their Web-
harvesting operation (quantity is not an immediate goal).
4. Improving technical infrastructures and tools comes at the forefront of upcoming library
research projects along with expanding collections, better description of web archives
metadata and extracting pages on new topics and fields such as social media and live
streaming.
5. The most experienced in web harvesting libraries, aim at the extraction of materials from
“difficult” websites such as complex websites and sites with pay walls. Less experienced
libraries aim at collaboration and co-operation development and awareness raising
programs of their Web-harvesting operation.
29. Text & Data Mining in Archives, Libraries & Museums:
Research on TDM of National Libraries in the EU
Centre of International and European Economic Law
&
Jean Monnet Foundation For Europe
Prof. Maria Kanellopoulou Botti
Department of Archives, Library Science & Museology
Ionian University
Attorney-at-Law
&
Dr. Marinos Papadopoulos
Attorney-at-Law
Prof. M. Kanellopoulou Botti & Dr. M. Papadopoulos | e-Conference on Mass Digitization and the EU Policy for Intellectual Property @ 30-31/03/2022
29