In the next 20 years, the Dutch national library will digitize all printed publications since 1470, some 730M pages. To realize the first milestone of this ambition, KB made deals with Google and Proquest to digitize 42M pages.
Since 2003 KB has operated its e-Depot, a system for permanent digital object storage. KB is now replacing it with a new solution to better deal with future demands, allowing improved storage of its mass digitization output.
To meet user demand for centralized access, KB is also replacing its scattered full-text online portfolio by a National Platform for Digital Publications, both a content delivery platform for its mass digitization output and a national domain aggregator for publications. From 2011 onwards, this collaborative, open and scalable platform will be expanded with more partners, content and functionalities.
The KB is also involved in setting up a Dutch cross-domain aggregator, enabling content exposure in Europeana.
Digitizing all Dutch books, newspapers & magazines - 730 million pages in 20 years - storing it, and getting it out there
1. Digitizing all Dutch books, newspapers & magazines -
730 million pages in 20 years -
storing it, and getting it out there
Olaf D. Janssen
Koninklijke Bibliotheek (KB), National Library of the Netherlands,
Prins Willem-Alexanderhof 5, The Hague, The Netherlands
olaf.janssen@kb.nl
Abstract. In the next 20 years, the Dutch national library will digitize all
printed publications since 1470, some 730M pages. To realize the first
milestone of this ambition, KB made deals with Google and Proquest to digitize
42M pages.
Since 2003 KB has operated its e-Depot, a system for permanent digital object
storage. KB is now replacing it with a new solution to better deal with future
demands, allowing improved storage of its mass digitization output.
To meet user demand for centralized access, KB is also replacing its scattered
full-text online portfolio by a National Platform for Digital Publications, both a
content delivery platform for its mass digitization output and a national domain
aggregator for publications. From 2011 onwards, this collaborative, open and
scalable platform will be expanded with more partners, content and
functionalities.
The KB is also involved in setting up a Dutch cross-domain aggregator,
enabling content exposure in Europeana.
Keywords: National libraries, Digital library workflows, Mass digitization,
Google, Proquest, Permanent storage, Integrated access, Cross-domain cultural
heritage, Aggregation, Interoperability, Europeana
1 Digitizing the KB
The KB 1 started digitizing its holdings in 1995, for reasons of accessibility and long-
term preservation. In the first years small scale efforts focused on scanning visually
attractive materials, highlights of the collection for the widest possible audiences. One
of the first projects was 100 highlights of the Koninklijke Bibliotheek 2 , followed by
Memory of the Netherlands 3 , the national programme for digitizing Dutch cultural
heritage, which was focused on image based materials. It was not until 1999 that the
KB started digitizing historical textual publications (books, newspapers &
magazines).
For the last 8 years, the focus has been on large-scale digitization of text corpora for
study and research in the humanities using public funding. In 2003 a project took off
2. to scan the complete run of Dutch Parliamentary Papers 4 . Consisting of 2.3 million
pages, this was at that time an unprecedented quantity for the Netherlands. At the end
of 2006 the KB was rewarded the Historical Newspapers project 5 . By the end of
2011, it will have scanned 8 million pages from popular Dutch regional, national and
colonial newspapers from the period 1618-1995.
In addition, in February 2011 the Early Dutch Books Online digitization effort 6
delivered 2.1 million full-text pages from the specials book collections of the KB and
the university libraries of Amsterdam and Leiden. Furthermore, by the end of this
year, some 1.5 million pages from the most frequently consulted old magazines (1840
-1950) will have been converted into full-texts.
In 2010 the KB announced its ambitious plans to digitize all Dutch books,
newspapers, magazines and other printed publications from 1470 onwards, a total of
730 million pages. A first milestone is set for 2013, by when the library should have
scanned 10% of this amount. To realize its ambition, the KB cannot not rely on public
funding alone, especially in times when government support for cultural heritage is in
a downward trend. It has therefore entered into strategic public-private partnerships
with both Google 7 and Proquest 8 to digitize 210.000 books (some 42M pages) from
its public domain collections.
2 Permanent storage, now & in the future
As the national library, the KB has a duty to permanently store not only printed
publications, but also digital ones (both born-digital and digitized). As early as 1994
the KB recognized the importance of such an electronic depot and took action
accordingly. It started making pilot agreements with major international publishers for
depositing e-journals (“safehaven”) and undertook market research to acquire a
technical solution for permanent storage. Such a system turned out not to be available
off-the-shelf, so in 2000 KB joined forces with IBM to build the world’s first OAIS-
based processing and preservation system for permanent storage of digital objects.
This has resulted in the operational e-Depot 9 , which the KB has been running since
2003.
Nowadays, this deposit is a safehaven for over 15 million scientific articles from
some of the world’s biggest publishers 10 , focusing on international scientific,
technical and medical journals (STM-publications). In addition it houses digital
monographs, periodicals and reports from Dutch publishers and materials from the
scientific repositories of Dutch universities, as part of the NARCIS 11 initiative.
2.1 Towards a new e-Depot
In 2012 the KB’s maintenance contract with IBM will run out and components of the
system will no longer be supported. The current implementation of the e-Depot is
based on requirements set in the late ‘90s. Some of these have become outdated with
3. respect to current & expected future requirements for speed and collection
management facilities. Additionally, with the 'seven-year-itch' or the system 12 having
past, it is already living longer than most other IT systems. Other reasons for
upgrading the e-Depot are
Volume & scalability: digital publishing has lead to enormous growth of
KB’s digital collections. Furthermore, the KB wants to permanently store the
hundreds of millions of files resulting from its mass digitization programme
output.
Heterogeneity & flexibility: the current system is only optimized for
processing and storing relatively small numbers of homogeneous single
objects, i.e. mostly PDFs. In other words, it is not able to give fast access to
large numbers of diverse and compound content, which will become
increasingly common in the near future (e.g. enriched publications, e-books,
websites)
In defining new requirements, the KB looked for consultation with its international
colleagues, most notably with the National Library of Germany (DNB) and SUB
Göttingen. This collaboration was based on the joint use of the IBM based system.
Early 2009, the KB and DNB sought cooperation with other European national
libraries to share experience, knowledge and resources. Another reason for doing so
was the lack of suitable commercial off-the-shelf products; the solutions that are
available bring the risk of vendor lock-in. When national libraries would join forces in
defining requirements and tendering, this could trigger commercial suppliers to invest
more in developing solutions that answer their requirements. Together with the
national libraries of the UK, Germany, Norway, Spain, Portugal, Switzerland and the
Czech Republic, KB defined an architectural outline, based on a two-layered OAIS
model and a modular setup of the preservation system. Unfortunately, later that year
the libraries decided not to have a joint tender due to different timelines.
To guarantee continued technical innovation and development of the e-Depot, the KB
is a partner in the SCAPE project 13 . This EU-funded initiative will provide ongoing
technical input by developing scalable preservation planning and execution services
that can be deployed in the new e-Depot system within the next three to five years.
3 Providing access & adding value
The back-end data standards 14 are identical across all KB-run mass digitization
projects, making the outputs in theory fully interoperable. However, this potential has
not yet been optimized in the front-end presentation of the KB’s full-text collections.
So far this has been done via separate, websites (4, 5, 15 ), each with its own specific
branding, URLs, design and search & object display functionalities.
For end-users the KB-collections thus appear to be unrelated and scattered, making
them relatively difficult to use given the expectations of modern users. They demand
all content to be available via a single point of entry, with the ability to apply multiple
4. views & filters (by theme, by time, by geographical location, by object type etc.) to
the interoperable, contextualized, enriched and re-usable content, with minimum
copyright limitations. In addition, users are primarily interested in the digital content
itself, much less from which physical object or institution it was derived.
3.1 Providing access – the Dutch National Platform for Digital Publications
The KB has taken these user demands seriously and has just finished designing and
implementing the first basic iteration of the Dutch National Platform for Digital
Publications (working name). This full-text content distribution platform will give
access to digitized books, newspapers and magazines. Not only will it include the
output of the KB’s mass digitization projects, but it will also be open for text
collections from other libraries. Access will be central via a modern Web2.0 site, as
well as distributed via search and display APIs. These can deliver content to users in
their normal workflows (via regular social networks, on mobile devices, in
professional virtual research environments & communities, in products like Zotero,
ReWorks, EndNote etc.), as well as allow others (both business and consumers) to
build their own applications based on the content.
Further key design choices of the platform include:
1. Open: everybody can bring and get content, as long as it fits the scope
(Dutch textual publications) and certain standards (e.g. metadata & object
quality). This will enable small institutions without much in-house expertise
or infrastructure to expose their content on a national level. Depending on
the rights on the objects, the content can be used, re-used, shared or enriched
by third parties.
2. Scalable: given the ambitions of the KB to make all (to be) digitized
collections available online, the platform must be able to cope with huge
amounts of metadata and objects in the future. This means the service should
allow for step-by-step upscaling towards more content and functionalities,
with as little manual programming or data conversion work as possible.
3. Collaborative: as said above, the platform will be an open network of KB
and other institutions, starting with a coalition of the willing. To guarantee
buy-in from the start, partners will need to work collaboratively on both
operational and strategical levels. This not only includes technical, but also
organizational issues, such as funding, sustainability, governance and policy
development.
This collaborative approach means that
responsibilities (e.g. financial, technical, business, product development) are
shared among the partners,
national expertise about e.g. semantic & metadata interoperability is brought
together,
5. barriers for new partners to join the network are lowered,
positions for joint support funding requests (both on national and European
levels) become stronger, and thus
future sustainability of the platform is more likely.
Furthermore, the National Platform for Digital Publications will improve the
visibility of the KB as an attractive business-to-business service & data provider for
partners in the Netherlands. KB could for instance offer a package of (paid)
permanent object storage in its e-Depot, with an option to present the object on the
platform to end users free of charge.
The platform marks a turning point towards centralized access of KB text collections.
Starting with the output of the Early Dutch Books Online project in May 2011, the
content of the platform will be expanded step-by-step in the years to come. The
current planning is as follows:
2011: Early Dutch Books Online (2.1M pages), First set of old magazines
(1840 -1950, up to 1.5M pages), First set of early 20th century books (1913
onwards)
2012: Historical Newspaper collection (8M pages, by transferring the content
of http://kranten.kb.nl into the platform), Collection of historical children’s
books from the Rotterdam public library
2012-2014: output from the Google & Proquest efforts to be included, up to
42M pages
Finally, the National Platform for Digital Publications will be positioned as a full-text
and metadata aggregator, with the aim of making the content interoperable and
exporting it to cross-domain initiatives, both on national, European and global levels.
See Section 4 for more details.
3.2 Improved access leads to added value creation
In the past decade, cultural heritage institutions have invested increasingly in their
digital services, making their collections accessible and at the same time bringing new
economic and social benefits within reach. A report 16 by the Dutch Foundation for
Economic Research has shown that the total benefits of digitization and accessibility
outweigh the costs. The heritage sector, creative industries, the education sector and
consumers will all experience immediate benefits from widespread availability of
cultural heritage objects. In other words, digital collections represent significant
potential economic and social value, provided they are made easily accessible.
To get an understanding how institutions should make their collections accessible to
generate maximum added value, the BMICE 17 distribution ring model 18 of Figure 1
gives guidance.
6. Figure 1. The BMICE ring model - Distribution rings showing four forms of access
to cultural heritage. The outward arrow represents the direction of added value.
The four rings represent the following levels of access
1. Analogue in house: The work is displayed physically or made physically
accessible in an archive, exhibition or reading room.
2. Digital in house: The work is described digitally and may be digitized. It is
made available within the walls of the institution by means of a closed
network (or through digital data carriers), such as a computer or terminal at
the institution that visitors can use to search through the collection database.
3. Online: All or part of the digital collection of the institution is offered online
through the institution’s website, but without explicit rights of use or reuse.
4. Online in the network: Digital collections of the institution are made
available in online networks. Rights of use are granted to third parties (the
public, other institutions) for use or reuse.
Heritage institutions have traditionally focused on - and felt safe in - the first ring,
with ring 2 opening up since the start of the digital age in the late ‘80s. The 3rd ring
has come into view since the mid ‘90s, when the web entered everyday life. The rise
of the social web in the ‘00s has put momentum in giving access to objects in the 4th
ring. Even nowadays, many content holders are only just beginning to enter this circle
and understand the huge benefits of opening up their collections within rights-
controlled networks & communities; for many this means a big step outside their
trusted safe zones. The yellow outward arrow in Figure 1 represents the direction of
added value. It can thus be concluded that “the more heritage institutions move
outside their comfort zones, the greater the value that is created.”
7. Some examples of activities in the outermost ring are:
On-demand digital archive: Users can search & order (free or paid,
depending on the rights) cultural heritage sources using various search
functions.
Online museum experience: Alternative to or expansion of the museum using
web 2.0 tools and platforms. Target users are approached actively by
offering widgets, setting up discussion groups on social networks, and so on.
Collaborative storytelling: Users tell their own personal stories on platforms.
Heritage institutions often provide specific rights-cleared archive material
that users can then integrate into their narrative.
Distributed online research: Technical platforms, tools and social networks
where users can jointly conduct and present research. This guarantees a
certain degree of reliability with regard to the information, the relationship
between the sources and the members of the community. An example of this
is wikipedia.org.
Social tagging: Users are given the facility of tagging digitized cultural
heritage sources. The tags can contain a description or can express some
appreciation, and they enrich the collection, making it easier and more
worthwhile to discover.
Online marketplace: This offers users the chance to bid online for cultural
heritage objects and works of art.
Another example of a 4th ring service is the National Platform for Digital
Publications. As said above, it will be an open & collaborative service, providing
search and display APIs for delivering content to the places and networks the user are.
Similar to Youtube, it will offer widget-based embeddable content, possibilities for
user annotation, user profile pages, and cross-collection searching & display.
4 The cross-domain & international dimensions
As the national library, the KB has a very important facilitating and networking role
in the Dutch scientific and cultural infrastructure. Using this position, it has the
potential to set up and stimulate different levels of collaboration to make online
heritage more accessible. This is illustrated by the 3-tier collaborative model in Fig.2.
8. Figure 2. Dutch national collaborative aggregation model. The KB is responsible for
aggregating publications in the National Platform for Digital Publications
Lower level: domain specific collaboration & aggregation
As said in Section 3, KB’s National Platform for Digital Publications will be
positioned as an aggregator for Dutch full-texts, aiming to make the content - and the
network of content delivering partners - interoperable and ready for participation in
cross-domain initiatives on national and international levels.
Besides the KB with its platform, organizations from other domains are working on
interoperability and aggregation for their specific sectors. Lead by the Institute for
Sound & Vision 19 , institutions from the audio-visual domain collaborate to enable
aggregation of AV-materials. Similar initiatives are taking place for the archival
domain, with the National Archives 20 as the facilitator, and for the museum sector.
For the latter, the Rijksdienst voor het Cultureel Erfgoed 21 is the main player.
The ways content aggregation and the supporting technical and organizational
structures are set up are not uniform, but differ across the domains. Based on sector-
specific best-practices, knowledge and culture, each aggregator is setting up domain
interoperability in the best possible way. This is however not done in isolation; the
domains are in regular contact to reach consensus on issues such as “which content
goes where”, to learn from each other and to avoid overlapping work. This way
responsibilities & roles are kept clear, while at the same time synergies are exploited
where possible.
9. Middle level: national cross-domain collaboration & aggregation
To enable these sector specific aggregation initiatives to come together, the results of
the NED! project 22 are used. It delivered a basic infrastructure for the interoperability
of Dutch digital heritage, using open standards including XML, DublinCore, OAI-
PMH and SRU. It is now being expanded to build a cross-domain heritage aggregator
that can become the national hub for content delivery to international initiatives.
Building a national aggregator is however a step-by-step process, not finished
overnight. Until that time domain-specific aggregators - in case of the library domain
the Dutch National Platform for Digital Publications or The European Library 23 -
will continue to have an important role in routing Dutch library content directly to
top-level services. Finally, it should be noted that the cross-domain hub is envisioned
as a “dark aggregator”, i.e. a B2B service without an interface (website) for end users
(however, see item 5 below).
Top level: International cross-country collaboration & aggregation
Having established national cross-domain aggregation and interoperability on as
many levels as possible 24 , Dutch content can be shown and used on international
stages, most notably Europeana 25 .
This fast growing, largely EU-funded, metadata aggregator and display space for
European digitized works enables people to explore the resources of Europe's
museums, libraries, archives and audio-visual collections. It promotes discovery and
networking opportunities in a multilingual space where users can engage, share in and
be inspired by the rich diversity of Europe's cultural and scientific heritage.
Europeana always connects users to the original source of the material so authenticity
is ensured. The digital objects they can find are not stored centrally with Europeana,
but remain hosted at the providing cultural institutions.
Europeana offers the following added values for (Dutch) content holding institutions:
1. It enriches the experience of their users by making relations between their
objects and information from other countries and in other formats. This
enables cross-border and interdisciplinary research, as well as enriching the
content by presenting it in a wider context.
2. Users expect integrated content – they want to see video’s, listen to sound
recordings, look at images and read texts, all in once place. Using Europeana
they can find related content in multiple formats, from different countries
and from diverse domains and disciplines.
3. Europeana makes their content findable in search engines.
4. Europeana generates extra visits to their holdings by redirecting users to the
original source of the content (i.e. the content holders’ websites).
10. 5. Europeana offers a set of APIs 26 . These not only enable reuse of Europeana
content by third parties, but also allow the contextualized & enriched content
of the providing institutions to be used in their own environments. The APIs,
in other words, make it possible to create user interface elements for (dark)
aggregation services on the lower and middle levels, as indicated in Figure 2
by the dotted API arrows.
6. Knowledge transfer can be major added value for participants in the
Europeana network. Europeana collaborates with professionals from digital
libraries across Europe and the US. Knowledge generated by these experts is
fed back into the network via presentations, workshops and seminars. This
way valuable knowledge about the theory and practice on metadata
standards, multilinguality, semantic web, information architectures, usability,
geolocation, object modeling and many other subjects becomes available for
content suppliers.
All advantages mentioned in Section 3 about openness, scalability and collaboration
apply equally to Europeana, as these key design choices were also the foundations on
which Europeana was built. Similar to the National Platform for Digital Publications,
Europeana is also a service in the 4th ring of the BMICE model. Becoming partners in
the Europeana network and making their content (re-)usable there, will thus allow
Dutch institutions to add another layer of added value to Dutch cultural & scientific
heritage.
1
Koninklijke Bibliotheek (KB), national library of the Netherlands, http://www.kb.nl
2
100 highlights of the KB, http://www.kb.nl/galerie/100hoogtepunten/index-en.html
3
Memory of the Netherlands, the national programme for digitizing Dutch cultural heritage,
http://www.geheugenvannederland.nl
4
Filming and digitization of the Dutch parliamentary papers 1814-1995,
http://www.kb.nl/hrd/digitalisering/archief/staten-generaal-en.html (project information) &
http://www.statengeneraaldigitaal.nl/ (website)
5
Dutch Historical Newspapers 1618-1945, http://www.kb.nl/hrd/digi/ddd/index-en.html
(project information) & http://kranten.kb.nl (website)
6
EDBO – Early Dutch Books Online - 10.000 full-text digitized books from 1781-1800, 2.1
million pages, http://www.earlydutchbooksonline.nl (from 26-5-2011 onwards)
11. 7
KB and Google sign book digitization agreement, http://www.kb.nl/nieuws/2010/google-
en.html
8
Digitization by Proquest of early printed books in KB collection,
http://www.kb.nl/nieuws/2011/proquest-en.html
9
E-Depot, the KB’s digital archiving environment for permanent access to digital objects -
http://www.kb.nl/hrd/dd/index-en.html
10
Including, but not limited to Elsevier, BioMed Central, Blackwell Publishing, Oxford
University Press, Springer and Brill. For a complete list, see http://www.kb.nl/dnp/e-
depot/operational/background/policy_archiving_agreements-en.html
11
NARCIS, National Academic Research and Collaborations Information System,
http://www.narcis.nl/about/Language/en
12
Wijngaarden, H. van.: The seven year itch. Developing a next generation e-Depot at the
KB. Paper for the 76th IFLA General Conference and Assembly, 10-15 August 2010,
Gothenburg, Sweden, http://www.ifla.org/files/hq/papers/ifla76/157-wijngaarden-en.pdf
(accessed on 28-03-2011)
13
SCAPE - SCAlable Preservation Environments, http://www.scape-project.eu/
14
KB’s open digitization & accessibility standards,
http://www.kb.nl/hrd/digitalisering/standaarden-en.html
15
Digitization of ANP news items, http://www.kb.nl/hrd/digitalisering/archief/anp-en.html
(project information) & http://anp.kb.nl (website)
16
Hof, B.J.F. et al.: Baten in beeld; Kengetallen kosten-batenanalyse: beelden voor de
toekomst, SEO Amsterdam (2006), ISBN13 9789067333405,
http://www.kennisland.nl/uploads/.../8ba66f40-51c9-4f7f-9e60-8404c8aa84e8 (accessed on 27-
03-2011)
17
BMICE, Business Model Innovatie Cultural Erfgoed / Business Model Innovation
Cultural Heritage, http://www.bmice.nl/
18
BMICE ring model, taken from
http://www.den.nl/getasset.aspx?id=Businessmodellen/KL_BusModIn_web_eng_04.pdf&asset
type=attachments
19
The Netherlands Institute for Sound & Vision, http://instituut.beeldengeluid.nl
20
National Archives of the Netherlands, http://www.en.nationaalarchief.nl/default.asp
21
Rijksdienst voor het Cultureel Erfgoed, http://www.cultureelerfgoed.nl
22
NED! - Nederlands Erfgoed Digitaal!, http://www.nederlandserfgoeddigitaal.nl/
23
The European Library; on the one hand a free service that offers access to the resources of
the 48 national libraries of Europe in 35 languages, on the other hand an international library
domain aggregator for Europeana, http://www.theeuropeanlibrary.org
24
Establishing interoperability on as many levels as possible: technical, metadata,
semantical, human, inter-domain, organizational, political, .etc.
25
Europeana; paintings, music, films and books from over 1500 of Europe's galleries,
libraries, archives and museums, http://www.europeana.eu
26
Europeana Application Programming Interfaces, http://version1.europeana.eu/web/api