These slides were used in a presentation at the "Our Digital Future - Multidisciplinary Perspectives on Long Term Data Preservation and Access" conference in Cambridge/UK in March 2016 in the session "Current and Future perspectives on technology for data preservation and sharing". They describe work in progress in the E-ARK project, which is co-funded by the European Commission and has as its main objective the creation of a scalable open source, digital archiving system offering efficent search and access content of very large digital object collections. The focus of this presentation lies on describing the core big data technologies (Apache Hadoop, Apache Hbase, and the document repository Lily developed by NGData), the architecture of the E-ARK integrated prototype implementation, and data mining use cases related to geographical data, named entitity extraction, and OLAP data analysis.
5. Advisory Boards
Archival
• Archives of Emilia-Romagna, Italy
• Directorate-General of the Book, of
Archives & of Libraries, Portugal
• EC Archives & Records Management
• EC Historical Archives
• German Federal Archives
• National Archives of Bulgaria
• National Archives of Finland
• National Archives of France
• National Archives of Sweden
• National Archives of the
Netherlands
• Polish Data Archive
• Queensland State Archives
• Swiss Federal Archives
• UK National Archives
• UK Parliamentary Archives
Commercial
Technial
• Arkivum
• ARMA Europe
• DigitalForever
• Discovery Garden
• Microsoft Research
• Open Preservation Foundation
• Open Text Initiative
• Preservica
• Versity
Data Providers
• Danish Agency for Digitisation
• Estonian Ministry of Economic
Affairs & Communication
• Estonian Unemployment Insurance
Fund
• James Lappin, RM Consultant
6. Project mission
• Improve access to the archived records of
European Archives
• Create guidelines and recommended
practices
• Cover relational databases, record
management systems, and geographical
data
• Create open source implementation
evaluated in several pilots
7. Outcomes
Standardisation of
available best-
practices
• Common terminology
(Knowledge Center)
• SIP, AIP and DIP
format specifications
• Pre-ingest, ingest and
access workflows
Open source tools
• Scalable, modular,
and reusable
implementation of
specifications
• Individual
deployments (Pilots)
and an integrated
reference
implementation
11. Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
Map/Reduce in a nutshell
15. AIP to DIPSIP to AIP
Hadoop Distributed
File System
NAS
Working area
Search and Access
Lily Repository
DIP Delivery
Workers
Celery
Information Package processing &
Access Repository
17. Ingest and Preservation
Access
E-ARK
SIP
SIP
Creation
Tools
Archival
records
Content and
Records
Management
Systems
SIP – AIP
Conversion
E-ARK
AIP
CMIS
Interface
Data
Mining
Interface
Digital preservation systems
AIP - DIP
Conversion
Scalable
Computation
E-ARK
DIP
Archival Search ,
Access and
Display Tools
Content and
Records
Management
Systems
Data Mining
Showcase
24. Text mining: Text classification
Training
• Train classifier using annotated text corpus
• SVM – based on statistical features
Classification
• Scan for texts during ingest (or run MR after)
• Text category estimation
Search
• Add category as a searcheable field to Lily index
• Full-text search using Lily‘s SolR search interface
25. OLAP (Online Analytical Processing)
• Database archiving
and re-use (SIARD2)
• Normalization -
OLAP/Oracle Data
Warehouse
Purpose is to assess contributions to and from the project
Open to interested parties
Meetings of these groups
Gather information and contribute to a knowledge base (maintained by the DLM Forum)
Technologies: Hadoop MapReduce, SolR, HDFS, Lily Repository, ESSArch Preservatin Platform, E-ARK Web
Vertical Integration: [MapReduce] works atop [HDFS], [SolR] indexes [Lily] Records
Horizontal Integration: [MapReduce] used to build [SolR] index, [HDFS] used to store [Lily] content, packages ingested via [EPP] UI are searched/accessed via [E-ARK WEB] UI