The Use of Big Data Techniques for Digital Archiving

•Télécharger en tant que PPTX, PDF•

0 j'aime•342 vues

These slides were used in a presentation at the "Our Digital Future - Multidisciplinary Perspectives on Long Term Data Preservation and Access" conference in Cambridge/UK in March 2016 in the session "Current and Future perspectives on technology for data preservation and sharing". They describe work in progress in the E-ARK project, which is co-funded by the European Commission and has as its main objective the creation of a scalable open source, digital archiving system offering efficent search and access content of very large digital object collections. The focus of this presentation lies on describing the core big data technologies (Apache Hadoop, Apache Hbase, and the document repository Lily developed by NGData), the architecture of the E-ARK integrated prototype implementation, and data mining use cases related to geographical data, named entitity extraction, and OLAP data analysis.

Technologie

The Use of Big Data Techniques
for Digital Archiving
Sven Schlarb, Austrian Institute of
Technology
Tuesday 15th March 2016, Cambridge

OUTLINE
• E-ARK Project Overview
• Technical Background
• Integrated Prototype
• Data Mining Use Cases

THE
E-ARK PROJECT
IS
CO-FUNDED
BY THE
EUROPEAN
COMMISSION
UNDER THE
ICT-PSP
PROGRAMME
www.eark-project.eu

Advisory Boards
Archival
• Archives of Emilia-Romagna, Italy
• Directorate-General of the Book, of
Archives & of Libraries, Portugal
• EC Archives & Records Management
• EC Historical Archives
• German Federal Archives
• National Archives of Bulgaria
• National Archives of Finland
• National Archives of France
• National Archives of Sweden
• National Archives of the
Netherlands
• Polish Data Archive
• Queensland State Archives
• Swiss Federal Archives
• UK National Archives
• UK Parliamentary Archives
Commercial
Technial
• Arkivum
• ARMA Europe
• DigitalForever
• Discovery Garden
• Microsoft Research
• Open Preservation Foundation
• Open Text Initiative
• Preservica
• Versity
Data Providers
• Danish Agency for Digitisation
• Estonian Ministry of Economic
Affairs & Communication
• Estonian Unemployment Insurance
Fund
• James Lappin, RM Consultant

Project mission
• Improve access to the archived records of
European Archives
• Create guidelines and recommended
practices
• Cover relational databases, record
management systems, and geographical
data
• Create open source implementation
evaluated in several pilots

Outcomes
Standardisation of
available best-
practices
• Common terminology
(Knowledge Center)
• SIP, AIP and DIP
format specifications
• Pre-ingest, ingest and
access workflows
Open source tools
• Scalable, modular,
and reusable
implementation of
specifications
• Individual
deployments (Pilots)
and an integrated
reference
implementation

Hadoop Cluster
Task Trackers
Data Nodes
Job Tracker
Name Node

Hadoop = MapReduce + HDFS
Distributed processing (MapReduce)
Distributed Storage (HDFS)
example: 2 x Quad-Core-CPUs:
10 Map (Parallelisierung)
4 Reduce (Aggregation)
example: 4 x 1 TB Hard-Disks (replication factor 3):
ca. 1,33 TB
HADOOP

Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
Map/Reduce in a nutshell

E-ARK Integrated Prototype
Architecture & Implementation

AIP to DIPSIP to AIP
Hadoop Distributed
File System
NAS
Working area
Search and Access
Lily Repository
DIP Delivery
Workers
Celery
Information Package processing &
Access Repository

Ingest and Preservation
Access
E-ARK
SIP
SIP
Creation
Tools
Archival
records
Content and
Records
Management
Systems
SIP – AIP
Conversion
E-ARK
AIP
CMIS
Interface
Data
Mining
Interface
Digital preservation systems
AIP - DIP
Conversion
Scalable
Computation
E-ARK
DIP
Archival Search ,
Access and
Display Tools
Content and
Records
Management
Systems
Data Mining
Showcase

Geographical/timeline search
Peripleo - PELAGIOS Project

Text mining: Text classification
Training
• Train classifier using annotated text corpus
• SVM – based on statistical features
Classification
• Scan for texts during ingest (or run MR after)
• Text category estimation
Search
• Add category as a searcheable field to Lily index
• Full-text search using Lily‘s SolR search interface

OLAP (Online Analytical Processing)
• Database archiving
and re-use (SIARD2)
• Normalization -
OLAP/Oracle Data
Warehouse

Thank you!
• http://www.eark-project.eu
• https://github.com/eark-project

Contenu connexe

Tendances

HDF Town HallThe HDF-EOS Tools and Information Center

Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRMariadnenetwork

RIPE Atlas and IXPs "Stitchin' it up"RIPE NCC

Pilot Project for HDF5 Metadata Structures for SWOTThe HDF-EOS Tools and Information Center

Intro to R statistic programming Bryan Downing

Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...CILIP MDG

Sitemap4rdf(v2 boris)Boris Villazón-Terrazas

Arkstore web ready2013coldsnipe

Python in geospatial analysisSakthivel R

The New HDF-EOS WebSite - How it can help youThe HDF-EOS Tools and Information Center

GeoKnow: Making the Web an Exploratory Place for Spatial DataOpenLink Software

Basic Analytic Techniques - Using R Tool - Part 1Beamsync

Comsode tools - pushing data to open ecosystemComsode - FP7 project

c,c++,java and python in gis developmentSakthivel R

Drupal Day 2011 - Thinking spatially with your open dataDrupalDay

Using Linked Data to diversify search results: a case study in cultural heritageChris Dijkshoorn

Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...OpenAIRE

RJ Broker: Automating Delivery of Research Output to RepositoriesEDINA, University of Edinburgh

IXP Traffic and Major Sports EventsRIPE NCC

Geo linked data lstd10(v2-boris)Boris Villazón-Terrazas

Tendances (20)

HDF Town Hall

Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM

RIPE Atlas and IXPs "Stitchin' it up"

Pilot Project for HDF5 Metadata Structures for SWOT

Intro to R statistic programming

Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...

Sitemap4rdf(v2 boris)

Arkstore web ready2013

Python in geospatial analysis

The New HDF-EOS WebSite - How it can help you

GeoKnow: Making the Web an Exploratory Place for Spatial Data

Basic Analytic Techniques - Using R Tool - Part 1

Comsode tools - pushing data to open ecosystem

c,c++,java and python in gis development

Drupal Day 2011 - Thinking spatially with your open data

Using Linked Data to diversify search results: a case study in cultural heritage

Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...

RJ Broker: Automating Delivery of Research Output to Repositories

IXP Traffic and Major Sports Events

Geo linked data lstd10(v2-boris)

En vedette

Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.CRISTIÁN E. GUTIERREZ GEBAUER

Composición de predios agricolasCRISTIÁN E. GUTIERREZ GEBAUER

Mercado hotelero 06 2016CRISTIÁN E. GUTIERREZ GEBAUER

21st Annual Day of Yoganjali Natyalayam 2014Yogacharya AB Bhavanani

EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...IndexBox Marketing

Introduction to Apache Tajo: Future of Data WarehouseJihoon Son

TMC Resource Kit Hugues Sweeney CoProduction Interview NFBTMC Resource Kit

ABB MagMaster - Flow Meter & End to End Testing ProcedureDavid List

Therapeutic Potential of PranayamaYogacharya AB Bhavanani

Elastic search 클러스터관리HyeonSeok Choi

Data analysis with TajoGruter

Neev Conversion Strategy CapabilitiesNeev Technologies

DDD start 1장Sunggon Song

Ddd start! 6장. 응용 서비스와 표현 영역Hyunsoo Jung

AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍Amazon Web Services Korea

Data Governance for Data LakesKiran Kamreddy

Using hadoop for enterprise data managementEstuate, Inc.

Lesson 14 osmosisMaria Teresa Asprella Libonati

선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...Amazon Web Services Korea

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network

En vedette (20)

Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.

Composición de predios agricolas

Mercado hotelero 06 2016

21st Annual Day of Yoganjali Natyalayam 2014

EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...

Introduction to Apache Tajo: Future of Data Warehouse

TMC Resource Kit Hugues Sweeney CoProduction Interview NFB

ABB MagMaster - Flow Meter & End to End Testing Procedure

Therapeutic Potential of Pranayama

Elastic search 클러스터관리

Data analysis with Tajo

Neev Conversion Strategy Capabilities

DDD start 1장

Ddd start! 6장. 응용 서비스와 표현 영역

AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍

Data Governance for Data Lakes

Using hadoop for enterprise data management

Lesson 14 osmosis

선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...

October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop

Similaire à The Use of Big Data Techniques for Digital Archiving

SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb

Scalable Preservation WorkflowsSCAPE Project

How to read a million books?cneudecker

Digital Archiving at the Meertens Institutejuntez

ARCLib project presentation from Pasig 2016dp-blog-cz

Do MORe with your datalocloud

LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project

Application scenarios of the SCAPE project at the Austrian National LibrarySven Schlarb

Towards a Common Approach for Access to Digital Archival Records in Europe. A...12th International Conference on Digital Preservation (iPRES 2015)

UCD Digital Library: Creating online access to historical and contemporary co...UCD Library

Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker

SCAPE - Building Digital Preservation InfrastructureSCAPE Project

Session5 03.george rehmIMPACT Centre of Competence

Ariadne overviewariadnenetwork

Update From OCLC Research May 2008Nancy Elkington

LoCloud - Local content in a Europeana cloudEuropeana

If You Have The Content, Then Apache Has The Technology!gagravarr

SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...South Tyrol Free Software Conference

SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project

E-ARK: Open Data Mining for Government ArchivesDanube University Krems, Centre for E-Governance

Similaire à The Use of Big Data Techniques for Digital Archiving (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium

Scalable Preservation Workflows

How to read a million books?

Digital Archiving at the Meertens Institute

ARCLib project presentation from Pasig 2016

Do MORe with your data

LIBER Satellite Event, SCAPE by Sven Schlarb

Application scenarios of the SCAPE project at the Austrian National Library

Towards a Common Approach for Access to Digital Archival Records in Europe. A...

UCD Digital Library: Creating online access to historical and contemporary co...

Open Science Days 2014 - Becker - Repositories and Linked Data

SCAPE - Building Digital Preservation Infrastructure

Session5 03.george rehm

Ariadne overview

Update From OCLC Research May 2008

LoCloud - Local content in a Europeana cloud

If You Have The Content, Then Apache Has The Technology!

SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...

SCAPE Information Day at BL - Characterising content in web archives with Nanite

E-ARK: Open Data Mining for Government Archives

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

A Year of the Servo Reboot: Where Are We Now?Igalia

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Artificial Intelligence: Facts and MythsJoaquim Jorge

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Real Time Object Detection Using Open CVKhem

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

A Year of the Servo Reboot: Where Are We Now?

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Finology Group – Insurtech Innovation Award 2024

08448380779 Call Girls In Friends Colony Women Seeking Men

Artificial Intelligence: Facts and Myths

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

The Codex of Business Writing Software for Real-World Solutions 2.pptx

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Boost Fertility New Invention Ups Success Rates.pdf

🐬 The future of MySQL is Postgres 🐘

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Tata AIG General Insurance Company - Insurer Innovation Award 2024

CNv6 Instructor Chapter 6 Quality of Service

What Are The Drone Anti-jamming Systems Technology?

Real Time Object Detection Using Open CV

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

[2024]Digital Global Overview Report 2024 Meltwater.pdf

The Use of Big Data Techniques for Digital Archiving

1. The Use of Big Data Techniques for Digital Archiving Sven Schlarb, Austrian Institute of Technology Tuesday 15th March 2016, Cambridge

2. OUTLINE • E-ARK Project Overview • Technical Background • Integrated Prototype • Data Mining Use Cases

3. Project Overview

4. THE E-ARK PROJECT IS CO-FUNDED BY THE EUROPEAN COMMISSION UNDER THE ICT-PSP PROGRAMME www.eark-project.eu

5. Advisory Boards Archival • Archives of Emilia-Romagna, Italy • Directorate-General of the Book, of Archives & of Libraries, Portugal • EC Archives & Records Management • EC Historical Archives • German Federal Archives • National Archives of Bulgaria • National Archives of Finland • National Archives of France • National Archives of Sweden • National Archives of the Netherlands • Polish Data Archive • Queensland State Archives • Swiss Federal Archives • UK National Archives • UK Parliamentary Archives Commercial Technial • Arkivum • ARMA Europe • DigitalForever • Discovery Garden • Microsoft Research • Open Preservation Foundation • Open Text Initiative • Preservica • Versity Data Providers • Danish Agency for Digitisation • Estonian Ministry of Economic Affairs & Communication • Estonian Unemployment Insurance Fund • James Lappin, RM Consultant

6. Project mission • Improve access to the archived records of European Archives • Create guidelines and recommended practices • Cover relational databases, record management systems, and geographical data • Create open source implementation evaluated in several pilots

7. Outcomes Standardisation of available best- practices • Common terminology (Knowledge Center) • SIP, AIP and DIP format specifications • Pre-ingest, ingest and access workflows Open source tools • Scalable, modular, and reusable implementation of specifications • Individual deployments (Pilots) and an integrated reference implementation

8. Technical Background

9. Hadoop Cluster Task Trackers Data Nodes Job Tracker Name Node

10. Hadoop = MapReduce + HDFS Distributed processing (MapReduce) Distributed Storage (HDFS) example: 2 x Quad-Core-CPUs: 10 Map (Parallelisierung) 4 Reduce (Aggregation) example: 4 x 1 TB Hard-Disks (replication factor 3): ca. 1,33 TB HADOOP

11. Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 Task1 Map Reduce Task 2 Task 3 Output data Aggregated Result Aggregated Result Map/Reduce in a nutshell

12. E-ARK Integrated Prototype Architecture & Implementation

13. Base technology stack

14. E-ARK Web “Integrated” Prototype?

15. AIP to DIPSIP to AIP Hadoop Distributed File System NAS Working area Search and Access Lily Repository DIP Delivery Workers Celery Information Package processing & Access Repository

16. Access Repository - Interfaces

17. Ingest and Preservation Access E-ARK SIP SIP Creation Tools Archival records Content and Records Management Systems SIP – AIP Conversion E-ARK AIP CMIS Interface Data Mining Interface Digital preservation systems AIP - DIP Conversion Scalable Computation E-ARK DIP Archival Search , Access and Display Tools Content and Records Management Systems Data Mining Showcase

18.

19.

20.

21. E-ARK Data Mining

22. Geographical/timeline search Peripleo - PELAGIOS Project

23. Geographical/timeline search Peripleo - PELAGIOS Project

24. Text mining: Text classification Training • Train classifier using annotated text corpus • SVM – based on statistical features Classification • Scan for texts during ingest (or run MR after) • Text category estimation Search • Add category as a searcheable field to Lily index • Full-text search using Lily‘s SolR search interface

25. OLAP (Online Analytical Processing) • Database archiving and re-use (SIARD2) • Normalization - OLAP/Oracle Data Warehouse

26. Thank you! • http://www.eark-project.eu • https://github.com/eark-project

Notes de l'éditeur

Purpose is to assess contributions to and from the project Open to interested parties Meetings of these groups Gather information and contribute to a knowledge base (maintained by the DLM Forum)
Technologies: Hadoop MapReduce, SolR, HDFS, Lily Repository, ESSArch Preservatin Platform, E-ARK Web Vertical Integration: [MapReduce] works atop [HDFS], [SolR] indexes [Lily] Records Horizontal Integration: [MapReduce] used to build [SolR] index, [HDFS] used to store [Lily] content, packages ingested via [EPP] UI are searched/accessed via [E-ARK WEB] UI

The Use of Big Data Techniques for Digital Archiving

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à The Use of Big Data Techniques for Digital Archiving

Similaire à The Use of Big Data Techniques for Digital Archiving (20)

Dernier

Dernier (20)

The Use of Big Data Techniques for Digital Archiving

Notes de l'éditeur