SlideShare une entreprise Scribd logo
1  sur  26
The Use of Big Data Techniques
for Digital Archiving
Sven Schlarb, Austrian Institute of
Technology
Tuesday 15th March 2016, Cambridge
OUTLINE
• E-ARK Project Overview
• Technical Background
• Integrated Prototype
• Data Mining Use Cases
Project Overview
THE
E-ARK PROJECT
IS
CO-FUNDED
BY THE
EUROPEAN
COMMISSION
UNDER THE
ICT-PSP
PROGRAMME
www.eark-project.eu
Advisory Boards
Archival
• Archives of Emilia-Romagna, Italy
• Directorate-General of the Book, of
Archives & of Libraries, Portugal
• EC Archives & Records Management
• EC Historical Archives
• German Federal Archives
• National Archives of Bulgaria
• National Archives of Finland
• National Archives of France
• National Archives of Sweden
• National Archives of the
Netherlands
• Polish Data Archive
• Queensland State Archives
• Swiss Federal Archives
• UK National Archives
• UK Parliamentary Archives
Commercial
Technial
• Arkivum
• ARMA Europe
• DigitalForever
• Discovery Garden
• Microsoft Research
• Open Preservation Foundation
• Open Text Initiative
• Preservica
• Versity
Data Providers
• Danish Agency for Digitisation
• Estonian Ministry of Economic
Affairs & Communication
• Estonian Unemployment Insurance
Fund
• James Lappin, RM Consultant
Project mission
• Improve access to the archived records of
European Archives
• Create guidelines and recommended
practices
• Cover relational databases, record
management systems, and geographical
data
• Create open source implementation
evaluated in several pilots
Outcomes
Standardisation of
available best-
practices
• Common terminology
(Knowledge Center)
• SIP, AIP and DIP
format specifications
• Pre-ingest, ingest and
access workflows
Open source tools
• Scalable, modular,
and reusable
implementation of
specifications
• Individual
deployments (Pilots)
and an integrated
reference
implementation
Technical Background
Hadoop Cluster
Task Trackers
Data Nodes
Job Tracker
Name Node
Hadoop = MapReduce + HDFS
Distributed processing (MapReduce)
Distributed Storage (HDFS)
example: 2 x Quad-Core-CPUs:
10 Map (Parallelisierung)
4 Reduce (Aggregation)
example: 4 x 1 TB Hard-Disks (replication factor 3):
ca. 1,33 TB
HADOOP
Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
Map/Reduce in a nutshell
E-ARK Integrated Prototype
Architecture & Implementation
Base technology stack
E-ARK Web
“Integrated” Prototype?
AIP to DIPSIP to AIP
Hadoop Distributed
File System
NAS
Working area
Search and Access
Lily Repository
DIP Delivery
Workers
Celery
Information Package processing &
Access Repository
Access Repository - Interfaces
Ingest and Preservation
Access
E-ARK
SIP
SIP
Creation
Tools
Archival
records
Content and
Records
Management
Systems
SIP – AIP
Conversion
E-ARK
AIP
CMIS
Interface
Data
Mining
Interface
Digital preservation systems
AIP - DIP
Conversion
Scalable
Computation
E-ARK
DIP
Archival Search ,
Access and
Display Tools
Content and
Records
Management
Systems
Data Mining
Showcase
E-ARK Data Mining
Geographical/timeline search
Peripleo - PELAGIOS Project
Geographical/timeline search
Peripleo - PELAGIOS Project
Text mining: Text classification
Training
• Train classifier using annotated text corpus
• SVM – based on statistical features
Classification
• Scan for texts during ingest (or run MR after)
• Text category estimation
Search
• Add category as a searcheable field to Lily index
• Full-text search using Lily‘s SolR search interface
OLAP (Online Analytical Processing)
• Database archiving
and re-use (SIARD2)
• Normalization -
OLAP/Oracle Data
Warehouse
Thank you!
• http://www.eark-project.eu
• https://github.com/eark-project

Contenu connexe

Tendances

Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRMBéatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRMariadnenetwork
 
RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE NCC
 
Intro to R statistic programming
Intro to R statistic programming Intro to R statistic programming
Intro to R statistic programming Bryan Downing
 
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...CILIP MDG
 
Arkstore web ready2013
Arkstore web ready2013Arkstore web ready2013
Arkstore web ready2013coldsnipe
 
Python in geospatial analysis
Python in geospatial analysisPython in geospatial analysis
Python in geospatial analysisSakthivel R
 
GeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataGeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataOpenLink Software
 
Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Beamsync
 
Comsode tools - pushing data to open ecosystem
Comsode tools - pushing data to open ecosystemComsode tools - pushing data to open ecosystem
Comsode tools - pushing data to open ecosystemComsode - FP7 project
 
c,c++,java and python in gis development
c,c++,java and python in gis developmentc,c++,java and python in gis development
c,c++,java and python in gis developmentSakthivel R
 
Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupalDay
 
Using Linked Data to diversify search results: a case study in cultural heritage
Using Linked Data to diversify search results: a case study in cultural heritageUsing Linked Data to diversify search results: a case study in cultural heritage
Using Linked Data to diversify search results: a case study in cultural heritageChris Dijkshoorn
 
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
Zenodo  and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...Zenodo  and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...OpenAIRE
 
RJ Broker: Automating Delivery of Research Output to Repositories
RJ Broker: Automating Delivery of Research Output to RepositoriesRJ Broker: Automating Delivery of Research Output to Repositories
RJ Broker: Automating Delivery of Research Output to RepositoriesEDINA, University of Edinburgh
 
IXP Traffic and Major Sports Events
IXP Traffic and Major Sports EventsIXP Traffic and Major Sports Events
IXP Traffic and Major Sports EventsRIPE NCC
 

Tendances (20)

HDF Town Hall
HDF Town HallHDF Town Hall
HDF Town Hall
 
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRMBéatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
Béatrice Markhoff - Semantic mediation ArSol and CIDOC CRM
 
RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"
 
Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
 
Intro to R statistic programming
Intro to R statistic programming Intro to R statistic programming
Intro to R statistic programming
 
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
Migrating data to a new LMS: challenges, opportunities and lessons / Penny Do...
 
Sitemap4rdf(v2 boris)
Sitemap4rdf(v2 boris)Sitemap4rdf(v2 boris)
Sitemap4rdf(v2 boris)
 
Arkstore web ready2013
Arkstore web ready2013Arkstore web ready2013
Arkstore web ready2013
 
Python in geospatial analysis
Python in geospatial analysisPython in geospatial analysis
Python in geospatial analysis
 
The New HDF-EOS WebSite - How it can help you
The New HDF-EOS WebSite - How it can help youThe New HDF-EOS WebSite - How it can help you
The New HDF-EOS WebSite - How it can help you
 
GeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial DataGeoKnow: Making the Web an Exploratory Place for Spatial Data
GeoKnow: Making the Web an Exploratory Place for Spatial Data
 
Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1
 
Comsode tools - pushing data to open ecosystem
Comsode tools - pushing data to open ecosystemComsode tools - pushing data to open ecosystem
Comsode tools - pushing data to open ecosystem
 
c,c++,java and python in gis development
c,c++,java and python in gis developmentc,c++,java and python in gis development
c,c++,java and python in gis development
 
Drupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open dataDrupal Day 2011 - Thinking spatially with your open data
Drupal Day 2011 - Thinking spatially with your open data
 
Using Linked Data to diversify search results: a case study in cultural heritage
Using Linked Data to diversify search results: a case study in cultural heritageUsing Linked Data to diversify search results: a case study in cultural heritage
Using Linked Data to diversify search results: a case study in cultural heritage
 
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
Zenodo  and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...Zenodo  and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
Zenodo and OpenAIRE Research Communities (Presentation by Tim Smith at DI4R2...
 
RJ Broker: Automating Delivery of Research Output to Repositories
RJ Broker: Automating Delivery of Research Output to RepositoriesRJ Broker: Automating Delivery of Research Output to Repositories
RJ Broker: Automating Delivery of Research Output to Repositories
 
IXP Traffic and Major Sports Events
IXP Traffic and Major Sports EventsIXP Traffic and Major Sports Events
IXP Traffic and Major Sports Events
 
Geo linked data lstd10(v2-boris)
Geo linked data lstd10(v2-boris)Geo linked data lstd10(v2-boris)
Geo linked data lstd10(v2-boris)
 

En vedette

Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.CRISTIÁN E. GUTIERREZ GEBAUER
 
21st Annual Day of Yoganjali Natyalayam 2014
21st Annual Day of Yoganjali Natyalayam 201421st Annual Day of Yoganjali Natyalayam 2014
21st Annual Day of Yoganjali Natyalayam 2014Yogacharya AB Bhavanani
 
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...IndexBox Marketing
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 
TMC Resource Kit Hugues Sweeney CoProduction Interview NFB
TMC Resource Kit Hugues Sweeney CoProduction Interview NFBTMC Resource Kit Hugues Sweeney CoProduction Interview NFB
TMC Resource Kit Hugues Sweeney CoProduction Interview NFBTMC Resource Kit
 
ABB MagMaster - Flow Meter & End to End Testing Procedure
ABB MagMaster - Flow Meter & End to End Testing ProcedureABB MagMaster - Flow Meter & End to End Testing Procedure
ABB MagMaster - Flow Meter & End to End Testing ProcedureDavid List
 
Elastic search 클러스터관리
Elastic search 클러스터관리Elastic search 클러스터관리
Elastic search 클러스터관리HyeonSeok Choi
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with TajoGruter
 
Neev Conversion Strategy Capabilities
Neev Conversion Strategy CapabilitiesNeev Conversion Strategy Capabilities
Neev Conversion Strategy CapabilitiesNeev Technologies
 
Ddd start! 6장. 응용 서비스와 표현 영역
Ddd start!   6장. 응용 서비스와 표현 영역Ddd start!   6장. 응용 서비스와 표현 영역
Ddd start! 6장. 응용 서비스와 표현 영역Hyunsoo Jung
 
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍Amazon Web Services Korea
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data LakesKiran Kamreddy
 
Using hadoop for enterprise data management
Using hadoop for enterprise data managementUsing hadoop for enterprise data management
Using hadoop for enterprise data managementEstuate, Inc.
 
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...Amazon Web Services Korea
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopYahoo Developer Network
 

En vedette (20)

Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
Análisis de contingencia inmobiliaria semanal Arenas y Cayo S.A.
 
Composición de predios agricolas
Composición de predios agricolasComposición de predios agricolas
Composición de predios agricolas
 
Mercado hotelero 06 2016
Mercado hotelero 06 2016Mercado hotelero 06 2016
Mercado hotelero 06 2016
 
21st Annual Day of Yoganjali Natyalayam 2014
21st Annual Day of Yoganjali Natyalayam 201421st Annual Day of Yoganjali Natyalayam 2014
21st Annual Day of Yoganjali Natyalayam 2014
 
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
EU: Plastic Tubes, Pipes And Hoses, And Fitting Therefor - Market Report. Ana...
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
TMC Resource Kit Hugues Sweeney CoProduction Interview NFB
TMC Resource Kit Hugues Sweeney CoProduction Interview NFBTMC Resource Kit Hugues Sweeney CoProduction Interview NFB
TMC Resource Kit Hugues Sweeney CoProduction Interview NFB
 
ABB MagMaster - Flow Meter & End to End Testing Procedure
ABB MagMaster - Flow Meter & End to End Testing ProcedureABB MagMaster - Flow Meter & End to End Testing Procedure
ABB MagMaster - Flow Meter & End to End Testing Procedure
 
Therapeutic Potential of Pranayama
Therapeutic Potential of PranayamaTherapeutic Potential of Pranayama
Therapeutic Potential of Pranayama
 
Elastic search 클러스터관리
Elastic search 클러스터관리Elastic search 클러스터관리
Elastic search 클러스터관리
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with Tajo
 
Neev Conversion Strategy Capabilities
Neev Conversion Strategy CapabilitiesNeev Conversion Strategy Capabilities
Neev Conversion Strategy Capabilities
 
DDD start 1장
DDD start 1장DDD start 1장
DDD start 1장
 
Ddd start! 6장. 응용 서비스와 표현 영역
Ddd start!   6장. 응용 서비스와 표현 영역Ddd start!   6장. 응용 서비스와 표현 영역
Ddd start! 6장. 응용 서비스와 표현 영역
 
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
AWS 보안: WAF, AWS Inspector, Config Rules - 임기성 :: 2015 리인벤트 리캡 게이밍
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
 
Using hadoop for enterprise data management
Using hadoop for enterprise data managementUsing hadoop for enterprise data management
Using hadoop for enterprise data management
 
Lesson 14 osmosis
Lesson 14   osmosisLesson 14   osmosis
Lesson 14 osmosis
 
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
선도 금융사들의 aws security 활용 방안 소개 :: Eugene Yu :: AWS Finance...
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 

Similaire à The Use of Big Data Techniques for Digital Archiving

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?cneudecker
 
Digital Archiving at the Meertens Institute
Digital Archiving at the Meertens InstituteDigital Archiving at the Meertens Institute
Digital Archiving at the Meertens Institutejuntez
 
ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016dp-blog-cz
 
Do MORe with your data
Do MORe with your dataDo MORe with your data
Do MORe with your datalocloud
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibrarySven Schlarb
 
UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Library
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker
 
SCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE Project
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008Nancy Elkington
 
LoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudLoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudEuropeana
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...South Tyrol Free Software Conference
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 

Similaire à The Use of Big Data Techniques for Digital Archiving (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
Digital Archiving at the Meertens Institute
Digital Archiving at the Meertens InstituteDigital Archiving at the Meertens Institute
Digital Archiving at the Meertens Institute
 
ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016ARCLib project presentation from Pasig 2016
ARCLib project presentation from Pasig 2016
 
Do MORe with your data
Do MORe with your dataDo MORe with your data
Do MORe with your data
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National Library
 
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Towards a Common Approach for Access to Digital Archival Records in Europe. A...Towards a Common Approach for Access to Digital Archival Records in Europe. A...
Towards a Common Approach for Access to Digital Archival Records in Europe. A...
 
UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...UCD Digital Library: Creating online access to historical and contemporary co...
UCD Digital Library: Creating online access to historical and contemporary co...
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
SCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation Infrastructure
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Ariadne overview
Ariadne overviewAriadne overview
Ariadne overview
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008
 
LoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudLoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloud
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
SFScon21 - Sander Van Dooren - Joinup: Maintaining an Open catalogue of reusa...
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
E-ARK: Open Data Mining for Government Archives
E-ARK: Open Data Mining for Government ArchivesE-ARK: Open Data Mining for Government Archives
E-ARK: Open Data Mining for Government Archives
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

The Use of Big Data Techniques for Digital Archiving

  • 1. The Use of Big Data Techniques for Digital Archiving Sven Schlarb, Austrian Institute of Technology Tuesday 15th March 2016, Cambridge
  • 2. OUTLINE • E-ARK Project Overview • Technical Background • Integrated Prototype • Data Mining Use Cases
  • 4. THE E-ARK PROJECT IS CO-FUNDED BY THE EUROPEAN COMMISSION UNDER THE ICT-PSP PROGRAMME www.eark-project.eu
  • 5. Advisory Boards Archival • Archives of Emilia-Romagna, Italy • Directorate-General of the Book, of Archives & of Libraries, Portugal • EC Archives & Records Management • EC Historical Archives • German Federal Archives • National Archives of Bulgaria • National Archives of Finland • National Archives of France • National Archives of Sweden • National Archives of the Netherlands • Polish Data Archive • Queensland State Archives • Swiss Federal Archives • UK National Archives • UK Parliamentary Archives Commercial Technial • Arkivum • ARMA Europe • DigitalForever • Discovery Garden • Microsoft Research • Open Preservation Foundation • Open Text Initiative • Preservica • Versity Data Providers • Danish Agency for Digitisation • Estonian Ministry of Economic Affairs & Communication • Estonian Unemployment Insurance Fund • James Lappin, RM Consultant
  • 6. Project mission • Improve access to the archived records of European Archives • Create guidelines and recommended practices • Cover relational databases, record management systems, and geographical data • Create open source implementation evaluated in several pilots
  • 7. Outcomes Standardisation of available best- practices • Common terminology (Knowledge Center) • SIP, AIP and DIP format specifications • Pre-ingest, ingest and access workflows Open source tools • Scalable, modular, and reusable implementation of specifications • Individual deployments (Pilots) and an integrated reference implementation
  • 9. Hadoop Cluster Task Trackers Data Nodes Job Tracker Name Node
  • 10. Hadoop = MapReduce + HDFS Distributed processing (MapReduce) Distributed Storage (HDFS) example: 2 x Quad-Core-CPUs: 10 Map (Parallelisierung) 4 Reduce (Aggregation) example: 4 x 1 TB Hard-Disks (replication factor 3): ca. 1,33 TB HADOOP
  • 11. Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 Task1 Map Reduce Task 2 Task 3 Output data Aggregated Result Aggregated Result Map/Reduce in a nutshell
  • 15. AIP to DIPSIP to AIP Hadoop Distributed File System NAS Working area Search and Access Lily Repository DIP Delivery Workers Celery Information Package processing & Access Repository
  • 16. Access Repository - Interfaces
  • 17. Ingest and Preservation Access E-ARK SIP SIP Creation Tools Archival records Content and Records Management Systems SIP – AIP Conversion E-ARK AIP CMIS Interface Data Mining Interface Digital preservation systems AIP - DIP Conversion Scalable Computation E-ARK DIP Archival Search , Access and Display Tools Content and Records Management Systems Data Mining Showcase
  • 18.
  • 19.
  • 20.
  • 24. Text mining: Text classification Training • Train classifier using annotated text corpus • SVM – based on statistical features Classification • Scan for texts during ingest (or run MR after) • Text category estimation Search • Add category as a searcheable field to Lily index • Full-text search using Lily‘s SolR search interface
  • 25. OLAP (Online Analytical Processing) • Database archiving and re-use (SIARD2) • Normalization - OLAP/Oracle Data Warehouse
  • 26. Thank you! • http://www.eark-project.eu • https://github.com/eark-project

Notes de l'éditeur

  1. Purpose is to assess contributions to and from the project Open to interested parties Meetings of these groups Gather information and contribute to a knowledge base (maintained by the DLM Forum)
  2. Technologies: Hadoop MapReduce, SolR, HDFS, Lily Repository, ESSArch Preservatin Platform, E-ARK Web Vertical Integration: [MapReduce] works atop [HDFS], [SolR] indexes [Lily] Records Horizontal Integration: [MapReduce] used to build [SolR] index, [HDFS] used to store [Lily] content, packages ingested via [EPP] UI are searched/accessed via [E-ARK WEB] UI