SlideShare une entreprise Scribd logo
1  sur  30
Sven Schlarb
Österreichische Nationalbibliothek
LIBER Satellite Event: APARSEN & SCAPE Workshop
21 May 2014, Austrian National Library, Vienna
Application scenarios of the SCAPE project at the Austrian
National Library
• Examples of Big Data in memory institutions
• What are the SCAPE Testbeds?
• Motivation for the Austrian National Library
• Hadoop in a nutshell
• SCAPE Platform setup at the Austrian National Library
• Selected SCAPE tools
• Application scenarios
• Web Archiving
• Austrian Books Online
Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Google Books Project: 30 Million digital books
• http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched
• Europeana: Metadata about over 24 million objects
• Europeana annual report and accounts 2012, Europeana Foundation, April 2013
• Hathi Trust: 10 million volumes (over 5,6 million titles)
comprising over 3,7 billion book page images
• http://www.hathitrust.org/statistics_info
• Internet Archive: 364 billion pages, about 10 Petabyte.
• http://archive.org und http://archive.org/web/petabox.php
Books, Journals, Newspapers, Websites. Big data?
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Takeup
•Stakeholders and Communities
•Dissemination
•Training Activities
•Sustainability
Platform
•Automation
•Workflows
•Parallelization
•Virtualization
SCAPE Project Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Good:
• Storing structured data
• Expressive query language
• ACID, type safety
• But:
• SQL Joins not efficient at scale
• ÖNB 2011: Failed creating a complete web-archive index
using single-instance-MySQL (write performance!)
• Solution?
• Scaling vertically  Bigger servers  hardware costs!
• Scaling horizontally  Sharding  maintenance costs!
Pushing the boundaries of RDBMs (e.g. MySQL)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Hadoop means a cost-
advantage because
• It usually runs on relatively
inexpensive (commodity)
hardware
• No binding to specific
vendors
• Open-Source-Software
Comparison of storage costs
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Quelle: BITKOM Leitfaden Big-Data-Technologien-Wissen
für Entscheider 2014, S. 39
• Required to move data
• From NAS to Server
• To Cloud
• Multi-Terabyte senarios?
Dealing with large amounts of data
• Immediate processing
• Unified storage and
processing capabilities
• Distributed I/O
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• When dealing with large data sets it is usually easier to
bring the processor to the data than the data to the
processor
• Fine-granular parallelisation: All processing cores of
the cluster are used as processors
• Designed for failure. In large clusters hardware failure
is the norm rather than the exception
• Redundancy : Redundant storage of data blocks
(default: 3 copies)
• Data locality: Free nodes with direct access to data do
the processing
Some Basic hadoop assumptions
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
What is Hadoop (physically)?
Distributed processing (MapReduce)
Distributed Storage (HDFS)
Hadoop = MapReduce + HDFS
2 x Quad-Core-CPUs:
10 Map (parallelisation)
4 Reduce (aggregation)
4 x 1 TB hard disks with redundancy 3:
1,33 TB effective
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Configuration per CPU
Configuration of one Quad-Core-CPU (= 1 node)
4 physical cores
8 hyperthreading-cores (System „sees“ 8 cores)
OS
Map
Map
Map
Map
Map
Reduce
Reduce
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB effective
• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 für Betriebssystem.
 25 processing cores for Map tasks
 10 processing cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
What is Hadoop (conceptually)?
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Platform instance architecture at the Austrian National Library
• Access via REST API
• Workflow engine for complex
jobs
• Hive as the frontend for
analytic queries
• MapReduce/Pig for
Extraction, Transform, and
Load (ETL)
• „Small“ objects in HDFS or
HBase
• „Large “ Digital objects stored
on NetApp Filer
Taverna Workflow engine
REST API
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Scalable
Command
Line
Processing
ToMaR
Large-
scale
content
profiling
C3PO
JPEG2000
file format
validation
Jpylyzer
Duplicate
image
detection
Matchbox
Selected SCAPE tools in various application scenarios
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Web Archiving
• Web Archive Mime Type Identification
• Characterisation of web archive data
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
Overview about application scenarios
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Webarchiving
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Storage: ca. 45TB
• ca. 1.7 Billion Objekts
• Domain harvesting
• Entire top-level-domain .at
every 2 years
• Selective harvesting
• Important websites that
change regularly
• Event harvesting
• Special occasions and
events (e.g. elections)
File format identification in web archives
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC InputFormat
(W)ARC RecordReader
Basiert auf
HERITRIX
Web Crawler
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
File format identification in web archives
Software-Integration Durchsatz(GB/min)
TIKA detector API in Map Phase 6,17 GB/min
FILE als Kommandozeilen-Applikation mit MapReduce 1,70 GB/min
TIKA JAR als Kommandozeilen-Applikation mit MapReduce 0,01 GB/min
Datenmenge Anzahl der ARC-Dateien Durchsatz(GB/min)
1 GB 10 x 100 MB 1,57 GB/min
2 GB 20 x 100 MB 2,5 GB/min
10 GB 100 x 100 MB 3,06 GB/min
20 GB 200 x 100 MB 3,40 GB/min
100 GB 1000 x 100 MB 3,71 GB/min
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Public private partnership with Google
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 200K physical volumes scanned so far
• ~ 60 Mio pages
Austrian Books Online
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
ADOCO (Austrian Books Online Download & Control)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
https://confluence.ucop.edu/display/Curation/PairTree
Google
Public Private Partnership
ADOCO
• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the
JPEG2000 file format obsolescense
risk
• Different preservation tool
categories:
• Validation
• Migration
• Quality assurance
Quality assured image file format migration
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Comparison of book derivatives
• Compare different versions of the same book
• Images come from different scanning sources
• Images have been manipulated (cropped, rotated)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• 60.000 books, ~ 24 Million pages
• Using Taverna‘s „Tool service“ (remote ssh execution)
• Orchestration of different types of hadoop jobs
• Hadoop-Streaming-API
• Hadoop Map/Reduce
• Hive
• Workflow available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-
data-processing-chaining-hadoop-jobs-using-taverna
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Bildbreite
Blockbreite
Assumption: „Significant“ difference between average blockwidth and image width
is an indicator for possible text loss due to cropping error.
Cropping errorCorrect cropping
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Create input text files w.
file paths (JP2 & HTML)
• Read image metadata
using Exiftool (Hadoop
Streaming API)
• Create sequence file
containing all HTML files
• Calculate average block
width using MapReduce
• Load data in Hive tables
• Execute SQL test query
• Possibility for libraries to build cost-efficient solutions
for storing large data collections
• HDFS as storage master or staging area?
• Local cluster vs. cloud?
• Apache Hadoop offers a stable core for building a large
scale processing platform; ready to be used in
production
• Carefully select additional components from the Apache
Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn,
Ambari, etc.) that fit your needs
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Summary
These slides on Slideshare:
• http://de.slideshare.net/SvenSchlarb/
application-scenarios-of-the-scape-project-at-the-austrian-national-library
Further information
• Project website: www.scape-project.eu
• Github repository: www.github.com/openplanets
• Project Wiki: www.wiki.opf-labs.org/display/SP/Home
SCAPE tools mentioned
• ToMaR: http://openplanets.github.io/ToMaR/#
• Jpylyzer: http://www.openplanetsfoundation.org/software/jpylyzer
• Matchbox: https://github.com/openplanets/scape/tree/master/pc-qa-
matchbox
• C3PO: http://ifs.tuwien.ac.at/imp/c3po
Thank you! Questions?
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Contenu connexe

Tendances

RIPEstat Public demo January 24, 2012
RIPEstat Public demo January 24, 2012RIPEstat Public demo January 24, 2012
RIPEstat Public demo January 24, 2012RIPE NCC
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and CeremonyArchiver
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3Tim Bell
 
RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE NCC
 
The RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPsThe RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPsRIPE NCC
 
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERNCase Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERNHelix Nebula The Science Cloud
 
Inspire Compliant Weather Data
Inspire Compliant Weather DataInspire Compliant Weather Data
Inspire Compliant Weather DataRoope Tervo
 
Countries, IXPs and RIPE Atlas
Countries, IXPs and RIPE AtlasCountries, IXPs and RIPE Atlas
Countries, IXPs and RIPE AtlasRIPE NCC
 
Visualisation of Big Imaging Data
Visualisation of Big Imaging DataVisualisation of Big Imaging Data
Visualisation of Big Imaging DataSlava Kitaeff, PhD
 
Dragon and cinder v brownbag
Dragon and cinder v brownbagDragon and cinder v brownbag
Dragon and cinder v brownbagAlon Marx
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyArchiver
 
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia InfoboxesMining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia InfoboxesVolha Bryl
 
Opening the Path to Technical Excellence
Opening the Path to Technical ExcellenceOpening the Path to Technical Excellence
Opening the Path to Technical ExcellenceNETWAYS
 
Pic archiver stansted
Pic archiver stanstedPic archiver stansted
Pic archiver stanstedArchiver
 

Tendances (20)

RIPEstat Public demo January 24, 2012
RIPEstat Public demo January 24, 2012RIPEstat Public demo January 24, 2012
RIPEstat Public demo January 24, 2012
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3
 
Hybrid Cloud for CERN
Hybrid Cloud for CERNHybrid Cloud for CERN
Hybrid Cloud for CERN
 
RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"
 
The RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPsThe RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPs
 
Census Update Webinar: Census Geography Tools
Census Update Webinar: Census Geography ToolsCensus Update Webinar: Census Geography Tools
Census Update Webinar: Census Geography Tools
 
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERNCase Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
 
Inspire Compliant Weather Data
Inspire Compliant Weather DataInspire Compliant Weather Data
Inspire Compliant Weather Data
 
HDF and netCDF Data Support in ArcGIS
HDF and netCDF Data Support in ArcGISHDF and netCDF Data Support in ArcGIS
HDF and netCDF Data Support in ArcGIS
 
Countries, IXPs and RIPE Atlas
Countries, IXPs and RIPE AtlasCountries, IXPs and RIPE Atlas
Countries, IXPs and RIPE Atlas
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
Visualisation of Big Imaging Data
Visualisation of Big Imaging DataVisualisation of Big Imaging Data
Visualisation of Big Imaging Data
 
Dragon and cinder v brownbag
Dragon and cinder v brownbagDragon and cinder v brownbag
Dragon and cinder v brownbag
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia InfoboxesMining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
 
HDF & HDF-EOS Data & Support at NSIDC
HDF & HDF-EOS Data & Support at NSIDCHDF & HDF-EOS Data & Support at NSIDC
HDF & HDF-EOS Data & Support at NSIDC
 
Opening the Path to Technical Excellence
Opening the Path to Technical ExcellenceOpening the Path to Technical Excellence
Opening the Path to Technical Excellence
 
Pic archiver stansted
Pic archiver stanstedPic archiver stansted
Pic archiver stansted
 
Status of HDF-EOS, Related Software and Tools
 Status of HDF-EOS, Related Software and Tools Status of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and Tools
 

En vedette

SCAPE Skalierbare Langzeitarchivierung
SCAPE Skalierbare LangzeitarchivierungSCAPE Skalierbare Langzeitarchivierung
SCAPE Skalierbare LangzeitarchivierungSven Schlarb
 
Capitulo ix.relación de la mente y el cerebro
Capitulo ix.relación de la mente y el cerebroCapitulo ix.relación de la mente y el cerebro
Capitulo ix.relación de la mente y el cerebroFrancisco Xavier
 
CISSP Week 14
CISSP Week 14CISSP Week 14
CISSP Week 14jemtallon
 
Моя перша презентація - 5 клас
Моя перша презентація - 5 класМоя перша презентація - 5 клас
Моя перша презентація - 5 класIrina Genih
 
2012 october-1-boe-child-abuse
2012 october-1-boe-child-abuse2012 october-1-boe-child-abuse
2012 october-1-boe-child-abuseLadystellas
 
E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016Sven Schlarb
 
quy_trinh_ky_thuat_cay_cao_su_2_5556
quy_trinh_ky_thuat_cay_cao_su_2_5556quy_trinh_ky_thuat_cay_cao_su_2_5556
quy_trinh_ky_thuat_cay_cao_su_2_5556ma ga ka lom
 
CISSP Week 9
CISSP Week 9CISSP Week 9
CISSP Week 9jemtallon
 
Hemophilia
HemophiliaHemophilia
Hemophiliagisellg
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 

En vedette (15)

SCAPE Skalierbare Langzeitarchivierung
SCAPE Skalierbare LangzeitarchivierungSCAPE Skalierbare Langzeitarchivierung
SCAPE Skalierbare Langzeitarchivierung
 
Capitulo ix.relación de la mente y el cerebro
Capitulo ix.relación de la mente y el cerebroCapitulo ix.relación de la mente y el cerebro
Capitulo ix.relación de la mente y el cerebro
 
London winter2013 peopleto_peopleambassadorprogram
London winter2013 peopleto_peopleambassadorprogramLondon winter2013 peopleto_peopleambassadorprogram
London winter2013 peopleto_peopleambassadorprogram
 
CISSP Week 14
CISSP Week 14CISSP Week 14
CISSP Week 14
 
Моя перша презентація - 5 клас
Моя перша презентація - 5 класМоя перша презентація - 5 клас
Моя перша презентація - 5 клас
 
Memories photos
Memories photosMemories photos
Memories photos
 
3d manuu1
3d manuu13d manuu1
3d manuu1
 
2012 october-1-boe-child-abuse
2012 october-1-boe-child-abuse2012 october-1-boe-child-abuse
2012 october-1-boe-child-abuse
 
E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016
 
quy_trinh_ky_thuat_cay_cao_su_2_5556
quy_trinh_ky_thuat_cay_cao_su_2_5556quy_trinh_ky_thuat_cay_cao_su_2_5556
quy_trinh_ky_thuat_cay_cao_su_2_5556
 
Proyecto
ProyectoProyecto
Proyecto
 
CISSP Week 9
CISSP Week 9CISSP Week 9
CISSP Week 9
 
iCademy prezentace
iCademy prezentaceiCademy prezentace
iCademy prezentace
 
Hemophilia
HemophiliaHemophilia
Hemophilia
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 

Similaire à Application scenarios of the SCAPE project at the Austrian National Library

SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE general presentation
SCAPE general presentationSCAPE general presentation
SCAPE general presentationSCAPE Project
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 
Automatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the WebAutomatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the WebLuis Faria
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summitcneudecker
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
2019 05-21 egi and eosc - final
2019 05-21 egi and eosc - final2019 05-21 egi and eosc - final
2019 05-21 egi and eosc - finalEOSC-hub project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
EOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introductionEOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introductionOpenAIRE
 

Similaire à Application scenarios of the SCAPE project at the Austrian National Library (20)

SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
SCAPE general presentation
SCAPE general presentationSCAPE general presentation
SCAPE general presentation
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 
Automatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the WebAutomatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the Web
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
2019 05-21 egi and eosc - final
2019 05-21 egi and eosc - final2019 05-21 egi and eosc - final
2019 05-21 egi and eosc - final
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
EOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introductionEOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introduction
 

Dernier

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Dernier (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Application scenarios of the SCAPE project at the Austrian National Library

  • 1. Sven Schlarb Österreichische Nationalbibliothek LIBER Satellite Event: APARSEN & SCAPE Workshop 21 May 2014, Austrian National Library, Vienna Application scenarios of the SCAPE project at the Austrian National Library
  • 2. • Examples of Big Data in memory institutions • What are the SCAPE Testbeds? • Motivation for the Austrian National Library • Hadoop in a nutshell • SCAPE Platform setup at the Austrian National Library • Selected SCAPE tools • Application scenarios • Web Archiving • Austrian Books Online Overview This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 3. • Google Books Project: 30 Million digital books • http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched • Europeana: Metadata about over 24 million objects • Europeana annual report and accounts 2012, Europeana Foundation, April 2013 • Hathi Trust: 10 million volumes (over 5,6 million titles) comprising over 3,7 billion book page images • http://www.hathitrust.org/statistics_info • Internet Archive: 364 billion pages, about 10 Petabyte. • http://archive.org und http://archive.org/web/petabox.php Books, Journals, Newspapers, Websites. Big data? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 4. Takeup •Stakeholders and Communities •Dissemination •Training Activities •Sustainability Platform •Automation •Workflows •Parallelization •Virtualization SCAPE Project Overview This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 5. • Good: • Storing structured data • Expressive query language • ACID, type safety • But: • SQL Joins not efficient at scale • ÖNB 2011: Failed creating a complete web-archive index using single-instance-MySQL (write performance!) • Solution? • Scaling vertically  Bigger servers  hardware costs! • Scaling horizontally  Sharding  maintenance costs! Pushing the boundaries of RDBMs (e.g. MySQL) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 6. • Hadoop means a cost- advantage because • It usually runs on relatively inexpensive (commodity) hardware • No binding to specific vendors • Open-Source-Software Comparison of storage costs This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Quelle: BITKOM Leitfaden Big-Data-Technologien-Wissen für Entscheider 2014, S. 39
  • 7. • Required to move data • From NAS to Server • To Cloud • Multi-Terabyte senarios? Dealing with large amounts of data • Immediate processing • Unified storage and processing capabilities • Distributed I/O This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 8. • When dealing with large data sets it is usually easier to bring the processor to the data than the data to the processor • Fine-granular parallelisation: All processing cores of the cluster are used as processors • Designed for failure. In large clusters hardware failure is the norm rather than the exception • Redundancy : Redundant storage of data blocks (default: 3 copies) • Data locality: Free nodes with direct access to data do the processing Some Basic hadoop assumptions This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 9. What is Hadoop (physically)? Distributed processing (MapReduce) Distributed Storage (HDFS) Hadoop = MapReduce + HDFS 2 x Quad-Core-CPUs: 10 Map (parallelisation) 4 Reduce (aggregation) 4 x 1 TB hard disks with redundancy 3: 1,33 TB effective This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 10. Configuration per CPU Configuration of one Quad-Core-CPU (= 1 node) 4 physical cores 8 hyperthreading-cores (System „sees“ 8 cores) OS Map Map Map Map Map Reduce Reduce This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 11. Experimental Cluster Job TrackerTask Trackers Data Nodes Name Node CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB effective • Of 16 HT cores: 5 for Map; 2 for Reduce; 1 für Betriebssystem.  25 processing cores for Map tasks  10 processing cores for Reduce tasks CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 12. Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 What is Hadoop (conceptually)? Task1 Map Reduce Task 2 Task 3 Output data Aggregated Result Aggregated Result This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 13. Platform instance architecture at the Austrian National Library • Access via REST API • Workflow engine for complex jobs • Hive as the frontend for analytic queries • MapReduce/Pig for Extraction, Transform, and Load (ETL) • „Small“ objects in HDFS or HBase • „Large “ Digital objects stored on NetApp Filer Taverna Workflow engine REST API This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 14. Scalable Command Line Processing ToMaR Large- scale content profiling C3PO JPEG2000 file format validation Jpylyzer Duplicate image detection Matchbox Selected SCAPE tools in various application scenarios This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 15. • Web Archiving • Web Archive Mime Type Identification • Characterisation of web archive data • Austrian Books Online • Scenario 2: Image File Format Migration • Scenario 3: Comparison of Book Derivatives • Scenario 4: MapReduce in Digitised Book Quality Assurance Overview about application scenarios This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 16. Webarchiving This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). • Storage: ca. 45TB • ca. 1.7 Billion Objekts • Domain harvesting • Entire top-level-domain .at every 2 years • Selective harvesting • Important websites that change regularly • Event harvesting • Special occasions and events (e.g. elections)
  • 17. File format identification in web archives This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 18. (W)ARC Container JPG GIF HTM HTM MID (W)ARC InputFormat (W)ARC RecordReader Basiert auf HERITRIX Web Crawler MapReduce JPG Apache Tika detect MIME Map Reduce image/jpg image/jpg 1 image/gif 1 text/html 2 audio/midi 1 File format identification in web archives Software-Integration Durchsatz(GB/min) TIKA detector API in Map Phase 6,17 GB/min FILE als Kommandozeilen-Applikation mit MapReduce 1,70 GB/min TIKA JAR als Kommandozeilen-Applikation mit MapReduce 0,01 GB/min Datenmenge Anzahl der ARC-Dateien Durchsatz(GB/min) 1 GB 10 x 100 MB 1,57 GB/min 2 GB 20 x 100 MB 2,5 GB/min 10 GB 100 x 100 MB 3,06 GB/min 20 GB 200 x 100 MB 3,40 GB/min 100 GB 1000 x 100 MB 3,71 GB/min This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 19. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 20. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 21. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 22. • Public private partnership with Google • Only public domain • Objective to scan ~ 600.000 Volumes • ~ 200 Mio. pages • ~ 70 project team members • 20+ in core team • ~ 200K physical volumes scanned so far • ~ 60 Mio pages Austrian Books Online This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 23. ADOCO (Austrian Books Online Download & Control) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). https://confluence.ucop.edu/display/Curation/PairTree Google Public Private Partnership ADOCO
  • 24. • TIFF to JPEG2000 migration • Objective: Reduce storage costs by reducing the size of the images • JPEG2000 to TIFF migration • Objective: Mitigation of the JPEG2000 file format obsolescense risk • Different preservation tool categories: • Validation • Migration • Quality assurance Quality assured image file format migration This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 25. Comparison of book derivatives • Compare different versions of the same book • Images come from different scanning sources • Images have been manipulated (cropped, rotated) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 26. • 60.000 books, ~ 24 Million pages • Using Taverna‘s „Tool service“ (remote ssh execution) • Orchestration of different types of hadoop jobs • Hadoop-Streaming-API • Hadoop Map/Reduce • Hive • Workflow available on myExperiment: http://www.myexperiment.org/workflows/3105 • See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-big- data-processing-chaining-hadoop-jobs-using-taverna Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 27. Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Bildbreite Blockbreite Assumption: „Significant“ difference between average blockwidth and image width is an indicator for possible text loss due to cropping error. Cropping errorCorrect cropping
  • 28. Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). • Create input text files w. file paths (JP2 & HTML) • Read image metadata using Exiftool (Hadoop Streaming API) • Create sequence file containing all HTML files • Calculate average block width using MapReduce • Load data in Hive tables • Execute SQL test query
  • 29. • Possibility for libraries to build cost-efficient solutions for storing large data collections • HDFS as storage master or staging area? • Local cluster vs. cloud? • Apache Hadoop offers a stable core for building a large scale processing platform; ready to be used in production • Carefully select additional components from the Apache Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn, Ambari, etc.) that fit your needs This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Summary
  • 30. These slides on Slideshare: • http://de.slideshare.net/SvenSchlarb/ application-scenarios-of-the-scape-project-at-the-austrian-national-library Further information • Project website: www.scape-project.eu • Github repository: www.github.com/openplanets • Project Wiki: www.wiki.opf-labs.org/display/SP/Home SCAPE tools mentioned • ToMaR: http://openplanets.github.io/ToMaR/# • Jpylyzer: http://www.openplanetsfoundation.org/software/jpylyzer • Matchbox: https://github.com/openplanets/scape/tree/master/pc-qa- matchbox • C3PO: http://ifs.tuwien.ac.at/imp/c3po Thank you! Questions? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).