Rainer Schmidt, AIT Austrian Institute of Technology, presented Scalable Preservation Workflows from SCAPE at the 5-days ‘Digital Preservation Advanced Practitioner Training’ event (http://bit.ly/1fYCvMO), hosted by DPC, in Glasgow on 15-19 July 2013.
The presentation gives an introduction to the SCAPE Platform, it presents scenarios from SCAPE Testbeds and it finally describes how to create scalable workflows and execute them on the SCAPE Platform.
1. SCA
PE
Rainer Schmidt
DP Advanced Practitioners Training
July 16th, 2013
University of Glasgow
Scalable Preservation Workflows
design, parallelisation, and execution
2. SCAlable Preservation Environments
SCAPE
2
• European Commission FP7 Integrated Project
• 16 Organizations, 8 Countries
• 42 months: February 2011 – July 2014
• Budget: 11.3 Million Euro (8.6 Million Euro funded)
• Consortium: data centers, memory institutions,
research centers, universities & commercial partners
• recently extended to involve HPC computing centers.
• Dealing with (digital) preservation processes at scale
• such as ingestion, migration, analysis and monitoring
of digital data sets
• Focus on scalability, robustness, and automation.
The Project
3. SCAlable Preservation Environments
SCAPE
3
What I will show you
• Example Scenarios from the SCAPE DL Testbed and how
they are formalized using Workflow Technology
• Introduction to the SCAPE Platform. Underlying
technologies, preservation services, and how to set-up.
• How is the paradigm different to a client-server set-up
and can I execute a standard tool against my data.
• How to create scalable workflows and execute them on
the platform.
• A practical demonstration (and available VM) for creating
and running such workflows.
5. SCAlable Preservation Environments
SCAPE
5
• Ability to process large and
complex data sets in
preservation scenarios
• Increasing amount of data in
data centers and memory
institutions
Volume, Velocity, and Variety
of data
1970 2000 2030
cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge.
available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx
Motivation
6. SCAlable Preservation Environments
SCAPE
Austrian National Library (ONB)
• Web Archiving
• Scenario 1: Web Archive Mime Type Identification
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
7. SCAlable Preservation Environments
SCAPE
• Physical storage 19 TB
• Raw data 32 TB
• Number of objects
1.241.650.566
• Domain harvesting
• Entire top-level-domain
.at every 2 years
• Selective harvesting
• Interesting frequently
changing websites
• Event harvesting
• Special occasions and
events (e.g. elections)
Web Archiving - File Format identification
8. SCAlable Preservation Environments
SCAPE
• Public private partnership with
Google Inc.
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 130K physical volumes scanned
• ~ 40 Mio pages
Austrian Books Online
10. SCAlable Preservation Environments
SCAPE
• Task: Image file format migration
• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the JPEG2000
file format obsolescense risk
• Challenges:
• Integrating validation, migration,
and quality assurance
• Computing intensive quality
assurance
Image file format migration
11. SCAlable Preservation Environments
SCAPE
Comparison of book derivatives – Matchbox tool
• Quality Assurance for different book versions
• Images have been manipulated (cropped,
rotated) and stored in different locations
• Images subject to different modification
procedures
• Detailed image comparison and detection of
near duplicates and corresponding images
• Feature extraction invariant under color
space, scale, rotation, cropping
• Detecting visual keypoints and
structural similarity
• Automated Quality Assurance workflows
• Austrian National Library - Book scan project
• The British Library - “Dunhuang” manuscripts
12. SCAlable Preservation Environments
SCAPE
Data Preparation and QA
• Goal: Preparing large document collections for data analysis.
• Example: Detecting quality issues due to cropping errors.
• Large volumes of HTML files generated as part of a book
collection
• Representing layout and text of corresponding book page
• HTML tags representing e.g. width and height of text or image block
• QA Workflow using multiple tools
• Generate image metadata using Exiftool
• Parse HTML and calculate block size of book page
• Normalize data and put into data base
• Execute query to detect quality issues
14. SCAlable Preservation Environments
SCAPE
Goal of the SCAPE Platform
• Hardware and software platform to support scalable
preservation in terms of computation and storage.
• Employing an scale-out architecture to supporting
preservation activities against large amounts of data.
• Integration of existing tools, workflows, and
data sources and sinks.
• A data center service providing a scalable execution
and storage backend for different object management
systems.
• Based a minimal set of defined services for
• processing tools and/or queries closely to the data.
15. SCAlable Preservation Environments
SCAPE
Underlying Technologies
• The SCAPE Platform is built on top of existing data-intensive
computing technologies.
• Reference Implementation leverages Hadoop Software Stack (HDFS,
MapReduce, Hive, …)
• Virtualization and packaging model for dynamic deployments of
tools and environments
• Debian packages and IaaS suppot.
• Repository Integration and Services
• Data/Storage Connector API (Fedora and Lily)
• Object Exchange Format (METS/PREMIS representation)
• Workflow modeling, translation, and provisioning.
• Taverna Workbench and Component Catalogue
• Workflow Compiler and Job Submission Service
16. SCAlable Preservation Environments
SCAPE
16
Components of the Platform
• Execution Platform
• Deploy SCAPE tools and parallelized (WF) applications
• Executable via CLI and Service API
• Scripts/Drivers aiding integration.
• Workflow Support
• Describe and validate preservation workflows using a
defined component model
• Register and semantic search using Component Catalogue
• Repository Integration
• Fedora implementation on top of CI
• Loader Application, Object Model, and Connector APIs.
20. SCAlable Preservation Environments
SCAPE
• Open-source software framework for large-scale data-
intensive computations running on large clusters of
commodity hardware.
• Derived from publications Google File System and
MapReduce publications.
• Hadoop = MapReduce + HDFS
• MapReduce: Programming Model (Map, Shuffle/Sort,
Reduce) and Execution Environment.
• HDFS: Virtual distributed file system overlay on top of local
file systems.
Hadoop Overview #1
21. SCAlable Preservation Environments
SCAPE
• Designed for write one read many times access model.
• Data IO is handled via HDFS.
• Data divided into blocks (typically 64MB) and distributed and
replicated over data nodes.
• Parallelization logic is strictly separated from user
program.
• Automated data decomposition and communication between
processing steps.
• Applications benefit from built-in support for data-locality and
fail-safety .
• Applications scale-out on big clusters processing very large data
volumes.
Hadoop Overview #2
22. SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
22
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
23. SCAlable Preservation Environments
SCAPE
MapReduce/Hadoop in a nutshell
23
Map Reduce
Map takes <k1, v1>
and transforms it to
<k2, v2> pairs
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
23
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
24. SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
24
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
Shuffle/Sort takes
<k2, v2> and
transforms
it to <k2, list(v2)>
25. SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
25
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
Reduce takes
<k2, list(v2)> and
transforms
it to <k3, v3)>
27. SCAlable Preservation Environments
SCAPE
Platform Deployment
• There is no prescribed deployment model
• Private, institutionally-shared, external data center
• Possible to deploy on “bare-metal” or using
virtualization and cloud middleware.
• Platform Environment packaged as VM image
• Automated and scalable deployment.
• Presently supporting Eucalyptus (and AWS) clouds.
• SCAPE provides two shared Platform instances
• Stable non-virtualized data-center cluster
• Private-cloud based development cluster
• Partitioning and dynamic reconfiguration
28. SCAlable Preservation Environments
SCAPE
Deploying Environments
• IaaS enabling packaging and dynamic deployment of (complex)
Software Environments
• But requires complex virtualization infrastructure
• Data-intensive technology is able to deal with a constantly
varying number of cluster nodes.
• Node failures are expected and automatically handled
• System can grow/shrink on demand
• Network Attached Storage solution can be used as data source
• But does not scalability and performance needs for computation
• SCAPE Hadoop Clusters
• Linux + Preservation tools + SCAPE Hadoop libraries
• Optionally Higher-level services (repository, workflow, …)
29. SCAlable Preservation Environments
SCAPE
ONB Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB
effective
• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system.
25 processing cores for Map tasks and
10 cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
30. SCAlable Preservation Environments
SCAPE
SCAPE Shared Clusters
• AIT (dev. cluster)
• 10 dual core nodes, 4 six-core
nodes, ~85 TB disk storage.
• Xen and Eucalyptus virtualization
and cloud management
• IMF (central instance)
• Low consumption machines in
NoRack column
• dual core AMD 64-bit processor,
8GB RAM, 15TB on 5 disks
• production data center facility
32. SCAlable Preservation Environments
SCAPE
32
• Wrapping Sequential Tools
• Using a wrapper script (Hadoop Streaming API)
• PT’s generic Java wrapper allows one to use pre-defined
patterns (based on toolspec language)
• Works well for processing a moderate number of files
• e.g. applying migration tools or FITS.
• Writing a custom MapReduce application
• Much more powerful and usually performs better.
• Suitable for more complex problems and file formats, such
as Web archives.
• Using a High-level Language like Hive and Pig
• Very useful to perform analysis of (semi-)structured data,
e.g. characterization output.
33. SCAlable Preservation Environments
SCAPE
• Preservation tools and libraries are pre-packaged so they
can be automatically deployed on cluster nodes
• SCAPE Debian Packages
• Supporting SCAPE Tool Specification Language
• MapReduce libs for processing large container files
• For example METS and (W)arc RecordReader
• Application Scripts
• Based on Apache Hive, Pig, Mahout
• Software components to assemble a complex data-parallel
workflows
• Taverna and Oozie Workflows
Available Tools
34. SCAlable Preservation Environments
SCAPE
34
Sequential Workflows
• In order to run a workflow (or activity) on the cluster it will
have to be parallelized first!
• A number of different parallelization strategies exist
• Approach typically determined on a case-by-case basis
• May lead to changes of activities, workflow structure, or
the entire application.
• Automated parallelization will only work to a certain degree
• Trivial workflows can be deployed/executed using without
requiring individual parallelization (wrapper approach).
• SCAPE driver program for parallelizing Taverna workflows.
• SCAPE template workflows for different institutional
scenarios developed.
35. SCAlable Preservation Environments
SCAPE
35
Parallel Workflows
• Are typically derived from sequential (conceptual) workflows
created for desktop environment (but may differ
substantially!).
• Rely on MapReduce as the parallel programming model and
Apache Hadoop as execution environment
• Data decomposition is handled by Hadoop framework based
on input format handlers (e.g text, warc, mets-xml, etc. )
• Can make use of a workflow engine (like Taverna and Oozie)
for orchestrating complex (composite) processes.
• May include interactions with data mgnt. sytems (repositories)
and sequential (concurrently executed) tools.
• Tools invocations are based on API or cmd-line interface and
performed as part of a MapReduce application.
37. SCAlable Preservation Environments
SCAPE
37
Tool Specification Language
• The SCAPE Tool Specification Language (toolspec) provides a
schema to formalize command line tool invocations.
• Can be used to automate a complex tool invocation (many
arguments) based on a keyword (e.g. ps2pdfs)
• Provides a simple and flexible mechanism to define tool
dependencies, for example of a workflow.
• Can be resolved by the execution system using Linux
packages.
• The toolspec is minimalistic and can be easily created for
individual tools and scripts.
• Tools provided as SCAPE Debian packages come with a
toolspec document by default.
39. SCAlable Preservation Environments
SCAPE
39
MapRed Toolwrapper
• Hadoop provides scalability, reliability, and robustness
supporting processing data that does not fit on a single
machine.
• Application must however be made compliant with the
execution environment.
• Our intention was to provide a wrapper allowing one to
execute a command-line tool on the cluster in a similar way
like on a desktop environment.
• User simply specifies toolspec file, command name, and payload
data.
• Supports HDFS references and (optionally) standard IO streams.
• Supports the SCAPE toolspec to execute preinstalled tools or
other applications available via OS command-line interface.
40. SCAlable Preservation Environments
SCAPE
40
Hadoop Streaming API
• Hadoop streaming API supports the execution of scripts (e.g.
bash or python) which are automatically translated and
executed as MapReduce applications.
• Can be used to process data with common UNIX filters using
commands like echo, awk, tr.
• Hadoop is designed to process its input based on key/value
pairs. This means the input data is interpreted and split by the
framework.
• Perfect for processing text but difficult to process binary data.
• The steaming API uses streams to read/write from/to HDFS.
• Preservation tools typically do not support HDFS file pointers
and/or IO streaming through stdin/sdout.
• Hence, DP tools are almost not usable with streaming API
41. SCAlable Preservation Environments
SCAPE
41
Suitable Use-Cases
• Use MapRed Toolwrapper when dealing with (a large number
of) single files.
• Be aware that this may not be an ideal strategy and there
are more efficient ways to deal with many files on Hadoop
(Sequence Files, Hbase, etc. ).
• However, practical and sufficient in many cases, as there is
no additional application development required.
• A typical example is file format migration on a moderate
number of files (e.g. 100.000s), which can be included in a
workflow with additional QA components.
• Very helpful when payload is simply too big to be computed
on a single machine.
42. SCAlable Preservation Environments
SCAPE
42
Example – Exploring an uncompressed WARC
• Unpacked a 1GB WARC.GZ on local computer
• 2.2 GB unpacked => 343.288 files
• `ls` took ~40s,
• count *.html files with `file` took ~4 hrs => 60.000 html files
• Provided corresponding bash command as toolspec:
• <command>if [ "$(file ${input} | awk "{print $2}" )" == HTML ]; then echo
"HTML" ; fi</command>
• Moved data to HDFS and executed pt-mapred with toolspec.
• 236min on local file system
• 160min with 1 mapper on HDFS (this was a surprise!)
• 85min (2), 52min (4), 27min (8)
• 26min with 8 mappers and IO streaming (also a surprise)
43. SCAlable Preservation Environments
SCAPE
43
Ongoing Work
• Source project and README on Github presently under
openplanets/scape/pt-mapred*
• Will be migrated to its own repository soon.
• Presently required to generate an input file that specifies input
file paths (along with optional output file names).
• TODO: Input binary directly based on input directory path
allowing Hadoop to take advantage of data locality.
• Input/output steaming and piping between toolspec
commands has already been implemented.
• TODO: Add support for Hadoop Sequence Files.
• Look into possible integration with Hadoop Streaming API.
* https://github.com/openplanets/scape/tree/master/pt-mapred
45. SCAlable Preservation Environments
SCAPE
45
What we mean by Workflow
• Formalized (and repeatable) processes/experiments consisting
of one or more activities interpreted by a workflow engine.
• Usually modeled as DAGs based on control-flow and/or
data-flow logic.
• Workflow engine functions as a coordinator/scheduler that
triggers the execution of the involved activities
• May be performed by a desktop or server-sided
component.
• Example workflow engines are Taverna workbench, Taverna
server, and Apache Oozie.
• Not equally rich and designed for different purposes:
experimentation & science, SOA, Hadoop integration.
46. SCAlable Preservation Environments
SCAPE
46
Taverna
• A workflow language and graphical editing environment based
on a dataflow model.
• Linking activities (tools, web services) based on data pipes.
• High level workflow diagram abstracting low level
implementation details
• Think of workflow as a kind of a configurable script.
• Easier to explain, share, reuse and repurpose.
• Taverna workbench provides a desktop environment to run
instances of that language.
• Workflows can also be run in headless and server mode.
• It doesn't necessarily run on a grid, cloud, or cluster but can be
used to interact with those resources.
47. SCAlable Preservation Environments
SCAPE
47
• Extract TIFF Metadata with
Matchbox and Jpylyzer
• Perform OpenJpeg
TIFF to JP2 migration
• Extract JP2 Metadata with
Matchbox and Jpylyzer
• Validation based on Jpylyzer
profiles
• Compare SIFT image
features to test visual
similarity
• Generate Report
Image Migration #1
48. SCAlable Preservation Environments
SCAPE
48
• No significant changes in
workflow structure
compared to sequential
workflow.
• Orchestrating remote
activities using Taverna’s
Tool Plugin over SSH.
• Using Platform’s MapRed
toolwrapper to invoke cmd-
line tools on cluster
Image Migration #2
Command: hadoop jar mpt-mapred.jar
-j $jobname -i $infile -r toolspecs
49. SCAlable Preservation Environments
SCAPE
WARC Identification #1
(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC RecordReader
based on
HERITRIX
Web crawler
read/write (W)ARC
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
Tool integration pattern Throughput (GB/min)
TIKA detector API call in Map phase 6,17 GB/min
FILE called as command line tool from map/reduce 1,70 GB/min
TIKA JAR command line tool called from map/reduce 0,01 GB/min
Amount of data
Number of ARC
files
Throughput
(GB/min)
1 GB 10 x 100 MB 1,57 GB/min
2 GB 20 x 100 MB 2,5 GB/min
10 GB 100 x 100 MB 3,06 GB/min
20 GB 200 x 100 MB 3,40 GB/min
100 GB 1000 x 100 MB 3,71 GB/min
54. SCAlable Preservation Environments
SCAPE
54
Quality Assurance #3 – Using Apache Oozie
• Remote Workflow
scheduler for Hadoop
• Accessible via REST interface
• Control-flow oriented
Workflow language
• Well integrated with Hadoop
stack (MapRed, Pig, HDFS).
• Hadoop API called directly,
no more ssh interaction req.
• Deals with classpath
problems and different
library versions.
56. SCAlable Preservation Environments
SCAPE
56
• When dealing with large amounts of data in terms of #files,
#objects, #records, #TB storage traditional data management
techniques begin to fail (file system operations, db , tools, etc.).
• Scalablity and Robustness are key.
• Data-intensive technologies can help a great deal but do not
support desktop tools and workflows used in many domains
out of the box.
• SCAPE has ported a number of preservation scenarios identified
by its user groups from sequential workflows to a scalable
(Hadoop-based) environment.
• The required effort can vary a lot depending on the
infrastructure in place, the nature of the data, scale, complexity,
and required performance.
Conclusions
57. SCAlable Preservation Environments
SCAPE
57
• Project website: www.scape-project.eu
• Github: https://github.com/openplanets/
• SCAPE Group on MyExperiment: http://www.myexperiment.org
• SCAPE tools: http://www.scape-project.eu/tools
• SCAPE on Slideshare: http://www.slideshare.net/SCAPEproject
• SCAPE Appliction Areas at Austrian National Library:
• http://www.slideshare.net/SvenSchlarb/elag2013-schlarb
• Submission and execution of SCAPE workflows:
• http://www.scape-project.eu/deliverable/d5-2-job-
submission-language-and-interface
Resources