SlideShare une entreprise Scribd logo
1  sur  65
Télécharger pour lire hors ligne
SCA
PE
Rainer Schmidt
DP Advanced Practitioners Training
July 16th, 2013
University of Glasgow
Scalable Preservation Workflows
design, parallelisation, and execution
SCAlable Preservation Environments
SCAPE
2
• European Commission FP7 Integrated Project
• 16 Organizations, 8 Countries
• 42 months: February 2011 – July 2014
• Budget: 11.3 Million Euro (8.6 Million Euro funded)
• Consortium: data centers, memory institutions,
research centers, universities & commercial partners
• recently extended to involve HPC computing centers.
• Dealing with (digital) preservation processes at scale
• such as ingestion, migration, analysis and monitoring
of digital data sets
• Focus on scalability, robustness, and automation.
The Project
SCAlable Preservation Environments
SCAPE
3
What I will show you
• Example Scenarios from the SCAPE DL Testbed and how
they are formalized using Workflow Technology
• Introduction to the SCAPE Platform. Underlying
technologies, preservation services, and how to set-up.
• How is the paradigm different to a client-server set-up
and can I execute a standard tool against my data.
• How to create scalable workflows and execute them on
the platform.
• A practical demonstration (and available VM) for creating
and running such workflows.
SCAlable Preservation Environments
SCAPE
Example Scenarios
and workflows
SCAlable Preservation Environments
SCAPE
5
• Ability to process large and
complex data sets in
preservation scenarios
• Increasing amount of data in
data centers and memory
institutions
Volume, Velocity, and Variety
of data
1970 2000 2030
cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge.
available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx
Motivation
SCAlable Preservation Environments
SCAPE
Austrian National Library (ONB)
• Web Archiving
• Scenario 1: Web Archive Mime Type Identification
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
SCAlable Preservation Environments
SCAPE
• Physical storage 19 TB
• Raw data 32 TB
• Number of objects
1.241.650.566
• Domain harvesting
• Entire top-level-domain
.at every 2 years
• Selective harvesting
• Interesting frequently
changing websites
• Event harvesting
• Special occasions and
events (e.g. elections)
Web Archiving - File Format identification
SCAlable Preservation Environments
SCAPE
• Public private partnership with
Google Inc.
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 130K physical volumes scanned
• ~ 40 Mio pages
Austrian Books Online
SCAlable Preservation Environments
SCAPE
Digitisation
Download
& Storage
Quality
Control
Access
9
https://confluence.ucop.edu/display/Curation/PairTree
Google
Public Private Partnership
ADOCO
SCAlable Preservation Environments
SCAPE
• Task: Image file format migration
• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the JPEG2000
file format obsolescense risk
• Challenges:
• Integrating validation, migration,
and quality assurance
• Computing intensive quality
assurance
Image file format migration
SCAlable Preservation Environments
SCAPE
Comparison of book derivatives – Matchbox tool
• Quality Assurance for different book versions
• Images have been manipulated (cropped,
rotated) and stored in different locations
• Images subject to different modification
procedures
• Detailed image comparison and detection of
near duplicates and corresponding images
• Feature extraction invariant under color
space, scale, rotation, cropping
• Detecting visual keypoints and
structural similarity
• Automated Quality Assurance workflows
• Austrian National Library - Book scan project
• The British Library - “Dunhuang” manuscripts
SCAlable Preservation Environments
SCAPE
Data Preparation and QA
• Goal: Preparing large document collections for data analysis.
• Example: Detecting quality issues due to cropping errors.
• Large volumes of HTML files generated as part of a book
collection
• Representing layout and text of corresponding book page
• HTML tags representing e.g. width and height of text or image block
• QA Workflow using multiple tools
• Generate image metadata using Exiftool
• Parse HTML and calculate block size of book page
• Normalize data and put into data base
• Execute query to detect quality issues
SCAlable Preservation Environments
SCAPE
The SCAPE Platform
SCAlable Preservation Environments
SCAPE
Goal of the SCAPE Platform
• Hardware and software platform to support scalable
preservation in terms of computation and storage.
• Employing an scale-out architecture to supporting
preservation activities against large amounts of data.
• Integration of existing tools, workflows, and
data sources and sinks.
• A data center service providing a scalable execution
and storage backend for different object management
systems.
• Based a minimal set of defined services for
• processing tools and/or queries closely to the data.
SCAlable Preservation Environments
SCAPE
Underlying Technologies
• The SCAPE Platform is built on top of existing data-intensive
computing technologies.
• Reference Implementation leverages Hadoop Software Stack (HDFS,
MapReduce, Hive, …)
• Virtualization and packaging model for dynamic deployments of
tools and environments
• Debian packages and IaaS suppot.
• Repository Integration and Services
• Data/Storage Connector API (Fedora and Lily)
• Object Exchange Format (METS/PREMIS representation)
• Workflow modeling, translation, and provisioning.
• Taverna Workbench and Component Catalogue
• Workflow Compiler and Job Submission Service
SCAlable Preservation Environments
SCAPE
16
Components of the Platform
• Execution Platform
• Deploy SCAPE tools and parallelized (WF) applications
• Executable via CLI and Service API
• Scripts/Drivers aiding integration.
• Workflow Support
• Describe and validate preservation workflows using a
defined component model
• Register and semantic search using Component Catalogue
• Repository Integration
• Fedora implementation on top of CI
• Loader Application, Object Model, and Connector APIs.
SCAlable Preservation Environments
SCAPE
Architectural Overview (Core)
Component
Catalogue
Workflow Modeling
Environment
Component
Lookup API
Component
Registration API
SCAlable Preservation Environments
SCAPE
Architectural Overview (Core)
Component
Catalogue
Workflow Modeling
Environment
Component
Lookup API
Component
Registration API
Focus of this talk
SCAlable Preservation Environments
SCAPE
Hadoop Overview
SCAlable Preservation Environments
SCAPE
• Open-source software framework for large-scale data-
intensive computations running on large clusters of
commodity hardware.
• Derived from publications Google File System and
MapReduce publications.
• Hadoop = MapReduce + HDFS
• MapReduce: Programming Model (Map, Shuffle/Sort,
Reduce) and Execution Environment.
• HDFS: Virtual distributed file system overlay on top of local
file systems.
Hadoop Overview #1
SCAlable Preservation Environments
SCAPE
• Designed for write one read many times access model.
• Data IO is handled via HDFS.
• Data divided into blocks (typically 64MB) and distributed and
replicated over data nodes.
• Parallelization logic is strictly separated from user
program.
• Automated data decomposition and communication between
processing steps.
• Applications benefit from built-in support for data-locality and
fail-safety .
• Applications scale-out on big clusters processing very large data
volumes.
Hadoop Overview #2
SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
22
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
SCAlable Preservation Environments
SCAPE
MapReduce/Hadoop in a nutshell
23
Map Reduce
Map takes <k1, v1>
and transforms it to
<k2, v2> pairs
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
23
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
24
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
Shuffle/Sort takes
<k2, v2> and
transforms
it to <k2, list(v2)>
SCAlable Preservation Environments
SCAPE
Sort
Shuffle
Merge
Input data
Input split
1
Record 1
Record 2
Record 3
Input split
2
Record 4
Record 5
Record 6
Input split
3
Record 7
Record 8
Record 9
MapReduce/Hadoop in a nutshell
25
Task1
Task 2
Task 3
Output
data
Aggregated
Result
Aggregated
Result
Map Reduce
Reduce takes
<k2, list(v2)> and
transforms
it to <k3, v3)>
SCAlable Preservation Environments
SCAPE
Cluster Set-up
SCAlable Preservation Environments
SCAPE
Platform Deployment
• There is no prescribed deployment model
• Private, institutionally-shared, external data center
• Possible to deploy on “bare-metal” or using
virtualization and cloud middleware.
• Platform Environment packaged as VM image
• Automated and scalable deployment.
• Presently supporting Eucalyptus (and AWS) clouds.
• SCAPE provides two shared Platform instances
• Stable non-virtualized data-center cluster
• Private-cloud based development cluster
• Partitioning and dynamic reconfiguration
SCAlable Preservation Environments
SCAPE
Deploying Environments
• IaaS enabling packaging and dynamic deployment of (complex)
Software Environments
• But requires complex virtualization infrastructure
• Data-intensive technology is able to deal with a constantly
varying number of cluster nodes.
• Node failures are expected and automatically handled
• System can grow/shrink on demand
• Network Attached Storage solution can be used as data source
• But does not scalability and performance needs for computation
• SCAPE Hadoop Clusters
• Linux + Preservation tools + SCAPE Hadoop libraries
• Optionally Higher-level services (repository, workflow, …)
SCAlable Preservation Environments
SCAPE
ONB Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB
effective
• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system.
 25 processing cores for Map tasks and
 10 cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
SCAlable Preservation Environments
SCAPE
SCAPE Shared Clusters
• AIT (dev. cluster)
• 10 dual core nodes, 4 six-core
nodes, ~85 TB disk storage.
• Xen and Eucalyptus virtualization
and cloud management
• IMF (central instance)
• Low consumption machines in
NoRack column
• dual core AMD 64-bit processor,
8GB RAM, 15TB on 5 disks
• production data center facility
SCAlable Preservation Environments
SCAPE
Using the Cluster
SCAlable Preservation Environments
SCAPE
32
• Wrapping Sequential Tools
• Using a wrapper script (Hadoop Streaming API)
• PT’s generic Java wrapper allows one to use pre-defined
patterns (based on toolspec language)
• Works well for processing a moderate number of files
• e.g. applying migration tools or FITS.
• Writing a custom MapReduce application
• Much more powerful and usually performs better.
• Suitable for more complex problems and file formats, such
as Web archives.
• Using a High-level Language like Hive and Pig
• Very useful to perform analysis of (semi-)structured data,
e.g. characterization output.
SCAlable Preservation Environments
SCAPE
• Preservation tools and libraries are pre-packaged so they
can be automatically deployed on cluster nodes
• SCAPE Debian Packages
• Supporting SCAPE Tool Specification Language
• MapReduce libs for processing large container files
• For example METS and (W)arc RecordReader
• Application Scripts
• Based on Apache Hive, Pig, Mahout
• Software components to assemble a complex data-parallel
workflows
• Taverna and Oozie Workflows
Available Tools
SCAlable Preservation Environments
SCAPE
34
Sequential Workflows
• In order to run a workflow (or activity) on the cluster it will
have to be parallelized first!
• A number of different parallelization strategies exist
• Approach typically determined on a case-by-case basis
• May lead to changes of activities, workflow structure, or
the entire application.
• Automated parallelization will only work to a certain degree
• Trivial workflows can be deployed/executed using without
requiring individual parallelization (wrapper approach).
• SCAPE driver program for parallelizing Taverna workflows.
• SCAPE template workflows for different institutional
scenarios developed.
SCAlable Preservation Environments
SCAPE
35
Parallel Workflows
• Are typically derived from sequential (conceptual) workflows
created for desktop environment (but may differ
substantially!).
• Rely on MapReduce as the parallel programming model and
Apache Hadoop as execution environment
• Data decomposition is handled by Hadoop framework based
on input format handlers (e.g text, warc, mets-xml, etc. )
• Can make use of a workflow engine (like Taverna and Oozie)
for orchestrating complex (composite) processes.
• May include interactions with data mgnt. sytems (repositories)
and sequential (concurrently executed) tools.
• Tools invocations are based on API or cmd-line interface and
performed as part of a MapReduce application.
SCAlable Preservation Environments
SCAPE
MapRed Tool Wrapper
SCAlable Preservation Environments
SCAPE
37
Tool Specification Language
• The SCAPE Tool Specification Language (toolspec) provides a
schema to formalize command line tool invocations.
• Can be used to automate a complex tool invocation (many
arguments) based on a keyword (e.g. ps2pdfs)
• Provides a simple and flexible mechanism to define tool
dependencies, for example of a workflow.
• Can be resolved by the execution system using Linux
packages.
• The toolspec is minimalistic and can be easily created for
individual tools and scripts.
• Tools provided as SCAPE Debian packages come with a
toolspec document by default.
SCAlable Preservation Environments
SCAPE
38
Ghostscript Example
SCAlable Preservation Environments
SCAPE
39
MapRed Toolwrapper
• Hadoop provides scalability, reliability, and robustness
supporting processing data that does not fit on a single
machine.
• Application must however be made compliant with the
execution environment.
• Our intention was to provide a wrapper allowing one to
execute a command-line tool on the cluster in a similar way
like on a desktop environment.
• User simply specifies toolspec file, command name, and payload
data.
• Supports HDFS references and (optionally) standard IO streams.
• Supports the SCAPE toolspec to execute preinstalled tools or
other applications available via OS command-line interface.
SCAlable Preservation Environments
SCAPE
40
Hadoop Streaming API
• Hadoop streaming API supports the execution of scripts (e.g.
bash or python) which are automatically translated and
executed as MapReduce applications.
• Can be used to process data with common UNIX filters using
commands like echo, awk, tr.
• Hadoop is designed to process its input based on key/value
pairs. This means the input data is interpreted and split by the
framework.
• Perfect for processing text but difficult to process binary data.
• The steaming API uses streams to read/write from/to HDFS.
• Preservation tools typically do not support HDFS file pointers
and/or IO streaming through stdin/sdout.
• Hence, DP tools are almost not usable with streaming API
SCAlable Preservation Environments
SCAPE
41
Suitable Use-Cases
• Use MapRed Toolwrapper when dealing with (a large number
of) single files.
• Be aware that this may not be an ideal strategy and there
are more efficient ways to deal with many files on Hadoop
(Sequence Files, Hbase, etc. ).
• However, practical and sufficient in many cases, as there is
no additional application development required.
• A typical example is file format migration on a moderate
number of files (e.g. 100.000s), which can be included in a
workflow with additional QA components.
• Very helpful when payload is simply too big to be computed
on a single machine.
SCAlable Preservation Environments
SCAPE
42
Example – Exploring an uncompressed WARC
• Unpacked a 1GB WARC.GZ on local computer
• 2.2 GB unpacked => 343.288 files
• `ls` took ~40s,
• count *.html files with `file` took ~4 hrs => 60.000 html files
• Provided corresponding bash command as toolspec:
• <command>if [ "$(file ${input} | awk "{print $2}" )" == HTML ]; then echo
"HTML" ; fi</command>
• Moved data to HDFS and executed pt-mapred with toolspec.
• 236min on local file system
• 160min with 1 mapper on HDFS (this was a surprise!)
• 85min (2), 52min (4), 27min (8)
• 26min with 8 mappers and IO streaming (also a surprise)
SCAlable Preservation Environments
SCAPE
43
Ongoing Work
• Source project and README on Github presently under
openplanets/scape/pt-mapred*
• Will be migrated to its own repository soon.
• Presently required to generate an input file that specifies input
file paths (along with optional output file names).
• TODO: Input binary directly based on input directory path
allowing Hadoop to take advantage of data locality.
• Input/output steaming and piping between toolspec
commands has already been implemented.
• TODO: Add support for Hadoop Sequence Files.
• Look into possible integration with Hadoop Streaming API.
* https://github.com/openplanets/scape/tree/master/pt-mapred
SCAlable Preservation Environments
SCAPE
Example Workflows
SCAlable Preservation Environments
SCAPE
45
What we mean by Workflow
• Formalized (and repeatable) processes/experiments consisting
of one or more activities interpreted by a workflow engine.
• Usually modeled as DAGs based on control-flow and/or
data-flow logic.
• Workflow engine functions as a coordinator/scheduler that
triggers the execution of the involved activities
• May be performed by a desktop or server-sided
component.
• Example workflow engines are Taverna workbench, Taverna
server, and Apache Oozie.
• Not equally rich and designed for different purposes:
experimentation & science, SOA, Hadoop integration.
SCAlable Preservation Environments
SCAPE
46
Taverna
• A workflow language and graphical editing environment based
on a dataflow model.
• Linking activities (tools, web services) based on data pipes.
• High level workflow diagram abstracting low level
implementation details
• Think of workflow as a kind of a configurable script.
• Easier to explain, share, reuse and repurpose.
• Taverna workbench provides a desktop environment to run
instances of that language.
• Workflows can also be run in headless and server mode.
• It doesn't necessarily run on a grid, cloud, or cluster but can be
used to interact with those resources.
SCAlable Preservation Environments
SCAPE
47
• Extract TIFF Metadata with
Matchbox and Jpylyzer
• Perform OpenJpeg
TIFF to JP2 migration
• Extract JP2 Metadata with
Matchbox and Jpylyzer
• Validation based on Jpylyzer
profiles
• Compare SIFT image
features to test visual
similarity
• Generate Report
Image Migration #1
SCAlable Preservation Environments
SCAPE
48
• No significant changes in
workflow structure
compared to sequential
workflow.
• Orchestrating remote
activities using Taverna’s
Tool Plugin over SSH.
• Using Platform’s MapRed
toolwrapper to invoke cmd-
line tools on cluster
Image Migration #2
Command: hadoop jar mpt-mapred.jar
-j $jobname -i $infile -r toolspecs
SCAlable Preservation Environments
SCAPE
WARC Identification #1
(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC RecordReader
based on
HERITRIX
Web crawler
read/write (W)ARC
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
Tool integration pattern Throughput (GB/min)
TIKA detector API call in Map phase 6,17 GB/min
FILE called as command line tool from map/reduce 1,70 GB/min
TIKA JAR command line tool called from map/reduce 0,01 GB/min
Amount of data
Number of ARC
files
Throughput
(GB/min)
1 GB 10 x 100 MB 1,57 GB/min
2 GB 20 x 100 MB 2,5 GB/min
10 GB 100 x 100 MB 3,06 GB/min
20 GB 200 x 100 MB 3,40 GB/min
100 GB 1000 x 100 MB 3,71 GB/min
SCAlable Preservation Environments
SCAPE
WARC Identification #2
TIKA 1.0DROID 6.01
SCAlable Preservation Environments
SCAPE
• ETL Processing of 60.000 books, ~ 24 Million pages
• Using Taverna‘s „Tool service“ (remote ssh execution)
• Orchestration of different types of hadoop jobs
• Hadoop-Streaming-API
• Hadoop Map/Reduce
• Hive
• Workflow available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data-
processing-chaining-hadoop-jobs-using-taverna
Quality Assurance #1
SCAlable Preservation Environments
SCAPE
52
• Create input text files
containing file paths (JP2 &
HTML)
• Read image metadata using
Exiftool (Hadoop Streaming
API)
• Create sequence file
containing all HTML files
• Calculate average block
width using MapReduce
• Load data in Hive tables
• Execute SQL test query
Quality Assurance #2
SCAlable Preservation Environments
SCAPE
53
Quality Assurance – Using Apache Oozie
SCAlable Preservation Environments
SCAPE
54
Quality Assurance #3 – Using Apache Oozie
• Remote Workflow
scheduler for Hadoop
• Accessible via REST interface
• Control-flow oriented
Workflow language
• Well integrated with Hadoop
stack (MapRed, Pig, HDFS).
• Hadoop API called directly,
no more ssh interaction req.
• Deals with classpath
problems and different
library versions.
SCAlable Preservation Environments
SCAPE
Conclusions &
Resources
SCAlable Preservation Environments
SCAPE
56
• When dealing with large amounts of data in terms of #files,
#objects, #records, #TB storage traditional data management
techniques begin to fail (file system operations, db , tools, etc.).
• Scalablity and Robustness are key.
• Data-intensive technologies can help a great deal but do not
support desktop tools and workflows used in many domains
out of the box.
• SCAPE has ported a number of preservation scenarios identified
by its user groups from sequential workflows to a scalable
(Hadoop-based) environment.
• The required effort can vary a lot depending on the
infrastructure in place, the nature of the data, scale, complexity,
and required performance.
Conclusions
SCAlable Preservation Environments
SCAPE
57
• Project website: www.scape-project.eu
• Github: https://github.com/openplanets/
• SCAPE Group on MyExperiment: http://www.myexperiment.org
• SCAPE tools: http://www.scape-project.eu/tools
• SCAPE on Slideshare: http://www.slideshare.net/SCAPEproject
• SCAPE Appliction Areas at Austrian National Library:
• http://www.slideshare.net/SvenSchlarb/elag2013-schlarb
• Submission and execution of SCAPE workflows:
• http://www.scape-project.eu/deliverable/d5-2-job-
submission-language-and-interface
Resources
SCAlable Preservation Environments
SCAPE
Thank you! Questions?
SCAlable Preservation Environments
SCAPE
Backup Slides
SCAlable Preservation Environments
SCAPE
60
find
/NAS/Z119585409/00000001.jp2
/NAS/Z119585409/00000002.jp2
/NAS/Z119585409/00000003.jp2
…
/NAS/Z117655409/00000001.jp2
/NAS/Z117655409/00000002.jp2
/NAS/Z117655409/00000003.jp2
…
/NAS/Z119585987/00000001.jp2
/NAS/Z119585987/00000002.jp2
/NAS/Z119585987/00000003.jp2
…
/NAS/Z119584539/00000001.jp2
/NAS/Z119584539/00000002.jp2
/NAS/Z119584539/00000003.jp2
…
/NAS/Z119599879/00000001.jp2l
/NAS/Z119589879/00000002.jp2
/NAS/Z119589879/00000003.jp2
...
...
NAS
reading files from NAS
1,4 GB 1,2 GB
60.000 books (24 Million pages): ~ 5 h + ~ 38 h = ~ 43 h
Jp2PathCreator HadoopStreamingExiftoolRead
Z119585409/00000001 2345
Z119585409/00000002 2340
Z119585409/00000003 2543
…
Z117655409/00000001 2300
Z117655409/00000002 2300
Z117655409/00000003 2345
…
Z119585987/00000001 2300
Z119585987/00000002 2340
Z119585987/00000003 2432
…
Z119584539/00000001 5205
Z119584539/00000002 2310
Z119584539/00000003 2134
…
Z119599879/00000001 2312
Z119589879/00000002 2300
Z119589879/00000003 2300
...
Reading image metadata
SCAlable Preservation Environments
SCAPE
61
find
/NAS/Z119585409/00000707.html
/NAS/Z119585409/00000708.html
/NAS/Z119585409/00000709.html
…
/NAS/Z138682341/00000707.html
/NAS/Z138682341/00000708.html
/NAS/Z138682341/00000709.html
…
/NAS/Z178791257/00000707.html
/NAS/Z178791257/00000708.html
/NAS/Z178791257/00000709.html
…
/NAS/Z967985409/00000707.html
/NAS/Z967985409/00000708.html
/NAS/Z967985409/00000709.html
…
/NAS/Z196545409/00000707.html
/NAS/Z196545409/00000708.html
/NAS/Z196545409/00000709.html
...
Z119585409/00000707
Z119585409/00000708
Z119585409/00000709
Z119585409/00000710
Z119585409/00000711
Z119585409/00000712
NAS
reading files from NAS
1,4 GB 997 GB (uncompressed)
60.000 books (24 Million pages): ~ 5 h + ~ 24 h = ~ 29 h
HtmlPathCreator SequenceFileCreator
SequenceFile creation
SCAlable Preservation Environments
SCAPE
62
Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005
...
Z119585409/00000001 2100
Z119585409/00000001 2200
Z119585409/00000001 2300
Z119585409/00000001 2400
Z119585409/00000002 2100
Z119585409/00000002 2200
Z119585409/00000002 2300
Z119585409/00000002 2400
Z119585409/00000003 2100
Z119585409/00000003 2200
Z119585409/00000003 2300
Z119585409/00000003 2400
Z119585409/00000004 2100
Z119585409/00000004 2200
Z119585409/00000004 2300
Z119585409/00000004 2400
Z119585409/00000005 2100
Z119585409/00000005 2200
Z119585409/00000005 2300
Z119585409/00000005 2400
Z119585409/00000001 2250
Z119585409/00000002 2250
Z119585409/00000003 2250
Z119585409/00000004 2250
Z119585409/00000005 2250
Map Reduce
HadoopAvBlockWidthMapReduce
SequenceFile Textfile
Calculate average block width using MapReduce
60.000 books (24 Million pages): ~ 6 h
SCAlable Preservation Environments
SCAPE
63
HiveLoadExifData & HiveLoadHocrData
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidth
jp2width
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
CREATE TABLE jp2width
(hid STRING, jwidth INT)
CREATE TABLE htmlwidth
(hid STRING, hwidth INT)
Analytic Queries
SCAlable Preservation Environments
SCAPE
64
HiveSelect
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidthjp2width
jid jwidth hwidth
Z119585409/00000001 2250 1870
Z119585409/00000002 2150 2100
Z119585409/00000003 2125 2015
Z119585409/00000004 2125 1350
Z119585409/00000005 2250 1700
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
Analytic Queries
SCAlable Preservation Environments
SCAPE
EOF

Contenu connexe

Tendances

Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalDatabricks
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copySharon Moses
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoUri Savelchev
 

Tendances (20)

Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copy
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
GDAL Enhancement for ESDIS Project
GDAL Enhancement for ESDIS ProjectGDAL Enhancement for ESDIS Project
GDAL Enhancement for ESDIS Project
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
 

En vedette

Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
Taverna and myExperiment. SCAPE presentation at a Hack-a-thon
Taverna and myExperiment. SCAPE presentation at a Hack-a-thonTaverna and myExperiment. SCAPE presentation at a Hack-a-thon
Taverna and myExperiment. SCAPE presentation at a Hack-a-thonSCAPE Project
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 
Planets, OPF & SCAPE - presentation of tools on digital preservation
Planets, OPF & SCAPE - presentation of tools on digital preservationPlanets, OPF & SCAPE - presentation of tools on digital preservation
Planets, OPF & SCAPE - presentation of tools on digital preservationSCAPE Project
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPESCAPE Project
 
SCAPE Preservation Platform. Design and Deployment
SCAPE Preservation Platform. Design and DeploymentSCAPE Preservation Platform. Design and Deployment
SCAPE Preservation Platform. Design and DeploymentSCAPE Project
 
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE projectJpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE projectSCAPE Project
 
Audio Quality Assurance. An application of cross correlation
Audio Quality Assurance. An application of cross correlationAudio Quality Assurance. An application of cross correlation
Audio Quality Assurance. An application of cross correlationSCAPE Project
 
Duplicate detection for quality assurance of document image collections
Duplicate detection for quality assurance of document image collectionsDuplicate detection for quality assurance of document image collections
Duplicate detection for quality assurance of document image collectionsSCAPE Project
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000SCAPE Project
 
Presentation of SCAPE Project
Presentation of SCAPE ProjectPresentation of SCAPE Project
Presentation of SCAPE ProjectSCAPE Project
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation SCAPE Project
 
SCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
Evolving Domains, Problems and Solutions for Long Term Digital Preservation
Evolving Domains, Problems and Solutions for Long Term Digital PreservationEvolving Domains, Problems and Solutions for Long Term Digital Preservation
Evolving Domains, Problems and Solutions for Long Term Digital PreservationSCAPE Project
 

En vedette (19)

Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
Taverna and myExperiment. SCAPE presentation at a Hack-a-thon
Taverna and myExperiment. SCAPE presentation at a Hack-a-thonTaverna and myExperiment. SCAPE presentation at a Hack-a-thon
Taverna and myExperiment. SCAPE presentation at a Hack-a-thon
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
Planets, OPF & SCAPE - presentation of tools on digital preservation
Planets, OPF & SCAPE - presentation of tools on digital preservationPlanets, OPF & SCAPE - presentation of tools on digital preservation
Planets, OPF & SCAPE - presentation of tools on digital preservation
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
 
SCAPE Preservation Platform. Design and Deployment
SCAPE Preservation Platform. Design and DeploymentSCAPE Preservation Platform. Design and Deployment
SCAPE Preservation Platform. Design and Deployment
 
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE projectJpylyzer, a validation and feature extraction tool developed in SCAPE project
Jpylyzer, a validation and feature extraction tool developed in SCAPE project
 
Audio Quality Assurance. An application of cross correlation
Audio Quality Assurance. An application of cross correlationAudio Quality Assurance. An application of cross correlation
Audio Quality Assurance. An application of cross correlation
 
Duplicate detection for quality assurance of document image collections
Duplicate detection for quality assurance of document image collectionsDuplicate detection for quality assurance of document image collections
Duplicate detection for quality assurance of document image collections
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
 
Presentation of SCAPE Project
Presentation of SCAPE ProjectPresentation of SCAPE Project
Presentation of SCAPE Project
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
 
SCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation InfrastructureSCAPE - Building Digital Preservation Infrastructure
SCAPE - Building Digital Preservation Infrastructure
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
Evolving Domains, Problems and Solutions for Long Term Digital Preservation
Evolving Domains, Problems and Solutions for Long Term Digital PreservationEvolving Domains, Problems and Solutions for Long Term Digital Preservation
Evolving Domains, Problems and Solutions for Long Term Digital Preservation
 

Similaire à Scalable Preservation Workflows

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE Project
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...Alex Zeltov
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Bootcamp Data Science using Cloudera
Bootcamp Data Science using ClouderaBootcamp Data Science using Cloudera
Bootcamp Data Science using ClouderaAntónio Rodrigues
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; pythonMaloy Manna, PMP®
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 

Similaire à Scalable Preservation Workflows (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Bootcamp Data Science using Cloudera
Bootcamp Data Science using ClouderaBootcamp Data Science using Cloudera
Bootcamp Data Science using Cloudera
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 

Plus de SCAPE Project

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPESCAPE Project
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012SCAPE Project
 

Plus de SCAPE Project (19)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPE
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
 

Dernier

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 

Dernier (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 

Scalable Preservation Workflows

  • 1. SCA PE Rainer Schmidt DP Advanced Practitioners Training July 16th, 2013 University of Glasgow Scalable Preservation Workflows design, parallelisation, and execution
  • 2. SCAlable Preservation Environments SCAPE 2 • European Commission FP7 Integrated Project • 16 Organizations, 8 Countries • 42 months: February 2011 – July 2014 • Budget: 11.3 Million Euro (8.6 Million Euro funded) • Consortium: data centers, memory institutions, research centers, universities & commercial partners • recently extended to involve HPC computing centers. • Dealing with (digital) preservation processes at scale • such as ingestion, migration, analysis and monitoring of digital data sets • Focus on scalability, robustness, and automation. The Project
  • 3. SCAlable Preservation Environments SCAPE 3 What I will show you • Example Scenarios from the SCAPE DL Testbed and how they are formalized using Workflow Technology • Introduction to the SCAPE Platform. Underlying technologies, preservation services, and how to set-up. • How is the paradigm different to a client-server set-up and can I execute a standard tool against my data. • How to create scalable workflows and execute them on the platform. • A practical demonstration (and available VM) for creating and running such workflows.
  • 5. SCAlable Preservation Environments SCAPE 5 • Ability to process large and complex data sets in preservation scenarios • Increasing amount of data in data centers and memory institutions Volume, Velocity, and Variety of data 1970 2000 2030 cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge. available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx Motivation
  • 6. SCAlable Preservation Environments SCAPE Austrian National Library (ONB) • Web Archiving • Scenario 1: Web Archive Mime Type Identification • Austrian Books Online • Scenario 2: Image File Format Migration • Scenario 3: Comparison of Book Derivatives • Scenario 4: MapReduce in Digitised Book Quality Assurance
  • 7. SCAlable Preservation Environments SCAPE • Physical storage 19 TB • Raw data 32 TB • Number of objects 1.241.650.566 • Domain harvesting • Entire top-level-domain .at every 2 years • Selective harvesting • Interesting frequently changing websites • Event harvesting • Special occasions and events (e.g. elections) Web Archiving - File Format identification
  • 8. SCAlable Preservation Environments SCAPE • Public private partnership with Google Inc. • Only public domain • Objective to scan ~ 600.000 Volumes • ~ 200 Mio. pages • ~ 70 project team members • 20+ in core team • ~ 130K physical volumes scanned • ~ 40 Mio pages Austrian Books Online
  • 9. SCAlable Preservation Environments SCAPE Digitisation Download & Storage Quality Control Access 9 https://confluence.ucop.edu/display/Curation/PairTree Google Public Private Partnership ADOCO
  • 10. SCAlable Preservation Environments SCAPE • Task: Image file format migration • TIFF to JPEG2000 migration • Objective: Reduce storage costs by reducing the size of the images • JPEG2000 to TIFF migration • Objective: Mitigation of the JPEG2000 file format obsolescense risk • Challenges: • Integrating validation, migration, and quality assurance • Computing intensive quality assurance Image file format migration
  • 11. SCAlable Preservation Environments SCAPE Comparison of book derivatives – Matchbox tool • Quality Assurance for different book versions • Images have been manipulated (cropped, rotated) and stored in different locations • Images subject to different modification procedures • Detailed image comparison and detection of near duplicates and corresponding images • Feature extraction invariant under color space, scale, rotation, cropping • Detecting visual keypoints and structural similarity • Automated Quality Assurance workflows • Austrian National Library - Book scan project • The British Library - “Dunhuang” manuscripts
  • 12. SCAlable Preservation Environments SCAPE Data Preparation and QA • Goal: Preparing large document collections for data analysis. • Example: Detecting quality issues due to cropping errors. • Large volumes of HTML files generated as part of a book collection • Representing layout and text of corresponding book page • HTML tags representing e.g. width and height of text or image block • QA Workflow using multiple tools • Generate image metadata using Exiftool • Parse HTML and calculate block size of book page • Normalize data and put into data base • Execute query to detect quality issues
  • 14. SCAlable Preservation Environments SCAPE Goal of the SCAPE Platform • Hardware and software platform to support scalable preservation in terms of computation and storage. • Employing an scale-out architecture to supporting preservation activities against large amounts of data. • Integration of existing tools, workflows, and data sources and sinks. • A data center service providing a scalable execution and storage backend for different object management systems. • Based a minimal set of defined services for • processing tools and/or queries closely to the data.
  • 15. SCAlable Preservation Environments SCAPE Underlying Technologies • The SCAPE Platform is built on top of existing data-intensive computing technologies. • Reference Implementation leverages Hadoop Software Stack (HDFS, MapReduce, Hive, …) • Virtualization and packaging model for dynamic deployments of tools and environments • Debian packages and IaaS suppot. • Repository Integration and Services • Data/Storage Connector API (Fedora and Lily) • Object Exchange Format (METS/PREMIS representation) • Workflow modeling, translation, and provisioning. • Taverna Workbench and Component Catalogue • Workflow Compiler and Job Submission Service
  • 16. SCAlable Preservation Environments SCAPE 16 Components of the Platform • Execution Platform • Deploy SCAPE tools and parallelized (WF) applications • Executable via CLI and Service API • Scripts/Drivers aiding integration. • Workflow Support • Describe and validate preservation workflows using a defined component model • Register and semantic search using Component Catalogue • Repository Integration • Fedora implementation on top of CI • Loader Application, Object Model, and Connector APIs.
  • 17. SCAlable Preservation Environments SCAPE Architectural Overview (Core) Component Catalogue Workflow Modeling Environment Component Lookup API Component Registration API
  • 18. SCAlable Preservation Environments SCAPE Architectural Overview (Core) Component Catalogue Workflow Modeling Environment Component Lookup API Component Registration API Focus of this talk
  • 20. SCAlable Preservation Environments SCAPE • Open-source software framework for large-scale data- intensive computations running on large clusters of commodity hardware. • Derived from publications Google File System and MapReduce publications. • Hadoop = MapReduce + HDFS • MapReduce: Programming Model (Map, Shuffle/Sort, Reduce) and Execution Environment. • HDFS: Virtual distributed file system overlay on top of local file systems. Hadoop Overview #1
  • 21. SCAlable Preservation Environments SCAPE • Designed for write one read many times access model. • Data IO is handled via HDFS. • Data divided into blocks (typically 64MB) and distributed and replicated over data nodes. • Parallelization logic is strictly separated from user program. • Automated data decomposition and communication between processing steps. • Applications benefit from built-in support for data-locality and fail-safety . • Applications scale-out on big clusters processing very large data volumes. Hadoop Overview #2
  • 22. SCAlable Preservation Environments SCAPE Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 MapReduce/Hadoop in a nutshell 22 Task1 Task 2 Task 3 Output data Aggregated Result Aggregated Result Map Reduce
  • 23. SCAlable Preservation Environments SCAPE MapReduce/Hadoop in a nutshell 23 Map Reduce Map takes <k1, v1> and transforms it to <k2, v2> pairs Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 23 Task1 Task 2 Task 3 Output data Aggregated Result Aggregated Result
  • 24. SCAlable Preservation Environments SCAPE Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 MapReduce/Hadoop in a nutshell 24 Task1 Task 2 Task 3 Output data Aggregated Result Aggregated Result Map Reduce Shuffle/Sort takes <k2, v2> and transforms it to <k2, list(v2)>
  • 25. SCAlable Preservation Environments SCAPE Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 MapReduce/Hadoop in a nutshell 25 Task1 Task 2 Task 3 Output data Aggregated Result Aggregated Result Map Reduce Reduce takes <k2, list(v2)> and transforms it to <k3, v3)>
  • 27. SCAlable Preservation Environments SCAPE Platform Deployment • There is no prescribed deployment model • Private, institutionally-shared, external data center • Possible to deploy on “bare-metal” or using virtualization and cloud middleware. • Platform Environment packaged as VM image • Automated and scalable deployment. • Presently supporting Eucalyptus (and AWS) clouds. • SCAPE provides two shared Platform instances • Stable non-virtualized data-center cluster • Private-cloud based development cluster • Partitioning and dynamic reconfiguration
  • 28. SCAlable Preservation Environments SCAPE Deploying Environments • IaaS enabling packaging and dynamic deployment of (complex) Software Environments • But requires complex virtualization infrastructure • Data-intensive technology is able to deal with a constantly varying number of cluster nodes. • Node failures are expected and automatically handled • System can grow/shrink on demand • Network Attached Storage solution can be used as data source • But does not scalability and performance needs for computation • SCAPE Hadoop Clusters • Linux + Preservation tools + SCAPE Hadoop libraries • Optionally Higher-level services (repository, workflow, …)
  • 29. SCAlable Preservation Environments SCAPE ONB Experimental Cluster Job TrackerTask Trackers Data Nodes Name Node CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB effective • Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system.  25 processing cores for Map tasks and  10 cores for Reduce tasks CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
  • 30. SCAlable Preservation Environments SCAPE SCAPE Shared Clusters • AIT (dev. cluster) • 10 dual core nodes, 4 six-core nodes, ~85 TB disk storage. • Xen and Eucalyptus virtualization and cloud management • IMF (central instance) • Low consumption machines in NoRack column • dual core AMD 64-bit processor, 8GB RAM, 15TB on 5 disks • production data center facility
  • 32. SCAlable Preservation Environments SCAPE 32 • Wrapping Sequential Tools • Using a wrapper script (Hadoop Streaming API) • PT’s generic Java wrapper allows one to use pre-defined patterns (based on toolspec language) • Works well for processing a moderate number of files • e.g. applying migration tools or FITS. • Writing a custom MapReduce application • Much more powerful and usually performs better. • Suitable for more complex problems and file formats, such as Web archives. • Using a High-level Language like Hive and Pig • Very useful to perform analysis of (semi-)structured data, e.g. characterization output.
  • 33. SCAlable Preservation Environments SCAPE • Preservation tools and libraries are pre-packaged so they can be automatically deployed on cluster nodes • SCAPE Debian Packages • Supporting SCAPE Tool Specification Language • MapReduce libs for processing large container files • For example METS and (W)arc RecordReader • Application Scripts • Based on Apache Hive, Pig, Mahout • Software components to assemble a complex data-parallel workflows • Taverna and Oozie Workflows Available Tools
  • 34. SCAlable Preservation Environments SCAPE 34 Sequential Workflows • In order to run a workflow (or activity) on the cluster it will have to be parallelized first! • A number of different parallelization strategies exist • Approach typically determined on a case-by-case basis • May lead to changes of activities, workflow structure, or the entire application. • Automated parallelization will only work to a certain degree • Trivial workflows can be deployed/executed using without requiring individual parallelization (wrapper approach). • SCAPE driver program for parallelizing Taverna workflows. • SCAPE template workflows for different institutional scenarios developed.
  • 35. SCAlable Preservation Environments SCAPE 35 Parallel Workflows • Are typically derived from sequential (conceptual) workflows created for desktop environment (but may differ substantially!). • Rely on MapReduce as the parallel programming model and Apache Hadoop as execution environment • Data decomposition is handled by Hadoop framework based on input format handlers (e.g text, warc, mets-xml, etc. ) • Can make use of a workflow engine (like Taverna and Oozie) for orchestrating complex (composite) processes. • May include interactions with data mgnt. sytems (repositories) and sequential (concurrently executed) tools. • Tools invocations are based on API or cmd-line interface and performed as part of a MapReduce application.
  • 37. SCAlable Preservation Environments SCAPE 37 Tool Specification Language • The SCAPE Tool Specification Language (toolspec) provides a schema to formalize command line tool invocations. • Can be used to automate a complex tool invocation (many arguments) based on a keyword (e.g. ps2pdfs) • Provides a simple and flexible mechanism to define tool dependencies, for example of a workflow. • Can be resolved by the execution system using Linux packages. • The toolspec is minimalistic and can be easily created for individual tools and scripts. • Tools provided as SCAPE Debian packages come with a toolspec document by default.
  • 39. SCAlable Preservation Environments SCAPE 39 MapRed Toolwrapper • Hadoop provides scalability, reliability, and robustness supporting processing data that does not fit on a single machine. • Application must however be made compliant with the execution environment. • Our intention was to provide a wrapper allowing one to execute a command-line tool on the cluster in a similar way like on a desktop environment. • User simply specifies toolspec file, command name, and payload data. • Supports HDFS references and (optionally) standard IO streams. • Supports the SCAPE toolspec to execute preinstalled tools or other applications available via OS command-line interface.
  • 40. SCAlable Preservation Environments SCAPE 40 Hadoop Streaming API • Hadoop streaming API supports the execution of scripts (e.g. bash or python) which are automatically translated and executed as MapReduce applications. • Can be used to process data with common UNIX filters using commands like echo, awk, tr. • Hadoop is designed to process its input based on key/value pairs. This means the input data is interpreted and split by the framework. • Perfect for processing text but difficult to process binary data. • The steaming API uses streams to read/write from/to HDFS. • Preservation tools typically do not support HDFS file pointers and/or IO streaming through stdin/sdout. • Hence, DP tools are almost not usable with streaming API
  • 41. SCAlable Preservation Environments SCAPE 41 Suitable Use-Cases • Use MapRed Toolwrapper when dealing with (a large number of) single files. • Be aware that this may not be an ideal strategy and there are more efficient ways to deal with many files on Hadoop (Sequence Files, Hbase, etc. ). • However, practical and sufficient in many cases, as there is no additional application development required. • A typical example is file format migration on a moderate number of files (e.g. 100.000s), which can be included in a workflow with additional QA components. • Very helpful when payload is simply too big to be computed on a single machine.
  • 42. SCAlable Preservation Environments SCAPE 42 Example – Exploring an uncompressed WARC • Unpacked a 1GB WARC.GZ on local computer • 2.2 GB unpacked => 343.288 files • `ls` took ~40s, • count *.html files with `file` took ~4 hrs => 60.000 html files • Provided corresponding bash command as toolspec: • <command>if [ "$(file ${input} | awk "{print $2}" )" == HTML ]; then echo "HTML" ; fi</command> • Moved data to HDFS and executed pt-mapred with toolspec. • 236min on local file system • 160min with 1 mapper on HDFS (this was a surprise!) • 85min (2), 52min (4), 27min (8) • 26min with 8 mappers and IO streaming (also a surprise)
  • 43. SCAlable Preservation Environments SCAPE 43 Ongoing Work • Source project and README on Github presently under openplanets/scape/pt-mapred* • Will be migrated to its own repository soon. • Presently required to generate an input file that specifies input file paths (along with optional output file names). • TODO: Input binary directly based on input directory path allowing Hadoop to take advantage of data locality. • Input/output steaming and piping between toolspec commands has already been implemented. • TODO: Add support for Hadoop Sequence Files. • Look into possible integration with Hadoop Streaming API. * https://github.com/openplanets/scape/tree/master/pt-mapred
  • 45. SCAlable Preservation Environments SCAPE 45 What we mean by Workflow • Formalized (and repeatable) processes/experiments consisting of one or more activities interpreted by a workflow engine. • Usually modeled as DAGs based on control-flow and/or data-flow logic. • Workflow engine functions as a coordinator/scheduler that triggers the execution of the involved activities • May be performed by a desktop or server-sided component. • Example workflow engines are Taverna workbench, Taverna server, and Apache Oozie. • Not equally rich and designed for different purposes: experimentation & science, SOA, Hadoop integration.
  • 46. SCAlable Preservation Environments SCAPE 46 Taverna • A workflow language and graphical editing environment based on a dataflow model. • Linking activities (tools, web services) based on data pipes. • High level workflow diagram abstracting low level implementation details • Think of workflow as a kind of a configurable script. • Easier to explain, share, reuse and repurpose. • Taverna workbench provides a desktop environment to run instances of that language. • Workflows can also be run in headless and server mode. • It doesn't necessarily run on a grid, cloud, or cluster but can be used to interact with those resources.
  • 47. SCAlable Preservation Environments SCAPE 47 • Extract TIFF Metadata with Matchbox and Jpylyzer • Perform OpenJpeg TIFF to JP2 migration • Extract JP2 Metadata with Matchbox and Jpylyzer • Validation based on Jpylyzer profiles • Compare SIFT image features to test visual similarity • Generate Report Image Migration #1
  • 48. SCAlable Preservation Environments SCAPE 48 • No significant changes in workflow structure compared to sequential workflow. • Orchestrating remote activities using Taverna’s Tool Plugin over SSH. • Using Platform’s MapRed toolwrapper to invoke cmd- line tools on cluster Image Migration #2 Command: hadoop jar mpt-mapred.jar -j $jobname -i $infile -r toolspecs
  • 49. SCAlable Preservation Environments SCAPE WARC Identification #1 (W)ARC Container JPG GIF HTM HTM MID (W)ARC RecordReader based on HERITRIX Web crawler read/write (W)ARC MapReduce JPG Apache Tika detect MIME Map Reduce image/jpg image/jpg 1 image/gif 1 text/html 2 audio/midi 1 Tool integration pattern Throughput (GB/min) TIKA detector API call in Map phase 6,17 GB/min FILE called as command line tool from map/reduce 1,70 GB/min TIKA JAR command line tool called from map/reduce 0,01 GB/min Amount of data Number of ARC files Throughput (GB/min) 1 GB 10 x 100 MB 1,57 GB/min 2 GB 20 x 100 MB 2,5 GB/min 10 GB 100 x 100 MB 3,06 GB/min 20 GB 200 x 100 MB 3,40 GB/min 100 GB 1000 x 100 MB 3,71 GB/min
  • 50. SCAlable Preservation Environments SCAPE WARC Identification #2 TIKA 1.0DROID 6.01
  • 51. SCAlable Preservation Environments SCAPE • ETL Processing of 60.000 books, ~ 24 Million pages • Using Taverna‘s „Tool service“ (remote ssh execution) • Orchestration of different types of hadoop jobs • Hadoop-Streaming-API • Hadoop Map/Reduce • Hive • Workflow available on myExperiment: http://www.myexperiment.org/workflows/3105 • See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-big-data- processing-chaining-hadoop-jobs-using-taverna Quality Assurance #1
  • 52. SCAlable Preservation Environments SCAPE 52 • Create input text files containing file paths (JP2 & HTML) • Read image metadata using Exiftool (Hadoop Streaming API) • Create sequence file containing all HTML files • Calculate average block width using MapReduce • Load data in Hive tables • Execute SQL test query Quality Assurance #2
  • 53. SCAlable Preservation Environments SCAPE 53 Quality Assurance – Using Apache Oozie
  • 54. SCAlable Preservation Environments SCAPE 54 Quality Assurance #3 – Using Apache Oozie • Remote Workflow scheduler for Hadoop • Accessible via REST interface • Control-flow oriented Workflow language • Well integrated with Hadoop stack (MapRed, Pig, HDFS). • Hadoop API called directly, no more ssh interaction req. • Deals with classpath problems and different library versions.
  • 56. SCAlable Preservation Environments SCAPE 56 • When dealing with large amounts of data in terms of #files, #objects, #records, #TB storage traditional data management techniques begin to fail (file system operations, db , tools, etc.). • Scalablity and Robustness are key. • Data-intensive technologies can help a great deal but do not support desktop tools and workflows used in many domains out of the box. • SCAPE has ported a number of preservation scenarios identified by its user groups from sequential workflows to a scalable (Hadoop-based) environment. • The required effort can vary a lot depending on the infrastructure in place, the nature of the data, scale, complexity, and required performance. Conclusions
  • 57. SCAlable Preservation Environments SCAPE 57 • Project website: www.scape-project.eu • Github: https://github.com/openplanets/ • SCAPE Group on MyExperiment: http://www.myexperiment.org • SCAPE tools: http://www.scape-project.eu/tools • SCAPE on Slideshare: http://www.slideshare.net/SCAPEproject • SCAPE Appliction Areas at Austrian National Library: • http://www.slideshare.net/SvenSchlarb/elag2013-schlarb • Submission and execution of SCAPE workflows: • http://www.scape-project.eu/deliverable/d5-2-job- submission-language-and-interface Resources
  • 60. SCAlable Preservation Environments SCAPE 60 find /NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ... ... NAS reading files from NAS 1,4 GB 1,2 GB 60.000 books (24 Million pages): ~ 5 h + ~ 38 h = ~ 43 h Jp2PathCreator HadoopStreamingExiftoolRead Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ... Reading image metadata
  • 61. SCAlable Preservation Environments SCAPE 61 find /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html … /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … /NAS/Z967985409/00000707.html /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html /NAS/Z196545409/00000709.html ... Z119585409/00000707 Z119585409/00000708 Z119585409/00000709 Z119585409/00000710 Z119585409/00000711 Z119585409/00000712 NAS reading files from NAS 1,4 GB 997 GB (uncompressed) 60.000 books (24 Million pages): ~ 5 h + ~ 24 h = ~ 29 h HtmlPathCreator SequenceFileCreator SequenceFile creation
  • 62. SCAlable Preservation Environments SCAPE 62 Z119585409/00000001 Z119585409/00000002 Z119585409/00000003 Z119585409/00000004 Z119585409/00000005 ... Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400 Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400 Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250 Map Reduce HadoopAvBlockWidthMapReduce SequenceFile Textfile Calculate average block width using MapReduce 60.000 books (24 Million pages): ~ 6 h
  • 63. SCAlable Preservation Environments SCAPE 63 HiveLoadExifData & HiveLoadHocrData jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 htmlwidth jp2width Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 CREATE TABLE jp2width (hid STRING, jwidth INT) CREATE TABLE htmlwidth (hid STRING, hwidth INT) Analytic Queries
  • 64. SCAlable Preservation Environments SCAPE 64 HiveSelect jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 htmlwidthjp2width jid jwidth hwidth Z119585409/00000001 2250 1870 Z119585409/00000002 2150 2100 Z119585409/00000003 2125 2015 Z119585409/00000004 2125 1350 Z119585409/00000005 2250 1700 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid Analytic Queries