Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Archiving and managing a million or more data files on BiG Grid
1. ‘Archiving and managing a
million or more data files on
BiG Grid’
Peter Doorn, Data Archiving and Networked
Services (DANS)
With Jan Just Keijser (NIKHEF)
BiG Grid & Beyond, Amsterdam, 26/9/2012
2. Contents
Promises and ideas at the kick-off of BiG Grid in 2007: what
became of them?
In NL
SSH in UK, DE, ESFRI
Two sub-projects of BiG Grid with DANS
Analyzing and visualizing big humanities data (briefly)
Archiving and managing a million or so humanities files
Beyond BiG Grid: next requirements and challenges for the
future of SSH research and infrastructure
An example of analysis of Big Social Science Data (GPS
traces) from Italy
Challenges for data infrastructure
3. From the original Big Grid proposal:
“BIG GRID is crucial to the success and continuity of
many Dutch research communities, covering
important areas such as life sciences, astronomy,
particle physics, meteorology, and climate
research, water management, to name just a few.
However, the very nature of the new infrastructure, a
multidimensional collaboration enabler and s…
accelerator, allows for direct participation of also e
is
social sciences, humanities, and even addressesm ro
communities in administrative domains, like digital P
academic repositories.” e s…
is
m
Pro
4. ESFRI projects in SSH about grid
CESSDA: grid technologies for facilitating the merging of
distributed data sources
DARIAH:
grid services for an open semantic architecture facilitating arts
and humanities research
need for ‘easy’ interfaces for humanities scholars, services need
to be usable without the complexities of the grid infrastructure …
CLARIN: grid technology for es
is
access to guidance and advice through distributed knowledge centres om
Pr
access to repositories of data with standardized descriptions, processing
tools ready to operate on standardized data s… e
is
m
Pro
5. Tools for processing, analysing,
annotating, editing and publishing text data
• Grid-enabled workbench to process, analyse,
annotate, edit and publish XML-encoded textual
data for academic research
• Connect to the D-Grid Integration Platform (DGI)
via TextGrid-specific middleware components
• Demonstrate the efficiency of the grid-enabled tools
in the areas publishing, processing, retrieval, and
e s…
linking is
om
• Semantic TextGrid: semantic methods for
Pr
processing text assets, and for interweaving texts s…
and dictionaries e
is
m
Pro
8. UK: e-Social Science
“The National Centre for e-Social Science
(NCeSS) investigates how innovative and
powerful computer-based infrastructure and
tools, developed under the UK e-Science
programme, can benefit the social science
research community”
Examples of grid-projects:
e s…
Mixed Media Grid (MiMeG): generate tools and is
techniques for social scientists to analyse audio-visual om
qualitative data and related materials collaborativelyPr
s…
SABRE software has been specifically designed for the
e
statistical analysis of multi-process random effect
is
response data, using parallel processing om
Pr
10. Dutch example from humanities
Subject: organization of knowledge
Comparison of designed classification system (UDC) with
a socially grown knowledge system (Wikipedia)
Multidisciplinary research group, including DANS
researcher Andrea Scharnhorst
Big data set (dump of Wikipedia: 2,8 TB)
Mine the data to extract the page and category link
changes over time
Create complex visualizations
Computational support by BiG Grid team: Tom Visser,
Coen Schrijvers and Ammar Benabadelkader
11.
12. Archiving experiments since 2007
Grid middleware not very suitable for our archiving
purposes
Use case:
How can you be sure that what you store on the grid
is valid?
Giving proof of data integrity is a requirement of ISO
standard 16363 for trusted digital archives
Advantages of grid storage:
Fast access to grid worker node
Hierarchical storage manager: eg. efficient automated
backup procedures
Shared facility is efficient and economically attractive
13. Large numbers of datasets and files
> 23,000 data sets in DANS archives
Every data set consists of 1+ data files, sometimes 1000+
Most data sets are small (98% < 1 Gb)
For example, the entire population census of 1960 (>11 million
records) fits on one CD-ROM (< 700 Mb)
Total number of files >1 million
Total storage volume ca. 70 Tb
Long processing times with large numbers of datasets and files
Management operations on the whole archive: slow and
problematic on normal servers
Mass conversions (e.g. thumbnails of images)
Data integrity control (checksums)
Compressing the data
Copying of the whole archive to the grid is not trivial
14. Datasets in DANS EASY (Sept. 2012)
1,8% of datasets > 2 GB
2,8% of datasets > 1 GB
23,560 datasets
1,693,413 files
15. The experiment
Experiment with five digital archives (not in EASY),
containing a total 290,341 files, grouped over a total of
1695 'tar' files of 5 GB each (c. 8.5 TB)
Carried out by Jan Just Keijser (Nikhef)
Three-phase workflow
16. DANS Workflow phase 1:
• Create checksums
• Create tarballs (.tar files)
• Upload tarballs to the grid
1) md5sum
2) tar
3) Upload grid
storage
17. DANS Workflow phase 2:
• Download .tar file
• Compress it to a .tar.gz file
• Upload compressed tarball
Worker Node
grid 1) Download
storage
2) Compress
3) Upload
18. DANS Workflow phase 3:
• Download .tar.gz file Worker Node
• Unpack it
• Calculate checksums
• Send checksums back and compare
2) Unpack
1) Download
grid
storage
3) md5sum
4) Compare
19. Results
The tool works
One checksum mismatch detected: disk
failure on grid worker node!
20. SSH: big data challenges
Data generated by people tend to be small
Data generated by social processes (Twitter,
Facebook), transactions (financial),
administrations and by devices (GSM, GPS) tend
to be big
More analytical projects of big data in SSH (but
few in NL)
Millions of digitized books (“Culturomics”)
Sentiment analysis of twitter feeds to predict
markets and economic trends
Traffic flows using GPS
21. An example from Italy
GPS traces
17K private cars
one week of ordinary mobility
200K trips (trajectories)
Milan, Italy
From presentation
by Dino Pedreschi
Pisa
Data donated by OCTO Telematics
22.
Where is traffic concentrated between midnight and 2 a.m.?
(red = most intense)
23.
Where is traffic concentrated between 6 p.m. and 8 p.m.?
24.
Select only trips that start in the city centre (orange) and move
to North-West
25.
Where is people between 6pm and 8pm of Wednesday, April
4th?
26.
Where is people between 8pm and 10pm of Wednesday, April
4th? (high density spot appeared)
27.
Where is people between 10pm and midnight of Wednesday,
April 4th? (The dense spot disappeared. What happened?)
28.
Focus on the high-density spot: Centered on the parking lots of the
stadium, a football match took place there...
29. SSH Research beyond Big Grid
Acceptance of grid technology by SSH community is low
and slow: “my laptop has enough processing power”
Grid is still perceived as “complicated”
Researchers are not aware of:
data management issues
the research potential of “Big SSH Data”
Demonstrator projects are still needed:
Social scientists need to focus more on the analytical
potential of “Big Social Data”
“Culturomics” in humanities
DANS can help to make that accessible, although we are
not only driven by data, but also by… demand!
30. Archiving beyond BiG Grid
Storage capacity: joining forces with other parties: 3TU
Data Centre, National Coalition for Digital Preservation
(NCDD with Royal Library, National Archives, Institute for
Sound and Vision, museum sector), Roadmap projects
Archiving is more than storage: archival management
requires repeated operations on masses of files, many
small, but also big (e.g. audio/visual)
Set of procedures to support archival management
Continuity of grid infrastructure is prerequisite
Is cloud the answer?
Public cloud is not without risk
Costs are not yet attractive enough
Private community cloud is attractive
31.
32. Thank you for your attention
peter.doorn@dans.knaw.nl
janjust@nikhef.nl
www.dans.knaw.nl