GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
Big Data in the Arts and Humanities
1. • It has been calculated that the books in the Library of
Congress represent 235 terabytes of data
• In 1994, at a conference on data management at ULCC, I
heard the word ‘petabyte’ for first time: ‘the problems
which confront the meteorologist today will worry the arts
and humanities in ten years time’
• A petabyte equals approximately four Library of Congresses
• The Large Hadron Collider generates one petabyte of data
every second (which would cost $1 trillion a year to store)
• The Square Kilometre Array will generate 4 petabytes of
data a second
• A new word for me in 2013: zettabyte (equivalent of 250
billion DVDs)
• By 2020, digital universe will be 35 zettabytes (44 times as
big as in 2009)
2. Is there big data in the arts and
humanities?
Letter of Gladstone to
Disraeli, 1878: British Library, Add.
MS. 44457, f. 166
The political and literary
papers of Gladstone
preserved in the British
Library comprise 762
volumes containing
approx. 160,000
documents.
3. George W. Bush Presidential Library:
200 million e-mails
4 million photographs
4.
5.
6. • FamilySearch
Genealogy Information from around the world
2.8 billion names; 300 million added every year
560 million digital images
10 million hits per day; 7,000 transactions / minute
Digitising microfilm to JPEG2000
Digitised in Utah, preserved in Maryland
Initial storage 7Pb
Ingest rate : 20Tb per day, rising to 50Tb per day
7. Anglo-American Legal Tradition (http://aalt.law.uh.edu) contains
over 8,500,000 images of the major series of legal records for
medieval and early modern England, but how do we navigate,
annotate and share these materials?
8. Credit: Mark Basham, Diamond,
Raw images
(2,600 X 4,000 px each image/projection)
(Total: 6,000 images/projections)
Sinograms
from the projections
(Total: 4,000 sinograms
Each sinogram: 2,600 x 4,000 x 1 px)
A sinogram
Reconstructed slices
in a 3d volume
(Total: 4,000 slices)
2,600 px
4,000 px
6, 000 projections
A slice
Tomographic Reconstruction
~100Gb per 3D image - ~40 mins on 16 GPU cluster
~10 TB per experiment” - ~3 days on site
~ 1PB per year (per beamline)
Working on using the Emerald (376 GPUs)
9. Use of Diamond Light Source to image
fragile manuscripts
Use of Diamond Light Source to investigate
silver degradation in altarpieces
10. Laser Scans of St Kilda and Rosslyn Chapel, produced by Digital Design Studio, Glasgow
School of Art, as part of the ‘Scottish Ten’ project. Scans such as these produce data sets
containing billions of points, amounting to many terabytes. Such data can have great
cultural and economic value.
14. Big Data in the Humanities
• Frequently (but not always) heterogeneous, comprising different
layers of information: comparable to climate and environmental
data
• Extremely varied in format, from linguistic corpora to images and
video. Navigation and processing major issue, even when in
digital format
• Nevertheless strong common characteristics with big data issues
in other disciplines and scope for shared dialogue
• One characteristic of ‘big data’ hype is use of predictive analytics.
Can predictive analytics work with some arts and humanities
data (eg sound?)
• But data driven research (as opposed to hypothesis driven
research) is potentially important in arts and humanities
• Arts and humanities can potentially help in dealing with big data
by new interpretative approaches
15.
16. Lise Autogena and Joshua Portway, Most Blue Skies, 2010: http://www.autogena.org/mbs.html
17. Fabio Lattanzi Antinori,The Obelisk (2012): Open Data Institute
Scanning online news websites in real-time, the artwork changes from
opaque to transparent, according to the quantity of references to the four
main crimes against peace (genocide, crimes against humanity, crimes of
aggression and crimes of war)