A whirlwind introduction to digital humanities for CDP Digital Humanities: Collections & Heritage - current challenges and futures workshop. February 22, 2018 Imperial War Museum
2. www.bl.uk 2
• Try to define digital humanities
• Understand some of the buzzwords in and around DH
• Text/Data Mining & Machine Learning
• Data & Data Visualisation
• Georeferencing
• and a little Computer Vision & 3D modelling for good measure
….through lots of examples!
• Get tips for finding further info & support
Over the next hour we will….
3. www.bl.uk 3
But first, who am I?!
Founded in 2010, the Digital
Scholarship Department at British
Library supports researchers and
staff to make innovative use of our
digital collections and data.
We are a group of cross disciplinary
experts in the areas of digitisation,
librarianship, digital history &
humanities, computer and data
science, looking at how technology is
transforming research, and in turn,
our services.
@BL_DigiSchol
4. www.bl.uk 4
Getting (& staying) in the game
The Digital Scholarship Training Programme is
an internal staff training initiative by the Digital
Curator team that launched in November 2012.
Informed by the Digital Humanities, we look at
what researchers in the field were
learning/doing.
6. www.bl.uk 6
“Unlike many other interdisciplinary
experiments, humanities computing
has a very well-known beginning. In
1949, an Italian Jesuit priest, Father
Roberto Busa, began what even to
this day is a monumental task: to
make an index verborum of all the
words in the works of St Thomas
Aquinas and related authors, totaling
some 11 million words of medieval
Latin.
http://www.digitalhumanities.org/companion
/view?docId=blackwell/9781405103213/97
81405103213.xml&chunk.id=ss1-2-1
The origin story
7. www.bl.uk 7
“The real origin of that term [digital humanities] was in conversation
with Andrew McNeillie, the original acquiring editor for the Blackwell
Companion to Digital Humanities. We started talking with him about that
book project in 2001, in April, and by the end of November we’d lined up
contributors and were discussing the title, for the contract. Ray
[Siemens] wanted “A Companion to Humanities Computing” as that was
the term commonly used at that point; the editorial and marketing folks
at Blackwell wanted “Companion to Digitized Humanities.” I suggested
“Companion to Digital Humanities” to shift the emphasis away from
simple digitization.”
-John Unsworth, founding director of the
Institute for Advanced Technology in the Humanities
at the University of Virginia and author of
Blackwell Companion to Digital Humanities
The origin story, part II
8. www.bl.uk 8
• An area of scholarly activity, born from humanities computing, at the
intersection of computing/digital technologies and the
humanities.
• The field both employs technology in the pursuit of humanities
research, and subjects technology to humanistic questioning and
interrogation.
• DH is collaborative, crossdisciplinary, and computationally
engaged research, teaching, and publishing.
https://en.wikipedia.org/wiki/Digital_humanities
Defining digital humanities (DH)
9. www.bl.uk 9
The emergence of the new digital humanities isn’t an isolated academic
phenomenon. The institutional and disciplinary changes are part of a
larger cultural shift, inside and outside the academy, a rapid cycle of
emergence and convergence in technology and culture
Steven E Jones, Emergence of the Digital Humanities (2014)
http://lisacharlotterost.github.io/2015/06/20/Searching-through-the-years/
10. www.bl.uk 10
Is it a discipline? Or a set of methods that can be
used across disciplines (like textual criticism)
Lots of debate but for today we can safely
agree….
DH combines the methodologies from
traditional humanities & social science
disciplines…
….with computational tools provided by
computing disciplines.
Machine learning
Data Mining
Georeferencing
Text mining
Defining digital humanities (DH)
Data Visualisation
Crowdsourcing
11. www.bl.uk 11
How might digital humanities
techniques benefit your research?
• Explore a bigger body of material computationally than by individually
reading entire texts
• Sometimes see trends, patterns and relationships not apparent from
close reading
• Gain a broad overview of a topic
• Test an idea or hypothesis on a large dataset
• Provide skills and tools for keeping your research data clean
• New sources of funding, collaborations, connections
• …..and more!
14. www.bl.uk 14
Text & Data Mining
Using a variety of computational techniques to derive information
from and find patterns in texts and large datasets. Two common TM
tasks:
• Named-entity recognition: find and classify words in texts that might
refer to names of things, such as a person or company
• Topic modelling: a method for finding a group of words (i.e topic) from
a collection of documents that best represents the information in the
collection.
Machine Learning
• Constructing algorithms that can learn from and make predictions on
data...employed in a range of computing tasks relevant to humanities
scholarship such as TM & automatic Handwritten Text Recognition (HTR)
16. www.bl.uk 16
Transkribus
Transkribus is an open-source software
for the automated recognition,
transcription, indexing and enrichment of
handwritten archival documents. It relies
on crowdsourcing and machine learning.
Each contribution
helps train the model
for automatic
recognition.
17. www.bl.uk 17
Political Meetings Mapper
Dr. Katrina Navickas, a self-professed
luddite, wanted to know how many, and
where, Chartist movement meetings
took place in the 19th Century and if
there was a more efficient way to
extract this information
programmatically from our digitised
newspapers, rather than by hand.
5,519 meetings held from 1838 to 1850
discovered in 462 towns and villages
across the UK!
Will be added to her existing findings:
http://protesthistory.org.uk/the-story-
1789-1848/database-of-meetings
“I was able to do in minutes with a python code what
I’d spent the last ten years trying to do by hand!”
-Dr. Katrina Navickas, BL Labs Winner 2015
18. www.bl.uk 18
Data Visualisation
• The graphical display of quantitative or qualitative information to
create insights by highlighting patterns, trends, variations and
anomalies.
• For 'sense-making (also called data analysis) and communication'
(Stephen Few)
• '…interactive, visual representations of abstract data to amplify
cognition' (Card et al)
• Visual perception is faster; interactive visualisations let you move
between the shape and the detail of a collection
20. www.bl.uk 20
Big Data History of Music
How can vast amounts of bibliographic data held by research libraries be
unlocked for music researchers to analyse?
Can this data be interrogated in ways that challenge the traditional narratives
of music history?
Analyses and visualisations
exposed previously
uncharted patterns in the
history of music, for instance
the rise and fall of music
printing in 16th- and 17th-
century Europe (huge dips in
output in Venice were down
to plague and war).
https://www.royalholloway.ac
.uk/music/research/abigdata
historyofmusic/home.aspx
22. www.bl.uk 22
Georeferencing
• Linking data with a physical location. It relates information (documents,
texts, maps, images) to geographic locations through place names and
place codes or geospatial referencing (longitude and latitude coordinates).
• Some representative modes of enquiry enabled by georeferencing…
• Correspondence, Networks & Relationships (Republic of Letters)
• Mapping Literature (Willa Cather)
• Historical Social Movements (Political Meetings Mapper)
• Historical reconstructions (Orbis)
• Cities & Memory (Bomb Sight)
• Spread of Technology & Ideas (Atlas of Early Printing)
• Human-Environment Interaction (London Sound Survey)
23. www.bl.uk 23
Orbis: "Google Maps for Ancient Rome"
Video: https://www.youtube.com/watch?v=eWz7vXzmreg
View Interactive Map: http://atlas.lib.uiowa.edu/
Project Site: http://atlas.lib.uiowa.edu/about.php
The Stanford Geospatial
Network Model of the Roman
World reconstructs the time cost
and financial expense
associated with a wide range of
different types of travel in
antiquity.
ORBIS was created using data
from both primary sources and
computational geography
simulations about travel, wind
and sea patterns, seasonal
access, costs and other
considerations to plot realistic
transport networks.
24. www.bl.uk 24
Canada Through the Lens:
mapping a collection
Phil Hatfield, Curator created an
interactive map enabling access to
the Canadian copyright collection
by location, providing users with
metadata and, where possible,
access to the rights cleared (public
domain) images held on the
Library's Wikimedia site.
He used openly available tools
(Google Fusion Tables) which
automatically georeferenced the
data for him.
Discovered much of the collection
followed closely along railway
lines.
25. www.bl.uk 25
Computer Vision
• Closely related to Machine Learning, it’s concerned with the automatic
extraction, analysis and understanding of useful information from a
single image or a sequence of images.
It’s not ALL text based!
26. www.bl.uk 26
3D modelling
• Creating a three dimensional
computer model which
represents a three dimensional
object. 3D models are made
from points or vertices in 3D
space connected by geometric
data, such as lines and curves.
This forms a wireframe
representation which can be
displayed with a solid surface
through a process called
‘rendering’. Textures and
images can then be mapped to
the surfaces of the 3D model to
create ‘visualisations’.
It’s not ALL text based!
28. www.bl.uk 28
Humanities Data
• Facts and statistics collected together for reference or analysis
• Humanities data might be sets of bibliographic information, images,
image processing details, texts, texts with mark-up and annotations,
historical tabular data, archived webpages…you name it!
• A data set represents a distinct collection of data ideally packaged,
preserved and made accessible for enquiry.
• Humanities data can be “big”, “small”, “smart”…..but mostly
complex!
29. www.bl.uk 29
Messiness in historical data
• 'Begun in Kiryu, Japan, finished in France'
• 'Bali? Java? Mexico?'
• Variations on USA:
– U.S.
– U.S.A
– U.S.A.
– USA
– United States of America
– USA ?
– United States (case)
• Inconsistency in uncertainty
– U.S.A. or England
– U.S.A./England ?
– England & U.S.A.
31. www.bl.uk 31
Ships Log Books & Modern
Forecast Models
The East India Company archives
include 900 log-books of ships containing
daily instrumental measurements of
temperature and pressure, and
subjective estimates of wind speed and
direction, from voyages across the
Atlantic and Indian Oceans between
1789 and 1834.
The Met Office digitised and transcribed
these books, providing 273,000 new
weather records offering an
unprecedentedly detailed view of the
weather and climate of the late
eighteenth and early nineteenth centuries
in certain locations, which can be used to
test the accuracy of their forecasting
models.
32. www.bl.uk 32
• Cultural heritage records contain uncertainty and fuzziness (e.g. date ranges, multiple
values, uncertain or unavailable information)—Curators and staff at institutions often
have unique expertise in deciphering these anomalies-ask them! ( [1960] vs.1960 can
have a big impact depending on what you’re doing)
• Optical Character Recognition in particular is an imperfect art-need to consider how
bad it is, how this might effect your findings, and what needs doing to mitigate it.
• Keeping data clean, organised, open and described well will not only make your life
easier, but enable its widespread re-use beyond the life of your PhD and increase
future impact. (Datasets you’ve created in the course of your research projects could
even be used to enhance national collections!)
• Decisions always need to be made while normalising information for visualisation.
Documenting them is important for your research but also future re-use!
• Is your aim enquiry or presentation? All of this will have an impact on the tools and
data cleaning choices you make.
Things to consider: Data + Tools
For more details on the process: http://www.historyofinformation.com/expanded.php?id=2321
Father Busa imagined that a machine might be able to help him, and, having heard of computers, went to visit Thomas J. Watson at IBM in the United States in search of support (Busa 1980). The entire texts were gradually transferred to punched cards and a concordance program written.”
400 Million views since 2013
Video: http://www.bl.uk/case-studies/political-meetings-mapper
Research Question:
Chartism was the biggest popular movement for democracy in 19th Century British history. They campaigned for the vote for all men. The Chartists advertised their meeting in the Northern Star newspaper from 1838 to 1850.
The question is, how many of the meetings took place and where? We started with 1841-1845.
Source Collections:
19th Century Digitised Newspapers, specifically Northern Star newspaper
Digitised and Georeferenced Map of Oxford Street
Digital/Computational Techniques:
The images of the relevant pages of the Northern Star were run through an Optical Character Recognition program (Abbyy Finereader 12) and the resulting text was checked manually.
We developed a set of Python codes to extract and geo-code the place of meeting, using a gazetteer of places, and parse the date of the meeting.
Outcome: 5,519 meetings discovered in 462 towns and villages across the UK! http://politicalmeetingsmapper.co.uk/maps/
Research Question:
Brought together for the first time the world's biggest datasets about published sheet music, music manuscripts and classical concerts (in excess of 5 million records) for statistical analysis, manipulation and visualisation. Aim was to unlock musical-bibliographical data held by libraries in order to create new research opportunities. The project cleaned and enhanced aspects of the British Library catalogues of printed and manuscript music, which are now available as open data from www.bl.uk/bibliographic/download.html and piloted big data research techniques on these and five other datasets.
Source Collections:
Data from seven existing databases and catalogues were used as the basis of this project: the British Library's catalogues of printed and manuscript music; the bibliographies created by Répertoire International des Sources Musicales (RISM) that list European music printed 1500-1800 and music manuscripts in European libraries; and the RISM UK Music Manuscripts Database and the Concert Programmes Project database.
Digital/Computational Techniques:
Data wrangling using Open Refine and MARCedit. Data visualisation using: Google Fusion Tables and PalladioProject slides: http://www.slideshare.net/historyspot/ihr-big-data-history-of-music-9-june15
Outcome: Analyses and visualisations of these datasets exposed previously uncharted patterns in the history of music, for instance involving the rise and fall of music printing in 16th- and 17th-century Europe (huge dips in output in Venice were down to plague and war!), or the rise of nationalist colourings in music of the late 18th and early 19th centuries. The detection of these long-term trends permits new ways of linking music history to wider histories of culture, economics, society and politics
Sketchfab has tutorials: https://blog.sketchfab.com/category/tutorial/ - in general Sketchfab would be a good place to start, exploring models, blogs, tutorials
Examples from the Cooper Hewitt collection. I spent 3/5 of my time at the Cooper Hewitt just trying to get the data clean enough to vaguely represent the collection. The problem is that computers think U.S., U. S. , U.S.A., U. S. A. , United States, United States of America are six different places.
Fields also contain things like internal notes about potential duplicates, unexpected extra information - notes on what type of location, etc. Lots of inconsistencies - uncertainty and date ranges expressed in different ways.
More common GLAM issues - What year is 'early 18th century'? What do you do with '1836 (probably)'?
Open Refine is an amazing tool, and I wouldn't have gotten anywhere at Cooper Hewitt without it. It will suggest ways to make the data more consistent. You can then export the data and keep working on it in other tools, or put it into Open Refine. Because Refine runs locally it can be used for sensitive data you mightn't put online.
One issue is that GLAMs tend to use question marks to record uncertainty in attribution, but Refine strips out all punctuation, so you have to be careful about preserving it (if that's what you want).
Takes in TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents.
http://freeyourmetadata.org/cleanup/ useful advice
When plotted on a map though some of the location data recorded was WAY off. Wasn’t down to modern transcriptionist error, was a sailor error!
http://www.clim-past.net/8/1551/2012/cp-8-1551-2012.html
More on the project:
http://blogs.bl.uk/untoldlives/2013/04/history-and-science-meet-1.html
http://www.clim-past.net/8/1551/2012/cp-8-1551-2012.html
Video: https://vimeo.com/43884291