There still remains a certain misunderstanding by the very definition of "big data" and the perceived hype around the term. This workshop clarified the concepts and give examples of relevant big data projects.
4. » Introduction to the topic and its importance education and
research
» Presentations from some key projects at the coal face of this issue
› COSMOS - Collaborative online social media observatory (Pete
Burnap)
› Mining Biodiversity - Enriching biodiversity heritage with text mining
and social media (Riza Batista-Navarro)
› Trees andTweets - combining twitter datawith family trees- (JackGrieve)
Structure of session
410/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
5. In 2012, Gartner updated its definition as follows:
"Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and
process optimization."[16] Additionally, a newV "Veracity" is
added by some organizations to describe it.[17]
(http://en.wikipedia.org/wiki/Big_data)
5
6. » Better use of Big data through high performance analytics could
add £216 billion to the UK economy by 2017 (CEBR via sas.com)
» Data has moved from a backroom issue to a boardroom issue
(strategy insight and competitive advantage)
chiefdataofficersummit.com/
» Therefore data ownership also a very important issue
» Tim Berners-Lee (as paraphrased in Guardian):
“the data we create about ourselves should be owned by each of us,
not the large companies that harvest it”
theguardian.com/technology/2014/oct/08/sir-tim-berners-lee-
speaks-out-on-data-ownership
Big data: big issue
610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
7. » Total investment is in the region of £550 m (2012-15)
» This is across all 7 research councils but also includes collaborative
programmes (17 programmes)
» Includes production of:
› Methodologies, tools and new aggregated datasets
› Infrastructure - giving access to public and private data
› Infrastructure - providing storage, compute
› Centres of Expertise - Capacity and skills development
» RCUK overview of Big data investments
rcuk.ac.uk/research/infrastructure/big-data/
RCUK “Big data” investment overview
710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
8. » Power
» Responsibility
» Opportunity
Big data for Universities
810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
9. » Enterprise Data: about learners, researchers and staff and the
University as a business (including research grants)
› Held in structuredsystems,databasesbutmaybe notall interoperable
» Research Data (generally not structured or centrally held, Jisc
supporting universities to address this challenge “Research at Risk”)
› But Open Access publications (and some other material) in
Institutional Repositories (about 125 universities have one)
» Sensitive Data (e.g. medical data – securenetworks,anonymisedetc.)
» Activity data (data about performance, benchmarking, student
and researcher behaviour)
Big data for Universities
910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
10. » Big data enables much better analytics -
Key area for universities and for Jisc to support
» Jisc-HESA Business Intelligence Service (in development)
» LAMP (shared academic library analytics service)
» Effective Learner Analytics challenge
» All designed to help support effective analytics at institutional and
national (aggregate level)
Big data: analytics
1010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
11. 11
» “Your recent Amazon purchases,
Tweet score and location history
makes you 23.5% welcome here.”
(Cartoon critical of big data application, byT. Gregorius
en.wikipedia.org/wiki/Big_data)
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
12. » Big data research is not all about analysing very big data
» It can be about bringing data together from different sources
» It can be about techniques from the big data field to build more
interesting ways of interacting with digital libraries
» It can be about using and building new techniques, tools to interact
with data and address research questions
» Project presentations will illustrate this
Big data: For research
1210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
13. » Issues around curation and preservation of research data (variable
size and condition)
» Performance of infrastructure required
» Why should we share and re-use research data?
» What tools, methodologies, techniques can be used?
» Do researchers have the rights skills to exploit data effectively
» How does all of the above impact on research and the research
process?
Big data: For research
1310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
14. » Two of the projects presenting today are part of
Digging into Data Challenge
» Digging into Data has been addressing many of the challenges
that were flagged earlier
» Digging into Data brings together 10 funders in four countries (UK,
US, Canada, NL)
» 36 projects funded since 2011
» Addresses “big data for research” in the humanities and social
sciences
Big data: Digging into data
Machine Anatomy 101 - UK funders & unviersities 17/10/2013 14
15. » Pioneered and legitimised big data based research in the humanities
– for computer scientists and others. (from zero to hero)
» “digital humanities” and “computational social sciences” working
together
» Engaged GLAM sector and others and encourage them to make
their data available in forms useful to researchers and to work with
them (encourages joint data curation)
Digging into data:Achievements so far
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 15
16. » Progress on the policy side toward reforming copyright and IP to
allow for big data research on cultural heritage materials - (more to
do here)
» International & multidisciplinary cooperation had high impact
(more than anticipated). Increased visibility also strengthened
research bringing new teams together)
Digging into data:Achievements so far
1610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
17. » Bringing data together to make Big data can create exciting
research opportunities
» Article in Nature 2013
» Mummies reveal that clogged arteries plagued the ancient world
» Based on Digging into Data programme project that brought
together CT scans on 137 mummies from four very different
ancient populations: Egyptian, Peruvian, the Ancestral Puebloans
of southwest America and the Unangans of the Aleutian Islands in
Alaska
» nature.com/news/mummies-reveal-that-clogged-arteries-
plagued-the-ancient-world-1.12568
Big data: For research
1710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
18. Big data: For research
1810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
19. » “Big data” covers a very wide set of activities
» But has and is inspiring major investments and changes in practice
» Jisc is helping to support institutions in making the most of big
data through:
› Developing shared services, advice and guidance to help manage
research data effectively and comply with funders requirements
(Research at Risk Challenge)
› Promoting effective use of data analytics and delivering some key
analytics services
› Working with the Research Councils to help exploit the benefits
of big data for research
Big data: In summary
1910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
20. Gerd Leonhard , Big Data and the Future of
Journalismflickr.com/photos/gleonhard/8978372783/
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
22. Collaborative Online Social Media
Observatory
COSMOS
Dr. Pete Burnap (@pbFeed)
Cardiff School of Computer Science and
Informatics
Cardiff University, UK
With Matthew Williams, Jeffrey Morgan, Omer Rana, Luke Sloan, Alex Voss
Adam Edwards, William Housley and Rob Procter
23. What is COSMOS?
• Aim to establish a coordinated interdisciplinary response to “Big
Social Data”
• Led from Cardiff (Computer Science and Social Sciences),
Warwick and St. Andrews
• Additional input from Edinburgh, UCL, Leeds, Manchester and
Wolverhampton
• Brings together social, computer, political, health and
mathematical scientists to study the methodological, theoretical,
and empirical dimensions of Big Data in technical, social and policy
contexts
• Developing a research programme to help understand and explain
how social processes and interactions manifest on the Web, with
a focus upon the challenges posed by big social data to government,
digital economy and civil society,
• Development of new methodological tools and technical/data
solutions for UK academia and public sector…a Web Observatory
24. What is COSMOS?
• COSMOS has attracted 17 research grants
amounting to over £1.25M in funding from
JISC/ESRC/EPSRC/AHRC/and £500K from the
public and private sectors (DoH/FSA/HPC Wales).
• A significant proportion of these funds have been
awarded to collect and analyse social media data in
the contexts of Societal Safety and Security e.g.
social tension, hate speech, crime reporting and
fear of crime, suicidal ideation
25. Research Programme
Digital Social Research Tools, Tension Indicators and Safer
Communities: A demonstration of COSMOS (ESRC DSR)
COSMOS: Supporting Empirical Social Scientific Research with a
Virtual Research Environment (JISC)
Small items of research equipment at Cardiff University (EPSRC)
Hate Speech and Social Media: Understanding Users, Networks and
Information Flows (ESRC Google)
Social Media and Prediction: Crime Sensing, Data Integration and
Statistical Modelling (ESRC NCRM)
Understanding the Role of Social Media in the Aftermath of Youth
Suicides (Department of Health)
Scaling the Computational Analysis of “Big Social Data” & Massive
Temporal Social Media Datasets (HPC Wales)
Digital Wildfire: (Mis)information flows, propagation and responsible
governance, (ESRC Global Uncertainties)
Public perceptions of the UK food system: public understanding and
engagement, and the impact of crises and scares (ESRC/FSA)
2011
2016
26. COSMOS Web Observatory
Integrated
Open (“plug and play”)
Scalable (MongoDB data stores/
Hadoop Back End)
Burnap, P. et al. (2014) ‘COSMOS: Towards an Integrated and Scalable Service for Analyzing Social Media
on Demand’, International Journal of Parallel, Emergent and Distributed Systems
Usable – developed with social
scientists for social scientists
Reproducible/Citable Research
- export/share workflow
27. Web Observatory Features
• Data Collection
– Persistent connection to Twitter 1% Stream (~4 billion)
– ONS/Police API
– Drag and drop RSS
– Import CSV/JSON
• Data Transformation
– Word Frequency
– Point data frequency over time
– Social Network Analysis
– Geospatial Clustering
– Sentiment Analysis
– …API to plug new modules and benchmark tools
30. COSMOS Infrastructure
COSMOS Desktop
•Small local datasets
•Users’ API credentials
•Local analysis
•Sept ‘14 launch (>100 dl’s in 17
countries)
COSMOS Cloud
•Scalable storage
• Massive datasets
•Scalable compute
• On-demand nodes
• Fast search & retrieve
• Fast analysis
•Workflow management
•Collaboration support
•2015 launch
31. Web Observatory Examples
• Policy/impact driven (benefit to society/economy)
• Focus on ethical research into human safety and
security
• Augment terrestrial methods
• Comparison to existing methods
• Experimental applied stats & machine learning
• Provide examples of machine intelligence tasks
integrated into social research workflow…
• Radio 5 Live Hit List (#5LiveHitList) - biggest impact
stories across social media and online
33. Mining Biodiversity:
Enriching biodiversity literature with
OCR corrections and text-mined
semantic metadata
Riza Batista-Navarro
National centre for text mining,
University of Manchester
35. Mining biodiversity
35
» Transform BHL into a next-generation social digital library
» Bring together strengths from multiple disciplines:
› Text mining
› Machine learning
› Data visualisation
› History
› Library and information science
› Social media
Project aims
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
36. Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 36
What do we want to accomplish?
Social
Media
Semantic
Metadata
Visualisa-
tion
37. Mining biodiversity
37
» A consortium of botanical and natural history libraries
» Stores digitised legacy literature on biodiversity
» Currently holds 130,000 volumes = millions of pages (PDFs and
OCR-generated text)
» Open-access
Biodiversity Heritage Library (BHL)
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
38. Mining biodiversity
38
» Supports keyword-based search
» Species annotated and linked to the Encyclopedia of Life
» Integrates automatic taxonomic name finding tools
» Data access through export functionalities andWeb services
BHL: Current features
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
40. Mining biodiversity
40
BHL: Metadata included in advanced search functionality
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
41. Mining biodiversity
41
BHL: Page viewing
Page in PDF/image
format
OCR – generated
text
Annotated species
names
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
43. Mining biodiversity
43
Enhanced BHL: Proposed page view
Page in PDF/image
format
OCR – corrected text
with annotations
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
45. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 45
Big data analytics: Compilation and visualisation of (evolving) terms
Mining biodiversity
46. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 46
Big data analytics: Compilation and visualisation of (evolving) terms
Mining biodiversity
47. Sample OCR errors detected and corrected
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 47
Mining biodiversity
» Original
I mean by habit, that law in virtiie of
which all the actions and the characters
of living beings tend to repeat and to
T)err)etuatf
vi I'REFACE.
themselves, not only in tlie individual but
in its offspring.
» Result
I mean by habit, that law in virtue of
which all the actions and the characters
of living beings tend to repeat and to
perpetuate
vi PREFACE.
themselves, not only in the individual but
in its offspring.
49. Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 49
Examples of semantic metadata (annotations)
» Observation
» Habitation
50. Mining biodiversity
50
Examples of semantic metadata (annotations)
» Nutrition
» Trait
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
51. Mining biodiversity
51
» Web-based, graphicalTM workbench
» Conforms with the Unstructured Information Management
Architecture (UIMA) standard
» Facilitates the straightforward integration of various analytics into
workflows
» Allows for the validation of annotations
: Automatic annotation by text mining (TM)
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
53. 53
Reconfigurable, reusable, modular workflows
Mining biodiversity
ENVO
Catalogue
of Life
PATO
GAZ
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
55. 55
» Semantic metadata is generated and visualised using big data
analytics
» Enhanced searching through historical archives is facilitated
» Outcomes
› More informative search results
› Discovery of novel associations
In summary…
Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
56. Find out more…
Contact…
Riza Batista-Navarro
Research associate, NaCTeM
riza.batista@manchester.ac.uk
nactem.ac.uk/
57. Big data for lexical research
Jack Grieve, Aston University
58. » The problem with analyzing the lexicon is that most words are very
rare. For example, a majority of the 100,000 most common words
in English occur on average less than once per 25 million words.
However, even the largest standard linguistic datasets (e.g. the
British National Corpus) are smaller than 100 million words
» To observe the usage of most words, we therefore require access
to incredibly large corpora, which is now possible with the
availability big data
Big data for lexical research
5810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
59. » Today, I’m going to demonstrate how taking advantage of big data
mined from Twitter allows us to study for the first time how newly
emerging words enter and spread within in language
» In particular, I’ll be analysing a 8.9 billion word corpus ofAmerican
Tweets posted by over 7 million different users using geo-enabled
smart phones fromOctober 2013 – November 2014, which was
collected for the Digging into Data Challenge
Big data for lexical research
5910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
60. » To find newly emerging words we looked for words that were very
rare at the start of the period represented by our corpus but that
rose considerably over the course of this period by analysing the
relative frequency of the 67,000 most common words in our corpus
over each day of the corpus
Finding newly emerging words
6010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
63. » “Unbothered by the negativity and foolishness”
» “I starting to enjoying being unbothered”
» “What's that new s**t bitches are saying. Unbothered whatever
that means”
» “I'm always Unbothered I have no need to worry about the
next person.”
» “I'm so unbothered omg I've never felt more in my zone”
» “The FACTThat BeyoncéWas So Unbothered About
Michelle Falling”
Unbothered examples
6310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
72. » In addition to finding newly emerging words, we can also map the
spread of these words across space for the first time, by taking
advantage of the geocoded information provided byTwitter,
which consists of a longitude and latitude for each tweet
Mapping newly emerging words
7210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
86. » By taking of advantage of big data we are thus able to investigate
language in far greater detail than was previously possible, including
identifying and mapping the spread of newly emerging words
» Big data is therefore incredibly useful for understanding complex
systems that involve very large numbers of rare events, including
the lexicon of modern languages
Conclusion
8610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham