Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Big data and the dark arts - Jisc Digital Media 2015

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 87 Publicité

Big data and the dark arts - Jisc Digital Media 2015

Télécharger pour lire hors ligne

There still remains a certain misunderstanding by the very definition of "big data" and the perceived hype around the term. This workshop clarified the concepts and give examples of relevant big data projects.

There still remains a certain misunderstanding by the very definition of "big data" and the perceived hype around the term. This workshop clarified the concepts and give examples of relevant big data projects.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Big data and the dark arts - Jisc Digital Media 2015 (20)

Plus par Jisc (20)

Publicité

Plus récents (20)

Big data and the dark arts - Jisc Digital Media 2015

  1. 1. Big data and the dark arts: Demystifying the world of big data Catherine Grout, Jisc
  2. 2. http://fc00.deviantart.net/fs71/f/2013/073/5/e/defence_against_the_dark_arts_lesson_by _asiapasek-d5y0oc7.jpg
  3. 3. » Introduction to the topic and its importance education and research » Presentations from some key projects at the coal face of this issue › COSMOS - Collaborative online social media observatory (Pete Burnap) › Mining Biodiversity - Enriching biodiversity heritage with text mining and social media (Riza Batista-Navarro) › Trees andTweets - combining twitter datawith family trees- (JackGrieve) Structure of session 410/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  4. 4. In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."[16] Additionally, a newV "Veracity" is added by some organizations to describe it.[17] (http://en.wikipedia.org/wiki/Big_data) 5
  5. 5. » Better use of Big data through high performance analytics could add £216 billion to the UK economy by 2017 (CEBR via sas.com) » Data has moved from a backroom issue to a boardroom issue (strategy insight and competitive advantage) chiefdataofficersummit.com/ » Therefore data ownership also a very important issue » Tim Berners-Lee (as paraphrased in Guardian): “the data we create about ourselves should be owned by each of us, not the large companies that harvest it” theguardian.com/technology/2014/oct/08/sir-tim-berners-lee- speaks-out-on-data-ownership Big data: big issue 610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  6. 6. » Total investment is in the region of £550 m (2012-15) » This is across all 7 research councils but also includes collaborative programmes (17 programmes) » Includes production of: › Methodologies, tools and new aggregated datasets › Infrastructure - giving access to public and private data › Infrastructure - providing storage, compute › Centres of Expertise - Capacity and skills development » RCUK overview of Big data investments rcuk.ac.uk/research/infrastructure/big-data/ RCUK “Big data” investment overview 710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  7. 7. » Power » Responsibility » Opportunity Big data for Universities 810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  8. 8. » Enterprise Data: about learners, researchers and staff and the University as a business (including research grants) › Held in structuredsystems,databasesbutmaybe notall interoperable » Research Data (generally not structured or centrally held, Jisc supporting universities to address this challenge “Research at Risk”) › But Open Access publications (and some other material) in Institutional Repositories (about 125 universities have one) » Sensitive Data (e.g. medical data – securenetworks,anonymisedetc.) » Activity data (data about performance, benchmarking, student and researcher behaviour) Big data for Universities 910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  9. 9. » Big data enables much better analytics - Key area for universities and for Jisc to support » Jisc-HESA Business Intelligence Service (in development) » LAMP (shared academic library analytics service) » Effective Learner Analytics challenge » All designed to help support effective analytics at institutional and national (aggregate level) Big data: analytics 1010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  10. 10. 11 » “Your recent Amazon purchases, Tweet score and location history makes you 23.5% welcome here.” (Cartoon critical of big data application, byT. Gregorius en.wikipedia.org/wiki/Big_data) 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  11. 11. » Big data research is not all about analysing very big data » It can be about bringing data together from different sources » It can be about techniques from the big data field to build more interesting ways of interacting with digital libraries » It can be about using and building new techniques, tools to interact with data and address research questions » Project presentations will illustrate this Big data: For research 1210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  12. 12. » Issues around curation and preservation of research data (variable size and condition) » Performance of infrastructure required » Why should we share and re-use research data? » What tools, methodologies, techniques can be used? » Do researchers have the rights skills to exploit data effectively » How does all of the above impact on research and the research process? Big data: For research 1310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  13. 13. » Two of the projects presenting today are part of Digging into Data Challenge » Digging into Data has been addressing many of the challenges that were flagged earlier » Digging into Data brings together 10 funders in four countries (UK, US, Canada, NL) » 36 projects funded since 2011 » Addresses “big data for research” in the humanities and social sciences Big data: Digging into data Machine Anatomy 101 - UK funders & unviersities 17/10/2013 14
  14. 14. » Pioneered and legitimised big data based research in the humanities – for computer scientists and others. (from zero to hero) » “digital humanities” and “computational social sciences” working together » Engaged GLAM sector and others and encourage them to make their data available in forms useful to researchers and to work with them (encourages joint data curation) Digging into data:Achievements so far 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 15
  15. 15. » Progress on the policy side toward reforming copyright and IP to allow for big data research on cultural heritage materials - (more to do here) » International & multidisciplinary cooperation had high impact (more than anticipated). Increased visibility also strengthened research bringing new teams together) Digging into data:Achievements so far 1610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  16. 16. » Bringing data together to make Big data can create exciting research opportunities » Article in Nature 2013 » Mummies reveal that clogged arteries plagued the ancient world » Based on Digging into Data programme project that brought together CT scans on 137 mummies from four very different ancient populations: Egyptian, Peruvian, the Ancestral Puebloans of southwest America and the Unangans of the Aleutian Islands in Alaska » nature.com/news/mummies-reveal-that-clogged-arteries- plagued-the-ancient-world-1.12568 Big data: For research 1710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  17. 17. Big data: For research 1810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  18. 18. » “Big data” covers a very wide set of activities » But has and is inspiring major investments and changes in practice » Jisc is helping to support institutions in making the most of big data through: › Developing shared services, advice and guidance to help manage research data effectively and comply with funders requirements (Research at Risk Challenge) › Promoting effective use of data analytics and delivering some key analytics services › Working with the Research Councils to help exploit the benefits of big data for research Big data: In summary 1910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  19. 19. Gerd Leonhard , Big Data and the Future of Journalismflickr.com/photos/gleonhard/8978372783/ 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  20. 20. Find out more… Contact… Catherine Grout Head of change – research, Jisc catherine.grout@jisc.ac.uk
  21. 21. Collaborative Online Social Media Observatory COSMOS Dr. Pete Burnap (@pbFeed) Cardiff School of Computer Science and Informatics Cardiff University, UK With Matthew Williams, Jeffrey Morgan, Omer Rana, Luke Sloan, Alex Voss Adam Edwards, William Housley and Rob Procter
  22. 22. What is COSMOS? • Aim to establish a coordinated interdisciplinary response to “Big Social Data” • Led from Cardiff (Computer Science and Social Sciences), Warwick and St. Andrews • Additional input from Edinburgh, UCL, Leeds, Manchester and Wolverhampton • Brings together social, computer, political, health and mathematical scientists to study the methodological, theoretical, and empirical dimensions of Big Data in technical, social and policy contexts • Developing a research programme to help understand and explain how social processes and interactions manifest on the Web, with a focus upon the challenges posed by big social data to government, digital economy and civil society, • Development of new methodological tools and technical/data solutions for UK academia and public sector…a Web Observatory
  23. 23. What is COSMOS? • COSMOS has attracted 17 research grants amounting to over £1.25M in funding from JISC/ESRC/EPSRC/AHRC/and £500K from the public and private sectors (DoH/FSA/HPC Wales). • A significant proportion of these funds have been awarded to collect and analyse social media data in the contexts of Societal Safety and Security e.g. social tension, hate speech, crime reporting and fear of crime, suicidal ideation
  24. 24. Research Programme Digital Social Research Tools, Tension Indicators and Safer Communities: A demonstration of COSMOS (ESRC DSR) COSMOS: Supporting Empirical Social Scientific Research with a Virtual Research Environment (JISC) Small items of research equipment at Cardiff University (EPSRC) Hate Speech and Social Media: Understanding Users, Networks and Information Flows (ESRC Google) Social Media and Prediction: Crime Sensing, Data Integration and Statistical Modelling (ESRC NCRM) Understanding the Role of Social Media in the Aftermath of Youth Suicides (Department of Health) Scaling the Computational Analysis of “Big Social Data” & Massive Temporal Social Media Datasets (HPC Wales) Digital Wildfire: (Mis)information flows, propagation and responsible governance, (ESRC Global Uncertainties) Public perceptions of the UK food system: public understanding and engagement, and the impact of crises and scares (ESRC/FSA) 2011 2016
  25. 25. COSMOS Web Observatory Integrated Open (“plug and play”) Scalable (MongoDB data stores/ Hadoop Back End) Burnap, P. et al. (2014) ‘COSMOS: Towards an Integrated and Scalable Service for Analyzing Social Media on Demand’, International Journal of Parallel, Emergent and Distributed Systems Usable – developed with social scientists for social scientists Reproducible/Citable Research - export/share workflow
  26. 26. Web Observatory Features • Data Collection – Persistent connection to Twitter 1% Stream (~4 billion) – ONS/Police API – Drag and drop RSS – Import CSV/JSON • Data Transformation – Word Frequency – Point data frequency over time – Social Network Analysis – Geospatial Clustering – Sentiment Analysis – …API to plug new modules and benchmark tools
  27. 27. Observing Events
  28. 28. Observing Events
  29. 29. COSMOS Infrastructure COSMOS Desktop •Small local datasets •Users’ API credentials •Local analysis •Sept ‘14 launch (>100 dl’s in 17 countries) COSMOS Cloud •Scalable storage • Massive datasets •Scalable compute • On-demand nodes • Fast search & retrieve • Fast analysis •Workflow management •Collaboration support •2015 launch
  30. 30. Web Observatory Examples • Policy/impact driven (benefit to society/economy) • Focus on ethical research into human safety and security • Augment terrestrial methods • Comparison to existing methods • Experimental applied stats & machine learning • Provide examples of machine intelligence tasks integrated into social research workflow… • Radio 5 Live Hit List (#5LiveHitList) - biggest impact stories across social media and online
  31. 31. Questions? Pete Burnap (@pbFeed) burnapp@cardiff.ac.uk
  32. 32. Mining Biodiversity: Enriching biodiversity literature with OCR corrections and text-mined semantic metadata Riza Batista-Navarro National centre for text mining, University of Manchester
  33. 33. Mining biodiversity 34 The Partners A A B B C C D D Social Media Lab E E 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  34. 34. Mining biodiversity 35 » Transform BHL into a next-generation social digital library » Bring together strengths from multiple disciplines: › Text mining › Machine learning › Data visualisation › History › Library and information science › Social media Project aims 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  35. 35. Mining biodiversity 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 36 What do we want to accomplish? Social Media Semantic Metadata Visualisa- tion
  36. 36. Mining biodiversity 37 » A consortium of botanical and natural history libraries » Stores digitised legacy literature on biodiversity » Currently holds 130,000 volumes = millions of pages (PDFs and OCR-generated text) » Open-access Biodiversity Heritage Library (BHL) 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  37. 37. Mining biodiversity 38 » Supports keyword-based search » Species annotated and linked to the Encyclopedia of Life » Integrates automatic taxonomic name finding tools » Data access through export functionalities andWeb services BHL: Current features 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  38. 38. Mining biodiversity 39 BHL: Keyword-based search and Browsing 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  39. 39. Mining biodiversity 40 BHL: Metadata included in advanced search functionality 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  40. 40. Mining biodiversity 41 BHL: Page viewing Page in PDF/image format OCR – generated text Annotated species names 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  41. 41. Mining biodiversity 42 Enhanced BHL: Proposed search functionalities Faceted search Time-sensitive search Automatically generated questions 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  42. 42. Mining biodiversity 43 Enhanced BHL: Proposed page view Page in PDF/image format OCR – corrected text with annotations 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  43. 43. Mining biodiversity 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 44 Big data analytics: OCR correction and text mining
  44. 44. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 45 Big data analytics: Compilation and visualisation of (evolving) terms Mining biodiversity
  45. 45. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 46 Big data analytics: Compilation and visualisation of (evolving) terms Mining biodiversity
  46. 46. Sample OCR errors detected and corrected 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 47 Mining biodiversity » Original I mean by habit, that law in virtiie of which all the actions and the characters of living beings tend to repeat and to T)err)etuatf vi I'REFACE. themselves, not only in tlie individual but in its offspring. » Result I mean by habit, that law in virtue of which all the actions and the characters of living beings tend to repeat and to perpetuate vi PREFACE. themselves, not only in the individual but in its offspring.
  47. 47. Semantic metadata generation 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 48 Mining biodiversity » Entity types › Taxonomic entities › Geographic locations › Habitats › Anatomical entities › Qualities › Temporal expressions › Persons » Association types › Observation › Habitation › Nutrition › Trait
  48. 48. Mining biodiversity 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 49 Examples of semantic metadata (annotations) » Observation » Habitation
  49. 49. Mining biodiversity 50 Examples of semantic metadata (annotations) » Nutrition » Trait 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  50. 50. Mining biodiversity 51 » Web-based, graphicalTM workbench » Conforms with the Unstructured Information Management Architecture (UIMA) standard » Facilitates the straightforward integration of various analytics into workflows » Allows for the validation of annotations : Automatic annotation by text mining (TM) 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  51. 51. Mining biodiversity 52 Main interface 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  52. 52. 53 Reconfigurable, reusable, modular workflows Mining biodiversity ENVO Catalogue of Life PATO GAZ 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  53. 53. 54 Validation interface Mining biodiversity 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  54. 54. 55 » Semantic metadata is generated and visualised using big data analytics » Enhanced searching through historical archives is facilitated » Outcomes › More informative search results › Discovery of novel associations In summary… Mining biodiversity 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  55. 55. Find out more… Contact… Riza Batista-Navarro Research associate, NaCTeM riza.batista@manchester.ac.uk nactem.ac.uk/
  56. 56. Big data for lexical research Jack Grieve, Aston University
  57. 57. » The problem with analyzing the lexicon is that most words are very rare. For example, a majority of the 100,000 most common words in English occur on average less than once per 25 million words. However, even the largest standard linguistic datasets (e.g. the British National Corpus) are smaller than 100 million words » To observe the usage of most words, we therefore require access to incredibly large corpora, which is now possible with the availability big data Big data for lexical research 5810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  58. 58. » Today, I’m going to demonstrate how taking advantage of big data mined from Twitter allows us to study for the first time how newly emerging words enter and spread within in language » In particular, I’ll be analysing a 8.9 billion word corpus ofAmerican Tweets posted by over 7 million different users using geo-enabled smart phones fromOctober 2013 – November 2014, which was collected for the Digging into Data Challenge Big data for lexical research 5910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  59. 59. » To find newly emerging words we looked for words that were very rare at the start of the period represented by our corpus but that rose considerably over the course of this period by analysing the relative frequency of the 67,000 most common words in our corpus over each day of the corpus Finding newly emerging words 6010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  60. 60. 6110/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  61. 61. 6210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  62. 62. » “Unbothered by the negativity and foolishness” » “I starting to enjoying being unbothered” » “What's that new s**t bitches are saying. Unbothered whatever that means” » “I'm always Unbothered I have no need to worry about the next person.” » “I'm so unbothered omg I've never felt more in my zone” » “The FACTThat BeyoncéWas So Unbothered About Michelle Falling” Unbothered examples 6310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  63. 63. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 64
  64. 64. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 65
  65. 65. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 66
  66. 66. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 67
  67. 67. 10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 68
  68. 68. 6910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  69. 69. 7010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  70. 70. 7110/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  71. 71. » In addition to finding newly emerging words, we can also map the spread of these words across space for the first time, by taking advantage of the geocoded information provided byTwitter, which consists of a longitude and latitude for each tweet Mapping newly emerging words 7210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  72. 72. 7310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  73. 73. 7410/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  74. 74. 7510/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  75. 75. 7610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  76. 76. 7710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  77. 77. 7810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  78. 78. 7910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  79. 79. 8010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  80. 80. 8110/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  81. 81. 8210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  82. 82. 8310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  83. 83. 8410/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  84. 84. 8510/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  85. 85. » By taking of advantage of big data we are thus able to investigate language in far greater detail than was previously possible, including identifying and mapping the spread of newly emerging words » Big data is therefore incredibly useful for understanding complex systems that involve very large numbers of rare events, including the lexicon of modern languages Conclusion 8610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
  86. 86. Find out more… Contact… Jack Grieve Aston University j.grieve1@aston.ac.uk @JWGrieve

×