I was invitted to redo the talk about Big Data i did in Berlin earlier this year - slides also here.
Slides are similar but updated to reflect my new company and some slides are new.
Enjoy
5. Big Data –
Either VERY large datasets AND/OR other complexities
Characteristics of big data
Source: IBM methodology
6. A couple of words about scale
• 100’s of Megabytes
• This should not be a problem. Can be handled with Matlab, R, Ruby
• 100/500 Gigabytes – 1Terabyte
• 2 Terabyte harddrives can be bought in the local shop for €100
• Connect it to your laptop and install postgresql or a no-sql database on it
• > 5 Terabytes
• Now you might have a size issue
Inspired by: http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
7. Big Data - “Definition”
"Big Data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable
enhanced decision making, insight discovery and process
optimization."
10. What is Big Data in Pharma R&D?
• Many ideas/possibilities across Pharma R&D and market
access
• But many of them are likley NOT “real” Big Data problems!
• Are they relevant and can they bring insights?
• Yes, very much so
• Should we than find a way to handle them?
• Absolutely
11. Disclaimer
• I am a (web) tech geek
• I have nothing against new technologies
• Like many other geeks I like it
• But do try to use the right tool for the right
job
13. Another great tool - for some
Q: “Could you help me get to Nürnberg, pls?”
A: “Yes, absolutely. Not a problem”
Q: “Ok, btw I want to try the Endeavour
A: “...ahh why?”
Q: “Because I have read it’s great”
A: “Yes, but the ICE….”
14. MapReduce explained in 41 words
Goal: Count the number of books in the library.
Map: You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this part goes. )
Reduce: We all get together and add up our individual counts.
http://www.chrisstucchio.com/blog/2011/mapreduce_explained.html
18. For many people/companies
”Big data technology” is a black box
”A lot of stuff”
And then the vendors go:
If
{ box = magic or money}
then
{ box = expensive}
19. Working within a community
A lot of tools available
From: ttp://people10.com/blog/ruby-on-rails-the-popular-platform-for-web-development/
23. Elasticsearch text indexes
• Indexed research assay metadata
=> Google like search to find the relevant assay
• Indexed sharepoint project workspaces
=> Enable easy, fast cross project queries to find trends
24. Conclusion – Big data in Pharma R&D
• Many opportunities across R&D and market access
• More data linking and data analytics than Big Data
• You can use freely available tools on ”normal” hardware
• No magic ”Under the hood” – it’s just data
25. BUT you still need to define
the questions you
want to answer
– before diving into technology!