SlideShare une entreprise Scribd logo
1  sur  69
BIG DATA - AS OPPOSED
TO  SMALL DATA




Mark Whitehorn
What is Big data?

Is it really just a marketing campaign?

http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf




“If you’re like me, the mere mention of Big Data
now turns your stomach….Why all the fuss? Why,
indeed. Essentially, Big Data is a marketing
campaign, pure and simple.”
Stephen Few
                                                                                            2
Big data
Clearly I am not like Stephen Few.

I don’t believe I have a particular axe to grind, I
simply find this interesting

This talk is designed to try to explain:
• what Big Data is
• what characteristics we have found useful
• why it may be of interest to you
                                                      3
• a paradox
Data

All computer applications manipulate data




                                            4
Data

So, in the ’60 and ‘70s we rapidly learnt to
separate the data, and its manipulation, from
the application




                                                5
Data

So, in the ’60 and ‘70s we rapidly learnt to
separate the data, and its manipulation, from
the application
Which led directly to the development of
database engines and, ultimately, relational
ones (DB2, Oracle, SQL Server)

                                                6
Data

Data has always existed in two, very broad,
flavours…..
  • Data that is treated as small, discrete
    packages and is a good fit with the
    relational way of storing and querying data
  • Data that is not as above

                                                  7
Data is stored in tables

      LicenseNo   Make      Model      Year       Colour
      CER 162 C   Triumph   Spitfire   1965       Green
      EF 8972     Bentley   Mk. VI     1946       Black
      YSK 114     Bentley   Mk. VI     1949       Red




                                                           8

                                 Mark Whitehorn
Data is stored in tables
      Each table has a name
       Car
      LicenseNo   Make      Model      Year       Colour
      CER 162 C   Triumph   Spitfire   1965       Green
      EF 8972     Bentley   Mk. VI     1946       Black
      YSK 114     Bentley   Mk. VI     1949       Red




                                                           9

                                 Mark Whitehorn
Data is stored in tables

      Car
      LicenseNo   Make      Model      Year       Colour
      CER 162 C   Triumph   Spitfire   1965       Green
      EF 8972     Bentley   Mk. VI     1946       Black
      YSK 114     Bentley   Mk. VI     1949       Red




            Data is
            atomic                                         10

                                 Mark Whitehorn
Data is stored in tables
                            Columns
      Car
      LicenseNo   Make      Model      Year       Colour
      CER 162 C   Triumph   Spitfire   1965       Green
      EF 8972     Bentley   Mk. VI     1946       Black
      YSK 114     Bentley   Mk. VI     1949       Red




                                                           11

                                 Mark Whitehorn
Data is stored in tables
                             Columns
       Car
       LicenseNo   Make      Model      Year       Colour
       CER 162 C   Triumph   Spitfire   1965       Green
Rows   EF 8972     Bentley   Mk. VI     1946       Black
       YSK 114     Bentley   Mk. VI     1949       Red




                                                            12

                                  Mark Whitehorn
Data is stored in tables

       Car
       LicenseNo   Make      Model      Year       Color
       CER 162 C   Triumph   Spitfire   1965       Green
       EF 8972     Bentley   Mk. VI     1946       Black
       YSK 114     Bentley   Mk. VI     1949       Red


     Each row represents a unique entity in
       the ‘real’ world……

                                                           13

                                  Mark Whitehorn
14
Data

The manipulation consists typically of sub-
setting the data by rows and columns and
then doing some sums




                                              15
Data

Note that this kind of manipulation is treating
the data as atomic, which is fine, because the
relational model assumes atomicity of data

Note also, that the rows are unordered


                                                  16
Data

• Data has always existed in two, very broad,
  flavours…..
   • Data that is inherently atomic and is a good
     fit with the relational way of storing and
     querying data
   • Data that is not as above

                                                    17
Examples
• Examples of ‘other’ data:
  • Images
  • Music
  • Word docs
  • Sensor data
  • Web logs
  • Twitter
  • Machines
    • Point of Sale
                              18
    • Mass spectrometers
What’s in a name?
So, what do we call the ‘rest’?
  • Un-structured?
  • Semi-structured?
  • Multi-structured?
  • Non-relational?
  • Non-tabular?

                                  19
What’s in a name?
• What about:
  • Big data?




                    20
Other definitions?
  •VVVvvvv
   • Volume
   • Variety
   • Velocity
   • Value
   • Very interesting
   • Various other words beginning with V…..
                                               21
Big Data – not new?

• So why have we focused, for the last 30 years,
  almost exclusively on the first flavour?
• Because it:
 • is easy (relatively easy – Jim Gray*)
 • represents a significant proportion of the
   available data
 *Jim Gray and Andreas Reuter - Transaction Processing: Concepts and
 Techniques (1993)
 Turning Award 1998
                                                                       22
Big Data has come of age

• Two factors have changed
 • Rise of the Machines
 • Increase is computational power
• There is a great synergy here
 • We are acquiring far more big data and we
   have computational power to extract the
   information it contains
                                               23
Big Data is hard

• 3 Vs
• It is highly variable
• We often want to look inside the data
 • Frequently non-atomic
 • Need custom functions for virtually every
   operation
    • find the rotating wing aircraft in the image
    • Identify the best customer
    • What does the blog sphere think of our
      company?                                       24
Big Data
• Examples
  • Log file
  • Mass spec.
  • Images




                 25
Big Data
• Examples
  • Log file
  • Mass spectrometer
  • Image




                        26
Big Data
 • Examples
   • Log file
   • Mass spec.
   • Images




                  27
What is Big Data?
• Examples
  • Log file
  • Mass spec.
  • Images




                    BIG DATA
Summary so far……
• Just as you can always fit an aircraft engine
  into a car chassis, you can always put Big
  Data in a table, but you probably don’t
  want to
• The analysis is not sub-setting the data by
  rows and columns
• So each class of big data usually require a
  (lovingly hand-crafted) custom analysis
                                                  30
Case Study

Big Data in the Life Sciences World
    The massed spectrometers
       Why would anyone do that?



                                      31
Human Genome Project
$3 billion – 13 Years

Sequencing completed
(2003).




                        32
Human Genome Project
Human Genome Project
$3 billion – 13 Years

Our genes define us.

Errr…. how does
that work exactly?



                        33
What is a protein?

 DNA                   Protein




 blueprint           product
                                 34
Why study proteins
                            PROTEOME




              GENOME



     Genes contain              Proteins carry out
instructions for creating     functions within a cell   35
        proteins
Protein: ACTIN                          Example Proteins
Function: Contracts Muscles


                      Protein: Insulin
                      Function: Controls Blood Sugar


                                         Protein: Keratin
                                         Function: Forms Hair and Nails




O      2
          Protein: Hemoglobin
          Function: Carries Oxygen



                                         Protein: Antibody
                                         Function: Fights Viruses
                                                                          36
biSCIENCE
20-25,000 genes in the human
genome.
Every nucleated cell in the same
human has the same genome.
But not all genes are active at
the same time.
Perm any 15-18,00 active
proteins in any one cell at any      37

one time.
slowly changing         millions of years




           over a day           rapidly changing   38
Studying Proteins

Proteins are chopped up using an
enzyme to make them easier to measure.



A specialised instrument (Mass Spectrometer) is
used to measure (‘weigh’) the small protein
fragments.




We can use the mass of the small fragments to
carry out intelligent database searches to identify
which protein was detected.


                                                      39
Protein

                              Peptides


MKLNISFPATGCQKLIEVDDERKLRTFYEKRMATEVAADALGEEWKGYVV
RISGGNDKQGFPMKQGVLTHGRVRLLLSKGHSCYRPRRTGERKRKSVRGCI
VDANLSVLNLVIVKKGEKDIPGLTDTTVPRRLGPKRASRIRKLFNLSKEDDVR
QYVVRKPLNKEGKKPRTKAPKIQRLVTPRVLQHKRRRIALKKQRTKKNKEEA
AEYAKLLAKRMKEAKEKRQEQIAKRRRLSSLRASTSKSESSQK

       Amino Acids

                                                        40
Mass Spectrometry
An analytical technique for the determination of
the elemental composition of a sample.




                                                   41
Spectra

          P1




               P2
                    P3




                         42
Mass Spectra
File Sizes: typically several
gigabytes per MS run.
Identifications: range from 500-   43
8000 protein identifications.
pepTRACKER                    44
TRACK. VISUALISE. DISCOVER.
80%
   60%
         40%
               20%




                     45
Localisation

                  Protein Peptide Alignment Map




   Normalised
   Profiles for
    Synthesis,
  Degradation
 and Turnover


Comparison Between
Compartments




                                                  46
Custom analysis and custom visualisation –
vital tools in understanding big data




                                             47
Intensive Data Processing Required to derive
Information from the raw data
         Base Line Correction                                                 Peak Detection




                                BIOConductor PROcess R Package




   Deisotoping

                                                                                                                                             48


                                                                 Proteomics Volume 3, Issue 8, Article first published online: 12 AUG 2003
“proteomics is much more complicated
than genomics . . . while an organism's
 genome is more or less constant, the
 proteome differs from cell to cell and
              over time”

Computationally, perhaps three orders
of magnitude more complex than HGP        49
biSCIENCE
Why bother trying to quantify it?

Because this is payback time.

Documenting the proteome
opens the door to a whole new
world.                                50
biSCIENCE
So, what is a data scientist?
My favourite description comes from Twitter:
“Yeah, so I'm actually a data scientist. I just do this
barista thing in between gigs.”
More cynically:
“A data scientist is just an analyst who lives in
California.”

                                                            51
biSCIENCE
Possibly more accurate is that a data scientist (DS) is
“a better software engineer than any statistician and
a better statistician than any software engineer”.



                                                            52
biSCIENCE
DSs are also part artist and part engineer. They
need a toolbox of techniques, skills, processes and
abilities from which to construct novel solutions.
And they need the ability to create a UI that turns
their abstract finding into something that the users
of the system can understand, so DSs also need the
skills to create elegant visualisations that turn raw
data into information.
                                                          53
biSCIENCE
And (yes, there’s more) they need to be able to
communicate well with people. There is little use in
creating a superb analytical process if you can’t
communicate how and why it works to the board
members.


                                                         54
biSCIENCE
And then there is the curiosity. Duncan Ross
(Director of Data Sciences at Teradata) characterised
data scientists well:
The first and most important trait is curiosity. Insane
curiosity. In many walks of life evolution selects
against the kind of person who decides to find out
what happens “if I push that button”. Data Science
selects for it.
                                                            55
biSCIENCE
So, what are the general characteristics of a DS?

They include:
• insatiable curiosity (see above)
• interdisciplinary interests
• excellent communication skills
• excellent analytical capabilities


                                                      56
biSCIENCE
DSs also need a good working knowledge of:
• machine learning techniques
• data mining
• statistics
• maths
• algorithm development
• code development
• data visualisation
• multi-dimensional database design and
   implementation

                                               57
biSCIENCE
Specific skills include the technologies to handle big
data:
• NoSQL databases
• Hadoop and related technologies
• MapReduce and its implementation on differing
   software platforms



                                                           58
biSCIENCE
DSs also have an intimate knowledge of languages
such as:
• SQL
• MDX
• R
• Functional and OOP languages such as Erlang and
   Java


                                                      59
biSCIENCE
Most of all, no matter what they are called, all true
data scientists have started playing with some data
at 8:00PM and suddenly found it is 3:00AM.
Case Study

          Twitter
       Who loves you?
       Social/text/sentiment



                               61
Consider the humble tweet…




                             62
Consider the humble tweet…


As, indeed, Sally Bercow should
have done


                                  63
Consider the humble tweet…


As, indeed, Sally Bercow should
have done *Innocent Face*


                                  64
Consider the humble tweet…
I’d just like to apologise for that
 last slide but I would point out
that it “contained no accusation
 whatsoever … Mischievous but
           not libellous.”
                                      65
Case Study

        Oil Rig data
        Gone fishing
             Sensor data



                           66
Lessons learned

• Engagement
• Choose you battles – look for an area where you
  can gain competitive advantage
• Choose your platform carefully
• Programming – algorithm development
• Data scientists
  • Custom algorithms
                                                    67
  • Custom visualisations
Thank you very much for listening



Any Questions?
Mark Whitehorn
(MarkWhitehorn@computing.dundee.ac.uk)
                                         68
BIG DATA - AS OPPOSED
TO       SMALL DATA

60 minutes




Mark Whitehorn

Contenu connexe

Plus de Incisive_Events

Anne Osterrieder Tools for sharing your research
Anne Osterrieder Tools for sharing your researchAnne Osterrieder Tools for sharing your research
Anne Osterrieder Tools for sharing your researchIncisive_Events
 
Mahendra Mahey British Library Labs
Mahendra Mahey British Library LabsMahendra Mahey British Library Labs
Mahendra Mahey British Library LabsIncisive_Events
 
Phil Bradley The future of Search
Phil Bradley The future of SearchPhil Bradley The future of Search
Phil Bradley The future of SearchIncisive_Events
 
Arthur Weiss Google vs other search tools
Arthur Weiss Google vs other search toolsArthur Weiss Google vs other search tools
Arthur Weiss Google vs other search toolsIncisive_Events
 
James Bennett CLA Search and Licence System
James Bennett CLA Search and Licence SystemJames Bennett CLA Search and Licence System
James Bennett CLA Search and Licence SystemIncisive_Events
 
Lucy Montgomery Open access for scholarly books
Lucy Montgomery Open access for scholarly booksLucy Montgomery Open access for scholarly books
Lucy Montgomery Open access for scholarly booksIncisive_Events
 
Max Espley Royal Society of Chemistry and Open Access
Max Espley Royal Society of Chemistry and Open AccessMax Espley Royal Society of Chemistry and Open Access
Max Espley Royal Society of Chemistry and Open AccessIncisive_Events
 
Jacob Morgan The Future of Work
Jacob Morgan The Future of WorkJacob Morgan The Future of Work
Jacob Morgan The Future of WorkIncisive_Events
 
Mark Stevenson Surviving in a fast changing world
Mark Stevenson Surviving in a fast changing worldMark Stevenson Surviving in a fast changing world
Mark Stevenson Surviving in a fast changing worldIncisive_Events
 
Alex Follett Integrating your library into wider institutional environment
Alex Follett Integrating your library into wider institutional environmentAlex Follett Integrating your library into wider institutional environment
Alex Follett Integrating your library into wider institutional environmentIncisive_Events
 
Sarah Fahy Reshaping Your Team
Sarah Fahy Reshaping Your TeamSarah Fahy Reshaping Your Team
Sarah Fahy Reshaping Your TeamIncisive_Events
 
James Andrews User Engagement
James Andrews User EngagementJames Andrews User Engagement
James Andrews User EngagementIncisive_Events
 
Heini Oikkonen Mobile Library App Goes Home
Heini Oikkonen Mobile Library App Goes HomeHeini Oikkonen Mobile Library App Goes Home
Heini Oikkonen Mobile Library App Goes HomeIncisive_Events
 
Henry Stiller Implementing New Roles For Information Professionals
Henry Stiller Implementing New Roles For Information ProfessionalsHenry Stiller Implementing New Roles For Information Professionals
Henry Stiller Implementing New Roles For Information ProfessionalsIncisive_Events
 
Ellyssa Krosky The future of libraries and information services
Ellyssa Krosky The future of libraries and information servicesEllyssa Krosky The future of libraries and information services
Ellyssa Krosky The future of libraries and information servicesIncisive_Events
 
Miguel Garcia How Yammer Can Improve Collaboration
Miguel Garcia How Yammer Can Improve Collaboration Miguel Garcia How Yammer Can Improve Collaboration
Miguel Garcia How Yammer Can Improve Collaboration Incisive_Events
 
Paula Young Transforming how works gets done at PwC
Paula Young Transforming how works gets done at PwC Paula Young Transforming how works gets done at PwC
Paula Young Transforming how works gets done at PwC Incisive_Events
 

Plus de Incisive_Events (20)

Jan Reichelt Mendeley
Jan Reichelt MendeleyJan Reichelt Mendeley
Jan Reichelt Mendeley
 
Rachel Green Jove
Rachel Green JoveRachel Green Jove
Rachel Green Jove
 
Anne Osterrieder Tools for sharing your research
Anne Osterrieder Tools for sharing your researchAnne Osterrieder Tools for sharing your research
Anne Osterrieder Tools for sharing your research
 
Mahendra Mahey British Library Labs
Mahendra Mahey British Library LabsMahendra Mahey British Library Labs
Mahendra Mahey British Library Labs
 
Phil Bradley The future of Search
Phil Bradley The future of SearchPhil Bradley The future of Search
Phil Bradley The future of Search
 
Arthur Weiss Google vs other search tools
Arthur Weiss Google vs other search toolsArthur Weiss Google vs other search tools
Arthur Weiss Google vs other search tools
 
James Bennett CLA Search and Licence System
James Bennett CLA Search and Licence SystemJames Bennett CLA Search and Licence System
James Bennett CLA Search and Licence System
 
Lucy Montgomery Open access for scholarly books
Lucy Montgomery Open access for scholarly booksLucy Montgomery Open access for scholarly books
Lucy Montgomery Open access for scholarly books
 
Max Espley Royal Society of Chemistry and Open Access
Max Espley Royal Society of Chemistry and Open AccessMax Espley Royal Society of Chemistry and Open Access
Max Espley Royal Society of Chemistry and Open Access
 
Jacob Morgan The Future of Work
Jacob Morgan The Future of WorkJacob Morgan The Future of Work
Jacob Morgan The Future of Work
 
Mark Stevenson Surviving in a fast changing world
Mark Stevenson Surviving in a fast changing worldMark Stevenson Surviving in a fast changing world
Mark Stevenson Surviving in a fast changing world
 
Alex Follett Integrating your library into wider institutional environment
Alex Follett Integrating your library into wider institutional environmentAlex Follett Integrating your library into wider institutional environment
Alex Follett Integrating your library into wider institutional environment
 
Sarah Fahy Reshaping Your Team
Sarah Fahy Reshaping Your TeamSarah Fahy Reshaping Your Team
Sarah Fahy Reshaping Your Team
 
James Andrews User Engagement
James Andrews User EngagementJames Andrews User Engagement
James Andrews User Engagement
 
Heini Oikkonen Mobile Library App Goes Home
Heini Oikkonen Mobile Library App Goes HomeHeini Oikkonen Mobile Library App Goes Home
Heini Oikkonen Mobile Library App Goes Home
 
Henry Stiller Implementing New Roles For Information Professionals
Henry Stiller Implementing New Roles For Information ProfessionalsHenry Stiller Implementing New Roles For Information Professionals
Henry Stiller Implementing New Roles For Information Professionals
 
Ellyssa Krosky The future of libraries and information services
Ellyssa Krosky The future of libraries and information servicesEllyssa Krosky The future of libraries and information services
Ellyssa Krosky The future of libraries and information services
 
Miguel Garcia How Yammer Can Improve Collaboration
Miguel Garcia How Yammer Can Improve Collaboration Miguel Garcia How Yammer Can Improve Collaboration
Miguel Garcia How Yammer Can Improve Collaboration
 
Paula Young Transforming how works gets done at PwC
Paula Young Transforming how works gets done at PwC Paula Young Transforming how works gets done at PwC
Paula Young Transforming how works gets done at PwC
 
Kate Grady polka
Kate Grady polkaKate Grady polka
Kate Grady polka
 

Big Data as Opposed to Small Data Mark Whitehorn

  • 1. BIG DATA - AS OPPOSED TO SMALL DATA Mark Whitehorn
  • 2. What is Big data? Is it really just a marketing campaign? http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf “If you’re like me, the mere mention of Big Data now turns your stomach….Why all the fuss? Why, indeed. Essentially, Big Data is a marketing campaign, pure and simple.” Stephen Few 2
  • 3. Big data Clearly I am not like Stephen Few. I don’t believe I have a particular axe to grind, I simply find this interesting This talk is designed to try to explain: • what Big Data is • what characteristics we have found useful • why it may be of interest to you 3 • a paradox
  • 4. Data All computer applications manipulate data 4
  • 5. Data So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from the application 5
  • 6. Data So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from the application Which led directly to the development of database engines and, ultimately, relational ones (DB2, Oracle, SQL Server) 6
  • 7. Data Data has always existed in two, very broad, flavours….. • Data that is treated as small, discrete packages and is a good fit with the relational way of storing and querying data • Data that is not as above 7
  • 8. Data is stored in tables LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red 8 Mark Whitehorn
  • 9. Data is stored in tables Each table has a name Car LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red 9 Mark Whitehorn
  • 10. Data is stored in tables Car LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red Data is atomic 10 Mark Whitehorn
  • 11. Data is stored in tables Columns Car LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red 11 Mark Whitehorn
  • 12. Data is stored in tables Columns Car LicenseNo Make Model Year Colour CER 162 C Triumph Spitfire 1965 Green Rows EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red 12 Mark Whitehorn
  • 13. Data is stored in tables Car LicenseNo Make Model Year Color CER 162 C Triumph Spitfire 1965 Green EF 8972 Bentley Mk. VI 1946 Black YSK 114 Bentley Mk. VI 1949 Red Each row represents a unique entity in the ‘real’ world…… 13 Mark Whitehorn
  • 14. 14
  • 15. Data The manipulation consists typically of sub- setting the data by rows and columns and then doing some sums 15
  • 16. Data Note that this kind of manipulation is treating the data as atomic, which is fine, because the relational model assumes atomicity of data Note also, that the rows are unordered 16
  • 17. Data • Data has always existed in two, very broad, flavours….. • Data that is inherently atomic and is a good fit with the relational way of storing and querying data • Data that is not as above 17
  • 18. Examples • Examples of ‘other’ data: • Images • Music • Word docs • Sensor data • Web logs • Twitter • Machines • Point of Sale 18 • Mass spectrometers
  • 19. What’s in a name? So, what do we call the ‘rest’? • Un-structured? • Semi-structured? • Multi-structured? • Non-relational? • Non-tabular? 19
  • 20. What’s in a name? • What about: • Big data? 20
  • 21. Other definitions? •VVVvvvv • Volume • Variety • Velocity • Value • Very interesting • Various other words beginning with V….. 21
  • 22. Big Data – not new? • So why have we focused, for the last 30 years, almost exclusively on the first flavour? • Because it: • is easy (relatively easy – Jim Gray*) • represents a significant proportion of the available data *Jim Gray and Andreas Reuter - Transaction Processing: Concepts and Techniques (1993) Turning Award 1998 22
  • 23. Big Data has come of age • Two factors have changed • Rise of the Machines • Increase is computational power • There is a great synergy here • We are acquiring far more big data and we have computational power to extract the information it contains 23
  • 24. Big Data is hard • 3 Vs • It is highly variable • We often want to look inside the data • Frequently non-atomic • Need custom functions for virtually every operation • find the rotating wing aircraft in the image • Identify the best customer • What does the blog sphere think of our company? 24
  • 25. Big Data • Examples • Log file • Mass spec. • Images 25
  • 26. Big Data • Examples • Log file • Mass spectrometer • Image 26
  • 27. Big Data • Examples • Log file • Mass spec. • Images 27
  • 28.
  • 29. What is Big Data? • Examples • Log file • Mass spec. • Images BIG DATA
  • 30. Summary so far…… • Just as you can always fit an aircraft engine into a car chassis, you can always put Big Data in a table, but you probably don’t want to • The analysis is not sub-setting the data by rows and columns • So each class of big data usually require a (lovingly hand-crafted) custom analysis 30
  • 31. Case Study Big Data in the Life Sciences World The massed spectrometers Why would anyone do that? 31
  • 32. Human Genome Project $3 billion – 13 Years Sequencing completed (2003). 32
  • 33. Human Genome Project Human Genome Project $3 billion – 13 Years Our genes define us. Errr…. how does that work exactly? 33
  • 34. What is a protein? DNA Protein blueprint product 34
  • 35. Why study proteins PROTEOME GENOME Genes contain Proteins carry out instructions for creating functions within a cell 35 proteins
  • 36. Protein: ACTIN Example Proteins Function: Contracts Muscles Protein: Insulin Function: Controls Blood Sugar Protein: Keratin Function: Forms Hair and Nails O 2 Protein: Hemoglobin Function: Carries Oxygen Protein: Antibody Function: Fights Viruses 36
  • 37. biSCIENCE 20-25,000 genes in the human genome. Every nucleated cell in the same human has the same genome. But not all genes are active at the same time. Perm any 15-18,00 active proteins in any one cell at any 37 one time.
  • 38. slowly changing millions of years over a day rapidly changing 38
  • 39. Studying Proteins Proteins are chopped up using an enzyme to make them easier to measure. A specialised instrument (Mass Spectrometer) is used to measure (‘weigh’) the small protein fragments. We can use the mass of the small fragments to carry out intelligent database searches to identify which protein was detected. 39
  • 40. Protein Peptides MKLNISFPATGCQKLIEVDDERKLRTFYEKRMATEVAADALGEEWKGYVV RISGGNDKQGFPMKQGVLTHGRVRLLLSKGHSCYRPRRTGERKRKSVRGCI VDANLSVLNLVIVKKGEKDIPGLTDTTVPRRLGPKRASRIRKLFNLSKEDDVR QYVVRKPLNKEGKKPRTKAPKIQRLVTPRVLQHKRRRIALKKQRTKKNKEEA AEYAKLLAKRMKEAKEKRQEQIAKRRRLSSLRASTSKSESSQK Amino Acids 40
  • 41. Mass Spectrometry An analytical technique for the determination of the elemental composition of a sample. 41
  • 42. Spectra P1 P2 P3 42
  • 43. Mass Spectra File Sizes: typically several gigabytes per MS run. Identifications: range from 500- 43 8000 protein identifications.
  • 44. pepTRACKER 44 TRACK. VISUALISE. DISCOVER.
  • 45. 80% 60% 40% 20% 45
  • 46. Localisation Protein Peptide Alignment Map Normalised Profiles for Synthesis, Degradation and Turnover Comparison Between Compartments 46
  • 47. Custom analysis and custom visualisation – vital tools in understanding big data 47
  • 48. Intensive Data Processing Required to derive Information from the raw data Base Line Correction Peak Detection BIOConductor PROcess R Package Deisotoping 48 Proteomics Volume 3, Issue 8, Article first published online: 12 AUG 2003
  • 49. “proteomics is much more complicated than genomics . . . while an organism's genome is more or less constant, the proteome differs from cell to cell and over time” Computationally, perhaps three orders of magnitude more complex than HGP 49
  • 50. biSCIENCE Why bother trying to quantify it? Because this is payback time. Documenting the proteome opens the door to a whole new world. 50
  • 51. biSCIENCE So, what is a data scientist? My favourite description comes from Twitter: “Yeah, so I'm actually a data scientist. I just do this barista thing in between gigs.” More cynically: “A data scientist is just an analyst who lives in California.” 51
  • 52. biSCIENCE Possibly more accurate is that a data scientist (DS) is “a better software engineer than any statistician and a better statistician than any software engineer”. 52
  • 53. biSCIENCE DSs are also part artist and part engineer. They need a toolbox of techniques, skills, processes and abilities from which to construct novel solutions. And they need the ability to create a UI that turns their abstract finding into something that the users of the system can understand, so DSs also need the skills to create elegant visualisations that turn raw data into information. 53
  • 54. biSCIENCE And (yes, there’s more) they need to be able to communicate well with people. There is little use in creating a superb analytical process if you can’t communicate how and why it works to the board members. 54
  • 55. biSCIENCE And then there is the curiosity. Duncan Ross (Director of Data Sciences at Teradata) characterised data scientists well: The first and most important trait is curiosity. Insane curiosity. In many walks of life evolution selects against the kind of person who decides to find out what happens “if I push that button”. Data Science selects for it. 55
  • 56. biSCIENCE So, what are the general characteristics of a DS? They include: • insatiable curiosity (see above) • interdisciplinary interests • excellent communication skills • excellent analytical capabilities 56
  • 57. biSCIENCE DSs also need a good working knowledge of: • machine learning techniques • data mining • statistics • maths • algorithm development • code development • data visualisation • multi-dimensional database design and implementation 57
  • 58. biSCIENCE Specific skills include the technologies to handle big data: • NoSQL databases • Hadoop and related technologies • MapReduce and its implementation on differing software platforms 58
  • 59. biSCIENCE DSs also have an intimate knowledge of languages such as: • SQL • MDX • R • Functional and OOP languages such as Erlang and Java 59
  • 60. biSCIENCE Most of all, no matter what they are called, all true data scientists have started playing with some data at 8:00PM and suddenly found it is 3:00AM.
  • 61. Case Study Twitter Who loves you? Social/text/sentiment 61
  • 62. Consider the humble tweet… 62
  • 63. Consider the humble tweet… As, indeed, Sally Bercow should have done 63
  • 64. Consider the humble tweet… As, indeed, Sally Bercow should have done *Innocent Face* 64
  • 65. Consider the humble tweet… I’d just like to apologise for that last slide but I would point out that it “contained no accusation whatsoever … Mischievous but not libellous.” 65
  • 66. Case Study Oil Rig data Gone fishing Sensor data 66
  • 67. Lessons learned • Engagement • Choose you battles – look for an area where you can gain competitive advantage • Choose your platform carefully • Programming – algorithm development • Data scientists • Custom algorithms 67 • Custom visualisations
  • 68. Thank you very much for listening Any Questions? Mark Whitehorn (MarkWhitehorn@computing.dundee.ac.uk) 68
  • 69. BIG DATA - AS OPPOSED TO SMALL DATA 60 minutes Mark Whitehorn

Notes de l'éditeur

  1. Each scan in the data file requires a lot of highly intensive processing in order to determine what proteins were present in the cell.Some examples…..Currently a single threaded pc based application is used
  2. 4 minutes