SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
The road lies plain before
                                                                       me;--'tis a theme
                                                                          Single and of
                                                                 determined bounds; …
                                                            - Wordsworth, The Prelude

                                                              m
                                                    pre ss.co
                                             . word            ol
                                     bl eclix         te Scho
                            p:/ /dou          Gr adua            2
                  ka  r, htt        val Post            l2 7,201
           n a San             r, Na              Apri
     Krish                 in a
                  st Sem
         hD   Gue
    00–P
EC40
What is
    Big
   Data ?	

                    Big
                   Data to
                   smart
                    data	

                                           Big
                                          Data
                                         Pipeline	

o  Agenda
   o  To cover the broad
      picture
   o  Touch upon
      instances of the         Analytics/                          Cloud
      technologies             Modeling
                                                   Analytic
                                   R
                                                  Algorithms    Architectures	

      employed
o  Of the Big Data                               Processing -     Storage -
   domain …                   Visualization
                                                   Hadoop           NOSQL
Thanks to …
The giants whose
 shoulders I am
  standing on 




                                                                            Special	
  Thanks	
  to:	
  
                                                         	
  	
  	
  Peter	
  Ateshian,	
  NPS	
  
                               	
  	
  	
  Prof	
  Murali	
  Tummala,	
  NPS	
  
                                              	
  	
  	
  Shirley	
  Bailes,O’Reilly	
  
                                                               	
  	
  	
  Ed	
  Dumbill,O’Reilly	
  
                                                                                 	
  	
  	
  Jeff	
  Barr,AWS	
  
                   	
  	
  	
  Jenny	
  Kohr	
  Chynoweth,AWS	
  
Porcelain vs. Plumbing
                              	


                     • The balance is always
                       interesting …	

                     • This talk has both	



• Would be happy to dive deep
  into plumbing topics like
  Hadoop, R, MongoDB,
  Cassandra et al…
EBC322	
  




①  Volume	

   o    Scale	
  
②  Velocity	

  o     Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
③  Variety	

   o    Different	
  sources	
  &	
  formats	
  
   o    Structured	
  vs.	
  Unstructured	
  
④  Variability	

   o    Breadth	
  of	
  interpreta<on	
  &	
  
   o    Depth	
  of	
  analy<cs	
  

                                         hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                           hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
                                 hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence	
  
EBC322	
  




①  Volume	

   o    Scale	
  
②  Velocity	

  o     Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
③  Variety	

   o    Different	
  sources	
  &	
  formats	
  
   o    Structured	
  vs.	
  Unstructured	
  
④  Variability	

   o    Breadth	
  of	
  interpreta<on	
  &	
  
   o    Depth	
  of	
  analy<cs	
  

                                         hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                           hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
                                 hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence	
  
EBC322	
  




①  Volume	

   o    Scale	
  
②  Velocity	

  o     Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
③  Variety	

   o    Different	
  sources	
  &	
  formats	
  
   o    Structured	
  vs.	
  Unstructured	
  
④  Variability	

   o    Breadth	
  of	
  interpreta<on	
  &	
  
   o    Depth	
  of	
  analy<cs	
  

                                         hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                           hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
                                 hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence	
  
EBC322	
  




①  Volume	

   o    Scale	
  
②  Velocity	

  o     Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
③  Variety	

   o    Different	
  sources	
  &	
  formats	
  
   o    Structured	
  vs.	
  Unstructured	
  
④  Variability	

   o    Breadth	
  of	
  interpreta<on	
  &	
  
   o    Depth	
  of	
  analy<cs	
  

                                         hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                           hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
                                 hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence	
  
EBC322	
  




①         Volume	

     o          Scale	
  
②         Velocity	

     o         Data	
  change	
  rate	
  vs.	
  decision	
  window	
  
③         Variety	

     o          Different	
  sources	
  &	
  formats	
  
     o          Structured	
  vs.	
  Unstructured	
  
④         Variability	

     o          Breadth	
  of	
  interpreta<on	
  &	
  
     o          Depth	
  of	
  analy<cs	
  

⑤  Contextual	

     o          Dynamic	
  variability	
  
     o          RecommendaWon	
  
⑥  Connectedness	

                                                                         hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/	
  
                                                                                           hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf	
  
•  “…	
  they	
  didn’t	
  need	
  a	
  genius,	
  …	
  but	
  build	
  the	
  world’s	
  most	
  impressive	
  
      dileKante	
  …	
  baKling	
  the	
  efficient	
  human	
  mind	
  with	
  spectacular	
  
      flamboyant	
  inefficiency”	
  –	
  Final	
  Jeopardy	
  by	
  Stephen	
  Baker	
  
   •  15	
  TB	
  memory,	
  across	
  90	
  IBM	
  760	
  servers,	
  in	
  10	
  racks	
  
   •  1	
  TB	
  of	
  dataset	
  
   •  200	
  Million	
  pages	
  processed	
  by	
  Hadoop	
  
   •  This	
  is	
  a	
  good	
  example	
  of	
  Connected	
  data	
  
          –  Contextual	
  w/	
  variability	
  
          –  Breath	
  of	
  interpretaWon	
  
          –  AnalyWcs	
  depth	
  


hKp://doubleclix.wordpress.com/2011/03/01/the-­‐educaWon-­‐of-­‐a-­‐machine-­‐%E2%80%93-­‐review-­‐of-­‐book-­‐%E2%80%9Cfinal-­‐jeopardy
%E2%80%9D-­‐by-­‐stephen-­‐baker/	
  
hKp://doubleclix.wordpress.com/2011/02/17/watson-­‐at-­‐jeopardy-­‐a-­‐race-­‐of-­‐machines/	
  
Ref:	
  hKp://www.ciol.com/News/News/News-­‐Reports/Vinod-­‐Khosla%E2%80%99s-­‐cool-­‐dozen-­‐tech-­‐innovaWons/156307/0/	
  
hKp://yourstory.in/2011/11/vinod-­‐khoslas-­‐keynote-­‐at-­‐nasscom-­‐product-­‐conclave-­‐reject-­‐punditry-­‐believe-­‐in-­‐an-­‐idea-­‐take-­‐risk-­‐and-­‐succeed/	
  
Ref:h&p:goo.gl/Mm83k	

                                                                                              Infer-ability	


                                                                            Model	

                            Internal	
  
                                                                                                                dashboards,	
  
                                                                                       Hand	
                   Tableau	
  
                                          Context	

                                   coded	
                  	
  
                                                                                       Programs,	
  
                             Connectedness	

                                          R,	
  Mahout,	
  
                                                                                       …	
  
                                                        SQL,	
  	
                     	
  
                    Variety	

                          BI	
  Tools,	
  
                                                        Hadoop,	
  
                                                        Pig,	
  Hive,	
  	
  
              Variability	

 SQL	
                      .NET	
  
                                                        Dryad,	
  
                               NOSQL,	
  
           Logs,	
                                      Various	
  
Velocity	

Scribe,	
           HDFS,	
  
                               XML,	
  
                                                        other	
  tools	
  
           Flume,	
  
                               =iles,	
  …	
  
Volume	

Storm,	
              	
  
           Hadoop
           …	
  




              Decomplexify!                      Contextualize!                 Network!           Reason!        Infer!
Twitter	

  §      200 million tweets/day	

  §      Peak 10,000/second	

  §      How would you handle the fire
          hose for social network analytics 	

                                            ?
                                                         AWS – 900 Billion objects!
                                    Zynga	

                                        §      “Analytics company, not a
                                                gaming company!”	

                                        §      Harvests data : 15 TB/day	

Storage	

                                    §    Test new features	

    §     4 U box = 40 TB,	

                §    Target advertising	

           1 PB = 25 boxes !	

    § 
                                        §      230 million players/month	

                                                                      hKp://goo.gl/dcBsQ	
  
•  6	
  Billion	
  Messages	
  per	
  
   day	
  
•  2	
  PB	
  (w/compression)	
  
   online	
  
•  6	
  PB	
  w/	
  replicaWon	
  
•  250	
  TB/Month	
  growth	
  
•  HBase	
  Infrastructure	
  
eBay	
  Extreme	
  
                                                                                                  AnalyWcs	
  
                                                                                                  Architecture	
  




                       50	
  TB/Day	
                                            Very	
  systemaWc	
  
                                          240	
  nodes,	
  84	
  PB	
            Diagram	
  speaks	
  volumes!	
  
Path	
  Analysis	
                        Teradata	
  InstallaWon	
  
A/B	
  TesWng	
                                             Ref:	
  hKp://www.hpts.ws/sessions/2011HPTS-­‐TomFastner.pdf	
  
D3.js	
  
                                                                  Tableau	
  
                                                      R	
        Dashboard	
  
                                               Mahout	
  
                               Hadoop	
        BI	
  Tools	
       Predict,
                               Pig/Hive	
                        Recommend
               NOSQL	
                        Model &            & Visualize
              Cassandra	
          R	
  
                                              Reason
              MongoDB	
  
                              Transform
 Splunk	
       Hbase	
  
                              & Analyze
 Scribe	
       Neo4j	
  
 Flume	
  
 Storm	
        Store
                                          When I think of my own native land, !
Collect                                      In a moment I seem to be there; !
                                                But, alas! recollection at hand
                                              Soon hurries me back to despair.!
                                   - Cowper, The Solitude Of Alexander SelKirk!
NOSQL	
  



   Key	
  Value	
        Column	
           Document	
             Graph	
  


  In-­‐memory	
         SimpleDB	
           CouchDB	
              Neo4j	
  

 Memcached	
             Google	
  
                                            MongoDB	
              FlockDB	
  
                        BigTable	
  
  Disk	
  Based	
  
                          HBase	
         Lotus	
  Domino	
     InfiniteGraph	
  
     Redis	
  
                       Cassandra	
              Riak	
  
Tokyo	
  Cabinet	
  

   Dynamo	
            HyperTable	
  


  Voldemort	
           Azure	
  TS	
  
MapReduce




•  Data	
  parallelism	
  
•  Large	
  InstallaWons	
  (many	
  ~5000	
  node	
  clusters!)	
  
Sotware	
  As	
  A	
  Service	
  




Plasorm	
  As	
  A	
  Service	
  




Infrastructure	
  As	
  A	
  Service	
  




                                           19	
  
Amazon – Canonical Cloud

       •     S3	
  –	
  Blob	
  storage	
  
       •     Dynamo	
  DB	
  –	
  NOSQL	
  
       •     EMR	
  –	
  ElasWc	
  Map	
  Reduce	
  
       •     EC2	
  –	
  Compute	
  
       •     1%	
  of	
  Internet	
  traffic	
  
“Scalability is about building wider roads,
not about building faster cars” – Steve
Swartz	


hKp://blog.deepfield.net/2012/04/18/how-­‐big-­‐is-­‐amazons-­‐cloud/	
  
hKp://www.slideshare.net/AmazonWebServices/keynote-­‐your-­‐future-­‐with-­‐cloud-­‐compuWng-­‐dr-­‐werner-­‐vogels-­‐aws-­‐summit-­‐2012-­‐nyc	
  
EC2




                                               EC2



hKp://openclipart.org/detail/152311/internet-­‐cloud-­‐by-­‐b.gaulWer,hKp://openclipart.org/detail/17847	
  
•    Social	
  Network	
  Analysis	
  
   •    SenWment	
  Analysis	
  
   •    Brand	
  Strength	
  
   •    CitaWon/co-­‐citaWon	
  ≅	
  Followed	
  by/Also	
  Follows	
  
   •    Metrics	
  
                                                                  Tweets	
  
         –    Network	
  diameter,	
  	
                          Followers	
  
         –    Weak-­‐Wes,	
  	
                                   Follow/Unfollow	
  

         –    Erdös-­‐Renyi	
  model	
  &	
  	
  
         –    Kronecker	
  Graphs	
  


hKp://www.oscon.com/oscon2012/public/schedule/detail/23130	
  
Was it a vision, or a waking dream?!
Fled is that music:—do I wake or sleep?!
                  -Keats, Ode to a Nightingale!

Contenu connexe

Tendances

Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
Trey Grainger
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
Open hpi semweb-06-part2
Open hpi semweb-06-part2Open hpi semweb-06-part2
Open hpi semweb-06-part2
Nadine Ludwig
 

Tendances (20)

[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
[系列活動] 無所不在的自然語言處理—基礎概念、技術與工具介紹
 
Convolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language ProcessingConvolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language Processing
 
The Future of Search and AI
The Future of Search and AIThe Future of Search and AI
The Future of Search and AI
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
[DSC x TAAI 2016] 林守德 / 人工智慧與機器學習在推薦系統上的應用
 
[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法[系列活動] 文字探勘者的入門心法
[系列活動] 文字探勘者的入門心法
 
Big data
Big dataBig data
Big data
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
The Next Generation of AI-powered Search
The Next Generation of AI-powered SearchThe Next Generation of AI-powered Search
The Next Generation of AI-powered Search
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)Natural Language Search with Knowledge Graphs (Activate 2019)
Natural Language Search with Knowledge Graphs (Activate 2019)
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Chicago Meetup)
 
Data Scientist 101 BI Dutch
Data Scientist 101 BI DutchData Scientist 101 BI Dutch
Data Scientist 101 BI Dutch
 
Open hpi semweb-06-part2
Open hpi semweb-06-part2Open hpi semweb-06-part2
Open hpi semweb-06-part2
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 

En vedette

Resume A. Rinaldi - ENG
Resume A. Rinaldi - ENGResume A. Rinaldi - ENG
Resume A. Rinaldi - ENG
Arturo Rinaldi
 
Vasco da gama grace and sam2
Vasco da gama grace and sam2Vasco da gama grace and sam2
Vasco da gama grace and sam2
guest893afef
 
ABC Breakfast Club m DanZafe: Effektiv lageroprydning
ABC Breakfast Club m DanZafe: Effektiv lageroprydningABC Breakfast Club m DanZafe: Effektiv lageroprydning
ABC Breakfast Club m DanZafe: Effektiv lageroprydning
ABC Softwork
 
Proposed New US fashion design law 2010
Proposed New US fashion design law 2010Proposed New US fashion design law 2010
Proposed New US fashion design law 2010
Darrell Mottley
 
Feast of saint martin celebrated in santarcangelo by cassoli alberto 3 d
Feast of saint martin  celebrated in santarcangelo by cassoli alberto 3 dFeast of saint martin  celebrated in santarcangelo by cassoli alberto 3 d
Feast of saint martin celebrated in santarcangelo by cassoli alberto 3 d
barbelkarlsruhe
 

En vedette (18)

Netric for Publishers
Netric for PublishersNetric for Publishers
Netric for Publishers
 
Resume A. Rinaldi - ENG
Resume A. Rinaldi - ENGResume A. Rinaldi - ENG
Resume A. Rinaldi - ENG
 
BEST gr-bertool
BEST gr-bertoolBEST gr-bertool
BEST gr-bertool
 
Apartamento T1 Pipa Natal -T1 Apartment Pipa Natal
Apartamento T1 Pipa Natal -T1 Apartment Pipa NatalApartamento T1 Pipa Natal -T1 Apartment Pipa Natal
Apartamento T1 Pipa Natal -T1 Apartment Pipa Natal
 
Vietnam Powerpoint
Vietnam PowerpointVietnam Powerpoint
Vietnam Powerpoint
 
Fluency
FluencyFluency
Fluency
 
Thesis A. Rinaldi (PDF Slides)
Thesis A. Rinaldi (PDF Slides)Thesis A. Rinaldi (PDF Slides)
Thesis A. Rinaldi (PDF Slides)
 
Vasco da gama grace and sam2
Vasco da gama grace and sam2Vasco da gama grace and sam2
Vasco da gama grace and sam2
 
ABC Breakfast Club m DanZafe: Effektiv lageroprydning
ABC Breakfast Club m DanZafe: Effektiv lageroprydningABC Breakfast Club m DanZafe: Effektiv lageroprydning
ABC Breakfast Club m DanZafe: Effektiv lageroprydning
 
Mobility as the New Innovation Driver in the Enterprises
Mobility as the New Innovation Driver in the EnterprisesMobility as the New Innovation Driver in the Enterprises
Mobility as the New Innovation Driver in the Enterprises
 
APEC TEL41 990510
APEC TEL41  990510APEC TEL41  990510
APEC TEL41 990510
 
An independent view on the evolution of the Internet
An independent view on the evolution of the InternetAn independent view on the evolution of the Internet
An independent view on the evolution of the Internet
 
Proposed New US fashion design law 2010
Proposed New US fashion design law 2010Proposed New US fashion design law 2010
Proposed New US fashion design law 2010
 
Global
GlobalGlobal
Global
 
Duduk
DudukDuduk
Duduk
 
Feast of saint martin celebrated in santarcangelo by cassoli alberto 3 d
Feast of saint martin  celebrated in santarcangelo by cassoli alberto 3 dFeast of saint martin  celebrated in santarcangelo by cassoli alberto 3 d
Feast of saint martin celebrated in santarcangelo by cassoli alberto 3 d
 
Brochure invest eng
Brochure invest engBrochure invest eng
Brochure invest eng
 
2015 AHP International Conference session - Operations Opportunities
2015 AHP International Conference session - Operations Opportunities 2015 AHP International Conference session - Operations Opportunities
2015 AHP International Conference session - Operations Opportunities
 

Similaire à Big Data Engineering - Top 10 Pragmatics

Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
Alexandru Iosup
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupal
emmanuel_jamin
 
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
eswcsummerschool
 
May 2012 HUG: The Changing Big Data Landscape
May 2012 HUG: The Changing Big Data LandscapeMay 2012 HUG: The Changing Big Data Landscape
May 2012 HUG: The Changing Big Data Landscape
Yahoo Developer Network
 

Similaire à Big Data Engineering - Top 10 Pragmatics (20)

The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Bcn On Rails May2010 On Graph Databases
Bcn On Rails May2010 On Graph DatabasesBcn On Rails May2010 On Graph Databases
Bcn On Rails May2010 On Graph Databases
 
Some news about the SW
Some news about the SWSome news about the SW
Some news about the SW
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bits
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupal
 
When?
When?When?
When?
 
Parallel io
Parallel ioParallel io
Parallel io
 
Spark
SparkSpark
Spark
 
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicHKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Big Data Analytics V2
Big Data Analytics V2Big Data Analytics V2
Big Data Analytics V2
 
Hak intis2013
Hak intis2013Hak intis2013
Hak intis2013
 
STI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital WorldsSTI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital Worlds
 
ITWS Capstone Lecture (Spring 2013)
ITWS Capstone Lecture (Spring 2013)ITWS Capstone Lecture (Spring 2013)
ITWS Capstone Lecture (Spring 2013)
 
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceA Lightning Introduction To Clouds & HLT - Human Language Technology Conference
A Lightning Introduction To Clouds & HLT - Human Language Technology Conference
 
May 2012 HUG: The Changing Big Data Landscape
May 2012 HUG: The Changing Big Data LandscapeMay 2012 HUG: The Changing Big Data Landscape
May 2012 HUG: The Changing Big Data Landscape
 

Plus de Krishna Sankar

Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
Krishna Sankar
 

Plus de Krishna Sankar (15)

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Dernier

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Dernier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 

Big Data Engineering - Top 10 Pragmatics

  • 1. The road lies plain before me;--'tis a theme Single and of determined bounds; … - Wordsworth, The Prelude m pre ss.co . word ol bl eclix te Scho p:/ /dou Gr adua 2 ka r, htt val Post l2 7,201 n a San r, Na Apri Krish in a st Sem hD Gue 00–P EC40
  • 2. What is Big Data ? Big Data to smart data Big Data Pipeline o  Agenda o  To cover the broad picture o  Touch upon instances of the Analytics/ Cloud technologies Modeling Analytic R Algorithms Architectures employed o  Of the Big Data Processing - Storage - domain … Visualization Hadoop NOSQL
  • 3. Thanks to … The giants whose shoulders I am standing on Special  Thanks  to:        Peter  Ateshian,  NPS        Prof  Murali  Tummala,  NPS        Shirley  Bailes,O’Reilly        Ed  Dumbill,O’Reilly        Jeff  Barr,AWS        Jenny  Kohr  Chynoweth,AWS  
  • 4. Porcelain vs. Plumbing • The balance is always interesting … • This talk has both • Would be happy to dive deep into plumbing topics like Hadoop, R, MongoDB, Cassandra et al…
  • 5. EBC322   ①  Volume o  Scale   ②  Velocity o  Data  change  rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  • 6. EBC322   ①  Volume o  Scale   ②  Velocity o  Data  change  rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  • 7. EBC322   ①  Volume o  Scale   ②  Velocity o  Data  change  rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  • 8. EBC322   ①  Volume o  Scale   ②  Velocity o  Data  change  rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf   hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  
  • 9. EBC322   ①  Volume o  Scale   ②  Velocity o  Data  change  rate  vs.  decision  window   ③  Variety o  Different  sources  &  formats   o  Structured  vs.  Unstructured   ④  Variability o  Breadth  of  interpreta<on  &   o  Depth  of  analy<cs   ⑤  Contextual o  Dynamic  variability   o  RecommendaWon   ⑥  Connectedness hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/   hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  
  • 10. •  “…  they  didn’t  need  a  genius,  …  but  build  the  world’s  most  impressive   dileKante  …  baKling  the  efficient  human  mind  with  spectacular   flamboyant  inefficiency”  –  Final  Jeopardy  by  Stephen  Baker   •  15  TB  memory,  across  90  IBM  760  servers,  in  10  racks   •  1  TB  of  dataset   •  200  Million  pages  processed  by  Hadoop   •  This  is  a  good  example  of  Connected  data   –  Contextual  w/  variability   –  Breath  of  interpretaWon   –  AnalyWcs  depth   hKp://doubleclix.wordpress.com/2011/03/01/the-­‐educaWon-­‐of-­‐a-­‐machine-­‐%E2%80%93-­‐review-­‐of-­‐book-­‐%E2%80%9Cfinal-­‐jeopardy %E2%80%9D-­‐by-­‐stephen-­‐baker/   hKp://doubleclix.wordpress.com/2011/02/17/watson-­‐at-­‐jeopardy-­‐a-­‐race-­‐of-­‐machines/  
  • 12. Ref:h&p:goo.gl/Mm83k Infer-ability Model Internal   dashboards,   Hand   Tableau   Context coded     Programs,   Connectedness R,  Mahout,   …   SQL,       Variety BI  Tools,   Hadoop,   Pig,  Hive,     Variability SQL   .NET   Dryad,   NOSQL,   Logs,   Various   Velocity Scribe,   HDFS,   XML,   other  tools   Flume,   =iles,  …   Volume Storm,     Hadoop …   Decomplexify! Contextualize! Network! Reason! Infer!
  • 13. Twitter §  200 million tweets/day §  Peak 10,000/second §  How would you handle the fire hose for social network analytics ? AWS – 900 Billion objects! Zynga §  “Analytics company, not a gaming company!” §  Harvests data : 15 TB/day Storage §  Test new features §  4 U box = 40 TB, §  Target advertising 1 PB = 25 boxes ! §  §  230 million players/month hKp://goo.gl/dcBsQ  
  • 14. •  6  Billion  Messages  per   day   •  2  PB  (w/compression)   online   •  6  PB  w/  replicaWon   •  250  TB/Month  growth   •  HBase  Infrastructure  
  • 15. eBay  Extreme   AnalyWcs   Architecture   50  TB/Day   Very  systemaWc   240  nodes,  84  PB   Diagram  speaks  volumes!   Path  Analysis   Teradata  InstallaWon   A/B  TesWng   Ref:  hKp://www.hpts.ws/sessions/2011HPTS-­‐TomFastner.pdf  
  • 16. D3.js   Tableau   R   Dashboard   Mahout   Hadoop   BI  Tools   Predict, Pig/Hive   Recommend NOSQL   Model & & Visualize Cassandra   R   Reason MongoDB   Transform Splunk   Hbase   & Analyze Scribe   Neo4j   Flume   Storm   Store When I think of my own native land, ! Collect In a moment I seem to be there; ! But, alas! recollection at hand Soon hurries me back to despair.! - Cowper, The Solitude Of Alexander SelKirk!
  • 17. NOSQL   Key  Value   Column   Document   Graph   In-­‐memory   SimpleDB   CouchDB   Neo4j   Memcached   Google   MongoDB   FlockDB   BigTable   Disk  Based   HBase   Lotus  Domino   InfiniteGraph   Redis   Cassandra   Riak   Tokyo  Cabinet   Dynamo   HyperTable   Voldemort   Azure  TS  
  • 18. MapReduce •  Data  parallelism   •  Large  InstallaWons  (many  ~5000  node  clusters!)  
  • 19. Sotware  As  A  Service   Plasorm  As  A  Service   Infrastructure  As  A  Service   19  
  • 20.
  • 21. Amazon – Canonical Cloud •  S3  –  Blob  storage   •  Dynamo  DB  –  NOSQL   •  EMR  –  ElasWc  Map  Reduce   •  EC2  –  Compute   •  1%  of  Internet  traffic   “Scalability is about building wider roads, not about building faster cars” – Steve Swartz hKp://blog.deepfield.net/2012/04/18/how-­‐big-­‐is-­‐amazons-­‐cloud/  
  • 23. EC2 EC2 hKp://openclipart.org/detail/152311/internet-­‐cloud-­‐by-­‐b.gaulWer,hKp://openclipart.org/detail/17847  
  • 24. •  Social  Network  Analysis   •  SenWment  Analysis   •  Brand  Strength   •  CitaWon/co-­‐citaWon  ≅  Followed  by/Also  Follows   •  Metrics   Tweets   –  Network  diameter,     Followers   –  Weak-­‐Wes,     Follow/Unfollow   –  Erdös-­‐Renyi  model  &     –  Kronecker  Graphs   hKp://www.oscon.com/oscon2012/public/schedule/detail/23130  
  • 25. Was it a vision, or a waking dream?! Fled is that music:—do I wake or sleep?! -Keats, Ode to a Nightingale!