SlideShare une entreprise Scribd logo
1  sur  31
What Does ‘Big Data’ Mean
   and Who Will Win?


   Michael Stonebraker
The Meaning of Big Data - 3 V’s

• Big Volume
   — With simple (SQL) analytics

   — With complex (non-SQL) analytics



• Big Velocity
   — Drink from the fire hose



• Big Variety
   — Large number of diverse data sources to integrate




                                                         2
Big Volume - Little Analytics

• Well addressed by data warehouse crowd

• Who are pretty good at SQL analytics on
  — Hundreds of nodes

  — Petabytes of data



• I know of a dozen or so multi-petabyte data
  warehouses in production from multiple vendors on
  more than 100 nodes of iron….


                                                      3
Big Data - Big Analytics

• Complex math operations (machine learning, clustering,
  trend detection, ….)
   — In your market, the world of the “quants”

   — Mostly specified as linear algebra on array data



• A dozen or so common ‘inner loops’
   — Matrix multiply

   — QR decomposition

   — SVD decomposition

   — Linear regression




                                                           4
Big Data - Big Analytics
                     An Example


• Consider closing price on all trading days for the last
  5 years for two stocks A and B

• What is the covariance between the two time-
  series?
      (1/N) * sum (Ai - mean(A)) * (Bi - mean (B))




                                                            5
Now Make It Interesting …

• Do this for all pairs of 4000 stocks
   — The data is the following 4000 x 1000 matrix


Stock     t1     t2    t3    t4     t5   t6   t7   ….   t1000

S1
S2
…
S4000

     Hourly data?     All securities?

                                                                6
Array Answer

• Ignoring the (1/N) and subtracting off the
  means ….

           Stock * StockT

• Try this in SQL with some relational simulation
  of the stock array!!!!




                                                    7
Solution Options
                    R, SAS, Matlab, et al


• Weak or non-existent data management
   —   Do the correlation only for companies with revenue > $1B ?


• File system storage

• R doesn’t scale and is not a parallel system
   — Revolution does a bit better




                                                                    8
Solution Options
                    RDBMS alone


• SQL simulator (MadLib) is slooooow
   — And only does some of the required operations



• Coding operations as UDFs still requires you to
  simulate arrays on top of tables --- sloooow
   — And current UDF model not powerful enough to
     support iteration




                                                     9
Solution Options
                       R + RDBMS


• Have to extract and transform the data from RDBMS
  table to math package data format (e.g. data frames)
• ‘move the world’ nightmare
• Need to learn 2 systems
• And R still doesn’t scale and is not a parallel system




                                                           10
Solution Options


• New Array DBMS designed with this market in mind




                                                11
An Example Array Engine DB
                       Paradigm4/SciDB

• All-in-one: data management with massively scalable
  advanced analytics

• Data is updated; not overwritten
   —   Supports reproducibility for research and compliance
   —   Time-series data
   —   Scenario testing

• Supports uncertain data, provenance

• Open source

• Runs in cloud or private grid of commodity HW

                                                              12
Solution Options: Hadoop


• Simple analytics (Hive queries)
   — 100 times slower than a parallel DBMS

• Complex analytics (Mahout or roll-your-own)
   — 100 times slower than Scalapack

• Parallel programming
   — Parallel grep (great)

   — Everything else (awful)

• Hadoop lacks
   — Stateful computations

   — Point-to-point communication




                                                13
Solution Options: Hadoop


• Lot and lots and lots of people are piloting Hadoop
• Many will hit a scalability wall when they get to
  production
   — Unless they are doing parallel grep




• My prediction: the bloom will come off the rose




                                                        14
Big Velocity

• Sensor tagging everything of material
  significance – and reporting state in real
  time -- sends volumes through the roof
  —   Including shopping carts (customers)
  —   And retail items
  —   Marathon runners
  —   Library books
  —   Broken equipment

• Breaks all your infrastructure

• And it will just get worse

                                               15
Two Different Solutions

• Big pattern - little state (electronic trading)
   — Find me a ‘strawberry’ followed within
     100 msec by a ‘banana’

• Complex event processing (CEP) is focused
  on this problem
   — Patterns in a firehose




                                                    16
Two Different Solutions

• Big state - little pattern
   — For every customer location in a store,
     decide whether to offer a real-time
     coupon to an item at eye-level

• Looks like high performance OLTP
   — Want to update a database at very high
     speed

• Looks exactly like ad placement on the web

                                               17
My Suspicion

• Your have 10 Big state - little pattern
  problems for every one Big pattern – little
  state problem




                                                18
New OLTP

• You need to ingest a fire
  hose in real-time

• You need to perform high
  volume OLTP

• You often need real-time
  analytics




                              19
Solution Choices

• Old SQL
   — The elephants


• No SQL
   — 75 or so vendors giving up both SQL and ACID


• New SQL
   — Retain SQL and ACID but go fast with a new
     architecture




                                                    20
Why Not Use Old SQL?

• Sloooow
   — By a couple orders of magnitude



• Because of
   — Disk

   — Heavy-weight transactions

   — Multi-threading



• See “Through the OLTP Looking Glass”
   — VLDB 2007




                                         21
No SQL

• Give up SQL
   — Interesting to note that

     Cassandra and Mongo are
     moving to (yup) SQL

• Give up ACID
   — If you need ACID, this is a

     decision to tear your hair out
     by doing it in user code
   — Can you guarantee you won’t
     need ACID tomorrow?


                                      22
VoltDB: an example of New SQL

• A main memory SQL engine

• Open source

• Shared nothing, Linux, TCP/IP on jelly beans

• Light-weight transactions
   — Run-to-completion with no locking


• Single-threaded
   — Multi-core by splitting main memory


• About 100x RDBMS on TPC-C


                                                 23
Big Variety – Traditional Solution

• Construct a global schema
• Have a programmer understand each data source –
  and map local objects to the global schema
• Using some scripting language
   — Sometimes this is very expensive…..

   — I.e. ask the customer

• And then you have the problem of cleaning the
  data….
   — More scripts



• Works for 20 (or so) data sources

                                                    24
Big Variety

• Typical enterprise has 5000 operational systems
   — Only a few get into the data warehouse

   — What about the rest?



• And what about all the rest of your data?
   — Spreadsheets

   — Access data bases

   — Web pages



• And public data from the web?


                                                    25
The World of Data Integration

        the rest of your data




     enterprise         text
     data warehouse




                                26
Summary

• The rest of your data (public and private)
   — Is a treasure trove of incredibly valuable

     information

   —   Largely untapped




                                                  27
Data Tamer

• Integrate the rest of your data

• Has to
   — Be scalable to 1000s of sites

   — Deal with incomplete, conflicting, and incorrect data

   — Be incremental

      • Task is never done




                                                        28
Data Tamer in a Nutshell

• Apply machine learning and statistics to perform
  automatic:
   — Discovery of structure

   — Entity resolution

   — Transformation



• With a human assist if necessary
  — WYSIWYG tool (Wrangler)




                                                     29
Data Tamer

• MIT research project

• Looking for more integration problems
   — Wanna partner?




                                          30
Take away

• One size does not fit all

• Plan on (say) 6 DBMS architectures
   — Use the right tool for the job



• Elephants are not competitive
   — At anything

   — Have a bad ‘innovator’s dilemma’ problem




                                                31

Contenu connexe

Tendances

NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseJoe Alex
 
Hybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseHybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseLaine Campbell
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!Andraz Tori
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Oracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the uglyOracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the uglyJohn Kanagaraj
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File FormatScylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File FormatScyllaDB
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQLRTigger
 
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...Insight Technology, Inc.
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandAndrew Brust
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 

Tendances (20)

NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed Database
 
Hybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseHybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouse
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Oracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the uglyOracle vs NoSQL – The good, the bad and the ugly
Oracle vs NoSQL – The good, the bad and the ugly
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File FormatScylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-Land
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 

Similaire à What Does Big Data Mean and Who Will Win

Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBigDataCloud
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data RampageNiko Vuokko
 
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...NETWAYS
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911Ines Sombra
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQLCrate.io
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
Pldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysqlPldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysqlradiocats
 
Database & Technology 1 | Andrew Holdsworth | Orace Database Performance.pdf
Database & Technology 1 | Andrew Holdsworth | Orace Database Performance.pdfDatabase & Technology 1 | Andrew Holdsworth | Orace Database Performance.pdf
Database & Technology 1 | Andrew Holdsworth | Orace Database Performance.pdfInSync2011
 
100 Exadata Implementations Later-Tim Fox
100 Exadata Implementations Later-Tim Fox100 Exadata Implementations Later-Tim Fox
100 Exadata Implementations Later-Tim FoxEnkitec
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
Big Data Overview Part 1
Big Data Overview Part 1Big Data Overview Part 1
Big Data Overview Part 1William Simms
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalramazan fırın
 

Similaire à What Does Big Data Mean and Who Will Win (20)

Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Lean & agile with MongoDB
Lean & agile with MongoDBLean & agile with MongoDB
Lean & agile with MongoDB
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
OSDC 2018 | The operational brain: how new Paradigms like Machine Learning ar...
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
Pldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysqlPldc2012 monitoring-and-trending-with-mysql
Pldc2012 monitoring-and-trending-with-mysql
 
Database & Technology 1 | Andrew Holdsworth | Orace Database Performance.pdf
Database & Technology 1 | Andrew Holdsworth | Orace Database Performance.pdfDatabase & Technology 1 | Andrew Holdsworth | Orace Database Performance.pdf
Database & Technology 1 | Andrew Holdsworth | Orace Database Performance.pdf
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
100 Exadata Implementations Later-Tim Fox
100 Exadata Implementations Later-Tim Fox100 Exadata Implementations Later-Tim Fox
100 Exadata Implementations Later-Tim Fox
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Big Data Overview Part 1
Big Data Overview Part 1Big Data Overview Part 1
Big Data Overview Part 1
 
Big data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-finalBig data hadoop-no sql and graph db-final
Big data hadoop-no sql and graph db-final
 

Plus de BigDataCloud

Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsWebinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsBigDataCloud
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction SystemBigDataCloud
 
REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS BigDataCloud
 
Cloud Computing Services
Cloud Computing ServicesCloud Computing Services
Cloud Computing ServicesBigDataCloud
 
Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!BigDataCloud
 
Big Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBig Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBigDataCloud
 
Big Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBig Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBigDataCloud
 
Streak + Google Cloud Platform
Streak + Google Cloud PlatformStreak + Google Cloud Platform
Streak + Google Cloud PlatformBigDataCloud
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value BigDataCloud
 
Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.BigDataCloud
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Recommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideRecommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideBigDataCloud
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?BigDataCloud
 
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalHadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalBigDataCloud
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud
 
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...BigDataCloud
 

Plus de BigDataCloud (20)

Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsWebinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS
 
Cloud Computing Services
Cloud Computing ServicesCloud Computing Services
Cloud Computing Services
 
Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!
 
Big Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBig Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & Apps
 
Big Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBig Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud Platform
 
Streak + Google Cloud Platform
Streak + Google Cloud PlatformStreak + Google Cloud Platform
Streak + Google Cloud Platform
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value
 
Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Recommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideRecommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural Guide
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?
 
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalHadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
 
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
BigDataCloud Sept 8 2011 Meetup - Big Data Analytics for DoddFrank Regulation...
 

Dernier

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

What Does Big Data Mean and Who Will Win

  • 1. What Does ‘Big Data’ Mean and Who Will Win? Michael Stonebraker
  • 2. The Meaning of Big Data - 3 V’s • Big Volume — With simple (SQL) analytics — With complex (non-SQL) analytics • Big Velocity — Drink from the fire hose • Big Variety — Large number of diverse data sources to integrate 2
  • 3. Big Volume - Little Analytics • Well addressed by data warehouse crowd • Who are pretty good at SQL analytics on — Hundreds of nodes — Petabytes of data • I know of a dozen or so multi-petabyte data warehouses in production from multiple vendors on more than 100 nodes of iron…. 3
  • 4. Big Data - Big Analytics • Complex math operations (machine learning, clustering, trend detection, ….) — In your market, the world of the “quants” — Mostly specified as linear algebra on array data • A dozen or so common ‘inner loops’ — Matrix multiply — QR decomposition — SVD decomposition — Linear regression 4
  • 5. Big Data - Big Analytics An Example • Consider closing price on all trading days for the last 5 years for two stocks A and B • What is the covariance between the two time- series? (1/N) * sum (Ai - mean(A)) * (Bi - mean (B)) 5
  • 6. Now Make It Interesting … • Do this for all pairs of 4000 stocks — The data is the following 4000 x 1000 matrix Stock t1 t2 t3 t4 t5 t6 t7 …. t1000 S1 S2 … S4000 Hourly data? All securities? 6
  • 7. Array Answer • Ignoring the (1/N) and subtracting off the means …. Stock * StockT • Try this in SQL with some relational simulation of the stock array!!!! 7
  • 8. Solution Options R, SAS, Matlab, et al • Weak or non-existent data management — Do the correlation only for companies with revenue > $1B ? • File system storage • R doesn’t scale and is not a parallel system — Revolution does a bit better 8
  • 9. Solution Options RDBMS alone • SQL simulator (MadLib) is slooooow — And only does some of the required operations • Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow — And current UDF model not powerful enough to support iteration 9
  • 10. Solution Options R + RDBMS • Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames) • ‘move the world’ nightmare • Need to learn 2 systems • And R still doesn’t scale and is not a parallel system 10
  • 11. Solution Options • New Array DBMS designed with this market in mind 11
  • 12. An Example Array Engine DB Paradigm4/SciDB • All-in-one: data management with massively scalable advanced analytics • Data is updated; not overwritten — Supports reproducibility for research and compliance — Time-series data — Scenario testing • Supports uncertain data, provenance • Open source • Runs in cloud or private grid of commodity HW 12
  • 13. Solution Options: Hadoop • Simple analytics (Hive queries) — 100 times slower than a parallel DBMS • Complex analytics (Mahout or roll-your-own) — 100 times slower than Scalapack • Parallel programming — Parallel grep (great) — Everything else (awful) • Hadoop lacks — Stateful computations — Point-to-point communication 13
  • 14. Solution Options: Hadoop • Lot and lots and lots of people are piloting Hadoop • Many will hit a scalability wall when they get to production — Unless they are doing parallel grep • My prediction: the bloom will come off the rose 14
  • 15. Big Velocity • Sensor tagging everything of material significance – and reporting state in real time -- sends volumes through the roof — Including shopping carts (customers) — And retail items — Marathon runners — Library books — Broken equipment • Breaks all your infrastructure • And it will just get worse 15
  • 16. Two Different Solutions • Big pattern - little state (electronic trading) — Find me a ‘strawberry’ followed within 100 msec by a ‘banana’ • Complex event processing (CEP) is focused on this problem — Patterns in a firehose 16
  • 17. Two Different Solutions • Big state - little pattern — For every customer location in a store, decide whether to offer a real-time coupon to an item at eye-level • Looks like high performance OLTP — Want to update a database at very high speed • Looks exactly like ad placement on the web 17
  • 18. My Suspicion • Your have 10 Big state - little pattern problems for every one Big pattern – little state problem 18
  • 19. New OLTP • You need to ingest a fire hose in real-time • You need to perform high volume OLTP • You often need real-time analytics 19
  • 20. Solution Choices • Old SQL — The elephants • No SQL — 75 or so vendors giving up both SQL and ACID • New SQL — Retain SQL and ACID but go fast with a new architecture 20
  • 21. Why Not Use Old SQL? • Sloooow — By a couple orders of magnitude • Because of — Disk — Heavy-weight transactions — Multi-threading • See “Through the OLTP Looking Glass” — VLDB 2007 21
  • 22. No SQL • Give up SQL — Interesting to note that Cassandra and Mongo are moving to (yup) SQL • Give up ACID — If you need ACID, this is a decision to tear your hair out by doing it in user code — Can you guarantee you won’t need ACID tomorrow? 22
  • 23. VoltDB: an example of New SQL • A main memory SQL engine • Open source • Shared nothing, Linux, TCP/IP on jelly beans • Light-weight transactions — Run-to-completion with no locking • Single-threaded — Multi-core by splitting main memory • About 100x RDBMS on TPC-C 23
  • 24. Big Variety – Traditional Solution • Construct a global schema • Have a programmer understand each data source – and map local objects to the global schema • Using some scripting language — Sometimes this is very expensive….. — I.e. ask the customer • And then you have the problem of cleaning the data…. — More scripts • Works for 20 (or so) data sources 24
  • 25. Big Variety • Typical enterprise has 5000 operational systems — Only a few get into the data warehouse — What about the rest? • And what about all the rest of your data? — Spreadsheets — Access data bases — Web pages • And public data from the web? 25
  • 26. The World of Data Integration the rest of your data enterprise text data warehouse 26
  • 27. Summary • The rest of your data (public and private) — Is a treasure trove of incredibly valuable information — Largely untapped 27
  • 28. Data Tamer • Integrate the rest of your data • Has to — Be scalable to 1000s of sites — Deal with incomplete, conflicting, and incorrect data — Be incremental • Task is never done 28
  • 29. Data Tamer in a Nutshell • Apply machine learning and statistics to perform automatic: — Discovery of structure — Entity resolution — Transformation • With a human assist if necessary — WYSIWYG tool (Wrangler) 29
  • 30. Data Tamer • MIT research project • Looking for more integration problems — Wanna partner? 30
  • 31. Take away • One size does not fit all • Plan on (say) 6 DBMS architectures — Use the right tool for the job • Elephants are not competitive — At anything — Have a bad ‘innovator’s dilemma’ problem 31