SlideShare une entreprise Scribd logo
1  sur  45
Scalability in Hadoop and
                                               Similar Systems
©MapR Technologies - Confidential              1
Big is the next big thing

     Big data and Hadoop are exploding


     Companies are being funded


     Books are being written


     Applications sprouting up everywhere




©MapR Technologies - Confidential   2
                                             2
Slow Motion Explosion




©MapR Technologies - Confidential   3
                                        3
Hadoop Explosion




©MapR Technologies - Confidential   4
                                        4
Why Now?

        But Moore’s law has applied for a long time


        Why is Hadoop exploding now?


        Why not 10 years ago?


        Why not 20?




9/18/2012
   ©MapR Technologies - Confidential    5
                                                       5
Size Matters, but …

     If it were just availability of data then existing big companies would
      adopt big data technology first




©MapR Technologies - Confidential      6
                                                          6
Size Matters, but …

     If it were just availability of data then existing big companies would
      adopt big data technology first


                       They didn’t




©MapR Technologies - Confidential      7
                                                          7
Or Maybe Cost

     If it were just a net positive value then finance companies should
      adopt first because they have higher opportunity value / byte




©MapR Technologies - Confidential     8
                                                        8
Or Maybe Cost

     If it were just a net positive value then finance companies should
      adopt first because they have higher opportunity value / byte


                       They didn’t




©MapR Technologies - Confidential     9
                                                        9
Backwards adoption

     Under almost any threshold argument startups would not adopt
      big data technology first




©MapR Technologies - Confidential   10
                                                    10
Backwards adoption

     Under almost any threshold argument startups would not adopt
      big data technology first


                       They did




©MapR Technologies - Confidential   11
                                                    11
Everywhere at Once?

     Something very strange is happening
       –   Big data is being applied at many different scales
       –   At many value scales
       –   By large companies and small




©MapR Technologies - Confidential             12
                                                                12
Everywhere at Once?

     Something very strange is happening
       –   Big data is being applied at many different scales
       –   At many value scales
       –   By large companies and small


                                    Why?




©MapR Technologies - Confidential             13
                                                                13
The Conventional Answer
More data is being produced more quickly
Data sizes are bigger than even a very large computer can hold
Cost to create and store continues to decrease




©MapR Technologies - Confidential      14
Analytics Scaling Laws

     Analytics scaling is all about the 80-20 rule
       –   Big gains for little initial effort
       –   Rapidly diminishing returns
     The key to net value is how costs scale
       –   Old school – exponential scaling
       –   Big data – linear scaling, low constant
     Cost/performance has changed radically
       –   IF you can use many commodity boxes




©MapR Technologies - Confidential                15
You’re kidding, people do that?


                                      We didn’t know that!

                                     We should have
                                     known that

                                    We knew that




©MapR Technologies - Confidential                  16
NSA, non-proliferation
                                      1




                                    0.75

                                                  Industry-wide data consortium
                           Value




                                     0.5
                                                 In-house analytics

                                                Intern with a spreadsheet
                                    0.25

                                               Anybody with eyes

                                      0
                                           0      500             1000      1500   2,000

                                                                  Scale




©MapR Technologies - Confidential                            17
1




                                    0.75




                                               Net value optimum has a
                           Value




                                     0.5       sharp peak well before
                                               maximum effort


                                    0.25




                                      0
                                           0   500            1000       1500   2,000

                                                              Scale




©MapR Technologies - Confidential                        18
But scaling laws are changing
                                         both slope and shape




©MapR Technologies - Confidential   19
1




                                    0.75
                           Value




                                     0.5
                                                                  More than just a little


                                    0.25




                                      0
                                           0   500        1000         1500           2,000

                                                          Scale




©MapR Technologies - Confidential                    20
1




                                    0.75
                           Value




                                     0.5


                                                                  They are changing a LOT!
                                    0.25




                                      0
                                           0   500        1000         1500         2,000

                                                          Scale




©MapR Technologies - Confidential                    21
©MapR Technologies - Confidential   22
©MapR Technologies - Confidential   23
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    24
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    25
1




                                    0.75

                                                                   A tipping point is reached and
                                                                   things change radically …
                           Value




                                     0.5

                                               Initially, linear cost scaling
                                               actually makes things worse
                                    0.25




                                      0
                                           0            500              1000      1500             2,000

                                                                         Scale




©MapR Technologies - Confidential                                   26
Pre-requisites for Tipping

     To reach the tipping point,
     Algorithms must scale out horizontally
       –   On commodity hardware
       –   That can and will fail
     Data practice must change
       –   Denormalized is the new black
       –   Flexible data dictionaries are the rule
       –   Structured data becomes rare




©MapR Technologies - Confidential              27
Yeah… but wait




©MapR Technologies - Confidential         28
The Standard Sort of Model

     People talk about the law of large numbers as if it were …



     Well, as if it were a law


     It’s not …


     It is a context and assumption dependent theorem




©MapR Technologies - Confidential     29
What if …

     These assumptions are:


     Changes have a
       –   stationary,
       –   independent,
       –   finite variance distribution




     What happens if these assumptions are wrong?


     And which of them is really wrong?

©MapR Technologies - Confidential         30
For Example
                         Stuff




                                    Tim e




©MapR Technologies - Confidential    31
End point
                         Stuff




                                            has nice
                                            tractable
                                            distribution




                                    Tim e




©MapR Technologies - Confidential    32
What if the Assumptions are Wrong?

     Take the finite variance as a simple example


     This leads to Levy stable distributions


     Like the Cauchy distribution




©MapR Technologies - Confidential      33
Is it Really Different?




©MapR Technologies - Confidential   34
Stuff




                                    Tim e




©MapR Technologies - Confidential    35
What About Real Life?




©MapR Technologies - Confidential             36
©MapR Technologies - Confidential   37
But is it Really Infinite Variance?

     Or are there other kinds of phenomena that show this?


     What about the independence assumption?



     What if the supposedly independent components of the system
      communicate?


     Like we do. Everyday. All the time.




©MapR Technologies - Confidential    38
Why the Difference?


                     The space of              Infinite                  The space of
                     all things that           variance                  interacting
                     change                                              things




                                       Law of large        Interacting
                                       numbers             agents




Apologies and credit to
Simon DaDeo, SFI

 ©MapR Technologies - Confidential                    39
What Happens with Interactions

     Social phenomena defeat the law of large numbers
     Distributions are well modeled by “rich get richer” processes
       –   Pittman-Yar process, Indian Buffet
     Limiting dstributions are heavy tailed, power law
     We see these distributions everywhere
       –   price of cotton in the 19th century
       –   word frequencies
       –   popularity of Github projects
       –   equity pricing and volumes
       –   sizes of cities
       –   popularity of web-sites


©MapR Technologies - Confidential                40
What are the
                                    Implications?



©MapR Technologies - Confidential         41
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    42
In a Nutshell

     Scalability is much more important than we thought


     Mashups are more important than we thought


     Network effects are more important than we thought


     Exploration is more important than we thought


     Hadoop style linear scaling must be mixed with ad hoc analysis



©MapR Technologies - Confidential    43
Thank You




©MapR Technologies - Confidential   44
whoami?

     Ted Dunning
       –   @ted_dunning
       –   tdunning@maprtech.com (MapR distribution for Hadoop)
       –   tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill)
       –   ted.dunning@gmail.com (me)


     More info:

       http://www.mapr.com/company/events/hadoop-in-finance-2012




©MapR Technologies - Confidential         45

Contenu connexe

Tendances

Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Ted Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...Edge AI and Vision Alliance
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...Edge AI and Vision Alliance
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data Alison B. Lowndes
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Sivadon Chaisiri
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...Edge AI and Vision Alliance
 
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization PerspectiveNew Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization PerspectiveFörderverein Technische Fakultät
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...Edge AI and Vision Alliance
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...Edge AI and Vision Alliance
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...KTN
 

Tendances (15)

Dunning strata-2012-27-02
Dunning strata-2012-27-02Dunning strata-2012-27-02
Dunning strata-2012-27-02
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06
 
Hcj 2013-01-21
Hcj 2013-01-21Hcj 2013-01-21
Hcj 2013-01-21
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
 
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization PerspectiveNew Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 

En vedette

Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent SearchTed Dunning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Transactional Data Mining
Transactional Data MiningTransactional Data Mining
Transactional Data MiningTed Dunning
 
Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...Mari Tinnemans
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementDr. Volkan OBAN
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data CertificationExperfy
 
Framework for Data Informed Science Policy
Framework for Data Informed Science PolicyFramework for Data Informed Science Policy
Framework for Data Informed Science PolicyBrian Wee
 
InfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management FrameworkInfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management FrameworkInfosys
 
The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...Iwl Pcu
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...ETCenter
 
Business Analyst Training - Gain America
Business Analyst Training - Gain AmericaBusiness Analyst Training - Gain America
Business Analyst Training - Gain AmericaGainAmerica
 
Scientific Data Cataloging Framework
Scientific Data Cataloging FrameworkScientific Data Cataloging Framework
Scientific Data Cataloging FrameworkSupun Nakandala
 
Open Science Framework (OSF)
Open Science Framework (OSF)Open Science Framework (OSF)
Open Science Framework (OSF)Andrew Sallans
 
Big Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science MethodologyBig Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science MethodologySumit Mattey
 

En vedette (20)

Iss
IssIss
Iss
 
Het Iss
Het IssHet Iss
Het Iss
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Transactional Data Mining
Transactional Data MiningTransactional Data Mining
Transactional Data Mining
 
Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
Framework for Data Informed Science Policy
Framework for Data Informed Science PolicyFramework for Data Informed Science Policy
Framework for Data Informed Science Policy
 
InfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management FrameworkInfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management Framework
 
The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...
 
R Cheat Sheet
R Cheat SheetR Cheat Sheet
R Cheat Sheet
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Big data framework
Big data frameworkBig data framework
Big data framework
 
Business Analyst Training - Gain America
Business Analyst Training - Gain AmericaBusiness Analyst Training - Gain America
Business Analyst Training - Gain America
 
Scientific Data Cataloging Framework
Scientific Data Cataloging FrameworkScientific Data Cataloging Framework
Scientific Data Cataloging Framework
 
Program Mgmt Framework
Program Mgmt FrameworkProgram Mgmt Framework
Program Mgmt Framework
 
Open Science Framework (OSF)
Open Science Framework (OSF)Open Science Framework (OSF)
Open Science Framework (OSF)
 
Big Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science MethodologyBig Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science Methodology
 

Similaire à Chicago finance-big-data

Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?Ted Dunning
 
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningMapR Technologies
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise WeAreEsynergy
 
EMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMCEMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMCCloudOps Summit
 
How to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in MindHow to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in MindBluelock
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21Ted Dunning
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowMapR Technologies
 
2012 Future of Cloud Computing
2012 Future of Cloud Computing 2012 Future of Cloud Computing
2012 Future of Cloud Computing Michael Skok
 
Dell panel cloud computing - small biz summit 2012
Dell panel   cloud computing - small biz summit 2012Dell panel   cloud computing - small biz summit 2012
Dell panel cloud computing - small biz summit 2012Ramon Ray
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation ITPaul Muller
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012Internap
 
Nyc lunch and learn 03 15 2012 final
Nyc lunch and learn   03 15 2012 finalNyc lunch and learn   03 15 2012 final
Nyc lunch and learn 03 15 2012 finalInternap
 
The Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize AlgorithmsThe Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize AlgorithmsCloudNSci
 
Managing your Cloud with Confidence
Managing your Cloud with Confidence Managing your Cloud with Confidence
Managing your Cloud with Confidence CA Nimsoft
 
CloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to ResolutionCloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to ResolutionOpsRamp
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformedDr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformedGlobal Business Events
 

Similaire à Chicago finance-big-data (20)

Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?
 
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
 
EMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMCEMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMC
 
How to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in MindHow to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in Mind
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Antonio piraino v1
Antonio piraino v1Antonio piraino v1
Antonio piraino v1
 
2012 Future of Cloud Computing
2012 Future of Cloud Computing 2012 Future of Cloud Computing
2012 Future of Cloud Computing
 
Dell panel cloud computing - small biz summit 2012
Dell panel   cloud computing - small biz summit 2012Dell panel   cloud computing - small biz summit 2012
Dell panel cloud computing - small biz summit 2012
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation IT
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
 
Nyc lunch and learn 03 15 2012 final
Nyc lunch and learn   03 15 2012 finalNyc lunch and learn   03 15 2012 final
Nyc lunch and learn 03 15 2012 final
 
The Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize AlgorithmsThe Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize Algorithms
 
Managing your Cloud with Confidence
Managing your Cloud with Confidence Managing your Cloud with Confidence
Managing your Cloud with Confidence
 
CloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to ResolutionCloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to Resolution
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformedDr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformed
 

Plus de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Plus de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Dernier

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Chicago finance-big-data

  • 1. Scalability in Hadoop and Similar Systems ©MapR Technologies - Confidential 1
  • 2. Big is the next big thing  Big data and Hadoop are exploding  Companies are being funded  Books are being written  Applications sprouting up everywhere ©MapR Technologies - Confidential 2 2
  • 3. Slow Motion Explosion ©MapR Technologies - Confidential 3 3
  • 5. Why Now?  But Moore’s law has applied for a long time  Why is Hadoop exploding now?  Why not 10 years ago?  Why not 20? 9/18/2012 ©MapR Technologies - Confidential 5 5
  • 6. Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first ©MapR Technologies - Confidential 6 6
  • 7. Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first They didn’t ©MapR Technologies - Confidential 7 7
  • 8. Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte ©MapR Technologies - Confidential 8 8
  • 9. Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t ©MapR Technologies - Confidential 9 9
  • 10. Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first ©MapR Technologies - Confidential 10 10
  • 11. Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first They did ©MapR Technologies - Confidential 11 11
  • 12. Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small ©MapR Technologies - Confidential 12 12
  • 13. Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small Why? ©MapR Technologies - Confidential 13 13
  • 14. The Conventional Answer More data is being produced more quickly Data sizes are bigger than even a very large computer can hold Cost to create and store continues to decrease ©MapR Technologies - Confidential 14
  • 15. Analytics Scaling Laws  Analytics scaling is all about the 80-20 rule – Big gains for little initial effort – Rapidly diminishing returns  The key to net value is how costs scale – Old school – exponential scaling – Big data – linear scaling, low constant  Cost/performance has changed radically – IF you can use many commodity boxes ©MapR Technologies - Confidential 15
  • 16. You’re kidding, people do that? We didn’t know that! We should have known that We knew that ©MapR Technologies - Confidential 16
  • 17. NSA, non-proliferation 1 0.75 Industry-wide data consortium Value 0.5 In-house analytics Intern with a spreadsheet 0.25 Anybody with eyes 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 17
  • 18. 1 0.75 Net value optimum has a Value 0.5 sharp peak well before maximum effort 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 18
  • 19. But scaling laws are changing both slope and shape ©MapR Technologies - Confidential 19
  • 20. 1 0.75 Value 0.5 More than just a little 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 20
  • 21. 1 0.75 Value 0.5 They are changing a LOT! 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 21
  • 22. ©MapR Technologies - Confidential 22
  • 23. ©MapR Technologies - Confidential 23
  • 24. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 24
  • 25. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 25
  • 26. 1 0.75 A tipping point is reached and things change radically … Value 0.5 Initially, linear cost scaling actually makes things worse 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 26
  • 27. Pre-requisites for Tipping  To reach the tipping point,  Algorithms must scale out horizontally – On commodity hardware – That can and will fail  Data practice must change – Denormalized is the new black – Flexible data dictionaries are the rule – Structured data becomes rare ©MapR Technologies - Confidential 27
  • 28. Yeah… but wait ©MapR Technologies - Confidential 28
  • 29. The Standard Sort of Model  People talk about the law of large numbers as if it were …  Well, as if it were a law  It’s not …  It is a context and assumption dependent theorem ©MapR Technologies - Confidential 29
  • 30. What if …  These assumptions are:  Changes have a – stationary, – independent, – finite variance distribution  What happens if these assumptions are wrong?  And which of them is really wrong? ©MapR Technologies - Confidential 30
  • 31. For Example Stuff Tim e ©MapR Technologies - Confidential 31
  • 32. End point Stuff has nice tractable distribution Tim e ©MapR Technologies - Confidential 32
  • 33. What if the Assumptions are Wrong?  Take the finite variance as a simple example  This leads to Levy stable distributions  Like the Cauchy distribution ©MapR Technologies - Confidential 33
  • 34. Is it Really Different? ©MapR Technologies - Confidential 34
  • 35. Stuff Tim e ©MapR Technologies - Confidential 35
  • 36. What About Real Life? ©MapR Technologies - Confidential 36
  • 37. ©MapR Technologies - Confidential 37
  • 38. But is it Really Infinite Variance?  Or are there other kinds of phenomena that show this?  What about the independence assumption?  What if the supposedly independent components of the system communicate?  Like we do. Everyday. All the time. ©MapR Technologies - Confidential 38
  • 39. Why the Difference? The space of Infinite The space of all things that variance interacting change things Law of large Interacting numbers agents Apologies and credit to Simon DaDeo, SFI ©MapR Technologies - Confidential 39
  • 40. What Happens with Interactions  Social phenomena defeat the law of large numbers  Distributions are well modeled by “rich get richer” processes – Pittman-Yar process, Indian Buffet  Limiting dstributions are heavy tailed, power law  We see these distributions everywhere – price of cotton in the 19th century – word frequencies – popularity of Github projects – equity pricing and volumes – sizes of cities – popularity of web-sites ©MapR Technologies - Confidential 40
  • 41. What are the Implications? ©MapR Technologies - Confidential 41
  • 42. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 42
  • 43. In a Nutshell  Scalability is much more important than we thought  Mashups are more important than we thought  Network effects are more important than we thought  Exploration is more important than we thought  Hadoop style linear scaling must be mixed with ad hoc analysis ©MapR Technologies - Confidential 43
  • 44. Thank You ©MapR Technologies - Confidential 44
  • 45. whoami?  Ted Dunning – @ted_dunning – tdunning@maprtech.com (MapR distribution for Hadoop) – tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill) – ted.dunning@gmail.com (me)  More info: http://www.mapr.com/company/events/hadoop-in-finance-2012 ©MapR Technologies - Confidential 45

Notes de l'éditeur

  1. Why is big data sooo fashionable with big and small companies from different industries? What has suddenly changed?
  2. Google searches are up 10x over just four years ago.
  3. Hadoop use is exploding. We chose this example, which shows job trends for Hadoop. Further evidence that you should pay attention during this talk.
  4. But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all?
  5. The different kinds of scaling laws have different shape and I think that shape is the key.
  6. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  7. In classical analytics, the cost of doing analytics increases sharply.
  8. The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
  9. New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
  10. This next sequence shows how the net value changes with different slope linear cost models.
  11. Notice how the best net value has jumped up significantly
  12. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
  13. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.