SlideShare une entreprise Scribd logo
1  sur  29
Bayesian Counters
aka In Memory Data Mining for Large DataSets
 Alex Kozlov, Ph.D., Principal Solutions Architect, Cloudera Inc.

 @alexvk2009 (Twitter)
June 13-th, 2012
My past (aka about me)
Agenda
• Current trends (large data, real time, uncertainty)
• What is Bayesian Counters
• Naïve Bayes
• NN
• Clique ranking
• Association Rules
• Some performance results
• Conclusions

                      ©2012 Cloudera, Inc. All Rights Reserved.   4
A Distributed System
Centralized                        Distributed

• SPoF                             • Availability

• Strict synchronization/Locking   • Redundancy/Fault Tolerance

• Better Resource Management       • Flexible

                                   • Interactive
Data collection
State space explosion
• Chess alpha-beta tree has 1045 nodes
• We can solve only 1018 state space
• Go has 10360 nodes
• Given the Moore’s law we’ll be there only by 2120
                       Can we help?
              Uncertainty rules the world!
               Or use distributed systems
More zeros

• Most powerful computer (2019): 1024 ops/sec

• Seconds in a year: 3 x 107 seconds

• Sun’s expected life: 107 years

     We can probably be done with chess!
Time
Examples                             Value vs time

• Advertising: if you don’t figure
   what the user wants in 5
   minutes, you lost him
• Intrusion detection: the
   damage may be significantly        0   1   2    3   4   5   6   7   8   9

   bigger after a few minutes                     Value    Precision
   after break-in
• Missing/misconfigured pages          http://cetas.net
                                       http://www.woopra.com
                                       http://www.wibidata.com/
What we’ve learned so far
• There is a lot of data out there
• The storage capacity of a distributed systems
  today is overwhelming
• We need to admit that some problems will
  never be solved
• Time is a critical factor
Why (not) to Mine from HD?
• L1 Cache: 64 bits per CPU clock    • Move computation to the data:

   cycle (10-9 sec) 1010 bytes per     but ML wants all your data!

   second, latency in ns             • And sorted…

• HD – 12 x 100 x 106 bytes per
   second, latency in ms
                                            What if it does not fit in
• Network – 10 GbE switches                          RAM?
   (depends on distance, topology)
• East-West coast latency 20-40
   ms (ms within a datacenter)       • Work on reasonable subsets
Push computations to the source

• Collect relevant information at the source
  (pairwise correlations, can be done in parallel
  using Hbase)

Compare:
    -> computations to data = MapReduce

    -> data to computations = map side join
Bayesian Counters
                         • [A=a1;B=b1] -> 5

                         • [A=a1;B=b2] -> 15

Pr(A|B) = Pr(AB)/Pr(B)   • …

  = Count(AB)/Count(B)   • [A=a2;B=b1] -> 3

                         • …
Time
                                What if we want to access more
                                  recent data more often?


•   Key: subset of variables with their values + timestamp (variable length)
•   Value: count (8 bytes)

                                                                                        index

       Key 1     Value       Key 2     Value      Key 3      Value      Key 4   Value



     Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.)


                             Pr(A|B, last 20 minutes)
Anatomy of a counter
                           Region (divide between)
   Counter/Table
                                     File        Column family
Iris
 [sepal_width=2;class=0]                         Column qualifier
                       30 mins

                             1321038671                    Version
                                    1321038998

                                                     15
                       2 hours
                                                          Value (data)
Cars                                             …
File/Memory Structure
HBase schema design

• Push computations into distributed realm

• Column family for data locality

• Key is a tuple of var=value combinations

• No random salt

• Value is a counter (8 bytes)
Implementations

• Naïve Bayes

• Nearest Neighbor

• Association rules

• Clique ranking
Naïve Bayes


Pr(C|F1, F2, ..., FN) =1/z Pr(C)    Πi Pr(F |C)
                                            i


Required only pairwise counters (complexity N2)


*Linear if we fix the target node
k-NN


       P(C) for k nearest neighbors

       count(C|X) = ΣXi count(C|Xi)

where X1, X2, ..., XN are in the vicinity of X
Clique ranking
What is the best structure of a Bayesian Network

     I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)]

           Where x in X and y in Y

Using random projection can generalize on
              abstract subset Z
Assoc
• Confidence (A -> B): count(A and B)/count(A)

• Lift (A -> B): count(A and B)/[count(A) x count(B)]



• Usually filtered on support: count(A and B)

• Frequent itemset search
Performance

retail.dat – 88K transactions over 14,246 items

• Mahout FPGrowth – 0.5 sec per pattern
  (58,623 patterns with min support 2)

• < 1 ms per pattern on a 5 node cluster
FPGrowth performance

Row       Support    Rules    Time(ms)
      1         1   69,309   25,659,052
      2         2   58,623   23,103,547
      3         4   48,270   20,782,325
      4         8   38,661   17,643,592
      5       16    28,988   13,994,334
      6       32    19,939    9,714,935
FPGrowth performance
Time
  nb iris class=2 sepal_length=5;petal_length=1.4 300



Target Variable                Time (seconds from now)




                  Predictors
Conclusions
• Storing n-wise counts is a powerful data
  analysis paradigm
• We can implement a number of powerful
  algorithms on top of counters
• A system that will know about the world more
  than you would ever dare to admit
Thank you!




             31
Questions?




                           freenode: #cloudera / #hadoop
                           http://www.cloudera.com
Do not hesitate to email alexvk@{gmail,cloudera}.com
                                                            32
                ©2012 Cloudera, Inc. All Rights Reserved.

Contenu connexe

Tendances

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
Strata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big DataStrata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big DataSrinath Perera
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetAmazon Web Services
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaTed Dunning
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax
 
Cassandra&map reduce
Cassandra&map reduceCassandra&map reduce
Cassandra&map reducevlaskinvlad
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29Ted Dunning
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
 

Tendances (20)

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Strata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big DataStrata 2014 Talk:Tracking a Soccer Game with Big Data
Strata 2014 Talk:Tracking a Soccer Game with Big Data
 
C07.heaps
C07.heapsC07.heaps
C07.heaps
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNet
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
 
Cassandra&map reduce
Cassandra&map reduceCassandra&map reduce
Cassandra&map reduce
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 

En vedette

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010Yahoo Developer Network
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
04 data types & variables
04   data types & variables04   data types & variables
04 data types & variablesdhrubo kayal
 
DMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryDMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryJohannes Hoppe
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classificationKrish_ver2
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classificationManu Chandel
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersPier Luca Lanzi
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
 
04 theories and classification of retailing
04 theories and classification of retailing04 theories and classification of retailing
04 theories and classification of retailingDr. Chandan Vichoray
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesGilad Barkan
 

En vedette (16)

Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
04 data types & variables
04   data types & variables04   data types & variables
04 data types & variables
 
04 data mining : data generelization
04 data mining : data generelization04 data mining : data generelization
04 data mining : data generelization
 
DMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryDMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining Theory
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Lecture 04 data resource management
Lecture 04 data resource managementLecture 04 data resource management
Lecture 04 data resource management
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
04 theories and classification of retailing
04 theories and classification of retailing04 theories and classification of retailing
04 theories and classification of retailing
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummies
 

Similaire à Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Data Sets

Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data eraBill GU
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhcdrsm79
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architectureMarkus Klems
 
It's the memory, stupid! CodeJam 2014
It's the memory, stupid!  CodeJam 2014It's the memory, stupid!  CodeJam 2014
It's the memory, stupid! CodeJam 2014Francesc Alted
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Boris Yen
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 

Similaire à Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Data Sets (20)

Modern software design in Big data era
Modern software design in Big data eraModern software design in Big data era
Modern software design in Big data era
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Cassandra background-and-architecture
Cassandra background-and-architectureCassandra background-and-architecture
Cassandra background-and-architecture
 
It's the memory, stupid! CodeJam 2014
It's the memory, stupid!  CodeJam 2014It's the memory, stupid!  CodeJam 2014
It's the memory, stupid! CodeJam 2014
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Dernier (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Data Sets

  • 1. Bayesian Counters aka In Memory Data Mining for Large DataSets Alex Kozlov, Ph.D., Principal Solutions Architect, Cloudera Inc. @alexvk2009 (Twitter) June 13-th, 2012
  • 2.
  • 3. My past (aka about me)
  • 4. Agenda • Current trends (large data, real time, uncertainty) • What is Bayesian Counters • Naïve Bayes • NN • Clique ranking • Association Rules • Some performance results • Conclusions ©2012 Cloudera, Inc. All Rights Reserved. 4
  • 5. A Distributed System Centralized Distributed • SPoF • Availability • Strict synchronization/Locking • Redundancy/Fault Tolerance • Better Resource Management • Flexible • Interactive
  • 7. State space explosion • Chess alpha-beta tree has 1045 nodes • We can solve only 1018 state space • Go has 10360 nodes • Given the Moore’s law we’ll be there only by 2120 Can we help? Uncertainty rules the world! Or use distributed systems
  • 8. More zeros • Most powerful computer (2019): 1024 ops/sec • Seconds in a year: 3 x 107 seconds • Sun’s expected life: 107 years We can probably be done with chess!
  • 9. Time Examples Value vs time • Advertising: if you don’t figure what the user wants in 5 minutes, you lost him • Intrusion detection: the damage may be significantly 0 1 2 3 4 5 6 7 8 9 bigger after a few minutes Value Precision after break-in • Missing/misconfigured pages http://cetas.net http://www.woopra.com http://www.wibidata.com/
  • 10. What we’ve learned so far • There is a lot of data out there • The storage capacity of a distributed systems today is overwhelming • We need to admit that some problems will never be solved • Time is a critical factor
  • 11. Why (not) to Mine from HD? • L1 Cache: 64 bits per CPU clock • Move computation to the data: cycle (10-9 sec) 1010 bytes per but ML wants all your data! second, latency in ns • And sorted… • HD – 12 x 100 x 106 bytes per second, latency in ms What if it does not fit in • Network – 10 GbE switches RAM? (depends on distance, topology) • East-West coast latency 20-40 ms (ms within a datacenter) • Work on reasonable subsets
  • 12. Push computations to the source • Collect relevant information at the source (pairwise correlations, can be done in parallel using Hbase) Compare: -> computations to data = MapReduce -> data to computations = map side join
  • 13. Bayesian Counters • [A=a1;B=b1] -> 5 • [A=a1;B=b2] -> 15 Pr(A|B) = Pr(AB)/Pr(B) • … = Count(AB)/Count(B) • [A=a2;B=b1] -> 3 • …
  • 14. Time What if we want to access more recent data more often? • Key: subset of variables with their values + timestamp (variable length) • Value: count (8 bytes) index Key 1 Value Key 2 Value Key 3 Value Key 4 Value Column families are different HFiles (30 min, 2 hours, 24 hours, 5 days, etc.) Pr(A|B, last 20 minutes)
  • 15. Anatomy of a counter Region (divide between) Counter/Table File Column family Iris [sepal_width=2;class=0] Column qualifier 30 mins 1321038671 Version 1321038998 15 2 hours Value (data) Cars …
  • 17. HBase schema design • Push computations into distributed realm • Column family for data locality • Key is a tuple of var=value combinations • No random salt • Value is a counter (8 bytes)
  • 18. Implementations • Naïve Bayes • Nearest Neighbor • Association rules • Clique ranking
  • 19. Naïve Bayes Pr(C|F1, F2, ..., FN) =1/z Pr(C) Πi Pr(F |C) i Required only pairwise counters (complexity N2) *Linear if we fix the target node
  • 20. k-NN P(C) for k nearest neighbors count(C|X) = ΣXi count(C|Xi) where X1, X2, ..., XN are in the vicinity of X
  • 21. Clique ranking What is the best structure of a Bayesian Network I(X;Y)=ΣΣp(x,y)log[p(x,y)/p(x)p(y)] Where x in X and y in Y Using random projection can generalize on abstract subset Z
  • 22. Assoc • Confidence (A -> B): count(A and B)/count(A) • Lift (A -> B): count(A and B)/[count(A) x count(B)] • Usually filtered on support: count(A and B) • Frequent itemset search
  • 23. Performance retail.dat – 88K transactions over 14,246 items • Mahout FPGrowth – 0.5 sec per pattern (58,623 patterns with min support 2) • < 1 ms per pattern on a 5 node cluster
  • 24. FPGrowth performance Row Support Rules Time(ms) 1 1 69,309 25,659,052 2 2 58,623 23,103,547 3 4 48,270 20,782,325 4 8 38,661 17,643,592 5 16 28,988 13,994,334 6 32 19,939 9,714,935
  • 26. Time nb iris class=2 sepal_length=5;petal_length=1.4 300 Target Variable Time (seconds from now) Predictors
  • 27. Conclusions • Storing n-wise counts is a powerful data analysis paradigm • We can implement a number of powerful algorithms on top of counters • A system that will know about the world more than you would ever dare to admit
  • 29. Questions? freenode: #cloudera / #hadoop http://www.cloudera.com Do not hesitate to email alexvk@{gmail,cloudera}.com 32 ©2012 Cloudera, Inc. All Rights Reserved.

Notes de l'éditeur

  1. Not about HDFS and/or Hadoop/HBaseNot about DB design (we can store PB of data)I am going to a lot of customers and “recommend” things. Sometimes they listen. If not, they come back.Most of the “data scientist” say that only simple/linear algos work on large dataHow to go beyond a simple “grep” or unique countMy approach to data mining and knowledge discovery (a.k.a. data science’)Common problems: not enough memory, state space explosion, exponential running timesHow to solve them?
  2. Josh Wills definition of a computer scientist:+ I am better at Physics than any data scientist (besides maybe Kevin Weil from Twitter)
  3. I’ve heard talks about data locality, task workload distribution, and reliability in 1998-1999. Everyone(almost) thought that distributed computations on commodity workstations is not a great idea.MapReduce was born on 2002-2004Hadoop had a world record of sorting 1TB 100-byte records (just under a minute)Can do the same on a ~50 node cluster today1PB close to 30 minutesI will talk abouty some tendencies that I see in data analysis areaAbout Cloudera and CDHAbout Distributed SystemsWhy to keep Dataset in MemoryCurrent TrendsWhat is Bayesian CountersNaïve BayesNNBayesian NetworksAssociation rulesConclusions
  4. Interest in Hadoop is surging…Hadoop is: ‘A scalable fault-tolerant distributed system for data storage and processing’Hadoop History2002-2004: Doug Cutting and Mike Cafarella started working on Nutch2003-2004: Google publishes GFS and MapReduce papers 2004: Cutting adds DFS &amp; MapReduce support to Nutch2006: Yahoo! hires Cutting, Hadoop spins out of Nutch2007: NY Times converts 4TB of archives over 100 EC2s2008: Web-scale deployments at Y!, Facebook, Last.fmApril 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodesMay 2009:Yahoo does fastest sort of a TB, 62secs over 1460 nodesYahoo sorts a PB in 16.25hours over 3658 nodesJune 2009, Oct 2009: Hadoop Summit, Hadoop WorldSeptember 2009: Doug Cutting joins ClouderaSeptember 2011: sort 1BPB in 32 minutes on 8,000 nodesCloudera helps other companies to embrace the technology
  5. Centralized system have more global barriers (as a rule)Distributed system are less resource efficient (unless one recomputes certain thing over and over)Democracy vs. DictatorshipEverything does look simple in a centarlized system
  6. If you’ve been in Cloudera long enough you remember April 1-st, 2010 blog http://www.cloudera.com/blog/2010/04/pushing-the-limits-of-distributed-processing/ written by OmerApple Q2: 35.1 million iPhones 11.8 million iPads30 million iPads x 32GB = 10^18 Bytes (exabyte)RFID collects a bunch of information, remote devices will collect more. Moreover they are stateful devices (another way to say smart).
  7. What do you do when the data is collected (beyond ETL)? You expand it.You can pre-create some of the combinations in a distributed way.A few algorithms run in linear time, but they are not really interesting.Random projections do work, but they are just an artifact of poor problem formulation in the first place.Admit that certain problems are not solvable (by brute force)Learn to leave with uncertaintyWhy not to build heuristics? Turkey paradox. One wrong move can lead to a disastrous outcome. (Jolly Chen)Using a IBM machine w/ 2,880 cores at 4.25 GHz, 16 terabytes of RAM running at about 10,750,000 single-core/CPU-hours, they solved the King&apos;s Gambit (a classical chess opening).Where is the limit? We can not solve a Shrodinger equation for the while universe (we will not be able to store state)
  8. Some of the problems will never be solvedGo – 10360 nodes in a game tree2,598,960 possible hands in poker (but it is a more complex game as it involves dealing with incomplete information, emotions, as is bridge)It can also be noted that since there are about 31 million seconds in a year, it would take about 2¼ years, playing 16 hours a day at one move per second, to play 47 million moves. As to 1048, since the future age of the universe is projected to be less than 1000 trillion years[10] and no computer is projected to compute anything close to a trillion Teraflops (one yottaflop), any number higher than 1039 is beyond possibility of being played.
  9. At least some of the problems will not be solved in timeIf we had all the time (the universe is projected to be less than 1000 trillion years) we could (probably) get the exact answerSome analytical companies:http://cetas.net/ acquired by VMWarehttp://www.woopra.com analyses traffic to a website real-timehttp://www.wibidata.com/ our friends
  10. Data mining likesto have every bit of information at one place. This is not necessary and not required for probabilistic computations. And more and more computations are about uncertainty and risk management. Should be ~ 15 minutes
  11. Diskmoves at 50 m/s vs 300,000,000 m/sIt is much easier for me to grab a remote from a table than to go to LA and back with a remoteRAM is faster than disks (RAM ns, disk ms)  There are 1,832,160 feet in 347 milesCombining storage or processing capabilities across a distributed system of machines is non-trivialCan we do at least 1,000 feet (300 m)?Network?There is no “virtual memory”HBase (sparse map column-family oriented DB) with enhanced consistency guarantees
  12. There was:Push computation to the data (MapReduce)Push data to the computations (Map side join)We need to push something to where it can be done is a distributed fashionWhat we did with storage, needs to be done with statistical computationsIf you carry one thing out of my talk I want it to be this: push computations to the source
  13. Pre-compute pieces at the source of dataWe will show that we can do Naïve Bayes, Assoc, NNAnd potentially push new reqs to the source
  14. More recent column families are accessed more oftenVersioning can be used for that, but we didn’t go this pathColumn family gives you data locality (more recent data are accessed more frequently)
  15. Value is just 8 bytes -&gt; sweet case for Hbase
  16. No salt (or random key)Column families, keys and column names are just ascii for now
  17. Data mining likesto have every bit of information at one place. This is not necessary and not required for probabilistic computations. And more and more computations are about uncertainty and risk management. Should be ~ 15 minutes
  18. Should be around 20 minutes
  19. Assumes conditional independence of predictors given the target. Can be completely substantiated given pairwise counts.Remember each key starts for a prefix var=val, there are only N such prefixes for each record!
  20. Need full cardinality of the counters
  21. A generic measure of mutual information between two subsets of nodesFor random variables this is 0After this a BN learning is just a min span tree
  22. Assoc is anitemset generation (which can be done with dynamically adjusting the types of counts we collect)A frequent measure of importance is support
  23. The # of rules grows with decreasing supportEven for min support one it is not disastrous
  24. This is DataDesk (just to show that Microsoft is not my only tool)To convert exponential things to linear just take a log of the X axisLinear deals with additions, exponential with multiplication (or addition of the logs)The actual time per pattern increases with min support!
  25. The amount of time per itemset increases
  26. Transparently traderecencyvs statistical error(time can be replaced with min # of trials or counts)
  27. Push computations to the sourceConvert an exponential problem to linaerThe problem is linear in # of observationsNon-linear part has been moved out to the sourceIn plans: dynamic adjustment of counters (depth, time buckets) to collectWhat we accomplished is to make an exponential problem linear by distributing compute-intensive parta.k.a. MapReduce for data mining (+ get time dependence for free)The code will be in public domain (still working with the contractor of this work)
  28. Anyone to work on this?
  29. I hope that I got you interested...If you want to contribute let me knowCloudera is hiring