SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Exact Inference in Bayesian
Networks using MapReduce
Alex Kozlov
Cloudera, Inc.
Session Agenda


 About Me
 About Cloudera
 Bayesian (Probabilistic) Networks
 BN Inference 101
 CPCS Network
 Why BN Inference
 Inference with MR
 Results
 Conclusions
                               2
About Me



 Worked on BN Inference in 1995-1998 (for Ph.D.)
 ›   Published the fastest implementation at the time
 Worked on DM/BI field since then
 Recently joined Cloudera, Inc.
 ›   Started looking at how to solve world’s hardest problems




                                   3
About Cloudera


Founded in the summer 2008
Cloudera helps organizations profit from all of their data. We deliver the
  industry-standard platform which consolidates, stores and processes
  any kind of data, from any source, at scale. We make it possible to do
  more powerful analysis of more kinds of data, at scale, than ever
  before. With Cloudera, you get better insight into their customers,
  partners, vendors and businesses.


Cloudera’s platform is built on the popular open source Apache Hadoop
  project. We deliver the innovative work of a global community of
  contributors in a package that makes it easy for anyone to put the
  power of Google, Facebook and Yahoo! to work on their own problems.


                                       4
Bayesian Networks


1. Nodes
2. Edges
3. Probabilities


 Bayes, Thomas (1763)
 An essay towards solving a problem in
 the doctrine of chances, published
 posthumously by his friend
 Philosophical Transactions of the
 Royal Society of London, 53:370-418



                                     5
Applications


1. Computational biology and bioinformatics (gene regulatory networks,
   protein structure, gene expression analysis)
2. Medicine
3. Document classification, information retrieval
4. Image processing
5. Data fusion
6. Gaming
7. Law
8. On-line advertising!



                                     6
A Simple BN Network


Rain    T     F
                                                Rain                     T      F
F      0.4 0.6
T      0.1 0.9                                                           0.2 0.8

               Sprinkler



                                                Sprinkler, Rain   T       F

                                                          F, F    0.01   0.99
                                      Wet                 F, T    0.8    0.2
                                    Driveway              T, F    0.9    0.1
                                                          T, T    0.99   0.01

       Pr(Rain | Wet Driveway)
 Pr(Sprinkler Broken | !Wet Driveway & !Rain)
                                         7
Asia Network

     Pr(Visit to Asia)          Pr(Lung Cancer | Smoking)     Pr(Smoking)




Pr(Tuberculosis | Visit to Asia)                              Pr(Bronchitis | Smoking)




                  Pr(C | BE )




Pr(X-Ray | Lung Cancer or Tuberculosis)                     Pr(Dyspnea | CG )


           Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea)
                                                  8
BN Inference 101 (in Hive)


JPD = <product of all probabilities and conditional
  probabilities in the network> = Pr(A, B, …, H)
PAB =
   SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B;
PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A;
Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule


CPCS is 422 nodes, a table of at least 2422 rows!


                                9
Junction Tree
                                                     Pr(E | F )
       Pr(Tuberculosis | Visit to Asia)
                                                        Pr(G | F )
                Pr(Visit to Asia)
                                                            Pr(F)




                        Pr(C | BE )
                                                                  Pr(H | CG )




                                               Pr(Lung Cancer | Dyspnea) =
                                                         Pr(E|H)

                      Pr(D| C)
                                          10
CPCS Networks


                     422 nodes

                     14 nodes describe
                     diseases

                     33 risk factors

                     375 various findings
                     related to diseases




                11
CPCS Networks




                12
Why Bayesian Network Inference?


                Choose the right tool for the right job!


   BN is an abstraction for reasoning and decision making
   Easy to incorporate human insight and intuitions
   Very general, no specific ‘label’ node
   Easy to do ‘what-if’, strength of influence, value of information,
    analysis
   Immune to Gaussian assumptions


              It’s all just a joint probability distribution

                                     13
Map & Reduces
          Map        Keys

                     B1C1E1
  A1B1               B1C1E2
                                                       Reduce
  A2B1       B1      B1C2E1
                     B1C2E2
  A1B2               B2C1E1
  A2B2       B2      B2C1E2        ∑ Pr(B1| A) x ∑ Pr(D| C1)
                     B2C2E1
                     B2C2E2
                     B1C1E1
  C1D1               B1C1E2             Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1)

  C2D1      C1       B1C2E1
                     B1C2E2   Aggregation 2 (x)
  C1D2               B2C1E1
  C2D2       C2      B2C1E2
                     B2C2E1                            BCE
                     B2C2E2
 Aggregation 1 (+)
                              14
MapReduce Implementation


for each clique in depth-first order:
   MAP:
       Sum over the variables to get ‘clique message’ (requires state, custom
         partitioner and input format)
       Emit factors for the next clique

   REDUCE:
       Multiply the factors from all children
       Include probabilities assigned to the clique
       Form the new clique values

the MAP is done over all child cliques

                                            15
Cliques, Trees, and Parallelism


                  C6
                       o Topological parallelism: compute
                         branches C2 and C4 in parallel
             C5        o Clique parallelism: divide
                         computation of each clique into
                         maps/reducers
       C4
                       o Fall back into optimal factoring if a
                         corresponding subtree is small
                  C3
                       o Combine multiple phases together
            C2         o Reduce replication level

 C1
         Cliques may be larger than they
                    appear!
                         16
CPCS Inference


CPCS:
The 360-node subnet has the largest ‘clique’ of
 11,739,896 floats (fits into 2GB)
The full 422-node version (absent, mild, moderate, severe)
 3,377,699,720,527,872 floats (or 12 PB of storage, but do not
    need it for all queries)


In most cases do not need to do inference on the full network



                                     17
Results

Network      Memory        Time          Macbook       Hadoop
                           (19971)       Pro (20102)   (& future3)
Random       10 MB         33 sec            < 1 sec
(B)
Random       254 MB        260 sec       10 sec
(A)
cpcs360      2 GB          640 sec           15 sec    1 min
cpcs422       > 12 PB      N/A           N/A           Minutes to hours for
                                                       most of the queries on
                                                       most of the clusters

1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195
      MHz clock speed)’ in 1997
2Macbook    Pro 4 GB DDR3 2.53 GHz
310   node Linux Xeon cluster 24 GB quad 2-core

                                        18
Conclusions


   Exact probabilistic inference is finally in sight for the full 422 node
    CPCS network
   Hadoop helps to solve the world’s hardest problems


         What you should know after this talk

BN is a DAG and represents a joint probability distribution (JPD)
Can compute conditional probabilities by multiplying and summing JPD
For large networks, this may be PBytes of intermediate data, but it’s MR




                                        19
Questions?


   alexvk@{cloudera,gmail}.com

Contenu connexe

Tendances

Double Patterning
Double PatterningDouble Patterning
Double PatterningDanny Luk
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningChun-Ming Chang
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Examplekailash shaw
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnnBrian Kim
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...Artem Lutov
 
Recent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph ClassificationRecent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph ClassificationChristopher Morris
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterMark Chang
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Christopher Morris
 
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Christopher Morris
 
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsMinimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsDing Nie
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용홍배 김
 
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...Ding Nie
 
Nie_ISCAS2015
Nie_ISCAS2015Nie_ISCAS2015
Nie_ISCAS2015Ding Nie
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionMostafa G. M. Mostafa
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Dongheon Lee
 

Tendances (20)

Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
11 clusadvanced
11 clusadvanced11 clusadvanced
11 clusadvanced
 
Double Patterning
Double PatterningDouble Patterning
Double Patterning
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Hands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep Learning
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 
Spectral cnn
Spectral cnnSpectral cnn
Spectral cnn
 
Deeplab
DeeplabDeeplab
Deeplab
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Recent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph ClassificationRecent Advances in Kernel-Based Graph Classification
Recent Advances in Kernel-Based Graph Classification
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
 
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
Glocalized Weisfeiler-Lehman Graph Kernels: Global-Local Feature Maps of Graphs
 
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled LoadsMinimum Complexity Decoupling Networks for Arbitrary Coupled Loads
Minimum Complexity Decoupling Networks for Arbitrary Coupled Loads
 
Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용Convolutional neural networks 이론과 응용
Convolutional neural networks 이론과 응용
 
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
Bandwidth Analysis of Low-Complexity Decoupling Networks for Multiple Coupled...
 
Nie_ISCAS2015
Nie_ISCAS2015Nie_ISCAS2015
Nie_ISCAS2015
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear Regression
 
Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++Pixel RNN to Pixel CNN++
Pixel RNN to Pixel CNN++
 
Birch1
Birch1Birch1
Birch1
 

Similaire à Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLuba Elliott
 
Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Oleg Ovcharenko
 
Igarss1792_v2.ppt
Igarss1792_v2.pptIgarss1792_v2.ppt
Igarss1792_v2.pptgrssieee
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxMannyK4
 
A MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD LibraryA MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD LibraryKen Friis Larsen
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeLorenzo Alberton
 
Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Kolja Kleineberg
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh KumarHyderabad Scalability Meetup
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and CompressionMark Kilgard
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Ganesan Narayanasamy
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
 

Similaire à Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010) (20)

Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetupLucas Theis - Compressing Images with Neural Networks - Creative AI meetup
Lucas Theis - Compressing Images with Neural Networks - Creative AI meetup
 
P2P Supernodes
P2P SupernodesP2P Supernodes
P2P Supernodes
 
Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...Surface-related multiple elimination through orthogonal encoding in the laten...
Surface-related multiple elimination through orthogonal encoding in the laten...
 
Pcm
PcmPcm
Pcm
 
Igarss1792_v2.ppt
Igarss1792_v2.pptIgarss1792_v2.ppt
Igarss1792_v2.ppt
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
 
A MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD LibraryA MuDDy Experience - ML Bindings to a BDD Library
A MuDDy Experience - ML Bindings to a BDD Library
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks Age
 
Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...Towards controlling evolutionary dynamics through network geometry: some very...
Towards controlling evolutionary dynamics through network geometry: some very...
 
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel   - Dr. Shailesh KumarMap reduce and the art of Thinking Parallel   - Dr. Shailesh Kumar
Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar
 
Fast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and moreFast matrix primitives for ranking, link-prediction and more
Fast matrix primitives for ranking, link-prediction and more
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and Compression
 
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...Fine grained asynchronism for pseudo-spectral codes - with application to tur...
Fine grained asynchronism for pseudo-spectral codes - with application to tur...
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Kailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptxKailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptx
 
Defense
DefenseDefense
Defense
 
PhD defense slides
PhD defense slidesPhD defense slides
PhD defense slides
 
MapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applicationsMapReduce Tall-and-skinny QR and applications
MapReduce Tall-and-skinny QR and applications
 

Dernier

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 

Dernier (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 

Exact Inference in Bayesian Networks using MapReduce (Hadoop Summit 2010)

  • 1. Exact Inference in Bayesian Networks using MapReduce Alex Kozlov Cloudera, Inc.
  • 2. Session Agenda  About Me  About Cloudera  Bayesian (Probabilistic) Networks  BN Inference 101  CPCS Network  Why BN Inference  Inference with MR  Results  Conclusions 2
  • 3. About Me  Worked on BN Inference in 1995-1998 (for Ph.D.) › Published the fastest implementation at the time  Worked on DM/BI field since then  Recently joined Cloudera, Inc. › Started looking at how to solve world’s hardest problems 3
  • 4. About Cloudera Founded in the summer 2008 Cloudera helps organizations profit from all of their data. We deliver the industry-standard platform which consolidates, stores and processes any kind of data, from any source, at scale. We make it possible to do more powerful analysis of more kinds of data, at scale, than ever before. With Cloudera, you get better insight into their customers, partners, vendors and businesses. Cloudera’s platform is built on the popular open source Apache Hadoop project. We deliver the innovative work of a global community of contributors in a package that makes it easy for anyone to put the power of Google, Facebook and Yahoo! to work on their own problems. 4
  • 5. Bayesian Networks 1. Nodes 2. Edges 3. Probabilities Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances, published posthumously by his friend Philosophical Transactions of the Royal Society of London, 53:370-418 5
  • 6. Applications 1. Computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis) 2. Medicine 3. Document classification, information retrieval 4. Image processing 5. Data fusion 6. Gaming 7. Law 8. On-line advertising! 6
  • 7. A Simple BN Network Rain T F Rain T F F 0.4 0.6 T 0.1 0.9 0.2 0.8 Sprinkler Sprinkler, Rain T F F, F 0.01 0.99 Wet F, T 0.8 0.2 Driveway T, F 0.9 0.1 T, T 0.99 0.01 Pr(Rain | Wet Driveway) Pr(Sprinkler Broken | !Wet Driveway & !Rain) 7
  • 8. Asia Network Pr(Visit to Asia) Pr(Lung Cancer | Smoking) Pr(Smoking) Pr(Tuberculosis | Visit to Asia) Pr(Bronchitis | Smoking) Pr(C | BE ) Pr(X-Ray | Lung Cancer or Tuberculosis) Pr(Dyspnea | CG ) Pr(Lung Cancer | Neg X-Ray & Positive Dyspnea) 8
  • 9. BN Inference 101 (in Hive) JPD = <product of all probabilities and conditional probabilities in the network> = Pr(A, B, …, H) PAB = SELECT A, B, SUM(PROB) FROM JPD GROUP BY A, B; PB = SELECT B, SUM(PROB) FROM PAB GROUP BY A; Pr(A|B) = Pr(A,B)/Pr(B) – Bayes’ rule CPCS is 422 nodes, a table of at least 2422 rows! 9
  • 10. Junction Tree Pr(E | F ) Pr(Tuberculosis | Visit to Asia) Pr(G | F ) Pr(Visit to Asia) Pr(F) Pr(C | BE ) Pr(H | CG ) Pr(Lung Cancer | Dyspnea) = Pr(E|H) Pr(D| C) 10
  • 11. CPCS Networks 422 nodes 14 nodes describe diseases 33 risk factors 375 various findings related to diseases 11
  • 13. Why Bayesian Network Inference? Choose the right tool for the right job!  BN is an abstraction for reasoning and decision making  Easy to incorporate human insight and intuitions  Very general, no specific ‘label’ node  Easy to do ‘what-if’, strength of influence, value of information, analysis  Immune to Gaussian assumptions It’s all just a joint probability distribution 13
  • 14. Map & Reduces Map Keys B1C1E1 A1B1 B1C1E2 Reduce A2B1 B1 B1C2E1 B1C2E2 A1B2 B2C1E1 A2B2 B2 B2C1E2 ∑ Pr(B1| A) x ∑ Pr(D| C1) B2C2E1 B2C2E2 B1C1E1 C1D1 B1C1E2 Pr(C| BE) x ∑ Pr(B1| A) x ∑ Pr(D| C1) C2D1 C1 B1C2E1 B1C2E2 Aggregation 2 (x) C1D2 B2C1E1 C2D2 C2 B2C1E2 B2C2E1 BCE B2C2E2 Aggregation 1 (+) 14
  • 15. MapReduce Implementation for each clique in depth-first order: MAP: Sum over the variables to get ‘clique message’ (requires state, custom partitioner and input format) Emit factors for the next clique REDUCE: Multiply the factors from all children Include probabilities assigned to the clique Form the new clique values the MAP is done over all child cliques 15
  • 16. Cliques, Trees, and Parallelism C6 o Topological parallelism: compute branches C2 and C4 in parallel C5 o Clique parallelism: divide computation of each clique into maps/reducers C4 o Fall back into optimal factoring if a corresponding subtree is small C3 o Combine multiple phases together C2 o Reduce replication level C1 Cliques may be larger than they appear! 16
  • 17. CPCS Inference CPCS: The 360-node subnet has the largest ‘clique’ of 11,739,896 floats (fits into 2GB) The full 422-node version (absent, mild, moderate, severe) 3,377,699,720,527,872 floats (or 12 PB of storage, but do not need it for all queries) In most cases do not need to do inference on the full network 17
  • 18. Results Network Memory Time Macbook Hadoop (19971) Pro (20102) (& future3) Random 10 MB 33 sec < 1 sec (B) Random 254 MB 260 sec 10 sec (A) cpcs360 2 GB 640 sec 15 sec 1 min cpcs422 > 12 PB N/A N/A Minutes to hours for most of the queries on most of the clusters 1‘used an SGI Origin 2000 machine with sixteen MIPS R10000 processors (195 MHz clock speed)’ in 1997 2Macbook Pro 4 GB DDR3 2.53 GHz 310 node Linux Xeon cluster 24 GB quad 2-core 18
  • 19. Conclusions  Exact probabilistic inference is finally in sight for the full 422 node CPCS network  Hadoop helps to solve the world’s hardest problems What you should know after this talk BN is a DAG and represents a joint probability distribution (JPD) Can compute conditional probabilities by multiplying and summing JPD For large networks, this may be PBytes of intermediate data, but it’s MR 19
  • 20. Questions? alexvk@{cloudera,gmail}.com