SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
The Artful Business
                        of Data Mining
                            Distributed Schema-less
                           Document-Based Databases




Wednesday 27 March 13
David Coallier
                         @davidcoallier



Wednesday 27 March 13
Data Scientist
                         At Engine Yard (.com)




Wednesday 27 March 13
RDBMs

Wednesday 27 March 13
Structure
          Restrictions
          Safety
Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
id    name      age    address

                        1     david       1     315
                        2     divad       3      51
                        3       foo      41      31
                        4       bar      42      98
                        5     john      3315     85
                        6      jack       4      11
                        7        jill     8      66
                        ...       ...    ...     ...




Wednesday 27 March 13
What If?


Wednesday 27 March 13
id    name      age    address   phone

                        1     david      26     IE        353
                        2     divad      27     US         1
                        3       foo      42     IE        353
                        4       bar      31     CA         1
                        5     john       17     NZ        131
                        6      jack     128     DK        311
                        7        jill    21     IE        353
                        ...       ...    ...     ...       ...




Wednesday 27 March 13
Before
                   Moving on
Wednesday 27 March 13
JSON

Wednesday 27 March 13
What is JSON?


Wednesday 27 March 13
{
                            "firstName": "David",
                            "lastName": "Coallier",
                            "age": 26,
                            "address": {
                                "streetAddress": "Mansfield House",
                                "city": "Crosshaven",
                            },
                            "phoneNumbers": [
                                {
                                    "type": "mobile",
                                    "number": "0863299999"
                                }
                            ]
                        }




Wednesday 27 March 13
What is HTTP?


Wednesday 27 March 13
What is a Schema?


Wednesday 27 March 13
Alternative

Wednesday 27 March 13
Schema-less


Wednesday 27 March 13
Does
      NOT
      Mean
      Structure-less
Wednesday 27 March 13
Documents
      and
      K-V Buckets
Wednesday 27 March 13
CouchDB
                        Cluster of unreliable commodity hardware




Wednesday 27 March 13
Replication Attachments
               Generated “random” ids
               Dictionary Revisions?
               JSON Objects
               HTTP CRUD


Wednesday 27 March 13
Documents

Wednesday 27 March 13
Wednesday 27 March 13
{
                            "_id": "131dafsd1vasd",
                            "_rev": "12-fva32asdf",
                            "firstName": "David",
                            "lastName": "Coallier",
                            "age": 26,
                            "address": {
                                "streetAddress": "Mansfield House",
                                "city": "Crosshaven",
                            },
                            "phoneNumbers": [
                                {
                                    "type": "mobile",
                                    "number": "0863299999"
                                }
                            ]
                        }




Wednesday 27 March 13
How do you
      find
      Anything?
Wednesday 27 March 13
Map/Reduce

Wednesday 27 March 13
...

Wednesday 27 March 13
Riak

Wednesday 27 March 13
Dynamo
     Paper
Wednesday 27 March 13
CAP
     Theorem
Wednesday 27 March 13
Key-Value
  Buckets
Wednesday 27 March 13
Differences?

Wednesday 27 March 13
CouchDB                                      Riak
           Storage Model         append-only                                 bitcask
                   Access            HTTP                                HTTP, PB
                 Retrieval       Views(M/R)                  M/R, Indexes, Search
               Versioning    Eventual Consistency                  Vector Clocks
            Concurrency          No Locking                   Client Resolution
              Replication    master/master/slave replication, clustering
           Scaling In/Out         Big Couch                                 Built-in
             Management         Futon/Fuxton                        Riak Control
                                  http://guide.couchdb.org   http://downloads.basho.com/papers/bitcask-intro.pdf



Wednesday 27 March 13
Map/Reduce

Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
{
                  "age": "32",
                  "heads": "3",
 }

Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
Map: find-ages
                function find_ages(doc) {
                  if (typeof(doc.age) != undefined) {
                    emit(doc._id, doc.age);
                  }
                }




Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
Map: find-ages

                                 {
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




                26                   32                   42                   17
Wednesday 27 March 13
Map: find-ages

               26       32   42   17

              Reduce: sum

Wednesday 27 March 13
Reduce: sum

    function sum(values) {
      return sum(values);
    }


Wednesday 27 March 13
Map: find-ages

               26       32    42   17

              Reduce: sum
                             117
Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
So
     What?
Wednesday 27 March 13
The
     Machines
     They Lurn.
Wednesday 27 March 13
The
     Problem
Wednesday 27 March 13
Statistics
     Example
Wednesday 27 March 13
Mean,
  Std. Deviation
  Age
Wednesday 27 March 13
n
                1
             µ = ∑ xi
                n i=1
Wednesday 27 March 13
n
           1
        σ=   ∑
           n i=1
                 (xi − µ ) 2




Wednesday 27 March 13
Mapper:
Executed on document

Reducer:
Receives output from mappers


Wednesday 27 March 13
Mapper:
  Retrieve values, pre-process

Reducer:
 Receive, process further.


Wednesday 27 March 13
{
            {                                         {                    {
                                     "_id": "...",
                "_id": "...",                             "_id": "...",        "_id": "...",
                                     "_rev": "...",
                "_rev": "...",                            "_rev": "...",       "_rev": "...",
                                     "age": "32",
                "age": "26"                               "age": "42"          "age": "17"
                                     "heads": "3",
            }                                         }                    }
                                 }




Wednesday 27 March 13
[
                            [ 26, 676],
                            [ 32, 1024],
                            [ 42, 1764],
                            [ 17, 289 ]
                        ]
Wednesday 27 March 13
/**
                          * Our mapper function.
                          */
                        map: function(doc) {
                           emit(null, [doc.age, doc.age * doc.age]);
                        }

                        /**
                         * Our reducer...
                         */
                        reduce: function(keys, values, rereduce) {
                          var N = 0;
                          var summed = 0;
                          var summedSquare = 0;

                            for (var i in values) {
                              N += 1;
                              summed += values[i][0];
                              summedSquare += values[i][1];
                            }

                            var mean = summed / N;
                            var standard_deviation = Math.sqrt(
                              (summedSquare / N) - (mean* mean)
                            )

                            return [mean, standard_deviation]
                        }




Wednesday 27 March 13
/**
   * Our mapper function.
   */
 map: function(doc) {
    emit(null, [doc.age, doc.age * doc.age]);
 }

 /**
  * Our reducer...
  */
 reduce: function(keys, values, rereduce) {
   var N = values.length;
   var summed = sum(values.map(function(v) { return v[0]; }));
   var summedSquares = sum(values.map(function(v) { return v[1];}));

     var mean = summed / N;
     var standard_deviation = Math.sqrt(
       (summedSquares / N) - (mean*mean)
     )

     return [mean, standard_deviation]
 }


Wednesday 27 March 13
Naive
  Bayes
Wednesday 27 March 13
Real Life
  Fraud
Wednesday 27 March 13
P(x j = k | y = fraudulent)
  P(x j = k | y = normal)
  P(y)

Wednesday 27 March 13
We need to:
  Sum x j = k , for each y
  to calculate P(x|y)



Wednesday 27 March 13
We need:
   More than 1 mapper.




Wednesday 27 March 13
We need

                          4
                        mappers
Wednesday 27 March 13
Mapper #1:
   ∑1i P(x = k | y = fraudulent)
                        j




Wednesday 27 March 13
Mapper #2:
   ∑1i P(x = k | y = normal)
                        j




Wednesday 27 March 13
Mapper #3:
   ∑1i P(y = fraudulent)

Wednesday 27 March 13
Mapper #4:
   ∑1i P(y = normal)


Wednesday 27 March 13
Reducer
         Sums up
         results for
         parameters
Wednesday 27 March 13
Cluster
  Analysis
Wednesday 27 March 13
k-means

Wednesday 27 March 13
Mapper:
 Divide vectors into subgroups,
 Calculate d(p,q) between
 vectors, find centroids,
 sum them up.

 Reducer:
 Sum up the sums,
 get new centroids.

Wednesday 27 March 13

Contenu connexe

En vedette

Facebooks new model
Facebooks new modelFacebooks new model
Facebooks new modelfinanzas_uca
 
Digital business #5
Digital business #5Digital business #5
Digital business #5finanzas_uca
 
Об инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданийОб инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданийKrainiak
 
Crystallized042210
Crystallized042210Crystallized042210
Crystallized042210klee4vp
 
Lams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management SystemLams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management SystemAllan Carrington
 
Thesis Final120309
Thesis Final120309Thesis Final120309
Thesis Final120309klee4vp
 
Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_klee4vp
 
Draft Framework sep 26
Draft Framework sep 26Draft Framework sep 26
Draft Framework sep 26chefhja
 
Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2Ahmed Hashmi
 
Code Reviews - Vortrag für Innogames
Code Reviews - Vortrag für InnogamesCode Reviews - Vortrag für Innogames
Code Reviews - Vortrag für InnogamesFrank Sons
 
Menulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainyaMenulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainyaAmril Taufik Gobel
 
Kitchenbathportfolio3
Kitchenbathportfolio3Kitchenbathportfolio3
Kitchenbathportfolio3RaquelT
 
telephone data systems 99AR
telephone data systems  99ARtelephone data systems  99AR
telephone data systems 99ARfinance48
 
autozone AZO_04AR
autozone  AZO_04ARautozone  AZO_04AR
autozone AZO_04ARfinance46
 
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conferencefinance46
 
Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305Johan Lapidaire
 

En vedette (19)

Facebooks new model
Facebooks new modelFacebooks new model
Facebooks new model
 
Digital business #5
Digital business #5Digital business #5
Digital business #5
 
Об инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданийОб инициативе украиского правительства касательно регистрации Интернет-изданий
Об инициативе украиского правительства касательно регистрации Интернет-изданий
 
Crystallized042210
Crystallized042210Crystallized042210
Crystallized042210
 
SuferinţA
SuferinţASuferinţA
SuferinţA
 
Lams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management SystemLams101: Introducing the Learning Activity Management System
Lams101: Introducing the Learning Activity Management System
 
Thesis Final120309
Thesis Final120309Thesis Final120309
Thesis Final120309
 
Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_Mobile clinic breast_cancer_research_proposal_
Mobile clinic breast_cancer_research_proposal_
 
Draft Framework sep 26
Draft Framework sep 26Draft Framework sep 26
Draft Framework sep 26
 
Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2Khrsheed khawar peshawar night Part-2
Khrsheed khawar peshawar night Part-2
 
Code Reviews - Vortrag für Innogames
Code Reviews - Vortrag für InnogamesCode Reviews - Vortrag für Innogames
Code Reviews - Vortrag für Innogames
 
Menulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainyaMenulis di blog dan manfaat yang menyertainya
Menulis di blog dan manfaat yang menyertainya
 
Kitchenbathportfolio3
Kitchenbathportfolio3Kitchenbathportfolio3
Kitchenbathportfolio3
 
Thats Cool
Thats CoolThats Cool
Thats Cool
 
telephone data systems 99AR
telephone data systems  99ARtelephone data systems  99AR
telephone data systems 99AR
 
Presentation2
Presentation2Presentation2
Presentation2
 
autozone AZO_04AR
autozone  AZO_04ARautozone  AZO_04AR
autozone AZO_04AR
 
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference 10:30 AM ET 	Q4 2008 Tenneco Inc. Earnings Conference
10:30 AM ET Q4 2008 Tenneco Inc. Earnings Conference
 
Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305Alescon Heeft Passie Voor Horeca 20120305
Alescon Heeft Passie Voor Horeca 20120305
 

Plus de David Coallier

Data Science at Scale @ barricade.io
Data Science at Scale @ barricade.ioData Science at Scale @ barricade.io
Data Science at Scale @ barricade.ioDavid Coallier
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!David Coallier
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...David Coallier
 
PRISM seed-stage Investor Deck
PRISM seed-stage Investor DeckPRISM seed-stage Investor Deck
PRISM seed-stage Investor DeckDavid Coallier
 
The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...David Coallier
 
Taking PHP to the next level
Taking PHP to the next levelTaking PHP to the next level
Taking PHP to the next levelDavid Coallier
 
Mobile Cloud Architectures
Mobile Cloud ArchitecturesMobile Cloud Architectures
Mobile Cloud ArchitecturesDavid Coallier
 
Taking PHP To the next level
Taking PHP To the next levelTaking PHP To the next level
Taking PHP To the next levelDavid Coallier
 
Orchestra at EngineYard
Orchestra at EngineYardOrchestra at EngineYard
Orchestra at EngineYardDavid Coallier
 
The Orchestra Platform
The Orchestra PlatformThe Orchestra Platform
The Orchestra PlatformDavid Coallier
 
Building APIs with FRAPI
Building APIs with FRAPIBuilding APIs with FRAPI
Building APIs with FRAPIDavid Coallier
 
RESTful APIs and FRAPI
RESTful APIs and FRAPIRESTful APIs and FRAPI
RESTful APIs and FRAPIDavid Coallier
 
Open Source for the greater good
Open Source for the greater goodOpen Source for the greater good
Open Source for the greater goodDavid Coallier
 
PHP 5.3, a walkthrough
PHP 5.3, a walkthroughPHP 5.3, a walkthrough
PHP 5.3, a walkthroughDavid Coallier
 
RESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutesRESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutesDavid Coallier
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDBDavid Coallier
 
Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!David Coallier
 

Plus de David Coallier (18)

Data Science at Scale @ barricade.io
Data Science at Scale @ barricade.ioData Science at Scale @ barricade.io
Data Science at Scale @ barricade.io
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...
 
PRISM seed-stage Investor Deck
PRISM seed-stage Investor DeckPRISM seed-stage Investor Deck
PRISM seed-stage Investor Deck
 
The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...The Artful Business of Data Mining: Computational Statistics with Open Source...
The Artful Business of Data Mining: Computational Statistics with Open Source...
 
Taking PHP to the next level
Taking PHP to the next levelTaking PHP to the next level
Taking PHP to the next level
 
Mobile Cloud Architectures
Mobile Cloud ArchitecturesMobile Cloud Architectures
Mobile Cloud Architectures
 
Taking PHP To the next level
Taking PHP To the next levelTaking PHP To the next level
Taking PHP To the next level
 
Orchestra at EngineYard
Orchestra at EngineYardOrchestra at EngineYard
Orchestra at EngineYard
 
The Orchestra Platform
The Orchestra PlatformThe Orchestra Platform
The Orchestra Platform
 
Breaking Technologies
Breaking TechnologiesBreaking Technologies
Breaking Technologies
 
Building APIs with FRAPI
Building APIs with FRAPIBuilding APIs with FRAPI
Building APIs with FRAPI
 
RESTful APIs and FRAPI
RESTful APIs and FRAPIRESTful APIs and FRAPI
RESTful APIs and FRAPI
 
Open Source for the greater good
Open Source for the greater goodOpen Source for the greater good
Open Source for the greater good
 
PHP 5.3, a walkthrough
PHP 5.3, a walkthroughPHP 5.3, a walkthrough
PHP 5.3, a walkthrough
 
RESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutesRESTful APIs and FRAPI, a matter of minutes
RESTful APIs and FRAPI, a matter of minutes
 
An introduction to CouchDB
An introduction to CouchDBAn introduction to CouchDB
An introduction to CouchDB
 
Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!Get ready for web3.0! Open up your app!
Get ready for web3.0! Open up your app!
 

Dernier

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

The Artful Business of Data Mining: Distributed Schema-less Document-Based Databases

  • 1. The Artful Business of Data Mining Distributed Schema-less Document-Based Databases Wednesday 27 March 13
  • 2. David Coallier @davidcoallier Wednesday 27 March 13
  • 3. Data Scientist At Engine Yard (.com) Wednesday 27 March 13
  • 5. Structure Restrictions Safety Wednesday 27 March 13
  • 6. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 7. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 8. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 9. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 10. id name age address 1 david 1 315 2 divad 3 51 3 foo 41 31 4 bar 42 98 5 john 3315 85 6 jack 4 11 7 jill 8 66 ... ... ... ... Wednesday 27 March 13
  • 12. id name age address phone 1 david 26 IE 353 2 divad 27 US 1 3 foo 42 IE 353 4 bar 31 CA 1 5 john 17 NZ 131 6 jack 128 DK 311 7 jill 21 IE 353 ... ... ... ... ... Wednesday 27 March 13
  • 13. Before Moving on Wednesday 27 March 13
  • 15. What is JSON? Wednesday 27 March 13
  • 16. { "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 27 March 13
  • 17. What is HTTP? Wednesday 27 March 13
  • 18. What is a Schema? Wednesday 27 March 13
  • 21. Does NOT Mean Structure-less Wednesday 27 March 13
  • 22. Documents and K-V Buckets Wednesday 27 March 13
  • 23. CouchDB Cluster of unreliable commodity hardware Wednesday 27 March 13
  • 24. Replication Attachments Generated “random” ids Dictionary Revisions? JSON Objects HTTP CRUD Wednesday 27 March 13
  • 27. { "_id": "131dafsd1vasd", "_rev": "12-fva32asdf", "firstName": "David", "lastName": "Coallier", "age": 26, "address": { "streetAddress": "Mansfield House", "city": "Crosshaven", }, "phoneNumbers": [ { "type": "mobile", "number": "0863299999" } ] } Wednesday 27 March 13
  • 28. How do you find Anything? Wednesday 27 March 13
  • 32. Dynamo Paper Wednesday 27 March 13
  • 33. CAP Theorem Wednesday 27 March 13
  • 36. CouchDB Riak Storage Model append-only bitcask Access HTTP HTTP, PB Retrieval Views(M/R) M/R, Indexes, Search Versioning Eventual Consistency Vector Clocks Concurrency No Locking Client Resolution Replication master/master/slave replication, clustering Scaling In/Out Big Couch Built-in Management Futon/Fuxton Riak Control http://guide.couchdb.org http://downloads.basho.com/papers/bitcask-intro.pdf Wednesday 27 March 13
  • 38. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 39. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 40. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 41. { "age": "32", "heads": "3", } Wednesday 27 March 13
  • 42. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 43. Map: find-ages function find_ages(doc) { if (typeof(doc.age) != undefined) { emit(doc._id, doc.age); } } Wednesday 27 March 13
  • 44. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 45. Map: find-ages { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } 26 32 42 17 Wednesday 27 March 13
  • 46. Map: find-ages 26 32 42 17 Reduce: sum Wednesday 27 March 13
  • 47. Reduce: sum function sum(values) { return sum(values); } Wednesday 27 March 13
  • 48. Map: find-ages 26 32 42 17 Reduce: sum 117 Wednesday 27 March 13
  • 49. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 50. So What? Wednesday 27 March 13
  • 51. The Machines They Lurn. Wednesday 27 March 13
  • 52. The Problem Wednesday 27 March 13
  • 53. Statistics Example Wednesday 27 March 13
  • 54. Mean, Std. Deviation Age Wednesday 27 March 13
  • 55. n 1 µ = ∑ xi n i=1 Wednesday 27 March 13
  • 56. n 1 σ= ∑ n i=1 (xi − µ ) 2 Wednesday 27 March 13
  • 57. Mapper: Executed on document Reducer: Receives output from mappers Wednesday 27 March 13
  • 58. Mapper: Retrieve values, pre-process Reducer: Receive, process further. Wednesday 27 March 13
  • 59. { { { { "_id": "...", "_id": "...", "_id": "...", "_id": "...", "_rev": "...", "_rev": "...", "_rev": "...", "_rev": "...", "age": "32", "age": "26" "age": "42" "age": "17" "heads": "3", } } } } Wednesday 27 March 13
  • 60. [ [ 26, 676], [ 32, 1024], [ 42, 1764], [ 17, 289 ] ] Wednesday 27 March 13
  • 61. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = 0; var summed = 0; var summedSquare = 0; for (var i in values) { N += 1; summed += values[i][0]; summedSquare += values[i][1]; } var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquare / N) - (mean* mean) ) return [mean, standard_deviation] } Wednesday 27 March 13
  • 62. /** * Our mapper function. */ map: function(doc) { emit(null, [doc.age, doc.age * doc.age]); } /** * Our reducer... */ reduce: function(keys, values, rereduce) { var N = values.length; var summed = sum(values.map(function(v) { return v[0]; })); var summedSquares = sum(values.map(function(v) { return v[1];})); var mean = summed / N; var standard_deviation = Math.sqrt( (summedSquares / N) - (mean*mean) ) return [mean, standard_deviation] } Wednesday 27 March 13
  • 63. Naive Bayes Wednesday 27 March 13
  • 64. Real Life Fraud Wednesday 27 March 13
  • 65. P(x j = k | y = fraudulent) P(x j = k | y = normal) P(y) Wednesday 27 March 13
  • 66. We need to: Sum x j = k , for each y to calculate P(x|y) Wednesday 27 March 13
  • 67. We need: More than 1 mapper. Wednesday 27 March 13
  • 68. We need 4 mappers Wednesday 27 March 13
  • 69. Mapper #1: ∑1i P(x = k | y = fraudulent) j Wednesday 27 March 13
  • 70. Mapper #2: ∑1i P(x = k | y = normal) j Wednesday 27 March 13
  • 71. Mapper #3: ∑1i P(y = fraudulent) Wednesday 27 March 13
  • 72. Mapper #4: ∑1i P(y = normal) Wednesday 27 March 13
  • 73. Reducer Sums up results for parameters Wednesday 27 March 13
  • 76. Mapper: Divide vectors into subgroups, Calculate d(p,q) between vectors, find centroids, sum them up. Reducer: Sum up the sums, get new centroids. Wednesday 27 March 13