SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Bayes on your (Big)Couch




                      Mike Miller
                      _milleratmit
                      July 25, 2011
I want my app to do _this_




             Mike Miller, Oscon 2011   2
CouchDB in a slide
• Schema-free document database management system
 Documents are JSON objects
 Able to store binary attachments

• RESTful API
 http://wiki.apache.org/couchdb/reference

• Views: Custom, persistent representations of your data
 Incremental MapReduce with results persisted to disk
 Fast querying by primary key (views stored in a B-tree)

• Bi-Directional Replication
 Master-slave and multi-master topologies supported
 Optional ‘filters’ to replicate a subset of the data
 Edge devices (mobile phones, sensors, etc.)
                                  Mike Miller, Oscon 2011   3
BigCouch = Couch+Scaling
• Open Source, Apache License
• Horizontal Scalability
 Easily add storage capacity by adding more servers
 Computing power (views, compaction, etc.) scales with
 more servers

• No SPOF
 Any node can handle any request
 Individual nodes can come and go

• Transparent to the Application
 All clustering operations take place “behind the curtain”
 looks (mostly) like a single server instance of CouchDB


                                       Mike Miller, Oscon 2011   4
...back to making my app smart




            Mike Miller, Oscon 2011   5
Sample Data
      Height vs. Weight
                  80
    Height [in]
                  75        Girls
                            Boys
                  70

                  65

                  60

                  55

                  50

                  45

                  40

                  35
                       80    100    120      140      160       180   200    220
                                                                        Weight [lbs]

                                          Mike Miller, Oscon 2011                      6
Naive Bayes Classifier
                                 gaus
           mean male
            height                 0.4

height                            0.35

                                   0.3

                                  0.25

                                   0.2

                                  0.15

           male height             0.1


    male    variance              0.05

                                    0
                                    -3   -2   -1   0   1   2   3




               Mike Miller, Oscon 2011                             7
Implementation Plan
                                                   Height vs. Weight
                                                               80




                                                 Height [in]
 Model people as documents in                                  75        Girls
                                                                         Boys
 CouchDB                                                       70

                                                               65

                                                               60
 Calculate Means/Variances with
                                                               55
 MapReduce
                                                               50

                                                               45

 Run classifier in the CouchDB as                               40

 post-MapReduce hook (“_list”)                                 35
                                                                    80    100    120   140   160   180   200    220
                                                                                                           Weight [lbs]


 • Note:
  do not need to specify fields to use in classification
  multi-class implementation
  continuous, incremental training! Results improve as training data trickles in.
                                   Mike Miller, Oscon 2011                                                                8
3 ways to follow along

 couchapp python tool to push/pull from other couchdb’s
 > sudo easy_install install -U couchapp
 > couchapp clone ‘http://millertime.cloudant.com/bitb'
 create an account at cloudant.com
 > curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’
 > couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’
 github
 > git clone git@github.com:mlmiller/bayes.git


 CouchDB replication to your cloudant account
 bonus, brings along the data, too!


                                      Mike Miller, Oscon 2011             9
The Code

post MapReduce                                  Classifier
 Hook (“_list”                                 (Probability
   method)                                     Calculator)




client side test
  via node.js                                  view code to
                                                calculate
                                                means and
   you can ignore                                variances
   everything else   Mike Miller, Oscon 2011              10
Data Model

                                     Arbitrary number of numerical
                                            fields allowed




‘class’ => training Data



                           Mike Miller, Oscon 2011                   11
Training via MapReduce
                                    ‘class’ => training Data
 views/training/map.js




           Calculate mean/variance for all numerical
                      fields in a document
                 emit: ([<class>, <field>], <value>)
                 Reduce: _stats (Erlang builtin)
                         Mike Miller, Oscon 2011               12
Bayes: Trained State




                             pre-reduce output



            Mike Miller, Oscon 2011              13
Bayes: Trained State




                                    Count, Min, Max, Mean,
                                          Variance

     Automatically Updated as new training Data
                      Arrives
                  Mike Miller, Oscon 2011                    14
Bayes Classifier
            lib/bayes_classifier.js
                     Load state from DB

                                      No assumptions on Field
                                              Names


                                 Calculate prob. for
                                    all possible
                                     hypotheses



            Mike Miller, Oscon 2011                             15
A brief aside...

 • Lets test our classifier
  Select 2000 documents for test
  Randomly choose 1000 documents for training sample
  Remaining documents used for validation

 • Simulate continuous training
  Add documents one at a time
  After each document addition, test on all 1000 of our validation sample
  Record and plot fraction of validation sample properly classified




                                Mike Miller, Oscon 2011                     16
A brief aside...


                                Dramatic improvement with
                                 additional training data




      Number of documents in the training set
                   Mike Miller, Oscon 2011                  17
... and back to the code




             Mike Miller, Oscon 2011   18
test it yourself
• Client side test via node.js
 > ./test.js height=<some number> weigth=<some number>
 Classifier runs server side, configured in line 6 of test.js




Can point this to
    your DB

                                      Mike Miller, Oscon 2011   19
Running as CouchApp



                create a database (e.g., ‘bitb’) at cloudant.com
                add data
                then push your code
                >couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’
                HTML & CSS served directly from BigCouch to the browser
                Heavy lifting of classification done server side


 http://millertime.cloudant.com/bitb/_design/bayes/index.html
                          Mike Miller, Oscon 2011                              20
Running as API (_list)
 > curl 'http://millertime.cloudant.com/bitb/_design/
               bayes/_list/index/training?
       height=65.65&weight=168.61&format=json
                      &group=true'




                       Mike Miller, Oscon 2011          21
Wrapping Up: Bayes on BigCouch
• Simple code, powerful results
 light requirements on data model
 can be relaxed with more complex view code
 Continuous learning is very powerful
 e.g., time-based learning (automatically adapt to changing conditions)
 Classification can be performed client- or server-side
 push documents into DB and they are auto-tagged!
 More sophisticated classifiers easily implemented
 e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual
 classification, weighted classifiers, etc
 View Engine allows simple deployment of sophisticated domain libraries in
 mass parallel
 e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc..


                                   Mike Miller, Oscon 2011                    22
Give it a spin




 Hosting, Management, Support for CouchDB and BigCouch
                  http://cloudant.com
        http://github.com/cloudant/bigcouch
                     Mike Miller, Oscon 2011             23

Contenu connexe

Similaire à Oscon miller 2011

P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionzukun
 
Build on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingBuild on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingAmazon Web Services
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simplellangit
 
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Cloudera, Inc.
 
Architecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudArchitecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudClint Edmonson
 
Stairway to heaven webinar
Stairway to heaven webinarStairway to heaven webinar
Stairway to heaven webinarCloudBees
 
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012Amazon Web Services
 
Scalable Database Options on AWS
Scalable Database Options on AWSScalable Database Options on AWS
Scalable Database Options on AWSAmazon Web Services
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Data Con LA
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesEduardo Castro
 
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmurTobias Koprowski
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning ClassifiersMostafa
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Miningllangit
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of MicroservicesWesley Reisz
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Steps towards business intelligence
Steps towards business intelligenceSteps towards business intelligence
Steps towards business intelligenceAhsan Kabir
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMark Kromer
 

Similaire à Oscon miller 2011 (20)

P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for vision
 
Build on AWS: Migrating and Platforming
Build on AWS: Migrating and PlatformingBuild on AWS: Migrating and Platforming
Build on AWS: Migrating and Platforming
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
 
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
Hadoop World 2011: BI on Hadoop in Financial Services - Stefan Grschupf, Data...
 
Architecting Scalable Applications in the Cloud
Architecting Scalable Applications in the CloudArchitecting Scalable Applications in the Cloud
Architecting Scalable Applications in the Cloud
 
Stairway to heaven webinar
Stairway to heaven webinarStairway to heaven webinar
Stairway to heaven webinar
 
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012DAT101 Understanding AWS Database Options - AWS re: Invent 2012
DAT101 Understanding AWS Database Options - AWS re: Invent 2012
 
Scalable Database Options on AWS
Scalable Database Options on AWSScalable Database Options on AWS
Scalable Database Options on AWS
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
WebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data LoadWebSphere Commerce v7 Data Load
WebSphere Commerce v7 Data Load
 
SQL Server 2008 Data Mining
SQL Server 2008 Data MiningSQL Server 2008 Data Mining
SQL Server 2008 Data Mining
 
Patterns & Practices of Microservices
Patterns & Practices of MicroservicesPatterns & Practices of Microservices
Patterns & Practices of Microservices
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Steps towards business intelligence
Steps towards business intelligenceSteps towards business intelligence
Steps towards business intelligence
 
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday PhillyMicrosoft Cloud BI Update 2012 for SQL Saturday Philly
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
 

Dernier

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Oscon miller 2011

  • 1. Bayes on your (Big)Couch Mike Miller _milleratmit July 25, 2011
  • 2. I want my app to do _this_ Mike Miller, Oscon 2011 2
  • 3. CouchDB in a slide • Schema-free document database management system Documents are JSON objects Able to store binary attachments • RESTful API http://wiki.apache.org/couchdb/reference • Views: Custom, persistent representations of your data Incremental MapReduce with results persisted to disk Fast querying by primary key (views stored in a B-tree) • Bi-Directional Replication Master-slave and multi-master topologies supported Optional ‘filters’ to replicate a subset of the data Edge devices (mobile phones, sensors, etc.) Mike Miller, Oscon 2011 3
  • 4. BigCouch = Couch+Scaling • Open Source, Apache License • Horizontal Scalability Easily add storage capacity by adding more servers Computing power (views, compaction, etc.) scales with more servers • No SPOF Any node can handle any request Individual nodes can come and go • Transparent to the Application All clustering operations take place “behind the curtain” looks (mostly) like a single server instance of CouchDB Mike Miller, Oscon 2011 4
  • 5. ...back to making my app smart Mike Miller, Oscon 2011 5
  • 6. Sample Data Height vs. Weight 80 Height [in] 75 Girls Boys 70 65 60 55 50 45 40 35 80 100 120 140 160 180 200 220 Weight [lbs] Mike Miller, Oscon 2011 6
  • 7. Naive Bayes Classifier gaus mean male height 0.4 height 0.35 0.3 0.25 0.2 0.15 male height 0.1 male variance 0.05 0 -3 -2 -1 0 1 2 3 Mike Miller, Oscon 2011 7
  • 8. Implementation Plan Height vs. Weight 80 Height [in] Model people as documents in 75 Girls Boys CouchDB 70 65 60 Calculate Means/Variances with 55 MapReduce 50 45 Run classifier in the CouchDB as 40 post-MapReduce hook (“_list”) 35 80 100 120 140 160 180 200 220 Weight [lbs] • Note: do not need to specify fields to use in classification multi-class implementation continuous, incremental training! Results improve as training data trickles in. Mike Miller, Oscon 2011 8
  • 9. 3 ways to follow along couchapp python tool to push/pull from other couchdb’s > sudo easy_install install -U couchapp > couchapp clone ‘http://millertime.cloudant.com/bitb' create an account at cloudant.com > curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ > couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’ github > git clone git@github.com:mlmiller/bayes.git CouchDB replication to your cloudant account bonus, brings along the data, too! Mike Miller, Oscon 2011 9
  • 10. The Code post MapReduce Classifier Hook (“_list” (Probability method) Calculator) client side test via node.js view code to calculate means and you can ignore variances everything else Mike Miller, Oscon 2011 10
  • 11. Data Model Arbitrary number of numerical fields allowed ‘class’ => training Data Mike Miller, Oscon 2011 11
  • 12. Training via MapReduce ‘class’ => training Data views/training/map.js Calculate mean/variance for all numerical fields in a document emit: ([<class>, <field>], <value>) Reduce: _stats (Erlang builtin) Mike Miller, Oscon 2011 12
  • 13. Bayes: Trained State pre-reduce output Mike Miller, Oscon 2011 13
  • 14. Bayes: Trained State Count, Min, Max, Mean, Variance Automatically Updated as new training Data Arrives Mike Miller, Oscon 2011 14
  • 15. Bayes Classifier lib/bayes_classifier.js Load state from DB No assumptions on Field Names Calculate prob. for all possible hypotheses Mike Miller, Oscon 2011 15
  • 16. A brief aside... • Lets test our classifier Select 2000 documents for test Randomly choose 1000 documents for training sample Remaining documents used for validation • Simulate continuous training Add documents one at a time After each document addition, test on all 1000 of our validation sample Record and plot fraction of validation sample properly classified Mike Miller, Oscon 2011 16
  • 17. A brief aside... Dramatic improvement with additional training data Number of documents in the training set Mike Miller, Oscon 2011 17
  • 18. ... and back to the code Mike Miller, Oscon 2011 18
  • 19. test it yourself • Client side test via node.js > ./test.js height=<some number> weigth=<some number> Classifier runs server side, configured in line 6 of test.js Can point this to your DB Mike Miller, Oscon 2011 19
  • 20. Running as CouchApp create a database (e.g., ‘bitb’) at cloudant.com add data then push your code >couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’ HTML & CSS served directly from BigCouch to the browser Heavy lifting of classification done server side http://millertime.cloudant.com/bitb/_design/bayes/index.html Mike Miller, Oscon 2011 20
  • 21. Running as API (_list) > curl 'http://millertime.cloudant.com/bitb/_design/ bayes/_list/index/training? height=65.65&weight=168.61&format=json &group=true' Mike Miller, Oscon 2011 21
  • 22. Wrapping Up: Bayes on BigCouch • Simple code, powerful results light requirements on data model can be relaxed with more complex view code Continuous learning is very powerful e.g., time-based learning (automatically adapt to changing conditions) Classification can be performed client- or server-side push documents into DB and they are auto-tagged! More sophisticated classifiers easily implemented e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual classification, weighted classifiers, etc View Engine allows simple deployment of sophisticated domain libraries in mass parallel e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc.. Mike Miller, Oscon 2011 22
  • 23. Give it a spin Hosting, Management, Support for CouchDB and BigCouch http://cloudant.com http://github.com/cloudant/bigcouch Mike Miller, Oscon 2011 23