The scheduled redis speaker was sick so I whipped up in about an hour and filled in on a different subject. It's a bit crude, but you get a big picture view of how to build a simple AI application using BigCouch. The accompanying video is up at http://www.youtube.com/watch?v=QEBDNxbSRuk
1. Bayes on your (Big)Couch
Mike Miller
_milleratmit
July 25, 2011
2. I want my app to do _this_
Mike Miller, Oscon 2011 2
3. CouchDB in a slide
• Schema-free document database management system
Documents are JSON objects
Able to store binary attachments
• RESTful API
http://wiki.apache.org/couchdb/reference
• Views: Custom, persistent representations of your data
Incremental MapReduce with results persisted to disk
Fast querying by primary key (views stored in a B-tree)
• Bi-Directional Replication
Master-slave and multi-master topologies supported
Optional ‘filters’ to replicate a subset of the data
Edge devices (mobile phones, sensors, etc.)
Mike Miller, Oscon 2011 3
4. BigCouch = Couch+Scaling
• Open Source, Apache License
• Horizontal Scalability
Easily add storage capacity by adding more servers
Computing power (views, compaction, etc.) scales with
more servers
• No SPOF
Any node can handle any request
Individual nodes can come and go
• Transparent to the Application
All clustering operations take place “behind the curtain”
looks (mostly) like a single server instance of CouchDB
Mike Miller, Oscon 2011 4
6. Sample Data
Height vs. Weight
80
Height [in]
75 Girls
Boys
70
65
60
55
50
45
40
35
80 100 120 140 160 180 200 220
Weight [lbs]
Mike Miller, Oscon 2011 6
7. Naive Bayes Classifier
gaus
mean male
height 0.4
height 0.35
0.3
0.25
0.2
0.15
male height 0.1
male variance 0.05
0
-3 -2 -1 0 1 2 3
Mike Miller, Oscon 2011 7
8. Implementation Plan
Height vs. Weight
80
Height [in]
Model people as documents in 75 Girls
Boys
CouchDB 70
65
60
Calculate Means/Variances with
55
MapReduce
50
45
Run classifier in the CouchDB as 40
post-MapReduce hook (“_list”) 35
80 100 120 140 160 180 200 220
Weight [lbs]
• Note:
do not need to specify fields to use in classification
multi-class implementation
continuous, incremental training! Results improve as training data trickles in.
Mike Miller, Oscon 2011 8
9. 3 ways to follow along
couchapp python tool to push/pull from other couchdb’s
> sudo easy_install install -U couchapp
> couchapp clone ‘http://millertime.cloudant.com/bitb'
create an account at cloudant.com
> curl -X PUT ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’
> couchapp push ‘http://<username>:<pwd>@<username>.cloudant.com/bitb’
github
> git clone git@github.com:mlmiller/bayes.git
CouchDB replication to your cloudant account
bonus, brings along the data, too!
Mike Miller, Oscon 2011 9
10. The Code
post MapReduce Classifier
Hook (“_list” (Probability
method) Calculator)
client side test
via node.js view code to
calculate
means and
you can ignore variances
everything else Mike Miller, Oscon 2011 10
11. Data Model
Arbitrary number of numerical
fields allowed
‘class’ => training Data
Mike Miller, Oscon 2011 11
12. Training via MapReduce
‘class’ => training Data
views/training/map.js
Calculate mean/variance for all numerical
fields in a document
emit: ([<class>, <field>], <value>)
Reduce: _stats (Erlang builtin)
Mike Miller, Oscon 2011 12
14. Bayes: Trained State
Count, Min, Max, Mean,
Variance
Automatically Updated as new training Data
Arrives
Mike Miller, Oscon 2011 14
15. Bayes Classifier
lib/bayes_classifier.js
Load state from DB
No assumptions on Field
Names
Calculate prob. for
all possible
hypotheses
Mike Miller, Oscon 2011 15
16. A brief aside...
• Lets test our classifier
Select 2000 documents for test
Randomly choose 1000 documents for training sample
Remaining documents used for validation
• Simulate continuous training
Add documents one at a time
After each document addition, test on all 1000 of our validation sample
Record and plot fraction of validation sample properly classified
Mike Miller, Oscon 2011 16
17. A brief aside...
Dramatic improvement with
additional training data
Number of documents in the training set
Mike Miller, Oscon 2011 17
18. ... and back to the code
Mike Miller, Oscon 2011 18
19. test it yourself
• Client side test via node.js
> ./test.js height=<some number> weigth=<some number>
Classifier runs server side, configured in line 6 of test.js
Can point this to
your DB
Mike Miller, Oscon 2011 19
20. Running as CouchApp
create a database (e.g., ‘bitb’) at cloudant.com
add data
then push your code
>couchapp push ‘http://<user>:<pwd>@<usr>.cloudant.com/bitb’
HTML & CSS served directly from BigCouch to the browser
Heavy lifting of classification done server side
http://millertime.cloudant.com/bitb/_design/bayes/index.html
Mike Miller, Oscon 2011 20
21. Running as API (_list)
> curl 'http://millertime.cloudant.com/bitb/_design/
bayes/_list/index/training?
height=65.65&weight=168.61&format=json
&group=true'
Mike Miller, Oscon 2011 21
22. Wrapping Up: Bayes on BigCouch
• Simple code, powerful results
light requirements on data model
can be relaxed with more complex view code
Continuous learning is very powerful
e.g., time-based learning (automatically adapt to changing conditions)
Classification can be performed client- or server-side
push documents into DB and they are auto-tagged!
More sophisticated classifiers easily implemented
e.g., Cloudant Search pre-calculates and exposes TF-IDF scores for textual
classification, weighted classifiers, etc
View Engine allows simple deployment of sophisticated domain libraries in
mass parallel
e.g. Lucene, R, SciPy, NumPy, Matlab/Octave, etc..
Mike Miller, Oscon 2011 22
23. Give it a spin
Hosting, Management, Support for CouchDB and BigCouch
http://cloudant.com
http://github.com/cloudant/bigcouch
Mike Miller, Oscon 2011 23