How to Troubleshoot Apps for the Modern Connected Worker
Sv big datascience_cliffclick_5_2_2013
1. Big Data for
Big Questions
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog
2. ● Motivation: What & Why Big Math?
● Better Mousetrap
● Demo
● Fork: Deep Dive into
Math Hacking ...or...
K/V Store
Source: https://github.com/0xdata/h2o
5. 0xdata.com 5
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
6. 0xdata.com 6
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
7. 0xdata.com 7
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
8. 0xdata.com 8
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
● Predict equipment failure ahead of time?
9. 0xdata.com 9
42!
What was the question again?
Oh yeah, it was:
● How do I place ads based on a clickstream?
● Detect fraud in a credit-card swipe stream?
● Detect cancer from sensor data?
● Predict equipment failure ahead of time?
● Find people (un)like me?
● ... or ... or ... or... ????
10. 0xdata.com 10
How do I figure it all out?
● Well... what are my tools?
● Domain Knowledge,
● (me! The Expert)
● Math & Science! Data Science, and
● Data – lots and lots and lots of it
● Old logs, new logs, databases, historical
records, click-streams, CSV files, dumps
● Often TB's, sometimes PB's of it
11. 0xdata.com 11
Data: The Main Player
● Data: I got lots of it
● But it's a messy mixed-up lot
● Stored in HDFS, S3, DB2 or scattered about
● Incompatible formats, older & newer bits
● Missing stuff, or "known broken" fields
● And it's Big
● Too big for my laptop, or even one server
12. 0xdata.com 12
Data: Cleaning it Up
● Just the parts I want:
● SQL, Hive, HBase, grep
● Data is Big, so this is slow
● Wrong format:
● Awk, shell scripts, files, disk-to-disk
● Inspection (do I got it right yet?)
● Grep/awk, histograms, plots/prints
● Visualization tools
13. 0xdata.com 13
From Facts to Knowledge
● Data cleaned up: lots of neat rows of facts
● Lots of rows: millions and billions ...
● But facts is not knowledge
● Too much to "get it" by looking
● Time for a mathematical Model!
● Here again, Big limits my tools
● Either can't deal, or deal very very slowly
14. 0xdata.com 14
Modeling: math(data)
● Modeling gives a simpler view
● A way to understand
● And predict in real time
● Modeling is Math!
● Generalized Linear Modeling
– Oldest, most well known & used
● Random Forest
● K-Means Clustering
15. 0xdata.com 15
Big Data vs Modeling
● Model: a concise description of my data
● A more accurate model predicts better
● Generally More Data builds a better Model
● But only if the tool can handle it
● (some datasets are not helped but it rarely hurts)
● Tools can't handle Big: so down sample,
and use better (more complex) algorithm
16. 0xdata.com 16
Big Data vs Better Algorithm
● Don't want to choose Big vs Better
● Down sampling loses information
● Want a way to manipulate Big Data like it's
small: interactive & fast. Subtle when I
need it and brute force when I don't
● Build the Better Algorithm and use Big Data
● Seeing 10x more data yield prediction
increases e.g. from 75% to 85%
17. 0xdata.com 17
Building The Better
Big Data Mousetrap
● Want fast: means dram instead of disk
● Fall back to disk, if data >>> dram
● Want fast: use all cpus
● Problems are mostly data-parallel anyways
● Want ease-of-programming:
● “parallelism without effort”
● Well understood programming model
18. 0xdata.com 18
● Want ease-of-use:
● python, json, REST/HTML interfaces
● Full R semantics (via fastr project)
● Data ingest:
● where: HDFS, S3, NFS, URL, URI, browser
● what: csv, hive, rdata
Building The Better
Big Data Mousetrap
19. 0xdata.com 19
Building The Better
Big Data Mousetrap
● Want ease-of-admin:
● e.g. java -jar h2o.jar
● auto-cluster (no config at all) or hadoop Job
● Want ease-of-upgrade:
adding more servers gives
● More CPU (faster exec)
● More DRAM (larger data in dram)
● More network/disk bandwidth (faster ingest)
20. 0xdata.com 20
H2O: An Engine for Big Math
● Built in layers – pick your abstraction level
● Analysts, starters: REST, browser
– "clicky clicky" load data, build model, score
● Scientists: R, JSON, python to drive engine
– Complex math
● Math hackers: building new algos
– Full (distributed) Java Memory Model
– "codes like Java, runs distributed"
● Core Engineering: call us, we're hiring
21. 0xdata.com 21
Core Engineering: K/V Store
● Classic distributed Key/Value store
● get/put/atomic-transaction
● Full JMM semantics, exact consistency
● Full caching as-needed
– Cached keys "get" in 150 nano's
– Misses limited by network speed
● Hardware-like cache coherency protocol
● Distributed fork/join (thanks Doug Lea)
22. 0xdata.com 22
Core Engineering: D/F/J
● Distributed fork/join (jsr 166y)
● Recursive-descent for data-parallel
● Distribution handled by the core
– Log-tree scatter/gather across cluster
● Supports map/reduce-style directly
● But also "do this on all nodes" style
● Or random graph hacking
23. 0xdata.com 23
Math Hacking
● “Tastes like (distributed) java”
(actual inner loop, auto-parallel, auto-distributed)
● Big “vector math” is easy
● The obvious for-loop "just works"
for( int i=0; i<rows; i++ ) {
double X = ary.datad(bits,i,A);
double Y = ary.datad(bits,i,B);
_sumX += X;
_sumY += Y;
_sumX2+= X*X;
}
24. 0xdata.com 24
Math Hacking
● Dense-vector algorithms are easy
● Generalized Linear Modeling: 2 weeks
● K-means: 2 days
● Histogram: 2 hours
● Random Forest: not dense vectors
● Still makes good use of D/F/J
● All-CPUs, all-nodes still light up
– Very fast tree building
25. 0xdata.com 25
Science: dancing with the data
● Like the belle of the ball, the main algos
(GLM, k-means, RF) only arrive when the
data is properly dressed
● Munging data: dropping junk columns,
replacing missing bits, adding features
● H2O provides a tool-kit
● Big vector calculator: "d := a+b*c"
● dram speeds: "msec per Gbyte"
26. 0xdata.com 26
Science: APIs
● Need to script, automate repetitive tasks
● R via fastr and bigmemory package
● Full R semantics, 5x R speed single-thread
● But your vectors can be very very big...
● https://github.com/allr/fastr
● REST / URL / JSON
● Drive from e.g. python, scripts, curl, wget
– e.g. h2o testing harness is all python
27. 0xdata.com 27
Demos & Quick Starts
● Full browser interface
● Tutorials
● Handful of clicks to run e.g. RF or GLM
on gigabytes of data
● Auto-cluster in seconds
● On EC2 (or your laptops right now)
● Good enough for serious work
● (and have customers using this interface!)
29. 0xdata.com 29
H2O: An Engine for Big Math
● Focus on Big Math
● Easy to extend via M/R or K/V programming
● Auto-cluster
● Data-parallel exec across all CPUs
● dram caching across all servers
● Parallel ingest across all servers
● Open source: https://github.com/0xdata/h2o
0xdata.com
30. 0xdata.com 30
Math Hacking: The M/R API
● Make a 'golden object'
● Will be endlessly replicated across cluster
● Set 'input' fields:
– Auto-serialized, distributed
– Shallow-copy on nodes: eg arrays share state
● golden.map(key_1mb)
● map() called on clone for each 1mb
● Set 'output' fields now
31. 0xdata.com 31
Math Hacking: The M/R API
● gold.reduce(gold)
● Combine pairs of 'golden' objects
● Both locally and remotely (distributed)
● Log-tree roll-up
● 'output' fields will be shipped over the wire
● null-out 'input' fields
● transient marker available
32. 0xdata.com 32
Math Hacking: Example
CalcSumsTask cst = new CalcSumsTask();
cst._arykey = ary._key; // BigData Table key
cst._colA = colA; // integer indices to columns
cst._colB = colB;
cst.invoke(ary._key); // Do It!
// Results returned directly in 'cst' object
...cst._sumX... // use results
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
33. 0xdata.com 33
Math Hacking: Example
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
// map called for every 1Mb of data, or so
public void map( Key key1Mb ) {
… boiler plate... // lots of unimportant details
// Standard for-loop over the data
for( int i=0; i<rows; i++ ) {
double X = ary.datad(bits,i,A);
double Y = ary.datad(bits,i,B);
_sumX += X;
_sumY += Y;
_sumX2+= X*X;
}
}
34. 0xdata.com 34
Math Hacking: Example
public static class CalcSumsTask extends MRTask {
Key _arykey; // BigData Table key
int _colA, _colB; // Column indices to work on
double _sumX,_sumY,_sumX2; // Sum of X's, Y's, X^2's
// reduce called between pairs of golden objects
// always reduce right-side into 'this' object
public void reduce( DRemoteTask rt ) {
CalcSumsTask cst = (CalcSumsTask)rt;
_sumX += cst._sumX ;
_sumY += cst._sumY ;
_sumX2+= cst._sumX2;
}
}
35. 0xdata.com 35
A Fast K/V Store
● Distributed in-memory K/V Store
● Peer-to-peer, no master
● Full JMM semantics, get/put/atomic/remove
● Hardware-style cache-coherency protocol
● Fast: 150nanos for cache-hitting 'get'
● Fast: 50micros for cache-missing 'put'
● No persistence (see above for 'fast')
● No locks: use 'atomic' instead
36. 0xdata.com 36
K/V Design Goals
● JMM semantics on all get/put
● Cache-hitting 'gets' as fast as possible
● Local hashtable lookup + few tests
● 'puts' as lazy as possible (still JMM)
● Typically do not block for remote put
● Arbitrary transactions on single Keys
37. 0xdata.com 37
K/V Coherency Protocol
● Many are possible
● Picked a {fast-enough,easy} one
● Faster is possible
● Every Key has 1 master node
● And everybody knows it from Key hash
● Master orders racing writes
● Winner of NBHM insert
38. 0xdata.com 38
K/V Coherency Protocol
● Master tracks replicas
● Single CAS update
● Invalidate replicas on update
● Single CAS required, plus the invalidates
● Cache miss on replica will reload
● Interlocking get/put races solved with
finite state machine
41. 0xdata.com 41
The Expert
● Domain Expert:
● What data is useful, which is trash
● What needs help to become useful
● Missing elements? Toss outliers?
● Build new features from old?
● All through this process Big Data is, well,
Big, hence Slow to cp / awk / grep
● And Big limits my tools