2013 05 ny

A Distributed Parallel
Logistic Regression & GLM
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog

0xdata.com 2
H2O – A Platform for Big Math
● In-memory distributed & parallel vector math
● Pure Java, runs in cloud, server, laptop
● Open source: http://0xdata.github.com/h2o
● java -jar h2o.jar -name meetup
● Will auto-cluster in this room
● Best with default GC, largest heap
● Inner loops: near FORTRAN speeds & Java ease
● for( int i=0; i<N; i++ )
...do_something... // auto-distribute & par

0xdata.com 3
GLM & Logistic Regression
● Vector Math (for non math majors):
● At the core, we compute a Gram Matrix
● i.e., we touch all the data
● Logistic Regression – solve with Iterative RLS
● Iterative: multiple passes, multiple Grams
ƞk
= Xßk
μk
= link-1
(ƞk
)
z = ƞk
+ (y-μk
)·link'(μk
)
ßk+1
= (XT
·w·X)-1
·(XT
·z)

0xdata.com 4
GLM & Logistic Regression
● Vector Math (for non math majors):
● At the core, we compute a Gram Matrix
● i.e., we touch all the data
● Logistic Regression – solve with Iterative RLS
● Iterative: multiple passes, multiple Grams
ƞk
= Xßk
μk
= link-1
(ƞk
)
z = ƞk
+ (y-μk
)·link'(μk
)
ßk+1
= (XT
·w·X)-1
·(XT
·z)
Inverse solved with
Cholesky Decomposition

0xdata.com 5
GLM Running Time
● n – number of rows or observations
● p – number of features
●
Gram Matrix: O(np2
) / #cpus
● n can be billions; constant is really small
● Data is distributed across machines
●
Cholesky Decomp: O(p3
)
●
Real limit: memory is O(p2
), on a single node
● Times a small number of iterations (5-50)

0xdata.com 6
Gram Matrix
●
Requires computing XT
·X
● A single observation: double x[], y;
for( int i=0; i<P; i++ ) {
for( int j=0; j<=i; j++ )
_xx[i][j] += x[i]*x[j];
_xy[i] += y*x[i];
}
_yy += y*y;
● Computed per-row
● Millions to billions of rows
● Parallelize / distribute per-row

0xdata.com 7
Distributed Vector Coding
● Map-Reduce Style
● Start with a Plain Olde Java Object
● Private clone per-Map
● Shallow-copy with-in JVM
Deep-copy across JVMs
● Map a “chunk” of data into private clone
● "chunk" == all the rows that fit in 4Meg
● Reduce: combine pairs of cloned objects

0xdata.com 8
Plain Old Java Object
● Using the POJO:
Gram G = new Gram();
G.invoke(A); // Compute the Gram of A
...G._xx[][]... // Use the Gram for more math
● Defining the POJO:
class Gram extends MRTask {
Key _data; // Input variable(s)
// Output variables
double _xx[][], _xy[], _yy;
void map( Key chunk ) { … }
void reduce( Gram other ) { … }

0xdata.com 9
Gram.map
● Define the map:
void map( Key chunk ) {
// Pull in 4M chunk of data
...boiler plate...
for( int r=0; r<rows; r++ ) {
double y,x[] = decompress(r);
for( int i=0; i<P; i++ ) {
_xx[i][j] += x[i]*x[j];
_xy[i] += y*x[i];
}
_yy += y*y;
}
}

0xdata.com 10
Gram.reduce
● Define the reduce:
// Fold 'other' into 'this'
void reduce( Gram other ) {
for( int i=0; i<P; i++ ) {
_xx[i][j] += other._xx[i][j];
_xy[i] += other._xy[i];
}
_yy += other._yy;
}

0xdata.com 11
Distributed Vector Coding 2
● Gram Matrix computed in parallel & distributed
● Excellent CPU & load-balancing
● About 1sec per Gig for 32 medium EC2 instances
● The whole Logistic Regression, about 10sec/Gig
– Varies by #features, (i.e. billion rows, 1000 features)
● Distribution & Parallelization handled by H2O
● Data is pre-split by rows during parse/ingest
●
map(chunk) is run where chunk is local
●
reduce runs both local & distributed
– Gram object auto-serialized, auto-cloned

0xdata.com 12
Other Inner-Loop Considerations
● Real inner loop has more cruft
● Some columns excluded by user
● Some rows excluded by sampling, or missing data
● Data is normalized & centered
● Catagorical column expansion
– Math is straightforward, but needs another indirection
● Iterative Reweighted Least Squares
– Adds weight to each row

0xdata.com 13
GLM + GLMGrid
● Gram matrix is computed in parallel & distributed
● Rest of GLM is all single-threaded pure Java
● Includes JAMA for Cholesky Decomposition
● Default 10-fold x-val runs in parallel
● Warm-start all models for faster solving
● GLMGrid: Parameter search for GLM
● In parallel try all combo's of λ & α

0xdata.com 14
Meta Considerations: Math @ Scale
● Easy coding style is key:
●
1st
cut GLM ready in 2 weeks, but
● Code was changing for months
● Incremental evolution of a number of features
● Distributed/parallel borders kept clean & simple
● Java
● Runs fine in a single-JVM in debugger + Eclipse
● Well understood programming model

0xdata.com 15
H2O: Memory Considerations
● Runs best with default GC, largest -Xmx
● Data cached in Java heap
● Cache size vs heap monitored, spill-to-disk
● FullGC typically <1sec even for >30G heap
● If data fits – math runs at memory speeds
● Else disk-bound
● Ingest: Typically need 4x to 6x more memory
● Depends on GZIP ratios & column-compress ratios

0xdata.com 16
H2O: Reliable Network I/O
● Uses both UDP & TCP
● UDP for fast point-to-point control logic
● Reliable UDP via timeout & retry
● TCP, under load, reliably fails silently
– No data at receiver, no errors at sender
– 100% fail, <5mins in our labs or EC2
● (so not a fault of virtualization)
● TCP uses the same reliable comm layer as UDP
– Only use TCP for congestion control of large xfers

0xdata.com 17
H2O: S3 Ingest
● H2O can inhale from S3 (any many others)
● S3, under load, reliably fails
● Unlike TCP, appears to throw exception every time
● Again, wrap in a relibility retry layer
● HDFS backed by S3 (jets3)
● New failure mode: reports premature EOF
● Again, wrap in a relibility retry layer

2013 05 ny

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (17)

En vedette

En vedette (10)

Similaire à 2013 05 ny

Similaire à 2013 05 ny (20)

Plus de Sri Ambati

Plus de Sri Ambati (20)

2013 05 ny