SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
A Distributed Parallel
Logistic Regression & GLM
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog
0xdata.com 2
H2O – A Platform for Big Math
● In-memory distributed & parallel vector math
● Pure Java, runs in cloud, server, laptop
● Open source: http://0xdata.github.com/h2o
● java -jar h2o.jar -name meetup
● Will auto-cluster in this room
● Best with default GC, largest heap
● Inner loops: near FORTRAN speeds & Java ease
● for( int i=0; i<N; i++ )
...do_something... // auto-distribute & par
0xdata.com 3
GLM & Logistic Regression
● Vector Math (for non math majors):
● At the core, we compute a Gram Matrix
● i.e., we touch all the data
● Logistic Regression – solve with Iterative RLS
● Iterative: multiple passes, multiple Grams
ƞk
= Xßk
μk
= link-1
(ƞk
)
z = ƞk
+ (y-μk
)·link'(μk
)
ßk+1
= (XT
·w·X)-1
·(XT
·z)
0xdata.com 4
GLM & Logistic Regression
● Vector Math (for non math majors):
● At the core, we compute a Gram Matrix
● i.e., we touch all the data
● Logistic Regression – solve with Iterative RLS
● Iterative: multiple passes, multiple Grams
ƞk
= Xßk
μk
= link-1
(ƞk
)
z = ƞk
+ (y-μk
)·link'(μk
)
ßk+1
= (XT
·w·X)-1
·(XT
·z)
Inverse solved with
Cholesky Decomposition
0xdata.com 5
GLM Running Time
● n – number of rows or observations
● p – number of features
●
Gram Matrix: O(np2
) / #cpus
● n can be billions; constant is really small
● Data is distributed across machines
●
Cholesky Decomp: O(p3
)
●
Real limit: memory is O(p2
), on a single node
● Times a small number of iterations (5-50)
0xdata.com 6
Gram Matrix
●
Requires computing XT
·X
● A single observation: double x[], y;
for( int i=0; i<P; i++ ) {
for( int j=0; j<=i; j++ )
_xx[i][j] += x[i]*x[j];
_xy[i] += y*x[i];
}
_yy += y*y;
● Computed per-row
● Millions to billions of rows
● Parallelize / distribute per-row
0xdata.com 7
Distributed Vector Coding
● Map-Reduce Style
● Start with a Plain Olde Java Object
● Private clone per-Map
● Shallow-copy with-in JVM
Deep-copy across JVMs
● Map a “chunk” of data into private clone
● "chunk" == all the rows that fit in 4Meg
● Reduce: combine pairs of cloned objects
0xdata.com 8
Plain Old Java Object
● Using the POJO:
Gram G = new Gram();
G.invoke(A); // Compute the Gram of A
...G._xx[][]... // Use the Gram for more math
● Defining the POJO:
class Gram extends MRTask {
Key _data; // Input variable(s)
// Output variables
double _xx[][], _xy[], _yy;
void map( Key chunk ) { … }
void reduce( Gram other ) { … }
0xdata.com 9
Gram.map
● Define the map:
void map( Key chunk ) {
// Pull in 4M chunk of data
...boiler plate...
for( int r=0; r<rows; r++ ) {
double y,x[] = decompress(r);
for( int i=0; i<P; i++ ) {
for( int j=0; j<=i; j++ )
_xx[i][j] += x[i]*x[j];
_xy[i] += y*x[i];
}
_yy += y*y;
}
}
0xdata.com 10
Gram.reduce
● Define the reduce:
// Fold 'other' into 'this'
void reduce( Gram other ) {
for( int i=0; i<P; i++ ) {
for( int j=0; j<=i; j++ )
_xx[i][j] += other._xx[i][j];
_xy[i] += other._xy[i];
}
_yy += other._yy;
}
0xdata.com 11
Distributed Vector Coding 2
● Gram Matrix computed in parallel & distributed
● Excellent CPU & load-balancing
● About 1sec per Gig for 32 medium EC2 instances
● The whole Logistic Regression, about 10sec/Gig
– Varies by #features, (i.e. billion rows, 1000 features)
● Distribution & Parallelization handled by H2O
● Data is pre-split by rows during parse/ingest
●
map(chunk) is run where chunk is local
●
reduce runs both local & distributed
– Gram object auto-serialized, auto-cloned
0xdata.com 12
Other Inner-Loop Considerations
● Real inner loop has more cruft
● Some columns excluded by user
● Some rows excluded by sampling, or missing data
● Data is normalized & centered
● Catagorical column expansion
– Math is straightforward, but needs another indirection
● Iterative Reweighted Least Squares
– Adds weight to each row
0xdata.com 13
GLM + GLMGrid
● Gram matrix is computed in parallel & distributed
● Rest of GLM is all single-threaded pure Java
● Includes JAMA for Cholesky Decomposition
● Default 10-fold x-val runs in parallel
● Warm-start all models for faster solving
● GLMGrid: Parameter search for GLM
● In parallel try all combo's of λ & α
0xdata.com 14
Meta Considerations: Math @ Scale
● Easy coding style is key:
●
1st
cut GLM ready in 2 weeks, but
● Code was changing for months
● Incremental evolution of a number of features
● Distributed/parallel borders kept clean & simple
● Java
● Runs fine in a single-JVM in debugger + Eclipse
● Well understood programming model
0xdata.com 15
H2O: Memory Considerations
● Runs best with default GC, largest -Xmx
● Data cached in Java heap
● Cache size vs heap monitored, spill-to-disk
● FullGC typically <1sec even for >30G heap
● If data fits – math runs at memory speeds
● Else disk-bound
● Ingest: Typically need 4x to 6x more memory
● Depends on GZIP ratios & column-compress ratios
0xdata.com 16
H2O: Reliable Network I/O
● Uses both UDP & TCP
● UDP for fast point-to-point control logic
● Reliable UDP via timeout & retry
● TCP, under load, reliably fails silently
– No data at receiver, no errors at sender
– 100% fail, <5mins in our labs or EC2
● (so not a fault of virtualization)
● TCP uses the same reliable comm layer as UDP
– Only use TCP for congestion control of large xfers
0xdata.com 17
H2O: S3 Ingest
● H2O can inhale from S3 (any many others)
● S3, under load, reliably fails
● Unlike TCP, appears to throw exception every time
● Again, wrap in a relibility retry layer
● HDFS backed by S3 (jets3)
● New failure mode: reports premature EOF
● Again, wrap in a relibility retry layer

Contenu connexe

Tendances

Tendances (17)

Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication works
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Map db
Map dbMap db
Map db
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
 
Threads and Node.js
Threads and Node.jsThreads and Node.js
Threads and Node.js
 
Evolution and Scaling of MongoDB Management Service Running on MongoDB
Evolution and Scaling of MongoDB Management Service Running on MongoDBEvolution and Scaling of MongoDB Management Service Running on MongoDB
Evolution and Scaling of MongoDB Management Service Running on MongoDB
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
 
Memcached
MemcachedMemcached
Memcached
 
Hibernate caching
Hibernate cachingHibernate caching
Hibernate caching
 
Attacking the Webkit heap [Or how to write Safari exploits]
Attacking the Webkit heap [Or how to write Safari exploits]Attacking the Webkit heap [Or how to write Safari exploits]
Attacking the Webkit heap [Or how to write Safari exploits]
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 

En vedette

Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2O
Sri Ambati
 
Europa Oriental: Economia e Sociedade
Europa Oriental: Economia e SociedadeEuropa Oriental: Economia e Sociedade
Europa Oriental: Economia e Sociedade
Fabiana Rocha
 

En vedette (10)

Metzgar Jason Mobile Presentation
Metzgar Jason Mobile PresentationMetzgar Jason Mobile Presentation
Metzgar Jason Mobile Presentation
 
Cosplay
CosplayCosplay
Cosplay
 
Growth pl
Growth plGrowth pl
Growth pl
 
Online Display Advertising Optimization with H2O at ShareThis
Online Display Advertising Optimization with H2O at ShareThisOnline Display Advertising Optimization with H2O at ShareThis
Online Display Advertising Optimization with H2O at ShareThis
 
A Predictive Model Factory Picks Up Steam
A Predictive Model Factory Picks Up SteamA Predictive Model Factory Picks Up Steam
A Predictive Model Factory Picks Up Steam
 
Sparkling Water Applications Meetup 07.21.15
Sparkling Water Applications Meetup 07.21.15Sparkling Water Applications Meetup 07.21.15
Sparkling Water Applications Meetup 07.21.15
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2O
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
H2O Big Data Environments
H2O Big Data EnvironmentsH2O Big Data Environments
H2O Big Data Environments
 
Europa Oriental: Economia e Sociedade
Europa Oriental: Economia e SociedadeEuropa Oriental: Economia e Sociedade
Europa Oriental: Economia e Sociedade
 

Similaire à 2013 05 ny

Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
Sri Ambati
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 

Similaire à 2013 05 ny (20)

GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O API
 
Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Caching in
Caching inCaching in
Caching in
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Caching in
Caching inCaching in
Caching in
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Towards hasktorch 1.0
Towards hasktorch 1.0Towards hasktorch 1.0
Towards hasktorch 1.0
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Building a Big Data Machine Learning Platform
Building a Big Data Machine Learning PlatformBuilding a Big Data Machine Learning Platform
Building a Big Data Machine Learning Platform
 
Shared Database Concurrency
Shared Database ConcurrencyShared Database Concurrency
Shared Database Concurrency
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Memory model
Memory modelMemory model
Memory model
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Robust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time ChecksRobust C++ Task Systems Through Compile-time Checks
Robust C++ Task Systems Through Compile-time Checks
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projects
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский Дмитрий
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 

Plus de Sri Ambati

Plus de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

2013 05 ny

  • 1. A Distributed Parallel Logistic Regression & GLM Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
  • 2. 0xdata.com 2 H2O – A Platform for Big Math ● In-memory distributed & parallel vector math ● Pure Java, runs in cloud, server, laptop ● Open source: http://0xdata.github.com/h2o ● java -jar h2o.jar -name meetup ● Will auto-cluster in this room ● Best with default GC, largest heap ● Inner loops: near FORTRAN speeds & Java ease ● for( int i=0; i<N; i++ ) ...do_something... // auto-distribute & par
  • 3. 0xdata.com 3 GLM & Logistic Regression ● Vector Math (for non math majors): ● At the core, we compute a Gram Matrix ● i.e., we touch all the data ● Logistic Regression – solve with Iterative RLS ● Iterative: multiple passes, multiple Grams ƞk = Xßk μk = link-1 (ƞk ) z = ƞk + (y-μk )·link'(μk ) ßk+1 = (XT ·w·X)-1 ·(XT ·z)
  • 4. 0xdata.com 4 GLM & Logistic Regression ● Vector Math (for non math majors): ● At the core, we compute a Gram Matrix ● i.e., we touch all the data ● Logistic Regression – solve with Iterative RLS ● Iterative: multiple passes, multiple Grams ƞk = Xßk μk = link-1 (ƞk ) z = ƞk + (y-μk )·link'(μk ) ßk+1 = (XT ·w·X)-1 ·(XT ·z) Inverse solved with Cholesky Decomposition
  • 5. 0xdata.com 5 GLM Running Time ● n – number of rows or observations ● p – number of features ● Gram Matrix: O(np2 ) / #cpus ● n can be billions; constant is really small ● Data is distributed across machines ● Cholesky Decomp: O(p3 ) ● Real limit: memory is O(p2 ), on a single node ● Times a small number of iterations (5-50)
  • 6. 0xdata.com 6 Gram Matrix ● Requires computing XT ·X ● A single observation: double x[], y; for( int i=0; i<P; i++ ) { for( int j=0; j<=i; j++ ) _xx[i][j] += x[i]*x[j]; _xy[i] += y*x[i]; } _yy += y*y; ● Computed per-row ● Millions to billions of rows ● Parallelize / distribute per-row
  • 7. 0xdata.com 7 Distributed Vector Coding ● Map-Reduce Style ● Start with a Plain Olde Java Object ● Private clone per-Map ● Shallow-copy with-in JVM Deep-copy across JVMs ● Map a “chunk” of data into private clone ● "chunk" == all the rows that fit in 4Meg ● Reduce: combine pairs of cloned objects
  • 8. 0xdata.com 8 Plain Old Java Object ● Using the POJO: Gram G = new Gram(); G.invoke(A); // Compute the Gram of A ...G._xx[][]... // Use the Gram for more math ● Defining the POJO: class Gram extends MRTask { Key _data; // Input variable(s) // Output variables double _xx[][], _xy[], _yy; void map( Key chunk ) { … } void reduce( Gram other ) { … }
  • 9. 0xdata.com 9 Gram.map ● Define the map: void map( Key chunk ) { // Pull in 4M chunk of data ...boiler plate... for( int r=0; r<rows; r++ ) { double y,x[] = decompress(r); for( int i=0; i<P; i++ ) { for( int j=0; j<=i; j++ ) _xx[i][j] += x[i]*x[j]; _xy[i] += y*x[i]; } _yy += y*y; } }
  • 10. 0xdata.com 10 Gram.reduce ● Define the reduce: // Fold 'other' into 'this' void reduce( Gram other ) { for( int i=0; i<P; i++ ) { for( int j=0; j<=i; j++ ) _xx[i][j] += other._xx[i][j]; _xy[i] += other._xy[i]; } _yy += other._yy; }
  • 11. 0xdata.com 11 Distributed Vector Coding 2 ● Gram Matrix computed in parallel & distributed ● Excellent CPU & load-balancing ● About 1sec per Gig for 32 medium EC2 instances ● The whole Logistic Regression, about 10sec/Gig – Varies by #features, (i.e. billion rows, 1000 features) ● Distribution & Parallelization handled by H2O ● Data is pre-split by rows during parse/ingest ● map(chunk) is run where chunk is local ● reduce runs both local & distributed – Gram object auto-serialized, auto-cloned
  • 12. 0xdata.com 12 Other Inner-Loop Considerations ● Real inner loop has more cruft ● Some columns excluded by user ● Some rows excluded by sampling, or missing data ● Data is normalized & centered ● Catagorical column expansion – Math is straightforward, but needs another indirection ● Iterative Reweighted Least Squares – Adds weight to each row
  • 13. 0xdata.com 13 GLM + GLMGrid ● Gram matrix is computed in parallel & distributed ● Rest of GLM is all single-threaded pure Java ● Includes JAMA for Cholesky Decomposition ● Default 10-fold x-val runs in parallel ● Warm-start all models for faster solving ● GLMGrid: Parameter search for GLM ● In parallel try all combo's of λ & α
  • 14. 0xdata.com 14 Meta Considerations: Math @ Scale ● Easy coding style is key: ● 1st cut GLM ready in 2 weeks, but ● Code was changing for months ● Incremental evolution of a number of features ● Distributed/parallel borders kept clean & simple ● Java ● Runs fine in a single-JVM in debugger + Eclipse ● Well understood programming model
  • 15. 0xdata.com 15 H2O: Memory Considerations ● Runs best with default GC, largest -Xmx ● Data cached in Java heap ● Cache size vs heap monitored, spill-to-disk ● FullGC typically <1sec even for >30G heap ● If data fits – math runs at memory speeds ● Else disk-bound ● Ingest: Typically need 4x to 6x more memory ● Depends on GZIP ratios & column-compress ratios
  • 16. 0xdata.com 16 H2O: Reliable Network I/O ● Uses both UDP & TCP ● UDP for fast point-to-point control logic ● Reliable UDP via timeout & retry ● TCP, under load, reliably fails silently – No data at receiver, no errors at sender – 100% fail, <5mins in our labs or EC2 ● (so not a fault of virtualization) ● TCP uses the same reliable comm layer as UDP – Only use TCP for congestion control of large xfers
  • 17. 0xdata.com 17 H2O: S3 Ingest ● H2O can inhale from S3 (any many others) ● S3, under load, reliably fails ● Unlike TCP, appears to throw exception every time ● Again, wrap in a relibility retry layer ● HDFS backed by S3 (jets3) ● New failure mode: reports premature EOF ● Again, wrap in a relibility retry layer