2. About Me
• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team atYahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid SolutionsTeam atYahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC),
National Center for Supercomputing Applications (NCSA), Center
for Simulation of Advanced Rockets, Siebel Systems (acquired by
Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and
Pivotal (formerly Greenplum)
20. W-1-W
•WebMap : Graph processing for WWW
•Dreadnaught: Infrastructure for WebMap
•W-1-W:WebMap In One Week
•Juggernaut: Infrastructure for W-1-W
•JFS, JMR, Condor:Abandoned for Hadoop
35. Hadoop Maturity
ETL Offload
Accommodate massive
data growth with existing
EDW investments
Data Lakes
Unify Unstructured and
Structured Data Access
Big Data
Apps
Build analytic-led
applications impacting
top line revenue
Data-Driven
Enterprise
App Dev and Operational
Management on HDFS
Data Architecture
36. 70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
Average Enterprises
The Big Gap
42. Provides data-parallel implementations
of mathematical, statistical and machine-learning
methods
for structured and unstructured data.
In-Database Analytics
45. k-Means Usage
SELECT * FROM madlib.kmeanspp (
‘customers’, -- name of the input table
‘features’, -- name of the feature array column
2 -- k : number of clusters
);
!
centroids | objective_fn | frac_reassigned | …!
------------------------------------------------------------------------+------------------+-----------------+ …
{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …
47. Pivotal R
•Interface is R client
•Execution is in database
•Parallelism handled by PivotalR
•Supports a portion of R
R> x = db.data.frame(“t1”)
R> l = madlib.lm(interlocks ~ assets + nation, data = t)
48.
49. A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
50. A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/%
^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect db.list
db.objects
db.existsObject delete
• dim names
• content
And more ... (SQL wrapper)
• predict
51. A wrapper of MADlib
• Linear regression
• Logistic regression
• Elastic Net
• ARIMA
• Table summary
• Categorial variable
as.factor()
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/%
^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max
length colMeans colSums
• db.connect db.disconnect db.list
db.objects
db.existsObject delete
• dim names
• content
And more ... (SQL wrapper)
• predict
52. In-Database Execution
•All data stays in DB: R objects merely point
to DB objects
•All model estimation and heavy lifting done
in DB by MADlib
•R→ SQL translation done in the R client
•Only strings of SQL and model output
transferred across ODBC/DBI
59. YARN
•Yet Another Resource Negotiator
•Resource Manager
•Node Managers
•Application Masters
•Specific to paradigm, e.g. MR Application
master (aka JobTracker)
60. Beyond MapReduce
•Apache Giraph - BSP & Graph Processing
•Storm onYarn - Streaming Computation
•HOYA - HBase onYarn
•Hamster - MPI on Hadoop
•More to come ...
61. Hamster
• Hadoop and MPI on the same
cluster
• OpenMPI Runtime on
HadoopYARN
• Hadoop Provides: Resource
Scheduling, Process
monitoring, Distributed File
System
• Open MPI Provides: Process
launching, Communication, I/O
forwarding
63. About GraphLab
•Graph-based, High-Performance distributed
computation framework
•Started by Prof. Carlos Guestrin in CMU in
2009
•Recently founded Graphlab Inc to
commercialize Graphlab.org
65. Only Graphs are not
Enough
•Full Data processing workflow requires ETL/
Postprocessing,Visualization, Data Wrangling,
Serving
•MapReduce excels at data wrangling
•OLTP/NoSQL Row-Based stores excel at
Serving
•GraphLab should co-exist with other Hadoop
frameworks
66. Data Platform of the Future ?
Analytic
Data Marts
SQL Services
Operational
Intelligence
In-Memory Database
Run-Time
Applications
Data Staging
Platform
Data Mgmt. Services
Stream
Ingestion
Streaming Services
Software-Defined Datacenter
New Data-fabrics
In-Memory Grid
...ETC