SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Pivotal Confidential–Internal Use Only
SQL & Machine Learning on
Hadoop
Mukund Babbar
Pivotal
Feb, 2015
1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Journey to Apache
Michael Stonebraker develops Postgres at UCB
Postgres adds support for SQL
Open Source PostgreSQL
PostgreSQL 7.0 released
PostgreSQL 8.0 released
Greenplum forks
PostgreSQL
Hadoop 1.0 Released
HAWQ & MADlib
go Apache
HAWQ
launched
Hadoop 2.0 Released
MADlib
launched
Greenplum
open sourced
Pivotal Confidential–Internal Use Only
Apache HAWQ Overview
HAWQ – SQL on Hadoop
Shared-Nothing Database Architecture
Standby
Master
Segment Host with one or more Segment Instances
Segment Instances process queries in parallel
High speed interconnect for
continuous pipelining of data
processing
…
Master
Host
SQL
Master Host and Standby Master Host
Master coordinates work with Segment Hosts
Interconnect
Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
Segment Hosts have their own
CPU, disk and memory (shared
nothing) Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
node1
Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
node2
Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
node3
Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
nodeN
Key	
  Features	
  
of	
  
HAWQ	
  
5	
  
5	
   •  Up	
  to	
  30x	
  SQL-­‐on-­‐Hadoop	
  performance	
  
advantage	
  
•  Faster	
  ;me	
  to	
  insight	
  
•  Massive	
  MPP	
  scalability	
  to	
  petabytes	
  
	
  
Benefits:	
  	
  Near	
  real-­‐;me	
  latency,	
  complex	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  queries	
  and	
  advanced	
  analy;cs	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  at	
  scale	
  
	
  
	
  
1.	
  Advanced	
  Analy9cs	
  Performance	
  
Key	
  Features	
  
of	
  
HAWQ	
  
HAWQ Performance vs Impala
HAWQ
Faster
Impala
Faster
2 28 46 66 73 76 79 80 88 90 96
HAWQ
•  Faster on 46 of 62
TPC-DS queries
completed*
•  4.55x mean avg.
•  12 hrs faster total
* Impala supported 74 of 99
queries, 12 crashed mid-run
HAWQ vs Apache Hive w/Tez
HAWQ
Faster
Hive
Faster
3 7 15 25 27 34 46 48 76 79 89 90 96
HAWQ
•  Faster on 45 of 60
TPC-DS queries
completed*
•  3.44x mean avg.
•  9 hrs faster total
* Hive supported 65 of 99 queries,
5 crashed mid-run
5	
   • ANSI	
  SQL-­‐92,	
  -­‐99,	
  -­‐2003	
  
• All	
  99	
  TPC-­‐DS	
  queries	
  tested,	
  no	
  
modifica;ons	
  
• Plus,	
  OLAP	
  extensions	
  
• Complete	
  ACID	
  integrity	
  and	
  reliability	
  
	
  
Benefits:	
  	
  100%	
  SQL	
  compliant	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  No	
  risk	
  to	
  SQL	
  applica;ons	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  All	
  na;ve	
  on	
  HDP	
  via	
  HAWQ	
  
2.	
  100%	
  ANSI	
  SQL	
  Compliant	
  
Key	
  Features	
  
of	
  
HAWQ	
  
5	
   • Advanced	
  machine	
  learning	
  for	
  big	
  data	
  
• Local,	
  in-­‐database	
  opera;on	
  
• Excep;onal	
  MPP/parallel	
  performance	
  
• Open	
  source,	
  Postgres-­‐based	
  
	
  
Benefits:	
  	
  Advanced,	
  highly	
  scalable,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  machine	
  learning,	
  directly	
  on	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  data	
  in	
  Hadoop	
  
3.	
  Integrated	
  Machine	
  Learning	
  
Key	
  Features	
  
of	
  
HAWQ	
  
5	
   • HDP,	
  PHD,	
  other	
  ODPi-­‐derived	
  distros	
  
• Easily	
  managed	
  via	
  Ambari	
  
• On	
  premises,	
  in	
  cloud,	
  or	
  PaaS	
  
• HBase,	
  Avro,	
  Parquet	
  and	
  more	
  
• Connectors	
  to	
  make	
  HAWQ	
  data	
  
available	
  to	
  other	
  SQL	
  query	
  tools	
  
	
  
Benefits:	
  	
  Flexibility	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Accessibility	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Portability	
  
	
  
4.	
  Flexible	
  Deployment	
  
Key	
  Features	
  
of	
  
HAWQ	
  
5	
   • Cost-­‐based	
  query	
  op;miza;on	
  	
  
• Robust	
  query	
  plan	
  op;miza;on	
  	
  
• Complex	
  big	
  data	
  management	
  	
  
	
  
Benefits:	
  	
  Op;mize	
  performance	
  and	
  costs	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Maximize	
  Hadoop	
  cluster	
  resources	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Offload	
  EDW	
  w/o	
  compromise	
  
5.	
  Query	
  Op9miza9on	
  Op9ons	
  
Key	
  Features	
  
of	
  
HAWQ	
  
Advanced	
  MPP:	
  Polymorphic	
  Storage™	
  
Ÿ  Columnar	
  storage	
  is	
  well	
  suited	
  to	
  
scanning	
  a	
  large	
  percentage	
  of	
  the	
  
data	
  
Ÿ  Row	
  storage	
  excels	
  at	
  small	
  lookups	
  
Ÿ  Most	
  systems	
  need	
  to	
  do	
  both	
  
Ÿ  Row	
  and	
  column	
  orienta;on	
  can	
  be	
  
mixed	
  within	
  a	
  table	
  or	
  database	
  
Ÿ  Both	
  types	
  can	
  be	
  drama;cally	
  more	
  efficient	
  with	
  
compression	
  
Ÿ  Compression	
  is	
  definable	
  column	
  by	
  column:	
  
Ÿ  Blockwise:	
  Gzip1-­‐9	
  &	
  QuickLZ	
  
Ÿ  Streamwise:	
  	
  Run	
  Length	
  Encoding	
  (RLE)	
  (levels	
  1-­‐4)	
  
Ÿ  Flexible	
  indexing,	
  par;;oning	
  enable	
  more	
  granular	
  control	
  
and	
  enable	
  true	
  ILM	
  
TABLE ‘SALES’
Mar Apr May Jun Jul Aug Sept Oct Nov
Row-­‐oriented	
  for	
  Small	
  Scans	
  Column-­‐oriented	
  for	
  Full	
  Scans	
  
PL/X : X in {pgsql, R, Python, Java, Perl, C, etc.}
•  Allows users to write HAWQ
functions in R, Perl, Java, Perl,
pgsql or C languages
•  The interpreter/VM of the
language ‘X’ is installed on
each node of the HAWQ
Cluster
•  Data Parallelism:
–  PL/X piggybacks on
HAWQ’s MPP architecture
Apache HAWQ 	
  
●  Discover	
  New	
  Rela9onships	
  
●  Enable	
  Data	
  Science	
  	
  
●  Analyze	
  External	
  Sources	
  
●  Query	
  All	
  Data	
  Types!	
  
Mul9-­‐level	
  
Fault	
  Tolerance	
  
Granular	
  
Authoriza9on	
  
Resource	
  Mgmt	
  
(+	
  YARN)	
  	
  
high	
  mul(-­‐tenancy	
  
ANSI	
  SQL	
  
Standard	
  
OLAP	
  
Extensions	
  
JDBC	
  ODBC	
  
Connec9vity	
  
Parallel	
  
Processing	
  
Online	
  
Expansion	
  
HDFS	
  
Petabyte	
  Scale	
  	
  
Cost	
  Based	
  Op9mizer	
  
Dynamic	
  
Pipelining	
  
ACID	
  +	
  
Transac9onal	
  
Mul9-­‐Language	
  
UDF	
  Support	
  
Built-­‐in	
  Data	
  
Science	
  Library	
  
Extensible	
  
(PXF)	
  
Query	
  External	
  
Sources	
  
Hardened,	
  10+	
  Years	
  Investment,	
  Produc9on	
  
Proven	
  
Accessibility	
  +	
  Usability	
  	
  
HDFS	
  Na9ve	
  
File	
  Formats	
  
●  Manage	
  Mul9ple	
  Workloads	
  
●  Petabyte	
  Scale	
  Analy9cs	
  
●  Security	
  controls	
  
●  Leverage	
  Exis9ng	
  
SQL	
  Skills	
  &	
  BI	
  Tools	
  
●  Easily	
  Integrate	
  with	
  
Other	
  Tools	
  
●  Sub-­‐second	
  
Performance	
   Compression	
  
+	
  Par99oning	
  
core	
  
compliance	
  
●  Hadoop-­‐Na9ve	
  
●  Supports	
  Pivotal	
  HD	
  
and	
  Hortonworks	
  
Data	
  Pladorm	
  
●  Ambari-­‐Integrated	
  
Apache HAWQ 2.0 (new features..)
Areas	
  of	
  Enhancement	
   New	
  Features	
  
Elas;c	
  &	
  Scalable	
  Architecture	
  
Hadoop-­‐Na;ve	
  Integra;ons	
  
Simplified	
  External	
  Data	
  Access/Queries	
  
Performance	
  &	
  Op;miza;ons	
  
On-­‐Demand	
  Virtual	
  Segments	
  
Flexible	
  Query	
  Dispatch	
  on	
  subset	
  nodes	
  
3	
  Tier	
  RM:	
  YARN	
  level>User>Query-­‐Operator	
  
Dynamic	
  Cluster	
  Expansion	
  (no	
  redistribute)	
  
New	
  Fault	
  Tolerance	
  Service	
  
HCatalog	
  integra;on	
  -­‐	
  Read	
  Access	
  
HDFS	
  Catalog	
  Cache	
  
Per	
  Table	
  Directory	
  storage	
  (user	
  friendly)	
  
Single	
  physical	
  segment	
  per	
  node	
  
Easier	
  Administra;on/Usage	
  
Cloud-­‐Ready	
  
Simpler	
  Management	
  Commands	
  
HAWQ
Segments
HAWQ	
  
Masters	
  
Yarn	
  
Physical	
  Segment	
  
Client	
  
	
  
Parser/	
  
Analyzer	
  
	
  
Op;mizer	
  
Dispatcher	
  
DataNode	
  
NodeManager	
  
NameNodeNameNode	
  
External Data Stores via Xtension Framework (Hive/HBase/etc)
Resource	
  
Manager	
  
Fault	
  Tolerance	
  
Service	
  
Catalog
Service
Virtual	
  
Segment	
  
Virtual	
  
Segment	
  
Physical	
  Segment	
  
DataNode	
  
NodeManager	
  
Virtual	
  
Segment	
  
Virtual	
  
Segment	
  
Physical	
  Segment	
  
DataNode	
  
NodeManager	
  
Virtual	
  
Segment	
  
Virtual	
  
Segment	
  
Resource	
  
Broker	
  
libYARN
HDFS	
  Catalog	
  Cache	
  
Interconnect Interconnect
Apache HAWQ 2.0
Architecture	
  
Pivotal Confidential–Internal Use Only
Apache MADlib Overview
Scalable, In-Database
Machine Learning
•  Open Source https://github.com/apache/incubator-madlib
•  Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL
•  Downloads and Docs: http://madlib.incubator.apache.org/
Apache (incubating)
Functions
Predictive Modeling Library
Linear Systems
•  Sparse and Dense Solvers
•  Linear Algebra
Matrix Factorization
•  Singular Value Decomposition (SVD)
•  Low Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards Regression
•  Elastic Net Regularization
•  Robust Variance (Huber-White), Clustered
Variance, Marginal Effects
Other Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Apriori)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Random Forest
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
•  Naïve Bayes
•  Support Vector Machines (SVM)
Descriptive Statistics
Sketch-Based Estimators
•  CountMin (Cormode-Muth.)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
Data Preparation
PMML Export
Conjugate Gradient
Inferential Statistics
Hypothesis Tests
Time Series
•  ARIMA
Oct 2014
MADlib Advantages
Ÿ  Better parallelism
–  Algorithms designed to leverage MPP and
Hadoop architecture
Ÿ  Better scalability
–  Algorithms scale as your data set scales
Ÿ  Better predictive accuracy
–  Can use all data, not a sample
Ÿ  ASF open source (incubating)
–  Available for customization and optimization
Calling MADlib Functions: Fast Training &
Scoring
•  MADlib allows users to easily create
models without moving data out of the
systems
–  Model generation
–  Model validation
–  Scoring (evaluation of) new data
•  All the data can be used in one model
•  Built-in functionality to create multiple
smaller models (e.g. classification
grouped by feature)
•  Open source lets you tweak and extend
methods, or build your own
Challenges in computing OLS solution
a b
c d
e f
g h
X
Segment 1
Segment 2
Challenges in computing OLS solution
a b
c d
e f
g h
X
Segment 1
Segment 2
a c e g
b d f hSegment1
Segment2
XT
Challenges in computing OLS solution
a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Data across nodes
are multiplied
Challenges in computing OLS solution
a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Looks like the result
can be decomposed
ab+cd+ef+gh
b2+d2+f2+h2
ab+cd+ef+gh
Challenges in computing OLS solution
a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Data across nodes
are multiplied!
ab+cd+ef+gh
b2+d2+f2+h2
ab+cd+ef+gh
= +a b e
f
e f
a
b +c d
g
h
g hc
d
+
Linear Regression on 10 Million Rows in Seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of
the VLDB Endowment 5.12 (2012): 1700-1711.
Contributors Welcome!
•  Web sites
–  http://hawq.incubator.apache.org/
–  http://madlib.incubator.apache.org/
–  https://cran.r-project.org/web/packages/PivotalR/index.html
•  Github
–  https://github.com/apache/incubator-hawq
–  https://github.com/apache/incubator-madlib
–  https://github.com/pivotalsoftware/PivotalR
?

Contenu connexe

Tendances

Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMapR Technologies
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServicePivotalOpenSourceHub
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQpivotalny
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 

Tendances (20)

Apache drill
Apache drillApache drill
Apache drill
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single ClusterMaintaining Low Latency While Maximizing Throughput on a Single Cluster
Maintaining Low Latency While Maximizing Throughput on a Single Cluster
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a Service
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 

Similaire à SQL and Machine Learning on Hadoop

Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...Christian Tzolov
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015NoSQLmatters
 
Pivotal HAWQ 소개
Pivotal HAWQ 소개Pivotal HAWQ 소개
Pivotal HAWQ 소개Seungdon Choi
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Hawq wp 042313_final
Hawq wp 042313_finalHawq wp 042313_final
Hawq wp 042313_finalEMC
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Continuent
 
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, AnalyticsReal-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, AnalyticsContinuent
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Continuent
 
Open stack ha design & deployment kilo
Open stack ha design & deployment   kiloOpen stack ha design & deployment   kilo
Open stack ha design & deployment kiloSteven Li
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
 
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Edureka!
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeTimothy Spann
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 

Similaire à SQL and Machine Learning on Hadoop (20)

Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
Apache conbigdata2015 christiantzolov-federated sql on hadoop and beyond- lev...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
 
Pivotal HAWQ 소개
Pivotal HAWQ 소개Pivotal HAWQ 소개
Pivotal HAWQ 소개
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Hawq wp 042313_final
Hawq wp 042313_finalHawq wp 042313_final
Hawq wp 042313_final
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
 
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, AnalyticsReal-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...Replication in real-time from Oracle and MySQL into data warehouses and analy...
Replication in real-time from Oracle and MySQL into data warehouses and analy...
 
Open stack ha design & deployment kilo
Open stack ha design & deployment   kiloOpen stack ha design & deployment   kilo
Open stack ha design & deployment kilo
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Upcoming Hado...
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 

Dernier

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Dernier (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

SQL and Machine Learning on Hadoop

  • 1. Pivotal Confidential–Internal Use Only SQL & Machine Learning on Hadoop Mukund Babbar Pivotal Feb, 2015
  • 2. 1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 Journey to Apache Michael Stonebraker develops Postgres at UCB Postgres adds support for SQL Open Source PostgreSQL PostgreSQL 7.0 released PostgreSQL 8.0 released Greenplum forks PostgreSQL Hadoop 1.0 Released HAWQ & MADlib go Apache HAWQ launched Hadoop 2.0 Released MADlib launched Greenplum open sourced
  • 3. Pivotal Confidential–Internal Use Only Apache HAWQ Overview
  • 4. HAWQ – SQL on Hadoop
  • 5. Shared-Nothing Database Architecture Standby Master Segment Host with one or more Segment Instances Segment Instances process queries in parallel High speed interconnect for continuous pipelining of data processing … Master Host SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts Interconnect Segment Host Segment Instance Segment Instance Segment Instance Segment Instance Segment Hosts have their own CPU, disk and memory (shared nothing) Segment Host Segment Instance Segment Instance Segment Instance Segment Instance node1 Segment Host Segment Instance Segment Instance Segment Instance Segment Instance node2 Segment Host Segment Instance Segment Instance Segment Instance Segment Instance node3 Segment Host Segment Instance Segment Instance Segment Instance Segment Instance nodeN
  • 6. Key  Features   of   HAWQ   5  
  • 7. 5   •  Up  to  30x  SQL-­‐on-­‐Hadoop  performance   advantage   •  Faster  ;me  to  insight   •  Massive  MPP  scalability  to  petabytes     Benefits:    Near  real-­‐;me  latency,  complex                                      queries  and  advanced  analy;cs                                      at  scale       1.  Advanced  Analy9cs  Performance   Key  Features   of   HAWQ  
  • 8. HAWQ Performance vs Impala HAWQ Faster Impala Faster 2 28 46 66 73 76 79 80 88 90 96 HAWQ •  Faster on 46 of 62 TPC-DS queries completed* •  4.55x mean avg. •  12 hrs faster total * Impala supported 74 of 99 queries, 12 crashed mid-run
  • 9. HAWQ vs Apache Hive w/Tez HAWQ Faster Hive Faster 3 7 15 25 27 34 46 48 76 79 89 90 96 HAWQ •  Faster on 45 of 60 TPC-DS queries completed* •  3.44x mean avg. •  9 hrs faster total * Hive supported 65 of 99 queries, 5 crashed mid-run
  • 10. 5   • ANSI  SQL-­‐92,  -­‐99,  -­‐2003   • All  99  TPC-­‐DS  queries  tested,  no   modifica;ons   • Plus,  OLAP  extensions   • Complete  ACID  integrity  and  reliability     Benefits:    100%  SQL  compliant                                      No  risk  to  SQL  applica;ons                                      All  na;ve  on  HDP  via  HAWQ   2.  100%  ANSI  SQL  Compliant   Key  Features   of   HAWQ  
  • 11. 5   • Advanced  machine  learning  for  big  data   • Local,  in-­‐database  opera;on   • Excep;onal  MPP/parallel  performance   • Open  source,  Postgres-­‐based     Benefits:    Advanced,  highly  scalable,                                      machine  learning,  directly  on                                      data  in  Hadoop   3.  Integrated  Machine  Learning   Key  Features   of   HAWQ  
  • 12. 5   • HDP,  PHD,  other  ODPi-­‐derived  distros   • Easily  managed  via  Ambari   • On  premises,  in  cloud,  or  PaaS   • HBase,  Avro,  Parquet  and  more   • Connectors  to  make  HAWQ  data   available  to  other  SQL  query  tools     Benefits:    Flexibility                                      Accessibility                                      Portability     4.  Flexible  Deployment   Key  Features   of   HAWQ  
  • 13. 5   • Cost-­‐based  query  op;miza;on     • Robust  query  plan  op;miza;on     • Complex  big  data  management       Benefits:    Op;mize  performance  and  costs                                      Maximize  Hadoop  cluster  resources                                      Offload  EDW  w/o  compromise   5.  Query  Op9miza9on  Op9ons   Key  Features   of   HAWQ  
  • 14. Advanced  MPP:  Polymorphic  Storage™   Ÿ  Columnar  storage  is  well  suited  to   scanning  a  large  percentage  of  the   data   Ÿ  Row  storage  excels  at  small  lookups   Ÿ  Most  systems  need  to  do  both   Ÿ  Row  and  column  orienta;on  can  be   mixed  within  a  table  or  database   Ÿ  Both  types  can  be  drama;cally  more  efficient  with   compression   Ÿ  Compression  is  definable  column  by  column:   Ÿ  Blockwise:  Gzip1-­‐9  &  QuickLZ   Ÿ  Streamwise:    Run  Length  Encoding  (RLE)  (levels  1-­‐4)   Ÿ  Flexible  indexing,  par;;oning  enable  more  granular  control   and  enable  true  ILM   TABLE ‘SALES’ Mar Apr May Jun Jul Aug Sept Oct Nov Row-­‐oriented  for  Small  Scans  Column-­‐oriented  for  Full  Scans  
  • 15. PL/X : X in {pgsql, R, Python, Java, Perl, C, etc.} •  Allows users to write HAWQ functions in R, Perl, Java, Perl, pgsql or C languages •  The interpreter/VM of the language ‘X’ is installed on each node of the HAWQ Cluster •  Data Parallelism: –  PL/X piggybacks on HAWQ’s MPP architecture
  • 16. Apache HAWQ   ●  Discover  New  Rela9onships   ●  Enable  Data  Science     ●  Analyze  External  Sources   ●  Query  All  Data  Types!   Mul9-­‐level   Fault  Tolerance   Granular   Authoriza9on   Resource  Mgmt   (+  YARN)     high  mul(-­‐tenancy   ANSI  SQL   Standard   OLAP   Extensions   JDBC  ODBC   Connec9vity   Parallel   Processing   Online   Expansion   HDFS   Petabyte  Scale     Cost  Based  Op9mizer   Dynamic   Pipelining   ACID  +   Transac9onal   Mul9-­‐Language   UDF  Support   Built-­‐in  Data   Science  Library   Extensible   (PXF)   Query  External   Sources   Hardened,  10+  Years  Investment,  Produc9on   Proven   Accessibility  +  Usability     HDFS  Na9ve   File  Formats   ●  Manage  Mul9ple  Workloads   ●  Petabyte  Scale  Analy9cs   ●  Security  controls   ●  Leverage  Exis9ng   SQL  Skills  &  BI  Tools   ●  Easily  Integrate  with   Other  Tools   ●  Sub-­‐second   Performance   Compression   +  Par99oning   core   compliance   ●  Hadoop-­‐Na9ve   ●  Supports  Pivotal  HD   and  Hortonworks   Data  Pladorm   ●  Ambari-­‐Integrated  
  • 17. Apache HAWQ 2.0 (new features..) Areas  of  Enhancement   New  Features   Elas;c  &  Scalable  Architecture   Hadoop-­‐Na;ve  Integra;ons   Simplified  External  Data  Access/Queries   Performance  &  Op;miza;ons   On-­‐Demand  Virtual  Segments   Flexible  Query  Dispatch  on  subset  nodes   3  Tier  RM:  YARN  level>User>Query-­‐Operator   Dynamic  Cluster  Expansion  (no  redistribute)   New  Fault  Tolerance  Service   HCatalog  integra;on  -­‐  Read  Access   HDFS  Catalog  Cache   Per  Table  Directory  storage  (user  friendly)   Single  physical  segment  per  node   Easier  Administra;on/Usage   Cloud-­‐Ready   Simpler  Management  Commands  
  • 18. HAWQ Segments HAWQ   Masters   Yarn   Physical  Segment   Client     Parser/   Analyzer     Op;mizer   Dispatcher   DataNode   NodeManager   NameNodeNameNode   External Data Stores via Xtension Framework (Hive/HBase/etc) Resource   Manager   Fault  Tolerance   Service   Catalog Service Virtual   Segment   Virtual   Segment   Physical  Segment   DataNode   NodeManager   Virtual   Segment   Virtual   Segment   Physical  Segment   DataNode   NodeManager   Virtual   Segment   Virtual   Segment   Resource   Broker   libYARN HDFS  Catalog  Cache   Interconnect Interconnect Apache HAWQ 2.0 Architecture  
  • 19. Pivotal Confidential–Internal Use Only Apache MADlib Overview
  • 20. Scalable, In-Database Machine Learning •  Open Source https://github.com/apache/incubator-madlib •  Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL •  Downloads and Docs: http://madlib.incubator.apache.org/ Apache (incubating)
  • 21. Functions Predictive Modeling Library Linear Systems •  Sparse and Dense Solvers •  Linear Algebra Matrix Factorization •  Singular Value Decomposition (SVD) •  Low Rank Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Robust Variance (Huber-White), Clustered Variance, Marginal Effects Other Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Apriori) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Random Forest •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation •  Naïve Bayes •  Support Vector Machines (SVM) Descriptive Statistics Sketch-Based Estimators •  CountMin (Cormode-Muth.) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient Inferential Statistics Hypothesis Tests Time Series •  ARIMA Oct 2014
  • 22. MADlib Advantages Ÿ  Better parallelism –  Algorithms designed to leverage MPP and Hadoop architecture Ÿ  Better scalability –  Algorithms scale as your data set scales Ÿ  Better predictive accuracy –  Can use all data, not a sample Ÿ  ASF open source (incubating) –  Available for customization and optimization
  • 23. Calling MADlib Functions: Fast Training & Scoring •  MADlib allows users to easily create models without moving data out of the systems –  Model generation –  Model validation –  Scoring (evaluation of) new data •  All the data can be used in one model •  Built-in functionality to create multiple smaller models (e.g. classification grouped by feature) •  Open source lets you tweak and extend methods, or build your own
  • 24. Challenges in computing OLS solution a b c d e f g h X Segment 1 Segment 2
  • 25. Challenges in computing OLS solution a b c d e f g h X Segment 1 Segment 2 a c e g b d f hSegment1 Segment2 XT
  • 26. Challenges in computing OLS solution a b c d e f g h X a c e g b d f h XT a2+c2+e2+g2 = Data across nodes are multiplied
  • 27. Challenges in computing OLS solution a b c d e f g h X a c e g b d f h XT a2+c2+e2+g2 = Looks like the result can be decomposed ab+cd+ef+gh b2+d2+f2+h2 ab+cd+ef+gh
  • 28. Challenges in computing OLS solution a b c d e f g h X a c e g b d f h XT a2+c2+e2+g2 = Data across nodes are multiplied! ab+cd+ef+gh b2+d2+f2+h2 ab+cd+ef+gh = +a b e f e f a b +c d g h g hc d +
  • 29. Linear Regression on 10 Million Rows in Seconds Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
  • 30. Contributors Welcome! •  Web sites –  http://hawq.incubator.apache.org/ –  http://madlib.incubator.apache.org/ –  https://cran.r-project.org/web/packages/PivotalR/index.html •  Github –  https://github.com/apache/incubator-hawq –  https://github.com/apache/incubator-madlib –  https://github.com/pivotalsoftware/PivotalR
  • 31. ?