1) HAWQ is an SQL and machine learning engine that runs on Hadoop, providing SQL capabilities and machine learning functionality directly on HDFS data.
2) HAWQ provides up to 30x faster performance than other SQL-on-Hadoop engines like Impala and Hive, through its massively parallel processing (MPP) architecture and query optimization capabilities.
3) Key features of HAWQ include ANSI SQL compliance, integrated machine learning via the MADlib library, flexible deployment across on-premises and cloud environments, and high scalability to petabytes of data.
7. 5
• Up
to
30x
SQL-‐on-‐Hadoop
performance
advantage
• Faster
;me
to
insight
• Massive
MPP
scalability
to
petabytes
Benefits:
Near
real-‐;me
latency,
complex
queries
and
advanced
analy;cs
at
scale
1.
Advanced
Analy9cs
Performance
Key
Features
of
HAWQ
8. HAWQ Performance vs Impala
HAWQ
Faster
Impala
Faster
2 28 46 66 73 76 79 80 88 90 96
HAWQ
• Faster on 46 of 62
TPC-DS queries
completed*
• 4.55x mean avg.
• 12 hrs faster total
* Impala supported 74 of 99
queries, 12 crashed mid-run
9. HAWQ vs Apache Hive w/Tez
HAWQ
Faster
Hive
Faster
3 7 15 25 27 34 46 48 76 79 89 90 96
HAWQ
• Faster on 45 of 60
TPC-DS queries
completed*
• 3.44x mean avg.
• 9 hrs faster total
* Hive supported 65 of 99 queries,
5 crashed mid-run
10. 5
• ANSI
SQL-‐92,
-‐99,
-‐2003
• All
99
TPC-‐DS
queries
tested,
no
modifica;ons
• Plus,
OLAP
extensions
• Complete
ACID
integrity
and
reliability
Benefits:
100%
SQL
compliant
No
risk
to
SQL
applica;ons
All
na;ve
on
HDP
via
HAWQ
2.
100%
ANSI
SQL
Compliant
Key
Features
of
HAWQ
11. 5
• Advanced
machine
learning
for
big
data
• Local,
in-‐database
opera;on
• Excep;onal
MPP/parallel
performance
• Open
source,
Postgres-‐based
Benefits:
Advanced,
highly
scalable,
machine
learning,
directly
on
data
in
Hadoop
3.
Integrated
Machine
Learning
Key
Features
of
HAWQ
12. 5
• HDP,
PHD,
other
ODPi-‐derived
distros
• Easily
managed
via
Ambari
• On
premises,
in
cloud,
or
PaaS
• HBase,
Avro,
Parquet
and
more
• Connectors
to
make
HAWQ
data
available
to
other
SQL
query
tools
Benefits:
Flexibility
Accessibility
Portability
4.
Flexible
Deployment
Key
Features
of
HAWQ
13. 5
• Cost-‐based
query
op;miza;on
• Robust
query
plan
op;miza;on
• Complex
big
data
management
Benefits:
Op;mize
performance
and
costs
Maximize
Hadoop
cluster
resources
Offload
EDW
w/o
compromise
5.
Query
Op9miza9on
Op9ons
Key
Features
of
HAWQ
14. Advanced
MPP:
Polymorphic
Storage™
Ÿ Columnar
storage
is
well
suited
to
scanning
a
large
percentage
of
the
data
Ÿ Row
storage
excels
at
small
lookups
Ÿ Most
systems
need
to
do
both
Ÿ Row
and
column
orienta;on
can
be
mixed
within
a
table
or
database
Ÿ Both
types
can
be
drama;cally
more
efficient
with
compression
Ÿ Compression
is
definable
column
by
column:
Ÿ Blockwise:
Gzip1-‐9
&
QuickLZ
Ÿ Streamwise:
Run
Length
Encoding
(RLE)
(levels
1-‐4)
Ÿ Flexible
indexing,
par;;oning
enable
more
granular
control
and
enable
true
ILM
TABLE ‘SALES’
Mar Apr May Jun Jul Aug Sept Oct Nov
Row-‐oriented
for
Small
Scans
Column-‐oriented
for
Full
Scans
15. PL/X : X in {pgsql, R, Python, Java, Perl, C, etc.}
• Allows users to write HAWQ
functions in R, Perl, Java, Perl,
pgsql or C languages
• The interpreter/VM of the
language ‘X’ is installed on
each node of the HAWQ
Cluster
• Data Parallelism:
– PL/X piggybacks on
HAWQ’s MPP architecture
16. Apache HAWQ
● Discover
New
Rela9onships
● Enable
Data
Science
● Analyze
External
Sources
● Query
All
Data
Types!
Mul9-‐level
Fault
Tolerance
Granular
Authoriza9on
Resource
Mgmt
(+
YARN)
high
mul(-‐tenancy
ANSI
SQL
Standard
OLAP
Extensions
JDBC
ODBC
Connec9vity
Parallel
Processing
Online
Expansion
HDFS
Petabyte
Scale
Cost
Based
Op9mizer
Dynamic
Pipelining
ACID
+
Transac9onal
Mul9-‐Language
UDF
Support
Built-‐in
Data
Science
Library
Extensible
(PXF)
Query
External
Sources
Hardened,
10+
Years
Investment,
Produc9on
Proven
Accessibility
+
Usability
HDFS
Na9ve
File
Formats
● Manage
Mul9ple
Workloads
● Petabyte
Scale
Analy9cs
● Security
controls
● Leverage
Exis9ng
SQL
Skills
&
BI
Tools
● Easily
Integrate
with
Other
Tools
● Sub-‐second
Performance
Compression
+
Par99oning
core
compliance
● Hadoop-‐Na9ve
● Supports
Pivotal
HD
and
Hortonworks
Data
Pladorm
● Ambari-‐Integrated
17. Apache HAWQ 2.0 (new features..)
Areas
of
Enhancement
New
Features
Elas;c
&
Scalable
Architecture
Hadoop-‐Na;ve
Integra;ons
Simplified
External
Data
Access/Queries
Performance
&
Op;miza;ons
On-‐Demand
Virtual
Segments
Flexible
Query
Dispatch
on
subset
nodes
3
Tier
RM:
YARN
level>User>Query-‐Operator
Dynamic
Cluster
Expansion
(no
redistribute)
New
Fault
Tolerance
Service
HCatalog
integra;on
-‐
Read
Access
HDFS
Catalog
Cache
Per
Table
Directory
storage
(user
friendly)
Single
physical
segment
per
node
Easier
Administra;on/Usage
Cloud-‐Ready
Simpler
Management
Commands
20. Scalable, In-Database
Machine Learning
• Open Source https://github.com/apache/incubator-madlib
• Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL
• Downloads and Docs: http://madlib.incubator.apache.org/
Apache (incubating)
21. Functions
Predictive Modeling Library
Linear Systems
• Sparse and Dense Solvers
• Linear Algebra
Matrix Factorization
• Singular Value Decomposition (SVD)
• Low Rank
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Robust Variance (Huber-White), Clustered
Variance, Marginal Effects
Other Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Apriori)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Random Forest
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
• Naïve Bayes
• Support Vector Machines (SVM)
Descriptive Statistics
Sketch-Based Estimators
• CountMin (Cormode-Muth.)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
Data Preparation
PMML Export
Conjugate Gradient
Inferential Statistics
Hypothesis Tests
Time Series
• ARIMA
Oct 2014
22. MADlib Advantages
Ÿ Better parallelism
– Algorithms designed to leverage MPP and
Hadoop architecture
Ÿ Better scalability
– Algorithms scale as your data set scales
Ÿ Better predictive accuracy
– Can use all data, not a sample
Ÿ ASF open source (incubating)
– Available for customization and optimization
23. Calling MADlib Functions: Fast Training &
Scoring
• MADlib allows users to easily create
models without moving data out of the
systems
– Model generation
– Model validation
– Scoring (evaluation of) new data
• All the data can be used in one model
• Built-in functionality to create multiple
smaller models (e.g. classification
grouped by feature)
• Open source lets you tweak and extend
methods, or build your own
25. Challenges in computing OLS solution
a b
c d
e f
g h
X
Segment 1
Segment 2
a c e g
b d f hSegment1
Segment2
XT
26. Challenges in computing OLS solution
a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Data across nodes
are multiplied
27. Challenges in computing OLS solution
a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Looks like the result
can be decomposed
ab+cd+ef+gh
b2+d2+f2+h2
ab+cd+ef+gh
28. Challenges in computing OLS solution
a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Data across nodes
are multiplied!
ab+cd+ef+gh
b2+d2+f2+h2
ab+cd+ef+gh
= +a b e
f
e f
a
b +c d
g
h
g hc
d
+
29. Linear Regression on 10 Million Rows in Seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of
the VLDB Endowment 5.12 (2012): 1700-1711.