SQL and Machine Learning on Hadoop

Pivotal Confidential–Internal Use Only
SQL & Machine Learning on
Hadoop
Mukund Babbar
Pivotal
Feb, 2015

1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Journey to Apache
Michael Stonebraker develops Postgres at UCB
Postgres adds support for SQL
Open Source PostgreSQL
PostgreSQL 7.0 released
PostgreSQL 8.0 released
Greenplum forks
PostgreSQL
Hadoop 1.0 Released
HAWQ & MADlib
go Apache
HAWQ
launched
Hadoop 2.0 Released
MADlib
launched
Greenplum
open sourced

Apache HAWQ Overview

Shared-Nothing Database Architecture
Standby
Master
Segment Host with one or more Segment Instances
Segment Instances process queries in parallel
High speed interconnect for
continuous pipelining of data
processing
…
Master
Host
SQL
Master Host and Standby Master Host
Master coordinates work with Segment Hosts
Interconnect
Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
Segment Hosts have their own
CPU, disk and memory (shared
nothing) Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
node1
Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
node2
Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
node3
Segment Host
Segment Instance
Segment Instance
Segment Instance
Segment Instance
nodeN

Key
Features

of

HAWQ

5

5
•  Up
to
30x
SQL-‐on-‐Hadoop
performance

advantage

•  Faster
;me
to
insight

•  Massive
MPP
scalability
to
petabytes

Beneﬁts:

Near
real-‐;me
latency,
complex

queries
and
advanced
analy;cs

at
scale

1.
Advanced
Analy9cs
Performance

Key
Features

of

HAWQ

HAWQ Performance vs Impala
HAWQ
Faster
Impala
Faster
2 28 46 66 73 76 79 80 88 90 96
HAWQ
•  Faster on 46 of 62
TPC-DS queries
completed*
•  4.55x mean avg.
•  12 hrs faster total
* Impala supported 74 of 99
queries, 12 crashed mid-run

HAWQ vs Apache Hive w/Tez
HAWQ
Faster
Hive
Faster
3 7 15 25 27 34 46 48 76 79 89 90 96
HAWQ
•  Faster on 45 of 60
TPC-DS queries
completed*
•  3.44x mean avg.
•  9 hrs faster total
* Hive supported 65 of 99 queries,
5 crashed mid-run

5
• ANSI
SQL-‐92,
-‐99,
-‐2003

• All
99
TPC-‐DS
queries
tested,
no

modiﬁca;ons

• Plus,
OLAP
extensions

• Complete
ACID
integrity
and
reliability

Beneﬁts:

100%
SQL
compliant

No
risk
to
SQL
applica;ons

All
na;ve
on
HDP
via
HAWQ

2.
100%
ANSI
SQL
Compliant

Key
Features

of

HAWQ

5
• Advanced
machine
learning
for
big
data

• Local,
in-‐database
opera;on

• Excep;onal
MPP/parallel
performance

• Open
source,
Postgres-‐based

Beneﬁts:

Advanced,
highly
scalable,

machine
learning,
directly
on

data
in
Hadoop

3.
Integrated
Machine
Learning

Key
Features

of

HAWQ

5
• HDP,
PHD,
other
ODPi-‐derived
distros

• Easily
managed
via
Ambari

• On
premises,
in
cloud,
or
PaaS

• HBase,
Avro,
Parquet
and
more

• Connectors
to
make
HAWQ
data

available
to
other
SQL
query
tools

Beneﬁts:

Flexibility

Accessibility

Portability

4.
Flexible
Deployment

Key
Features

of

HAWQ

5
• Cost-‐based
query
op;miza;on

• Robust
query
plan
op;miza;on

• Complex
big
data
management

Beneﬁts:

Op;mize
performance
and
costs

Maximize
Hadoop
cluster
resources

Oﬄoad
EDW
w/o
compromise

5.
Query
Op9miza9on
Op9ons

Key
Features

of

HAWQ

Advanced
MPP:
Polymorphic
Storage™

Ÿ  Columnar
storage
is
well
suited
to

scanning
a
large
percentage
of
the

data

Ÿ  Row
storage
excels
at
small
lookups

Ÿ  Most
systems
need
to
do
both

Ÿ  Row
and
column
orienta;on
can
be

mixed
within
a
table
or
database

Ÿ  Both
types
can
be
drama;cally
more
eﬃcient
with

compression

Ÿ  Compression
is
deﬁnable
column
by
column:

Ÿ  Blockwise:
Gzip1-‐9
&
QuickLZ

Ÿ  Streamwise:

Run
Length
Encoding
(RLE)
(levels
1-‐4)

Ÿ  Flexible
indexing,
par;;oning
enable
more
granular
control

and
enable
true
ILM

TABLE ‘SALES’
Mar Apr May Jun Jul Aug Sept Oct Nov
Row-‐oriented
for
Small
Scans
Column-‐oriented
for
Full
Scans

PL/X : X in {pgsql, R, Python, Java, Perl, C, etc.}
•  Allows users to write HAWQ
functions in R, Perl, Java, Perl,
pgsql or C languages
•  The interpreter/VM of the
language ‘X’ is installed on
each node of the HAWQ
Cluster
•  Data Parallelism:
–  PL/X piggybacks on
HAWQ’s MPP architecture

Apache HAWQ

●  Discover
New
Rela9onships

●  Enable
Data
Science

●  Analyze
External
Sources

●  Query
All
Data
Types!

Mul9-‐level

Fault
Tolerance

Granular

Authoriza9on

Resource
Mgmt

(+
YARN)

high
mul(-‐tenancy

ANSI
SQL

Standard

OLAP

Extensions

JDBC
ODBC

Connec9vity

Parallel

Processing

Online

Expansion

HDFS

Petabyte
Scale

Cost
Based
Op9mizer

Dynamic

Pipelining

ACID
+

Transac9onal

Mul9-‐Language

UDF
Support

Built-‐in
Data

Science
Library

Extensible

(PXF)

Query
External

Sources

Hardened,
10+
Years
Investment,
Produc9on

Proven

Accessibility
+
Usability

HDFS
Na9ve

File
Formats

●  Manage
Mul9ple
Workloads

●  Petabyte
Scale
Analy9cs

●  Security
controls

●  Leverage
Exis9ng

SQL
Skills
&
BI
Tools

●  Easily
Integrate
with

Other
Tools

●  Sub-‐second

Performance
Compression

+
Par99oning

core

compliance

●  Hadoop-‐Na9ve

●  Supports
Pivotal
HD

and
Hortonworks

Data
Pladorm

●  Ambari-‐Integrated

Apache HAWQ 2.0 (new features..)
Areas
of
Enhancement
New
Features

Elas;c
&
Scalable
Architecture

Hadoop-‐Na;ve
Integra;ons

Simpliﬁed
External
Data
Access/Queries

Performance
&
Op;miza;ons

On-‐Demand
Virtual
Segments

Flexible
Query
Dispatch
on
subset
nodes

3
Tier
RM:
YARN
level>User>Query-‐Operator

Dynamic
Cluster
Expansion
(no
redistribute)

New
Fault
Tolerance
Service

HCatalog
integra;on
-‐
Read
Access

HDFS
Catalog
Cache

Per
Table
Directory
storage
(user
friendly)

Single
physical
segment
per
node

Easier
Administra;on/Usage

Cloud-‐Ready

Simpler
Management
Commands

HAWQ
Segments
HAWQ

Masters

Yarn

Physical
Segment

Client

Parser/

Analyzer

Op;mizer

Dispatcher

DataNode

NodeManager

NameNodeNameNode

External Data Stores via Xtension Framework (Hive/HBase/etc)
Resource

Manager

Fault
Tolerance

Service

Catalog
Service
Virtual

Segment

Virtual

Segment

Physical
Segment

DataNode

NodeManager

Virtual

Segment

Virtual

Segment

Physical
Segment

DataNode

NodeManager

Virtual

Segment

Virtual

Segment

Resource

Broker

libYARN
HDFS
Catalog
Cache

Interconnect Interconnect
Apache HAWQ 2.0
Architecture

Apache MADlib Overview

Scalable, In-Database
Machine Learning
•  Open Source https://github.com/apache/incubator-madlib
•  Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL
•  Downloads and Docs: http://madlib.incubator.apache.org/
Apache (incubating)

Functions
Predictive Modeling Library
Linear Systems
•  Sparse and Dense Solvers
•  Linear Algebra
Matrix Factorization
•  Singular Value Decomposition (SVD)
•  Low Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards Regression
•  Elastic Net Regularization
•  Robust Variance (Huber-White), Clustered
Variance, Marginal Effects
Other Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Apriori)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Random Forest
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
•  Naïve Bayes
•  Support Vector Machines (SVM)
Descriptive Statistics
Sketch-Based Estimators
•  CountMin (Cormode-Muth.)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
Data Preparation
PMML Export
Conjugate Gradient
Inferential Statistics
Hypothesis Tests
Time Series
•  ARIMA
Oct 2014

MADlib Advantages
Ÿ  Better parallelism
–  Algorithms designed to leverage MPP and
Hadoop architecture
Ÿ  Better scalability
–  Algorithms scale as your data set scales
Ÿ  Better predictive accuracy
–  Can use all data, not a sample
Ÿ  ASF open source (incubating)
–  Available for customization and optimization

Calling MADlib Functions: Fast Training &
Scoring
•  MADlib allows users to easily create
models without moving data out of the
systems
–  Model generation
–  Model validation
–  Scoring (evaluation of) new data
•  All the data can be used in one model
•  Built-in functionality to create multiple
smaller models (e.g. classification
grouped by feature)
•  Open source lets you tweak and extend
methods, or build your own

Challenges in computing OLS solution
a b
c d
e f
g h
X
Segment 1
Segment 2

a b
c d
e f
g h
X
Segment 1
Segment 2
a c e g
b d f hSegment1
Segment2
XT

a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Data across nodes
are multiplied

a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Looks like the result
can be decomposed
ab+cd+ef+gh
b2+d2+f2+h2
ab+cd+ef+gh

a b
c d
e f
g h
X
a c e g
b d f h
XT
a2+c2+e2+g2
=
Data across nodes
are multiplied!
ab+cd+ef+gh
b2+d2+f2+h2
ab+cd+ef+gh
= +a b e
f
e f
a
b +c d
g
h
g hc
d
+

Linear Regression on 10 Million Rows in Seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of
the VLDB Endowment 5.12 (2012): 1700-1711.

Contributors Welcome!
•  Web sites
–  http://hawq.incubator.apache.org/
–  http://madlib.incubator.apache.org/
–  https://cran.r-project.org/web/packages/PivotalR/index.html
•  Github
–  https://github.com/apache/incubator-hawq
–  https://github.com/apache/incubator-madlib
–  https://github.com/pivotalsoftware/PivotalR

SQL and Machine Learning on Hadoop

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à SQL and Machine Learning on Hadoop

Similaire à SQL and Machine Learning on Hadoop (20)

Dernier

Dernier (20)

SQL and Machine Learning on Hadoop