2. What’s MPP in data warehousing?
MPP (massively parallel processing) data warehouse systems
are different from SMP (symmetric multiprocessing)
databases:
1. Shared-nothing architectures, with no single point of failure
and often hot-swappable components
2. Scale horizontally by adding nodes, rather than moving to a
server with more CPUs or higher storage capacity
3. Breaks a large queries across nodes for simultaneous
processing
4. Capable of higher data ingestion rates through parallelized
data movement
3. Who are the players?
Previously, we discussed just the specialized MPP data warehouse vendors:
Teradata
Netezza
Vertica
Greenplum
…But We should keep in mind that most major database vendors also have
their own MPP products for data warehousing. Examples include:
Microsoft PDW (Parallel Data Warehouse)
DB2 UDB with Database Partitioning Feature (DPF)
Oracle Big Data Appliance, which just provides a gateway between Hadoop to
their SMP RDBMS platform
Finally, we need to consider the emergence of SQL-oriented, low-latency
Hadoop solutions. Examples include:
Impala; Stinger; Apache Drill; Phoenix; Shark; Hadapt
Teradata’s SQL-H (Aster Data); EMC’s HAWQ; IBM’s BigSQL
See related writeup: http://www.slideshare.net/DavidPortnoy/hybrid-data-
warehouse-hadoop-implementations
4. How to the architectures compare?
Looking at the specialized MPP data warehouse vendors
Teradata Netezza Greenplum Vertica
Hardware Custom MPP, Shared
Nothing
Custom MPP: SPU +
FPGA logic
Commodity hardware Custom Hybrid MPP,
Shared Everything
Type of
processing
OLTP or OLAP,
Can handle high user
load
OLAP,
Assumes few users for
heavy analytics
OLAP OLAP optimized for
large fact tables
Inception /
Maturity
1979
From Caltech
2000
By Saxena & Hinshaw
2003
From Metapa & Didera
2005
By MIT’s Stonebaker
Performance &
maintenance
Auto-recommended
optimization,
columnar compression
available
No need for
performance tuning,
Must manually reclaim
space
Based on
PostgreSQL, but
optimized for MPP and
enterprise maint.
Column oriented
optimization for
ingestion,
storage/compression,
and access
Hardware Proprietary Proprietary Commodity Commodity
Definitions
* OLAP: Online Analytical Processing
* OLTP: Online Transaction Processing
5. The industry is moving towards open, commodity solutions
Traditional database servers, such as IBM DB2, Oracle Exadata and
Microsoft SQL Server, license proprietary software, but run on
commodity hardware. Although the nature of SMP architecture typically
favors having a few large expensive servers.
But the biggest MPP data warehouse vendors all have proprietary
software. That’s despite the fact that Netezza and Vertica were on the
open source PostgreSQL database. Teradata and Netezza even
implement custom hardware, which drives up the price.
Hadoop has open sourced the software component leading to a vibrant
ecosystem of tools and applications. And with built in redundancy, it’s
easy to deploy on cheap commodity servers.
6. Specialized
Hardware
Commodity
Hardware
Open Source,
Standardized Software
Proprietary Software
So the trend looks something like this
Hadoop
** While up-front cost of Hadoop may be lower, the TCO (total cost of ownership)
could be relatively much higher. This is due to the maturity of product, complexity of
solutions and scarcity of talent.
Traditional
Database
MPP Data
Warehouse
7. Teradata
Hardware and licenses the most
expensive of all options. Staff costs can
be expensive and it takes a great deal of
effort to configure and administer.
IBM
Netezza
Hardware and licenses used to be much
less than Teradata, but prices have been
converging. Some of the highest staff
cost due to scarcity, but that’s tempered
by lower effort for configuration and
admin of single purpose appliance.
Greenplum
Commodity hardware. Moderately priced
licenses. Few Greenplum specialists, but
can be staffed by PostgreSQL DBAs and
developers.
Vertica Commodity hardware. Moderately priced
licenses, but special purpose orientation
limits usefulness. Few specialists, but
can be staffed by traditional DBAs and
developerss.
Hadoop
HBase
Commodity hardware and no license
cost, resulting in lowest up-front cost.
Likely to buy more hardware for
redundancy and load. But requires
highly technical staff and implementation
is less productive than more mature
options.
So lets look at the relative cost breakdown
Hardware & Licenses Development
Hardware Licenses Development
Hardware & Licenses Development
Hardware Development
Hardware Licenses Development
8. What’s their relative adoption today?
Comparing the supply and demand for administrators and developers can
be a proxy for the strength and staying power of a platform.
Teradata has been around for many years longer than the alternatives and
still dominates the market in terms of install base (3 times next rival) and
vibrant development community (6 times next rival).
But in recent years Hadoop solutions have outstripped Teradata by a
significant margin. (Of course, it should be noted that Hadoop includes use
cases outside of
traditional data
warehousing.)
9. Over time, interest in market leader Teradata has been consistent, but flat
While Netezza, Vertica, and Greenplum have grown, they didn’t take significant
market share away from Teradata.
(The spike in Netezza interest is attributed to its acquisition by IBM.)
10. But when Hadoop is added into the mix, the picture changes drastically
Interest in Hadoop has quickly overtaken even traditional Teradata
Which might explain why Teradata has been on an acquisition spree for
Hadoop related products and services, such as Aster Data
The future of its next biggest rival, Netezza, is uncertain as it seeks its
niche within IBM’s product lineup.
11. Related Reading
Hybrid Data Warehouse-Hadoop Implementations:
http://www.slideshare.net/DavidPortnoy/hybrid-data-warehouse-
hadoop-implementations
Agile Business Intelligence:
http://www.slideshare.net/DavidPortnoy/agile-bi-18491924
Blog:
http://david.portnoy.us