Hadoop & Hive Change the Data Warehousing Game Forever
1. Hadoop & Hive
HOW
Data Warehousing Game
CHANGE THE
Forever
Dave Mariani
CEO & Founder of AtScale
@dmariani
Atscale.com
2014 Hadoop Summit
San Jose, CA
June 3, 2014
4. 44
“We think only 3% of the potentially
useful data is tagged, and even less
is analyzed.”
Source: IDC Predictions 2013: Big Data, IDC
“90% of the data in the world
today has been created in the
last two years”
Source: IBM
5. In 2012, 2.5 quintillion byes
of data was generated every
day
Source: IBM
40. SELECT { [Measures].[Counter],
[Measures].[PreviousPeriodCounter]}
ON COLUMNS,
NON EMPTY CROSSJOIN (
exists([Date].[Date].[Date].allmembers,
[Date].[Date].&[2012-05-19T00:00:00]:[Date].[Date].&[2012-0
02T00:00:00]),
[Events].[Event].[Event].allmembers) DIMENSION PROPERTIES
MEMBER_CAPTION
ON ROWS
FROM [ProductInsight]
WHERE ({[Projects].[Project].[plusK]})
55. Records
Returned
Time (Seconds)
Select Statement
HANA
Small
Impala
Small
(1 Node)
Parquet
Impala
Small
(3
Nodes)
Parquet
Impala
Small
(1 Node)
Text
Impala
Small
(3 Nodes)
Text
select count(*) from lineitem 1 1 3 1 74 31
select count(*), sum(l_extendedprice) from lineitem 1 4 12 3 73 29
select l_shipmode, count(*), sum(l_extendedprice) from lineitem group by
l_shipmode 7 8 23 5 74 28
select l_shipmode, count(*), sum(l_extendedprice) from lineitem where
l_shipmode = 'AIR' group by l_shipmode 1 1 20 4 73 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from
lineitem group by l_shipmode, l_linestatus 14 10 32 7 74 28
select l_shipmode, l_linestatus, count(*), sum(l_extendedprice) from
lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' group by
l_shipmode, l_linestatus 1 1 27 5 72 29
select count(*) from lineitem where l_shipmode = 'AIR' and l_linestatus =
'F' and l_suppkey = 1 45 1 23 5 73 30
select l_shipmode, l_linestatus, l_extendedprice from lineitem where
l_shipmode = 'AIR' and l_linestatus = 'F' and l_suppkey = 1 45 1 29 5 73 31
select * from lineitem where l_shipmode = 'AIR' and l_linestatus = 'F' and
l_suppkey = 1 45 1 104 21 73 30
Size
(5 Part.)
1.9Gb
(40 files x 80mb)
3.2Gb
(1 file – No
Compression)
7.2Gb
Est. Monthly Cost of Production Environment on AWS
(HANA m2.xlarge, Impala m1.medium) $1022 $175 $350 $175 $350Source: Aron MacDonald, http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws
TPC-H Query Run Times (Impala vs. HANA)
Line item table, 60 Million Rows
Our ability to capture data has far exceeded our ability to analyze it
Traditional data warehousing tools have not kept pace with the growth of data
Hadoop allows us to capture and store data economically but tradition BI tools and approaches don’t work
IDC
“Currently a quarter of the information in the Digital Universe would be useful for big data if it were tagged and analyzed. We think only 3% of the potentially useful data is tagged, and even less is analyzed”
Sad panda
Happy panda!
Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
Klout “scores” 400M people every day and collects 12B signals from various social networks that need to factor into the Klout score..
The Klout architecture is made up of open source tools.
Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse.
We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
Great page by page analysis, great reports, but couldn’t send user identifiable data
Mixpanel has great support for real time events, but we couldn’t send all the necessary data to really draw interesting conclusions. Joining on data was still going to be a huge challenge.
We had all our data, but of course, that was about it
We couldn’t cross the streams. We wanted to discover really interesting patterns and make advanced recommendations based on who the user was.
At Klout, we used web analytics tools like Google Analytics and Mixpanel to understand how our users interacted with our web site and mobile app. However, we could not join the usage data with our profile data. This made for an incomplete view of our users.
We decided to build a flexible, event oriented architecture to capture all events for user activity. This is the architecture.
First, we invented a simple, JSON oriented event capture method. This allowed our web and app designers to add instrumentation without regards to how it would affect the downstream analytics applications or Hive warehouse.
Next, using Flume, we mapped the semi-structure data stream into time partitioned files in Hadoop HDFS.
We then created an EXTERNAL Hive table on top of this file structure. That allowed us to “query” the incoming files in HDFS.
In order to provide an interactive query environment (OLAP), we connected SQL Server Analysis Services directly to the Hive warehouse and continuously updated a MOLAP cube with the data.
We then could hook up internally developed applications (Event Tracker) to our data by having the applications generate MDX (multi-dimensional query language) and run them against our cube.
Or we could use the Hive CLI (command line interface) to execute queries using SQL directly against our Hive warehouse.
Thumbs up!
The Klout architecture is made up of open source tools.
Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse.
We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
The Klout architecture is made up of open source tools.
Since we could not afford an expensive software/hardware solution, we chose Hadoop and Hive. We built a 150 node Hadoop cluster with $3k commodity nodes to create a 1.5 petabyte warehouse.
We used SQL Server Analysis Services, connected directly to our Hive data warehouse for providing an interactive query environment.
By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
By leveraging Hadoop’s non-scalar data types (map, array, struct, union), we can store data in an event/attributes (key/value pairs) format. This allows us to perform transformations at query time (schema on read). This allows us to keep our data modeling (pre-structuring) to a minimum and allows us to add new data without affecting the schema. In this way, we can capturing all the data we want in the most simple terms (log files) and structure the data later on read. This drastically simplifies data modeling and creates huge flexibility and reduced cost.
Apache Hive’s reliance on MapReduce as it’s core data processing engines makes it unsuitable for interactive queries due to startup times and MapReduce’s batch nature.
There are several approaches emerging to address these deficiencies that are still Hive catalog compatible. These developments are what makes using Hadoop/Hive as the world’s least expensive but scalable data warehousing platform possible.
Aron MacDonald
Source: http://scn.sap.com/community/developer-center/hana/blog/2013/05/30/big-data-analytics-hana-vs-hadoop-impala-on-aws
Cloudera Impala which is essentially free, performs almost as well as an expensive alternative that relies on memory caching for delivering performance.
Shark, Impala, etc. turn Hive into a real interactive SQL query environment. This is a huge advancement and was the missing piece that makes Hadoop into the world’s cheapest most scalable database.
Here’s a query that demonstrates Shark/Hive’s support for non-scalar data types:
use aw_demo;
describe factinternetsales;
select a.year, s.stylename, a.num_orders from (
select part_year as year,
product_info["style"] as style,
sum(orderquantity) as num_orders
from factinternetsales
where part_year < 2007
group by product_info["style"], part_year
) a
left outer join dimstyle s on a.style = s.stylekey
order by year, num_orders desc;