From Hadoop to Enterprise Data Warehouse

From Hadoop to
Data Ware House
Bui Hong Ha
2018/3/31
For Vietnamese AI Community in Japan
2018/4/1 1

Agenda
1. Hadoop Technologies
2. Data Warehouse
3. From Data Warehouse to Big Data
4. Observations
2018/4/1 2

Goals
1. Understanding the technologies and relationships between Hadoop,
Big Data and Data Warehouse
2. Understanding of vocabularies to “present” about Big Data and
Data Warehouse
2018/4/1 3
Raise your hands when you are in doubts

Self-Introduction
• Name: Bui Hong Ha
• Company: SBCloud (SoftBank + Alibaba Cloud JV)
• Role: Cloud Architect
• Internet: telescreen
• Video Delivery System
• Big Data
• I built one cluster (100ノード 1.5PB)
• CDH4.3、CDH5.4
• AWS Certified Solution Architect
• Alibaba Cloud Professional / MVP
Skills
Profile
2018/4/1 4

Interests: taking photos with famous people
2018/4/1 5

Softwares Positions
2018/4/1 8

1. Hadoop technologies
1. Hadoop
2. Query methods
3. UI
2018/4/1 9

Statisticians will be the
next sexy Job in next
decade
Google Flu Trends
Google:MapReduce
paper
Hadoop Initial
Release
2004 20092006
Google published
BigTable paper
2008
HBase Release
Yahoo Launch
Hadoop Cluster
Pig, Hive
Development
2012
YARN
Impala: MPP SQL
on Hadoop
2014
Spark
Big Data Timeline
Kudu
2017
Beam
Big Data Hype
2018/4/1 10
Big Data technologies and hypes originated from the innovations made by Google
engineers/analysts and the hard works of Open Source hackers

Hadoop: map-reduce framework
Map-Reduce first splits data into several parts (splitting) and processes those parts in
different computers (Mapping and Shuffling) and then aggregate results (Reducing)
2018/4/1 11

Hadoop Architect
Hadoop includes 2 components: Node
Manager and Data Manager
• Node Manager: manage tasks and
computing resources (CPU and
Memory)
• Data Manager: manage data stored
on local disks
2018/4/1 12

Features of Hadoop
 Fault Tolerant
 Scalability - Economic
 Data Locality
• Move computation to data
2018/4/1 13

Hardwares
Lots of Cores – average frequency
CPUs (to reduce energy consumption)
Lots of memory (32G – 128G)
Lots of HDD (10 HDDs + 2 HDDs)
• SATA (not SAS, SSD)
• No RAID (Raid0) (excluding system
areas)
Produces a huge amount of heat
Hadoop uses commodity type servers. Using special hardware
against the design philosophy of Hadoop
2018/4/1 14

Network and Rack Designs
 Hadoop tasks include a lot of
moving data around
 “Moving data around” produces
high traffics
• 10 HDD * 100 MB/s ~ 8Gbps
(Ethernet 1Gbps)
Design Strategy
10G Switch for Top-Of-rack switches
40G Switch for Core Switches
Enable “rack-awareness” for Hadoop
Hadoop performance does not only come from the power of machines
in the cluster but also from how we design cluster networks
2018/4/1 15

Pig
16
- High-level platform for creating
programs that run on Hadoop
- Jobs run on
- Map-Reduce
- Spark
- Apache Tez
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS
(line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS
word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words)
AS count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Ideas for Pig come from Sawzall, developed by
legendary programmer Rob Pike from Google
2018/4/1
https://static.googleusercontent.com/media/research.google.c
om/en//archive/sawzall-sciprog.pdf

Hive
17
- Support SQL like query: HiveQL
- Compatible with processing
framework
- MapReduce
- Apache Tez
- Spark
2018/4/1
Traditional Data Analysis and Reporting tools require SQL like query languages
 The needs for SQL on Hadoop

2. Data Warehouse
Technologies
2018/4/1 19
1. OLAP vs OLTP
2. Column vs Row storage

Data Warehouse vs Transactional
Database
Data Warehouse Transactional Database
Suitable Workloads Analytics, Big Data Transaction processing
Types of Operations Optimized for batched write operations and
reading high volumes of data to minimize I/O
and maximize data throughput
Optimized for continuous write operations and
high volumes of small read operations to
maximize transaction throughput
Data Normalization Employ denormalized schemas like the Star
schema and Snowflake schema
Employ highly normalized schemas, which are
more suited for high transaction throughput
requirements
Storage Requires columnar or other specialized
storage
Row-oriented databases that store whole rows in a
physical block
2018/4/1 20

Analytical vs Transactional (OLAP vs OLTP)
※ Understanding Analytic Workloads - IBM
2018/4/1 21

OLTP: Forms of Data Normalization
First Normal Form (1NF)
“An entity type is in 1NF when it contains no repeating groups of data.”
Second Normal Form (2NF)
“An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its Primary Key”
Third Normal Form (3NF)
“An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the Primary Key”
2018/4/1 22

OLAP: Data Modeling
2018/4/1 23
FACT TABLE includes all PRIMARY KEYS to DIMENSION TABLE. Query is analysis by
JOIN(ing) of FACT and DIMENSION tables
Abstract Star-Schema Detailed Example of Star-Schema

Columnar vs Row Storage
2018/4/1 24
 Columnar storage is used when
some fields are queried
 Same column  same data type
 Only queried columns are read
Row storage is used when all fields
are queried in table
 All fields can be queried by primary
key

Statisticians will be the
next sexy Job in next
decade
Google Flu Trends
Google:MapReduce
paper
Hadoop Initial
Release
2004 20092006
Google published
BigTable paper
2008
HBase Release
Yahoo Launch
Hadoop Cluster
Pig, Hive
Development
2012
YARN
Impala: MPP SQL
on Hadoop
2014
Spark
Big Data Timeline
Kudu
2017
Beam
Big Data Hype
2018/4/1 26
Big Data technologies and hypes originated from the innovations made by Google
engineers/analysts and the hard works of Open Source hackers

Big Data の 3V
Volume
量
Velocity
速度
Variety
多様性
Value
価値
Veracity
真実性

Hype Cycle 2011: On Radar (Nobody even knows what BigData is)
2018/4/1 28

Hype Cycle 2012: Rising
2018/4/1 29

Hype Cycle 2013: Peak of Inflated Expectation
2018/4/1 30

Hype Cycle 2014: Trough of Disillusionment (false claims of
simplicity, promise beyond reason)
2018/4/1 31

Hype Cycle 2015: BigData Disappeared (Adoption > 20% market)
2018/4/1 32

“ But what’s happening is that big data has quickly moved over the Peak of Inflated
Expectations, and has become prevalent in our lives across many hype cycles. So big data
has become a part of many hype cycles. ”
Betsy Burton
2018/4/1 33

4. Personal Observations
and Suggestions
2018/4/1 34

Obs + Sugg 1: mrjob is good for learning
• https://github.com/Yelp/mrjob
• Python
• Run on local machine or clusters
• Hadoop streaming
2018/4/1
http://calcite.apache.org/docs/stream.html
https://hadoop.apache.org/docs/current/hadoop-
streaming/HadoopStreaming.html
35

Obs + Sugg 2: Moving to the Cloud
On Premise  Cloud-based Big Data
2018/4/1 36

Obs + Sugg 3: Data Scientist uses SQL
 Hadoop is solely a data processing framework
• Map-Reduce is primitive
• Sometimes a over-killed solution
 SQL is great
• Mature analysis tools: BI, UI
2018/4/1 37

From Hadoop to Enterprise Data Warehouse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Hadoop to Enterprise Data Warehouse

Similar to From Hadoop to Enterprise Data Warehouse (20)

Recently uploaded

Recently uploaded (20)

From Hadoop to Enterprise Data Warehouse