An overview of Hadoop and Data warehouse from technologies and business viewpoints. The presentation also includes some of my personal observations and suggestions for people who want to join the field Big Data.
3. Goals
1. Understanding the technologies and relationships between Hadoop,
Big Data and Data Warehouse
2. Understanding of vocabularies to “present” about Big Data and
Data Warehouse
2018/4/1 3
Raise your hands when you are in doubts
4. Self-Introduction
• Name: Bui Hong Ha
• Company: SBCloud (SoftBank + Alibaba Cloud JV)
• Role: Cloud Architect
• Internet: telescreen
• Video Delivery System
• Big Data
• I built one cluster (100ノード 1.5PB)
• CDH4.3、CDH5.4
• AWS Certified Solution Architect
• Alibaba Cloud Professional / MVP
Skills
Profile
2018/4/1 4
10. Statisticians will be the
next sexy Job in next
decade
Google Flu Trends
Google:MapReduce
paper
Hadoop Initial
Release
2004 20092006
Google published
BigTable paper
2008
HBase Release
Yahoo Launch
Hadoop Cluster
Pig, Hive
Development
2012
YARN
Impala: MPP SQL
on Hadoop
2014
Spark
Big Data Timeline
Kudu
2017
Beam
Big Data Hype
2018/4/1 10
Big Data technologies and hypes originated from the innovations made by Google
engineers/analysts and the hard works of Open Source hackers
11. Hadoop: map-reduce framework
Map-Reduce first splits data into several parts (splitting) and processes those parts in
different computers (Mapping and Shuffling) and then aggregate results (Reducing)
2018/4/1 11
12. Hadoop Architect
Hadoop includes 2 components: Node
Manager and Data Manager
• Node Manager: manage tasks and
computing resources (CPU and
Memory)
• Data Manager: manage data stored
on local disks
2018/4/1 12
13. Features of Hadoop
Fault Tolerant
Scalability - Economic
Data Locality
• Move computation to data
2018/4/1 13
14. Hardwares
Lots of Cores – average frequency
CPUs (to reduce energy consumption)
Lots of memory (32G – 128G)
Lots of HDD (10 HDDs + 2 HDDs)
• SATA (not SAS, SSD)
• No RAID (Raid0) (excluding system
areas)
Produces a huge amount of heat
Hadoop uses commodity type servers. Using special hardware
against the design philosophy of Hadoop
2018/4/1 14
15. Network and Rack Designs
Hadoop tasks include a lot of
moving data around
“Moving data around” produces
high traffics
• 10 HDD * 100 MB/s ~ 8Gbps
(Ethernet 1Gbps)
Design Strategy
10G Switch for Top-Of-rack switches
40G Switch for Core Switches
Enable “rack-awareness” for Hadoop
Hadoop performance does not only come from the power of machines
in the cluster but also from how we design cluster networks
2018/4/1 15
16. Pig
16
- High-level platform for creating
programs that run on Hadoop
- Jobs run on
- Map-Reduce
- Spark
- Apache Tez
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS
(line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS
word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words)
AS count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Ideas for Pig come from Sawzall, developed by
legendary programmer Rob Pike from Google
2018/4/1
https://static.googleusercontent.com/media/research.google.c
om/en//archive/sawzall-sciprog.pdf
17. Hive
17
- Support SQL like query: HiveQL
- Compatible with processing
framework
- MapReduce
- Apache Tez
- Spark
2018/4/1
Traditional Data Analysis and Reporting tools require SQL like query languages
The needs for SQL on Hadoop
20. Data Warehouse vs Transactional
Database
Data Warehouse Transactional Database
Suitable Workloads Analytics, Big Data Transaction processing
Types of Operations Optimized for batched write operations and
reading high volumes of data to minimize I/O
and maximize data throughput
Optimized for continuous write operations and
high volumes of small read operations to
maximize transaction throughput
Data Normalization Employ denormalized schemas like the Star
schema and Snowflake schema
Employ highly normalized schemas, which are
more suited for high transaction throughput
requirements
Storage Requires columnar or other specialized
storage
Row-oriented databases that store whole rows in a
physical block
2018/4/1 20
22. OLTP: Forms of Data Normalization
First Normal Form (1NF)
“An entity type is in 1NF when it contains no repeating groups of data.”
Second Normal Form (2NF)
“An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its Primary Key”
Third Normal Form (3NF)
“An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the Primary Key”
2018/4/1 22
23. OLAP: Data Modeling
2018/4/1 23
FACT TABLE includes all PRIMARY KEYS to DIMENSION TABLE. Query is analysis by
JOIN(ing) of FACT and DIMENSION tables
Abstract Star-Schema Detailed Example of Star-Schema
24. Columnar vs Row Storage
2018/4/1 24
Columnar storage is used when
some fields are queried
Same column same data type
Only queried columns are read
Row storage is used when all fields
are queried in table
All fields can be queried by primary
key
26. Statisticians will be the
next sexy Job in next
decade
Google Flu Trends
Google:MapReduce
paper
Hadoop Initial
Release
2004 20092006
Google published
BigTable paper
2008
HBase Release
Yahoo Launch
Hadoop Cluster
Pig, Hive
Development
2012
YARN
Impala: MPP SQL
on Hadoop
2014
Spark
Big Data Timeline
Kudu
2017
Beam
Big Data Hype
2018/4/1 26
Big Data technologies and hypes originated from the innovations made by Google
engineers/analysts and the hard works of Open Source hackers
27. Big Data の 3V
Volume
量
Velocity
速度
Variety
多様性
Value
価値
Veracity
真実性
28. Hype Cycle 2011: On Radar (Nobody even knows what BigData is)
2018/4/1 28
33. “ But what’s happening is that big data has quickly moved over the Peak of Inflated
Expectations, and has become prevalent in our lives across many hype cycles. So big data
has become a part of many hype cycles. ”
Betsy Burton
2018/4/1 33
35. Obs + Sugg 1: mrjob is good for learning
• https://github.com/Yelp/mrjob
• Python
• Run on local machine or clusters
• Hadoop streaming
2018/4/1
http://calcite.apache.org/docs/stream.html
https://hadoop.apache.org/docs/current/hadoop-
streaming/HadoopStreaming.html
35
36. Obs + Sugg 2: Moving to the Cloud
On Premise Cloud-based Big Data
2018/4/1 36
37. Obs + Sugg 3: Data Scientist uses SQL
Hadoop is solely a data processing framework
• Map-Reduce is primitive
• Sometimes a over-killed solution
SQL is great
• Mature analysis tools: BI, UI
2018/4/1 37