SlideShare a Scribd company logo
1 of 38
From Hadoop to
Data Ware House
Bui Hong Ha
2018/3/31
For Vietnamese AI Community in Japan
2018/4/1 1
Agenda
1. Hadoop Technologies
2. Data Warehouse
3. From Data Warehouse to Big Data
4. Observations
2018/4/1 2
Goals
1. Understanding the technologies and relationships between Hadoop,
Big Data and Data Warehouse
2. Understanding of vocabularies to “present” about Big Data and
Data Warehouse
2018/4/1 3
Raise your hands when you are in doubts
Self-Introduction
• Name: Bui Hong Ha
• Company: SBCloud (SoftBank + Alibaba Cloud JV)
• Role: Cloud Architect
• Internet: telescreen
• Video Delivery System
• Big Data
• I built one cluster (100ノード 1.5PB)
• CDH4.3、CDH5.4
• AWS Certified Solution Architect
• Alibaba Cloud Professional / MVP
Skills
Profile
2018/4/1 4
Interests: taking photos with famous people
2018/4/1 5
2018/4/1 6
Quiz
2018/4/1 7
Softwares Positions
2018/4/1 8
1. Hadoop technologies
1. Hadoop
2. Query methods
3. UI
2018/4/1 9
Statisticians will be the
next sexy Job in next
decade
Google Flu Trends
Google:MapReduce
paper
Hadoop Initial
Release
2004 20092006
Google published
BigTable paper
2008
HBase Release
Yahoo Launch
Hadoop Cluster
Pig, Hive
Development
2012
YARN
Impala: MPP SQL
on Hadoop
2014
Spark
Big Data Timeline
Kudu
2017
Beam
Big Data Hype
2018/4/1 10
Big Data technologies and hypes originated from the innovations made by Google
engineers/analysts and the hard works of Open Source hackers
Hadoop: map-reduce framework
Map-Reduce first splits data into several parts (splitting) and processes those parts in
different computers (Mapping and Shuffling) and then aggregate results (Reducing)
2018/4/1 11
Hadoop Architect
Hadoop includes 2 components: Node
Manager and Data Manager
• Node Manager: manage tasks and
computing resources (CPU and
Memory)
• Data Manager: manage data stored
on local disks
2018/4/1 12
Features of Hadoop
 Fault Tolerant
 Scalability - Economic
 Data Locality
• Move computation to data
2018/4/1 13
Hardwares
Lots of Cores – average frequency
CPUs (to reduce energy consumption)
Lots of memory (32G – 128G)
Lots of HDD (10 HDDs + 2 HDDs)
• SATA (not SAS, SSD)
• No RAID (Raid0) (excluding system
areas)
Produces a huge amount of heat
Hadoop uses commodity type servers. Using special hardware
against the design philosophy of Hadoop
2018/4/1 14
Network and Rack Designs
 Hadoop tasks include a lot of
moving data around
 “Moving data around” produces
high traffics
• 10 HDD * 100 MB/s ~ 8Gbps
(Ethernet 1Gbps)
Design Strategy
10G Switch for Top-Of-rack switches
40G Switch for Core Switches
Enable “rack-awareness” for Hadoop
Hadoop performance does not only come from the power of machines
in the cluster but also from how we design cluster networks
2018/4/1 15
Pig
16
- High-level platform for creating
programs that run on Hadoop
- Jobs run on
- Map-Reduce
- Spark
- Apache Tez
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS
(line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS
word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words)
AS count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Ideas for Pig come from Sawzall, developed by
legendary programmer Rob Pike from Google
2018/4/1
https://static.googleusercontent.com/media/research.google.c
om/en//archive/sawzall-sciprog.pdf
Hive
17
- Support SQL like query: HiveQL
- Compatible with processing
framework
- MapReduce
- Apache Tez
- Spark
2018/4/1
Traditional Data Analysis and Reporting tools require SQL like query languages
 The needs for SQL on Hadoop
Hue
18
2. Data Warehouse
Technologies
2018/4/1 19
1. OLAP vs OLTP
2. Column vs Row storage
Data Warehouse vs Transactional
Database
Data Warehouse Transactional Database
Suitable Workloads Analytics, Big Data Transaction processing
Types of Operations Optimized for batched write operations and
reading high volumes of data to minimize I/O
and maximize data throughput
Optimized for continuous write operations and
high volumes of small read operations to
maximize transaction throughput
Data Normalization Employ denormalized schemas like the Star
schema and Snowflake schema
Employ highly normalized schemas, which are
more suited for high transaction throughput
requirements
Storage Requires columnar or other specialized
storage
Row-oriented databases that store whole rows in a
physical block
2018/4/1 20
Analytical vs Transactional (OLAP vs OLTP)
※ Understanding Analytic Workloads - IBM
2018/4/1 21
OLTP: Forms of Data Normalization
First Normal Form (1NF)
“An entity type is in 1NF when it contains no repeating groups of data.”
Second Normal Form (2NF)
“An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its Primary Key”
Third Normal Form (3NF)
“An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the Primary Key”
2018/4/1 22
OLAP: Data Modeling
2018/4/1 23
FACT TABLE includes all PRIMARY KEYS to DIMENSION TABLE. Query is analysis by
JOIN(ing) of FACT and DIMENSION tables
Abstract Star-Schema Detailed Example of Star-Schema
Columnar vs Row Storage
2018/4/1 24
 Columnar storage is used when
some fields are queried
 Same column  same data type
 Only queried columns are read
Row storage is used when all fields
are queried in table
 All fields can be queried by primary
key
3. Big Data Hype
2018/4/1 25
Statisticians will be the
next sexy Job in next
decade
Google Flu Trends
Google:MapReduce
paper
Hadoop Initial
Release
2004 20092006
Google published
BigTable paper
2008
HBase Release
Yahoo Launch
Hadoop Cluster
Pig, Hive
Development
2012
YARN
Impala: MPP SQL
on Hadoop
2014
Spark
Big Data Timeline
Kudu
2017
Beam
Big Data Hype
2018/4/1 26
Big Data technologies and hypes originated from the innovations made by Google
engineers/analysts and the hard works of Open Source hackers
Big Data の 3V
Volume
量
Velocity
速度
Variety
多様性
Value
価値
Veracity
真実性
Hype Cycle 2011: On Radar (Nobody even knows what BigData is)
2018/4/1 28
Hype Cycle 2012: Rising
2018/4/1 29
Hype Cycle 2013: Peak of Inflated Expectation
2018/4/1 30
Hype Cycle 2014: Trough of Disillusionment (false claims of
simplicity, promise beyond reason)
2018/4/1 31
Hype Cycle 2015: BigData Disappeared (Adoption > 20% market)
2018/4/1 32
“ But what’s happening is that big data has quickly moved over the Peak of Inflated
Expectations, and has become prevalent in our lives across many hype cycles. So big data
has become a part of many hype cycles. ”
Betsy Burton
2018/4/1 33
4. Personal Observations
and Suggestions
2018/4/1 34
Obs + Sugg 1: mrjob is good for learning
• https://github.com/Yelp/mrjob
• Python
• Run on local machine or clusters
• Hadoop streaming
2018/4/1
http://calcite.apache.org/docs/stream.html
https://hadoop.apache.org/docs/current/hadoop-
streaming/HadoopStreaming.html
35
Obs + Sugg 2: Moving to the Cloud
On Premise  Cloud-based Big Data
2018/4/1 36
Obs + Sugg 3: Data Scientist uses SQL
 Hadoop is solely a data processing framework
• Map-Reduce is primitive
• Sometimes a over-killed solution
 SQL is great
• Mature analysis tools: BI, UI
2018/4/1 37
The End
2018/4/1 38

More Related Content

What's hot

2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
jdijcks
 

What's hot (20)

Company report xinglian
Company report xinglianCompany report xinglian
Company report xinglian
 
Cloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure HuntCloud Storage Spring Cleaning: A Treasure Hunt
Cloud Storage Spring Cleaning: A Treasure Hunt
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Creating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data ArchitectureCreating a Next-Generation Big Data Architecture
Creating a Next-Generation Big Data Architecture
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Data lake
Data lakeData lake
Data lake
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
From Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data WarehouseFrom Traditional Data Warehouse To Real Time Data Warehouse
From Traditional Data Warehouse To Real Time Data Warehouse
 
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...Designing Fast Data Architecture for Big Data  using Logical Data Warehouse a...
Designing Fast Data Architecture for Big Data using Logical Data Warehouse a...
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Data Federation
Data FederationData Federation
Data Federation
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 

Similar to From Hadoop to Enterprise Data Warehouse

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 

Similar to From Hadoop to Enterprise Data Warehouse (20)

Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Big Data/Hadoop Option Analysis
Big Data/Hadoop Option AnalysisBig Data/Hadoop Option Analysis
Big Data/Hadoop Option Analysis
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Big data applications
Big data applicationsBig data applications
Big data applications
 

Recently uploaded

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 

Recently uploaded (20)

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 

From Hadoop to Enterprise Data Warehouse

  • 1. From Hadoop to Data Ware House Bui Hong Ha 2018/3/31 For Vietnamese AI Community in Japan 2018/4/1 1
  • 2. Agenda 1. Hadoop Technologies 2. Data Warehouse 3. From Data Warehouse to Big Data 4. Observations 2018/4/1 2
  • 3. Goals 1. Understanding the technologies and relationships between Hadoop, Big Data and Data Warehouse 2. Understanding of vocabularies to “present” about Big Data and Data Warehouse 2018/4/1 3 Raise your hands when you are in doubts
  • 4. Self-Introduction • Name: Bui Hong Ha • Company: SBCloud (SoftBank + Alibaba Cloud JV) • Role: Cloud Architect • Internet: telescreen • Video Delivery System • Big Data • I built one cluster (100ノード 1.5PB) • CDH4.3、CDH5.4 • AWS Certified Solution Architect • Alibaba Cloud Professional / MVP Skills Profile 2018/4/1 4
  • 5. Interests: taking photos with famous people 2018/4/1 5
  • 9. 1. Hadoop technologies 1. Hadoop 2. Query methods 3. UI 2018/4/1 9
  • 10. Statisticians will be the next sexy Job in next decade Google Flu Trends Google:MapReduce paper Hadoop Initial Release 2004 20092006 Google published BigTable paper 2008 HBase Release Yahoo Launch Hadoop Cluster Pig, Hive Development 2012 YARN Impala: MPP SQL on Hadoop 2014 Spark Big Data Timeline Kudu 2017 Beam Big Data Hype 2018/4/1 10 Big Data technologies and hypes originated from the innovations made by Google engineers/analysts and the hard works of Open Source hackers
  • 11. Hadoop: map-reduce framework Map-Reduce first splits data into several parts (splitting) and processes those parts in different computers (Mapping and Shuffling) and then aggregate results (Reducing) 2018/4/1 11
  • 12. Hadoop Architect Hadoop includes 2 components: Node Manager and Data Manager • Node Manager: manage tasks and computing resources (CPU and Memory) • Data Manager: manage data stored on local disks 2018/4/1 12
  • 13. Features of Hadoop  Fault Tolerant  Scalability - Economic  Data Locality • Move computation to data 2018/4/1 13
  • 14. Hardwares Lots of Cores – average frequency CPUs (to reduce energy consumption) Lots of memory (32G – 128G) Lots of HDD (10 HDDs + 2 HDDs) • SATA (not SAS, SSD) • No RAID (Raid0) (excluding system areas) Produces a huge amount of heat Hadoop uses commodity type servers. Using special hardware against the design philosophy of Hadoop 2018/4/1 14
  • 15. Network and Rack Designs  Hadoop tasks include a lot of moving data around  “Moving data around” produces high traffics • 10 HDD * 100 MB/s ~ 8Gbps (Ethernet 1Gbps) Design Strategy 10G Switch for Top-Of-rack switches 40G Switch for Core Switches Enable “rack-awareness” for Hadoop Hadoop performance does not only come from the power of machines in the cluster but also from how we design cluster networks 2018/4/1 15
  • 16. Pig 16 - High-level platform for creating programs that run on Hadoop - Jobs run on - Map-Reduce - Spark - Apache Tez input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES 'w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; Ideas for Pig come from Sawzall, developed by legendary programmer Rob Pike from Google 2018/4/1 https://static.googleusercontent.com/media/research.google.c om/en//archive/sawzall-sciprog.pdf
  • 17. Hive 17 - Support SQL like query: HiveQL - Compatible with processing framework - MapReduce - Apache Tez - Spark 2018/4/1 Traditional Data Analysis and Reporting tools require SQL like query languages  The needs for SQL on Hadoop
  • 19. 2. Data Warehouse Technologies 2018/4/1 19 1. OLAP vs OLTP 2. Column vs Row storage
  • 20. Data Warehouse vs Transactional Database Data Warehouse Transactional Database Suitable Workloads Analytics, Big Data Transaction processing Types of Operations Optimized for batched write operations and reading high volumes of data to minimize I/O and maximize data throughput Optimized for continuous write operations and high volumes of small read operations to maximize transaction throughput Data Normalization Employ denormalized schemas like the Star schema and Snowflake schema Employ highly normalized schemas, which are more suited for high transaction throughput requirements Storage Requires columnar or other specialized storage Row-oriented databases that store whole rows in a physical block 2018/4/1 20
  • 21. Analytical vs Transactional (OLAP vs OLTP) ※ Understanding Analytic Workloads - IBM 2018/4/1 21
  • 22. OLTP: Forms of Data Normalization First Normal Form (1NF) “An entity type is in 1NF when it contains no repeating groups of data.” Second Normal Form (2NF) “An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its Primary Key” Third Normal Form (3NF) “An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the Primary Key” 2018/4/1 22
  • 23. OLAP: Data Modeling 2018/4/1 23 FACT TABLE includes all PRIMARY KEYS to DIMENSION TABLE. Query is analysis by JOIN(ing) of FACT and DIMENSION tables Abstract Star-Schema Detailed Example of Star-Schema
  • 24. Columnar vs Row Storage 2018/4/1 24  Columnar storage is used when some fields are queried  Same column  same data type  Only queried columns are read Row storage is used when all fields are queried in table  All fields can be queried by primary key
  • 25. 3. Big Data Hype 2018/4/1 25
  • 26. Statisticians will be the next sexy Job in next decade Google Flu Trends Google:MapReduce paper Hadoop Initial Release 2004 20092006 Google published BigTable paper 2008 HBase Release Yahoo Launch Hadoop Cluster Pig, Hive Development 2012 YARN Impala: MPP SQL on Hadoop 2014 Spark Big Data Timeline Kudu 2017 Beam Big Data Hype 2018/4/1 26 Big Data technologies and hypes originated from the innovations made by Google engineers/analysts and the hard works of Open Source hackers
  • 27. Big Data の 3V Volume 量 Velocity 速度 Variety 多様性 Value 価値 Veracity 真実性
  • 28. Hype Cycle 2011: On Radar (Nobody even knows what BigData is) 2018/4/1 28
  • 29. Hype Cycle 2012: Rising 2018/4/1 29
  • 30. Hype Cycle 2013: Peak of Inflated Expectation 2018/4/1 30
  • 31. Hype Cycle 2014: Trough of Disillusionment (false claims of simplicity, promise beyond reason) 2018/4/1 31
  • 32. Hype Cycle 2015: BigData Disappeared (Adoption > 20% market) 2018/4/1 32
  • 33. “ But what’s happening is that big data has quickly moved over the Peak of Inflated Expectations, and has become prevalent in our lives across many hype cycles. So big data has become a part of many hype cycles. ” Betsy Burton 2018/4/1 33
  • 34. 4. Personal Observations and Suggestions 2018/4/1 34
  • 35. Obs + Sugg 1: mrjob is good for learning • https://github.com/Yelp/mrjob • Python • Run on local machine or clusters • Hadoop streaming 2018/4/1 http://calcite.apache.org/docs/stream.html https://hadoop.apache.org/docs/current/hadoop- streaming/HadoopStreaming.html 35
  • 36. Obs + Sugg 2: Moving to the Cloud On Premise  Cloud-based Big Data 2018/4/1 36
  • 37. Obs + Sugg 3: Data Scientist uses SQL  Hadoop is solely a data processing framework • Map-Reduce is primitive • Sometimes a over-killed solution  SQL is great • Mature analysis tools: BI, UI 2018/4/1 37