The Data Platform Administration Handling the 100 PB.pdf
1. The Data Platform Administration
Handling the 100 PB
May 19th, 2022
Yongduck Lee
Cloud Platform Department
Rakuten Group, Inc.
2. 2
About me
Lecture History
- Colloquium Lecturer at KAIST
Program Committee
- BigComp2017/2019
- EDB 2016
Certification
- Certified Scrum Master (CSM)
- Certified Project Management Professional (PMP #1255421)
… ETC
Lee Yongduck Daniel
A Vice Section Manager and Senior Architect at Data Storage and
Processing Section in Rakuten Group, Inc.
Started as Recommendation Engine Developer and now is focusing on
researching and verifying new Big Data Technology and how to support
users who want to use Big Data System.
B.Sc in Korea University in 2001.
21 years in Japan and have been worked for many organization and
company such as NHK, NTTD and Rakuten Group, Inc.
3. 3
CONTENTS
1. Global Internet & Data Explosion
2. Data in Rakuten
3. Data platform & Big Data Administrator in Rakuten
4. What Advantages as Engineer in Rakuten
4. 4
Internet & Globalization
The Internet is the global system of interconnected computer networks that use the Internet protocol
suite (TCP/IP) to link devices worldwide. It is a network of networks that consists of private, public, academic,
business, and government networks of local to global scope, linked by a broad array of electronic, wireless,
and optical networking technologies
G
C
Vast
Unstructured 80%
Structured 20%
35.2 ZB in 2020
The origins of the Internet date back to research
commissioned by the federal government of the
United States in the 1960s to build robust, fault-
tolerant communication with computer networks.
https://en.wikipedia.org/wiki/Internet#World_Wide_Web
* From IDC white paper & EMC
hances
Lobalization
Information
Structure Volume
5. 5
Internet Users
Internet users are defined as persons who accessed the Internet in the last 12 months from any device,
including mobile phones.
https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users#cite_note-UN_WPP-14
9. 9
The Big Data in Rakuten
There are huge potential value and possibilities due to Diversity of Service and Users not
only from Japan but also Global. It is very interesting and ideal environment for Data
Scientiest and Data Analyst.
Increase synergy effect on personalization, clustering, segmentation, etc. by combining
data from various services.
The large volume of data every day, every month, and every year from services and users.
It is a big challenge to store data and make it easy to utilize for data users as System
Infrastructure Engineer and Data Engineer.
Diversity and Synergy
Scale
10. 10
Rakuten Hadoop and Kafka
Supporting near-realtime & streaming processing in
each region.
Handling data totally around 1.3 Million Message/sec
( 10 GB/sec IN/OUT) around peak time at normal
date.
At 2021 Super Sale, we handled more than 2.5 times
messages and traffics.
Supporting Data Lake, Data Mart, and Data Analysis
for Rakuten Service in each region.
Lots of value mining from big data are being done by
data scientist and contributing on Rakuten Service.
Kafka: 800 Core, 20TB Mem, 4728 Topics
Hadoop : 80K Core, 600 TB Mem, 160K TB Disk
12. 12
The Big Data in Rakuten
Platform/Middleware
Administrator
Users
Project/Product
Manager
Big Data Platform
Administrator
Infra/Server
Administrator
Network
Administrator
Software/System
Architect
Software
Developer
13. 13
Administration Use CASE (HBase)
User reported performance issues on HBase but there were no issues or report from other users who are using
other component on Hadoop.
Confirm Way to get/put data on HBase
• HBase
Configuration
Architecture, Work/Dataflow.
Application/GC Logs
• Dependency Component (*HDFS)
READ/Write Performance Logs
Application/GC Logs
• DISK/Mem/CPU Load
• Kernel Log
• Network Connection
Date
&
Time
Matching
Data Hot Spotting.
Data or Configuration Caching
HDFS
JVM Config change
Increasing Handler
Increasing Scanner Interval
HW Improvement
Master Node Replacement
Reduced RegionServers
Move HDD to NVMe
Dedicated RegionServers
OS Configuration
Root noprocs, nofiles increasing on Dedicated RS
HBASE
TCPNoDelay, Parallel Seeking , Master Table Locality
WRITE/Short-READ/Long-READ Queue
DEADLINE Scheduler, Hedged Reads, Short Circuit READ
14. 14
What Advantages in Rakuten as Data Engineer
You can go through all necessary domains of Big Data Platform to get rich experience for Big Data Platform
Administrators. Rakuten has experts who have rich knowledges and experiences on each technical and
management domain.
15. 15
What Advantages in Rakuten as Data Engineer
You can also work with various stakeholders from various service domain, from the point of data utilization.
DB
Services
Event
INFRA
…