LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
1. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
2020/09/17
Akira Ajisaka
Upgrading HDFS to 3.3.0
and deploying RBF in
production
LINE Developer Meetup #68 – Big Data Platform
2. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Self introduction
2
• Akira Ajisaka (鯵坂 明, Twitter: @ajis_ka)
• Apache Hadoop PMC member (2016~)
• Yahoo! JAPAN (2018~)
Outdoor bouldering for the first time in Mitake
3. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Agenda
3
• Why and how we upgraded the largest
HDFS cluster to 3.3.0
• Hadoop clusters in Yahoo! JAPAN
• Short intro of RBF and why we choose it
• How to upgrade
• How to split namespace
• What we considered and experimented
• Many troubles and lessons learned from
them
4. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Why and how we
upgraded the cluster?
5. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Yahoo! JAPAN's largest HDFS cluster
5
• 100PB actual used
• 500+ DataNodes
• 240M files + directories
• 290M blocks
• 400GB NameNode Java
heap
• HDP 2.6.x + patches
(as of Dec. 2019)
Reference: https://www.slideshare.net/techblogyahoo/hadoop-yjtc19-in-shibuya-b2-yjtc
6. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Major existing problems
6
• The namespace is too large
• NameNode does not scale infinitely due to
heavy GC
• The Hadoop version is too old
• HDP 2.6 is based on Apache Hadoop 2.7.3
• 2.7.3 was released 4 years ago
• We upgraded to HDFS 3.3.0 and use RBF
to split the namespace
7. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
RBF (Router-based Federation)
7
/
top/
shp/
auc/
Namespace
Namespace
Namespace
NameNode
NameNode
NameNode
ZooKeeper
StateStore
DFSRouter
Note: Kerberos authentication is supported in Hadoop 3.3.0
8. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
How to enable RBF w/o clients' config changes
8
NameNode @
host1
(port 8021)
NameNode
@ host2
NameNode
@ host3
ZooKeeper
StateStore
DFSRouter @
host1
(port 8020)NameNode
@ host1
(port 8020)
Before After
Note: We couldn't rolling upgrade the cluster because of the NN RPC port change
9. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
How to split namespaces
9
• Calculated # of files/directories/blocks from
fsimage
• Calculated # of RPCs from audit logs
• RPCs are classified into two groups (update/read)
• We had to check audit logs to ensure that there is
no rename operation between namespaces
• RBF does not support it for now
• Xiaomi has developed HDFS Federation Rename (HFR)
• https://issues.apache.org/jira/browse/HDFS-15087
(work in progress)
10. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Split DataNodes or not?
10
Split DataNodes for each namespace (no-split) DNs register all the NameNodes
NN
DN
NN
DN
We chose splitting DNs because it is simple
11. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Split DataNodes – Pros and Cons
11
Pros
• Simple
• Easy to troubleshoot, operate
• No limitation of the # of namespaces
• East-west traffic can be controlled easily
Cons
• Need to calculate how many DNs required for each
namespaces
• Possible unbalanced resource usage among namespaces
• HFR uses hard-link for rename and it assumes non-split DNs
12. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Check HDFS client-server compatibility
12
• We upgrade HDFS only
• Old (HDP 2.6) clients still exist, so we have to
check the compatibility
• We read ".proto" files and verified that
• In addition, upgraded HDFS in development
cluster for end-users
• Wrote a blog post:
https://techblog.yahoo.co.jp/entry/20191206
786320/ (Japanese and English)
13. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
• If a client is configured as follows, the client always connects to
host1
• To avoid this problem, set "dfs.client.failover.random.order" to true
• This feature is available in Hadoop 2.9.0 and not available in the
old clients, so we patched internally
• The default value is true in Hadoop 3.4.0+ (HDFS-15350)
Load-balancing DFSRouters
13
<property name="dfs.nameservices" value="ns"/>
<property name="dfs.ha.namenodes.ns" value="dr1,dr2"/>
<property name="dfs.namenode.rpc-address.ns.dr1" value="host1:8020"/>
<property name="dfs.namenode.rpc-address.ns.dr2" value="host2:8020"/>
14. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Try Java 11
14
• Hadoop 3.3.0 supports Java 11 as runtime
• Upgrade to Java 11 to improve GC
performance
• We contributed many patches to support
Java 11 in Apache Hadoop community
• https://www.slideshare.net/techblogyahoo/jav
a11-apache-hadoop-146834504 (Japanese)
15. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Upgrade ZooKeeper to 3.5.x
15
• Error log w/ Hadoop 3.3.0 and ZK 3.4.x
• Hadoop 3.3.0 upgraded Curator version and it
depends on ZooKeeper 3.5.x (HADOOP-16579)
• Rolling upgraded ZK cluster before upgrading HDFS
• Upgrade succeeded without any major problems
(snip)
Caused by: org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode =
Unimplemented for /zkdtsm-router/ZKDTSMRoot/ZKDTSMSeqNumRoot
at org.apache.zookeeper.KeeperException.create(KeeperException.java:106)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
(snip)
16. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Planned schedule
16
• 2019.9 Upgraded to trunk in the dev
cluster
• 2020.3 Apache Hadoop 3.3.0 released
• 2020.3 Upgraded to 3.3.0 in the
staging cluster
• 2020.5 Upgraded to 3.3.0 in production
17. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Actual schedule
17
• 2019.9 Upgraded to trunk in the dev
cluster (with 1 retries)
• 2020.7 Apache Hadoop 3.3.0 released
• 2020.8 Upgraded to 3.3.0 in the
staging cluster (with 2 retries)
• 2020.8 Upgraded to 3.3.0 in production
(no retry! but faced many troubles...)
• Upgrade is completed remotely
18. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Many troubles
19. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
DistCp is slower than expected
19
• We used DistCp to move recent data between
namespaces after upgrade but it didn't finished by
deadline
• Directory listing of src/dst is serial
• Increasing Map tasks does not help
• DistCp always fails if (# of Map tasks) > 200 and
dynamic option is true
• Fails by configuration error
• To make matters worse, it fails after directory listing, which
takes very long time
• DistCp does not work well for very large directory
• Recommend splitting the job
20. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
• We faced many job failures just after the upgrade
• When splitting DNs, we considered only the data size
but it is not sufficient
• Read/write request must be considered as well
DN traffic reached the NW bandwidth limit
20
DN out traffic in a subcluster
25Gbps
21. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
DFSRouter slowdown
21
• DFSRouter drastically slowdown when restarting
active NameNode
• Wrote a patch and fixed in HDFS-15555
DFSRouter Average RPC Queue time
30 sec
Finished loading
fsimage
Restarted active
NameNode
22. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
HttpFS incompatibilities
22
• The implementation of the web server is different
• Hadoop 2.x: Tomcat 6.x
• Hadoop 3.x: Jetty 9.x
• The behavior is very different
• Jetty supports HTTP/1.1 (chunked encoding)
• Default idle timeout is different
• Tomcat: 60 seconds
• Jetty: Set by "hadoop.http.idle_timeout.ms" (default 1 second)
• Response flow (what timing the server returns 401) is
different
• Response body itself is different
• and more...
• Need to test very carefully if you are using HttpFS
23. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Lessons learned
23
• We have changed many configurations at a time,
but should be avoided as possible
• For example, we changed block placement policy to rack
fault-tolerant and under-replicated blocks become
300M+ after upgrade
• Trouble shooting become more difficult
• HttpFS upgrades can be also separated from this
upgrade, as well as ZooKeeper
• Imagine what will happen in production and test
them as possible in advance
• Consider the difference between dev/staging and prod
• There is a limit one people can imagine. Ask many
colleagues!
24. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
HDFS Future works
24
• Router-based Federation
• Rebalance DNs/namespaces between subclusters
well
• Considering multiple subclusters, non-split DNs (or
even in hybrid), HFR, and so on
• Erasure Coding in production
• Internally backporting EC feature to the old HDFS
client and the work mostly finished
• Try new low-pause-time GC algorithms
• ZGC, Shenandoah
25. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
We are hiring!
25
https://about.yahoo.co.jp/hr/job-info/role/1247/
(Japanese)