Contenu connexe

Présentations pour vous(20)

Similaire à Scaling HDFS at Xiaomi(20)


Plus de DataWorks Summit(20)


Scaling HDFS at Xiaomi

  1. Scaling HDFS at Xiaomi Chen Zhang
  2. Outline • Introduction of Xiaomi • Scenarios and challenges • Improvements on HDFS federation • Experience on scaling up single NameNode • Efficient management of hundreds of clusters
  3. About Xiaomi World’s 4th largest smartphone maker Sold 118 Million phones in 2018
  4. About Xiaomi World’s Largest consumer IoT platform Over 150 Million smart devices connected
  5. Software and Internet Services MIUI MiPay/Finance App Market Ads MiCloud Game MiPush Smart Home News Feeds …
  6. Scenarios HDFS HBase EMQ Yarn Talos FDS(S3) Spark HiveImpala
  7. Scenarios Micloud MiPush Feeds User Profile Talos Ads Online Services • 100+ Independent Clusters • Low Latency • High availability Offline Services Hadoop • Several Huge Clusters • High throughput • High Scalability, High availability
  8. Data Growth 2 23 41 71 3 30 60 150 0 20 40 60 80 100 120 140 160 2015 2016 2017 2018 Data Growth of The Largest Cluster File counts (10 million) Data Size (PB)
  9. Challenges • Challenges at late 2016 data growth is too fast dependency is too complex code change is almost impossible
  10. What We Need We need A Huge Single HDFS Cluster
  11. Improvements on HDFS Federation • Problem of HDFS Federation at late 2016 – NameNode are independent, metadata is not shared – Client side MountTable config, hard to maintain – MountTable don’t support nesting mount-point – ViewFileSystem is not compatible with DistributedFileSystem – RBF is not stable and not fully functioning at late 2016
  12. Improvements on HDFS Federation viewfs Pool 1 Pool nPool k Block Pools Datanode 1 … Datanode 2 … Datanode m … NS 1 NS k Foreign NS n Common Storage NN-1 NN-k NN-n … … BlockStorageNamespace Original HDFS Federation user / yarn hive service1 service2 small dir1 small dir2 small service2 small service1 … …
  13. Improvements on HDFS Federation viewfs Pool 1 Pool nPool k Block Pools Datanode 2 … Datanode 3 … Datanode m … NS 1 NS k Foreign NS n Common Storage NN-1 NN-k NN-n … … BlockStorageNamespace Support Nested MountPoints Pool 1 NS 0 NN-0 … Datanode 1 … user / yarn hive service1 service2 hdfs:// -> FederatedDFSFileSystem extends DistributedFileSystem Add Default NameSpace Support rename across NameSpaces Compatible with hdfs://, don’t need to change any code Update MountTable Config from ZK
  14. Nested Mount table and Default NameSpace 1. Xiaomi is not only a hardware company, also an Internet company, which develops very fast 2. There are more than 100 internet services, the new business and services emerges quickly, based on our smart devices and more than 300 million users 3. It’s hard for us to use a fixed mount table which is pre-divided
  15. NN-1 NN-k NN-nNN-0 user / yarn hive service1 service2 Nested Mount table and Default NameSpace /some_new_nosql_service /user/live_show_services /user/short_video_services 1. At First, we divide the initial mount point by data amount and QPS. Only need to config a dozen of mountpoints for the largest services, others fall into the default NameSpaces 2. When new infrastructure-services and internet-services emerges, the whole mount table don’t need any updates 3. HADOOP-13055 supports linkFallback, but our solution is more flexible NS 1 NS kNS 0 NS n
  16. Client Transparency ViewFileSystem FederatedDFSFilesystem /user/service1 /user/service2in fs.hdfs.impl=FederatedDFSFileSyste m hdfs://clustername/user/service1 access config ZooKeeper fetch mounttable watch Admin Tool update
  17. Client Transparency RPC integration • listStatus • getContentSummary • setQuota/getQuota Admin Tools • refreshNodes • setBalancerBandwidth • DataNode decommission NN-1 NN-k NN-nNN-0 user / yarn hive service1 service2 NS 1 NS kNS 0 NS n /user/service1/.Trash/ Trash optimization • moveToTrash is an rename operation • moveToTrash across namenode is very expensive
  18. Rename Across NameSpaces Client locked hardlink namenode1 namenode2 datanode1 datanode2 datanode3 blockpool1 blockpool2 Link block
  19. Rename Across NameSpaces in Detail Source Phase 1 1. Sanity Check. • Existence • Permission • Can’t be reserved directory • Can’t be symlink • Not in encryption zones 2. Serialize the inode-tree and blocks information with ProtoBuf • Name • Permissions • mtime/atime • Replication factor • Block locations • Acl / Xattr / Quota …
  20. Rename Across NameSpaces in Detail Source Phase 1 3. Lock the directory • Add a FederationRenameFeature. Record the information about renameId, source and destination path • With FederationRenameFeature, all sub-directories and files in this directory, and all inodes in the parent path, is not writable 4. Add a federation-rename record 5. Return the serialized data to client
  21. Rename Across NameSpaces in Detail Dest Phase 1 1. Sanity Check • permission, quota, not in encryption zones 2. Deserialize the inode-tree, graft it to the destination path • Allocate inode id for each inode • Allocate block id and new GS for each block • Update acl and other features
  22. Rename Across NameSpaces in Detail Dest Phase 1 3. Lock the directory • Also use FederationRenameFeature 4. Update quota count 5. Add a federation-rename record 6. Return a list of block information, inclouding: • srcBlockId, destBlockId, blockSize, srcGenStamp, destGenStamp for each block
  23. Rename Across NameSpaces in Detail Link Block 1. For each DN, send request in batch • Create new block file by hardlink, one by one • With a total operation timeout 2. Using a ThreadPoolExecutor 3. For each block, count as complete if at least 2/3 replicas succeed • Slow DN will not affect the total progress
  24. Rename Across NameSpaces in Detail Source Phase2 1. Delete the source directory/file 2. Delete all the inodes and blocks asyncronizely 3. Remove federation-rename record Dest Phase2 1. Remove FedeartionRenameFeature, make the target directory visible 2. Remove federation-rename record
  25. Error Handling Failed at How to Handle Result Source Phase 1 Fail Fail Dest Phase 1 Cancel source-phase1 Fail Link Block Request Fail NameNode Fixer will redo the remaining steps Will succeed finally Source Phase 2 Request Fail NameNode Fixer will redo the remaining steps Will succeed finally Dest Phase 2 Request Fail NameNode Fixer will redo the remaining steps Will succeed finally
  26. Error Handling NameNode Failover and Restart 1. All operation have editlog 2. FederationRenameFeature will serialized to FsImage 3. Federation-rename records won’t serialized to FsImage, rebuild from log replay or FsImage loading ( if some inode have FederationRenameFeature, then add a Federation-rename record)
  27. Scaling up NameNodes Our Largest NameNode 1. 150GB heap 2. Use CMS GC 3. More than 500 million objects (240 million files and 260 million blocks) 4. More than 20000 QPS
  28. Scaling up NameNodes Experience • Throttle – BlockReport / Incremental-BlockReport throttle – Concurrent GetContentSummary throttle • Lock optimization • Config optimization • Add more tracing information
  29. Block Report Throttle • Problem:Full GC when NameNode Startup NameNode 60% DN DN DN DN DN Thousands of DN Block Report at almost same time DN DN DN DN DN NameNode could only process one block report one time Throttle the max concurrent block reports, extra reports will be rejected, and DN will retry later
  30. Other optimization • Lock Optimization on exhausting operations – When processing block report, release and re-gain the lock for every storage – When processing getContentSummary, release the lock every N files • Config optimization – More handlers – Longer heart-beat interval – Longer full block report interval – disable retry-cache and access-time
  31. More tracing information • Record Operations that hold the FSNamesystem lock too long • Record QPS monitor on both server-side and client-side, push these data to our internal monitor system • Record failure reason and statistics of block allocation failure • Add log for slow block report processing
  32. How We Efficiently Manage 100+ Clusters • We use HBase heavily in Xiaomi • 20~30 HBase clusters for sensitive services and businesses in each datacenter • With the rapid growth of the global business, now there are more than 5 datacenters distributed in the whole world • The number of total clusters also grows very quickly, make it hard to maintain
  33. How We Efficiently Manage 100+ Clusters • Initially… cluster-1 Canary cluster-2 Canary cluster-3 Canary cluster-n Canary
  34. Efficiently manage 100+ clusters cluster-1 Canary Task cluster-2 cluster-3 cluster-n ClustrerOne Monitor System Canary Task Canary Task Balancer Task Balancer Task Balancer Task ZooKeeper NameService metrics generated configuration
  35. Q&A

Notes de l'éditeur

  1. introduce my self Today I’ll share some works we did on scaling HDFS spoken English
  2. investigation of xiaomi phone sales main market is india and china, also have good market share at southeast aisa and euroupe not in America
  3. IoT sales a variety of smart-devices it sales very well in china
  4. based on these phones and devices, we build lots of internet services and business these are most import part of them
  5. for this page, most services are well-known, I would introduce some of services that developed by us Talos is a data integration and distribution system FDS is an object storage system, which is quite similar with AWS S3, EMQ is a cloud message queue, which is also similar with AWS EMQ
  6. our clusters could be divided into 2 part, online vs offline these 2 scenarios is quite different, which brings us different challenges for online services, most HDFS clusters is deployed for hbase, we use hbase heavily, there are more than 100 online hdfs clusters and more than 3000 nodes the biggest challenge for online cluster is latency, especially the impaction of slow nodes and slow disk this part is not belong to this session, I’ll not introduce them in detail on the other hand, for offline analysis, we build several huge clusters, for these clusters ,the biggest challenge is scalability, which is how to serve more data and files
  7. let take a look at the data growth this is the chart for our largest cluster 4 years ago by the end of last year everybody knows what this means to hdfs cluster single namenode is hard to serve so many data
  8. with the repaid growth, we meet the scalability in 2016 after a bounch of work, we successfully make namenode become stable, but it will not last for a long time, we have to enable federation but the dependency is too complex and it’s almost impossible to divide these data into different namespaces it’s also very hard for us to ask users change their code to use viewfs
  9. So the only way makes sense for us is to build a huge single cluster more accurately, we need to modify federation to make it works like a single hdfs cluster how we did that let’s first take a look at the defects of federation
  10. in this solution, for every directory you need to assign a namespace, you have to add a mountpoint
  11. if the path is not in mount-table, then it will be mapped to one of the default namespaces in addition, to make the federation works like a single cluster, we support rename across namenode to avoid the code change, we created a new filesystem that wrapped viewfs in it in the last, we move the mountable to zookeeper and can update it automatically, so user don’t need to worry about the mount table this is the whole solution of us to make Federation works as a single cluster, in the next, I’ll introduce each part in details
  12. first, we create a wrapper FileSystem, it’s extended from DistributedFileSystem our users don’t need to change any code, just update some configs when the client initialing, it will fetch mounttable from zk in addition, we add a watcher, so clients can get the latest config anytime when they update at last, we made a admin tool to operate the mounttable config on zookeeper
  13. to make the federation transparently to user, still a lot of works to do here is some of them another improvement that worth to mention is the trash optimization by default , every user have only one trash folder, and since movetotrash is a rename operation and we support rename across namenode. a user delete operation on other namespaces may cause a rename across namenode. this operation’s cost is high. we don’t want it be triggered too frequently by removing trash data, so we did some optimization
  14. I’ll first introduce the overview, and then introduce some details it’s very complex, I’ll try to explain it as clear as I can there are 5 steps to complete a federation-rename
  15. Ok, the next is some experience of tuning a single namenode
  16. let me show the reason first, let‘s assume in the normal case, heap usage is 60%. when NN restart, it start receiving a lot of blockreport other blockreports that waiting proceed is stored in memory, the report speed is much higher than the processing speed, so the reports in memory keep accumulating until the heap is full.