Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Distribute Storage System May-2014

964 vues

Publié le

Architect document for build distributed storage system and the good sample distributed storage from author

Publié dans : Technologie
  • Soyez le premier à commenter

Distribute Storage System May-2014

  1. 1. DISTRIBUTED STORAGE SYSTEM Mr. Dương Công Lợi Company: VNG-Corp Tel: +84989510016/+84908522017 Email:loidc@vng.com.vn/loiduongcong@gmail.com
  2. 2. CONTENTS  1. What is distributed-computing system?  2. Principle of distributed database/storage system  3. Distributed storage system paradigm  4. Canonical problems in distributed systems  5. Common solution for canonical problems in distributed systems  6. UniversalDistributedStorage  7. Appendix
  3. 3. 1. WHAT IS DISTRIBUTED-COMPUTING SYSTEM?  Distributed-Computing is the process of solving a computational problem using a distributed system.  A distributed system is a computing system in which a number of components on multiple computers cooperate by communicating over a network to achieve a common goal.
  4. 4. DISTRIBUTED DATABASE/STORAGE SYSTEM  A distributed database system, the database is stored on several computers .  A distributed database is a collection of multiple , logic computer network .
  5. 5. DISTRIBUTED SYSTEM ADVANCE  Advance  Avoid bottleneck & single-point-of-failure  More Scalability  More Availability  Routing model  Client routing: client request to appropriate server to read/write data  Server routing: server forward request of client to appropriate server and send result to this client * can combine the two model above into a system
  6. 6. DISTRIBUTED STORAGE SYSTEM  Store some data {1,2,3,4,6,7,8} into 1 server  And store them into 3 distributed server 1,2,3,4, 6,7,8 1,2,3 4,6 7,8
  7. 7. 2. PRINCIPLE OF DISTRIBUTED DATABASE/STORAGE SYSTEM  Shard data key and store it into appropriate server use Distributed Hash Table (DHT)  DHT must be consistent hashing:  Uniform distribution of generation  Consistent  Jenkins, Murmur are the good choice;some else: MD5, SHA slower
  8. 8. 3. DISTRIBUTED STORAGE SYSTEM PARADIGM  Data Hashing/Addressing  Determine server for data store in  Data Replication  Store data into multi server node for more availability, fault-tolerance
  9. 9. DISTRIBUTED STORAGE SYSTEM ARCHITECT  Data Hashing/Addressing  Use DHT to addressing server (use server-name) to a number, performing it on one circle called the keys space  Use DHT to addressing data and find server store it by successor(k)=ceiling(addressing(k))  successor(k): server store k 0 server3 server1 server2
  10. 10. DISTRIBUTED STORAGE SYSTEM ARCHITECT  Addressing – Virtual node  Each server node is generated to more node-id for evenly distributed, load balance Server1: n1, n4, n6 Server2: n2, n7 Server3: n3, n5, n8 0 server3 server1 server2 n7 n1 n5 n2 n4 n8 n3 n6
  11. 11. DISTRIBUTED STORAGE SYSTEM ARCHITECT  Data Replication Data k1 store in server1 as master and store in server2 as slave 0 server3 server1 server2 k1
  12. 12. 4. CANONICAL PROBLEMS IN DISTRIBUTED SYSTEMS  Distributed transactions: ACID (Atomicity, Consistency, Isolation, Durability) requirement  Distributed data independence  Fault tolerance  Transparency
  13. 13. 5. COMMON SOLUTION FOR CANONICAL PROBLEMS IN DISTRIBUTED SYSTEMS  Atomicity and Consistency with Two Phase Commit protocal  Distributed data independence with consistent hashing algorithm  Fault tolerance with leader election, multi master and data replication  Transparency with server routing, client seen distributed system as a single server
  14. 14. TWO PHASE COMMIT PROTOCAL  What is this?  Two-phase commit is a transaction protocol designed for the complications that arise with distributed resource managers.  Two-phase commit technology is used for hotel and airline reservations, stock market transactions, banking applications, and credit card systems.  With a two-phase commit protocol, the distributed transaction manager employs a coordinator to manage the individual resource managers. The commit process proceeds as follows:
  15. 15. TWO PHASE COMMIT PROTOCAL  Phase1: Obtaining a Decision  Step 1  Coordinator asks all participants to prepare to commit transaction Ti.  Ci adds the records <prepare T> to the log and forces log to stable storage (a log is a file which maintains a record of all changes to the database)  sends prepare T messages to all sites where T executed
  16. 16. TWO PHASE COMMIT PROTOCAL  Phase1: Making a Decision  Step 2  Upon receiving message, transaction manager at site determines if it can commit the transaction  if not: add a record <no T> to the log and send abort T message to Ci  if the transaction can be committed, then: 1). add the record <ready T> to the log 2). force all records for T to stable storage 3). send ready T message to Ci
  17. 17. TWO PHASE COMMIT PROTOCAL  Phase 2: Recording the Decision  Step 1  T can be committed of Ci received a ready T message from all the participating sites: otherwise T must be aborted.  Step 2  Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record is in stable storage, it cannot be revoked (even if failures occur)  Step 3  Coordinator sends a message to each participant informing it of the decision (commit or abort)  Step 4  Participants take appropriate action locally.
  18. 18. TWO PHASE COMMIT PROTOCAL  Costs and Limitations  If one database server is unavailable, none of the servers gets the updates.  This is correctable through network tuning and correctly building the data distribution through database optimization techniques.
  19. 19. LEADER ELECTION  Some leader election algorithm can use: LCR (LeLann-Chang-Roberts), Pitterson, HS (Hirschberg-Sinclair)
  20. 20. LEADER ELECTION  Bully Leader Election algorithm
  21. 21. MULTI MASTER  Multi-master replication  Problem of multi-master replication
  22. 22. MULTI MASTER  Solution, 2 candicate model:  Two phase commit (always consistency)  Asynchronize sync data among multi node  Still active despite some node dies  Faster than 2PC
  23. 23. MULTI MASTER  Asynchronize sync data  Data store to main master (called sub-leader), then this data post to queue to sync to other master.
  24. 24. MULTI MASTER  Asynchronize sync data req1 req2 Server1 (leader ) Server2 data queue req2: forward X
  25. 25. UNIVERSALDISTRIBUTEDSTORAGE a distributed storage system
  26. 26. 6. UNIVERSALDISTRIBUTEDSTORAGE  UniversalDistributedStorage is a distributed storage system develop for:  Distributed transactions (ACID)  Distributed data independence  Fault tolerance  Leader election (decision for join or leave server node)  Replicate with multiple master replication  Transparency
  27. 27. UNIVERSALDISTRIBUTEDSTORAGE ARCHITECTURE  Overview Bussiness Layer Distrib uted Layer Storage Layer Bussiness Layer Distrib uted Layer Storage Layer Bussiness Layer Distrib uted Layer Storage Layer Server
  28. 28. UNIVERSALDISTRIBUTEDSTORAGE ARCHITECTURE  Internal Overview Business Layer Distributed Layer StorageLayer dataLocate(), dataRemote() Result(s) localData() Result{s} Client request(s) remote queuing
  29. 29. ARCHITECTURE OVERVIEW
  30. 30. UNIVERSALDISTRIBUTEDSTORAGE FEATURE  Data hashing/addressing  Use Murmur hashing function
  31. 31. UNIVERSALDISTRIBUTEDSTORAGE FEATURE  Leader election  Use Bully Leader Election algorithm
  32. 32. UNIVERSALDISTRIBUTEDSTORAGE FEATURE  Multi-master replication  Use asynchronize sync data among server nodes
  33. 33. UNIVERSALDISTRIBUTEDSTORAGE STATISTIC  System information:  3 machine 8GB Ram, core i5 3,220GHz  LAN/WAN network  7 physical servers on 3 above mechine  Concurrence write 16500000 items in 3680s, rate~ 4480req/sec (at client computing)  Concurrence read 16500000 items in 1458s, rate~ 11320req/sec (at client computing) * It doesn’t limit of this system, it limit at clients (this test using 3 client thread)
  34. 34. Q & A Contact: Duong Cong Loi loidc@vng.com.vn loiduongcong@gmail.com https://www.facebook.com/duongcong.loi
  35. 35. 7. APPENDIX
  36. 36. APPENDIX - 001  How to join/leave server(s) 1. join/leave 2. join/leave:forward Leaderserver 4. broadcast result 3. process join/leave Server A Server B Server C
  37. 37. APPENDIX - 002  How to move data when join/leave server(s)  Make appropriate data for the moving  Async data for the moving by thread, and control speed of the moving
  38. 38. APPENDIX - 003  How to detect Leader or sub-leader die  Easy dectect by polling connection
  39. 39. APPENDIX - 004  How to make multi virtual node for one server  Easy generate multi virtual node for one server by hash server-name  Ex: make 200 virtual node for server ‘photoTokyo’: use hash value of: photoTokyo1, photoTokyo2, …, photoTokyo200
  40. 40. APPENDIX - 005  For fast moving data  Use bloomfilter for dectect exist hash value of data- key  Use a storage for store all data-key for this local server
  41. 41. APPENDIX - 006  How to avoid network turnning  Use client connection pool with screening strategy before, it avoid many connection hanging when call remote via network between two server

×