Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Building reliable real-time services
with Apache DistributedLog
@sijieg
Logs are Everywhere
● DB Storage Engines - WAL
● DB Replication - Binlog, Log shipping
● Distributed Consensus - Replicate...
Apache DistributedLog
Log Stream
An endless, totally ordered,
sequence of immutable records
Log Stream
1 2 3 4 5 6 7 11
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
Sequence Numbers - DLSN
1 2 3 4 5 6 7 11
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
DLSN - System Sequence Number
Sequence Numbers - Transaction ID
1 2 3 4 5 6 7 11
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
DLSN - System Sequence Number
Tra...
Sequence Numbers - Sequence ID
1 2 3 4 5 6 7 11
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
DLSN - System Sequence Number
Transa...
Writer & Readers
1 2 3 4 5 6 7 11
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
New records added here
Tailing Reads
(close to hea...
Read Parallelism
1 2 3 4 5 6 7 11
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
Read from multiple positions in parallel
Log Segments
1 2 3 4 5 6 7 11
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
Log Segment
X
Log Segment
X+1
Log Segment
X+2
Log Segment Store
1 2 3 4 5 6 7 11
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
Log Segment
X
Log Segment
X+1
Log Segment
X+2
Apa...
Log Stream Metadata
1 2 3 4 5 6 7
1
1
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
Writer Reader
Reader
Reader
- List of segments...
Namespace
1 2 3 4 5 6 7
1
1
1
2
1
3
1
4
1
5
1
6
1
7
Oldest Newest
Writer Reader
Reader
Reader
/manhattan/stream-x
...
/ads...
Architecture
MetadataStore
Log Segment
Store
(BK)
- Segments
Architecture
MetadataStore
Log Segment
Store
(BK)
Log Streams - Abstraction & Naming
- Data Management
- Efficient Write &...
Architecture
MetadataStore
Log Segment
Store
(BK)
Log Streams - Abstraction & Naming
- Data Management
- Efficient Write &...
Architecture
MetadataStore
Log Segment
Store
(BK)
Cold
Storage
(HDFS)
Log Streams - Abstraction & Naming
- Data Management...
Data Flow
Write
Client
Write
Proxy
Bookie
Bookie
Bookie
Read
Proxy
Read
Client
Read
Client
Read
Client
1. write records
4....
Consensus
Consensus - Primary Leader Approach
Consensus - Log Replication
Consensus - Safety Ensurance
● Election Safety - CAS operation on metadata store
○ Log Segment Sequence Number monotonical...
User Cases
Architecture
MetadataStore
Log Segment
Store
(BK)
Cold
Storage
(HDFS)
Log Streams - Abstraction & Naming
- Data Management...
Database
Stronger Consistency
Stronger Consistency in Manhattan
MH
Coordinator
MH
Coordinator
MH
Coordinator
1 2 3 4 5 6 7
1
1
1
2
1
3
1
4
1
5
1
6
1
7
O...
Self-Serve Pub/Sub
Message Delivery
Topic
Partitioned Pub/Sub
1 2 3 4 5 6 7
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
1 2 3 4 5 6 7
1
1
1
2
1
3
1
4
1
5
...
Deferred RPC
Reliable Queuing
Reliable RPC System
E D E A E D A A E D E A E D A E D E A E D A
Web
Server
RPC
Queue
RPC
Worker
RPC
Worker
Service A
Servi...
Scale at Twitter
Performance - Basic (GCP)
● Disk & Network Bound
● 1 Journal Disk + 5 Ledger Diks
● Each Disk can write/read at ~220MB/sec...
Performance - Effect of Record Size
Applications at Twitter
● Manhattan Key/Value Store - Stronger Consistency
● Durable Deferred RPC - Journal
● Real-time se...
Scale at Twitter
● O(1) trillion records per day, O(10) petabytes per day
● O(10) thousands streams, O(1) million live log...
Future
Not Just Messaging
● Stream - Events between services
○ Persistent
○ Rewindable
○ Replayable
○ Time independent
● Unificat...
Apache DistributedLog (incubating)
● Open sourced on 05/09/2016.
● Landed at Apache Incubator on 06/25/2016.
● Website
○ h...
Apache DistributedLog (incubating)
● Mail List -
dev@distributedlog.incubator.apache.org
● Jira - https://issues.apache.or...
Q/A
● Twitter: @sijieg
● Email: sijie@apache.org
Appendix
● Kafka vs DistributedLog
Kafka vs DL - Overall
Kafka vs DL - Data Segmentation
Kafka vs DL - Data Retention
● Kafka
○ Time based Retention
○ Log compaction by keys
● DL
○ Time based Retention (messagin...
Kafka vs DL - Cluster Expand
● Kafka - Partition Rebalance
○ Adding new brokers
○ Partitions outgrow of brokers’ capacity
...
Kafka vs DL - Writer
● Kafka
○ Multiple-Writers Semantic via Brokers
● DL
○ Multiple-Writers Semantic via Write Proxies
(m...
Kafka vs DL - Reader
● Kafka
○ Both writes and reads are served by the leader brokers
○ Polling
● DL
○ Reads from any stor...
Kafka vs DL - Replication Scheme
● Kafka
○ ISR Replication
○ Follower brokers catchup with Leader broker
● DL
○ Quorum-Vot...
Kafka vs DL - Storage/Durability
● Kafka
○ File (set of files) per partition
○ Only write to filesystem page cache
● DL (B...
Prochain SlideShare
Chargement dans…5
×

Apache distributed log @ q con 2017

726 vues

Publié le

Apache DistributedLog (incubating) is a low-latency, high-throughput replicated log service. Sijie Guo shares how Twitter has used DistributedLog as the real-time data foundation in production for years, supporting services like distributed databases, pub-sub messaging, and real-time stream computing and delivering more than 1.5 trillion (17 PB) events per day.

Topics include:

- Overview of Apache DistributedLog
- How Twitter uses DistributedLog to build strong consistency in its databases
- How Twitter uses DistributedLog for data replication across regions
- How Twitter uses DistributedLog for pub-sub and real-time stream computing
- Lessons learned from production

Publié dans : Technologie
  • Soyez le premier à commenter

Apache distributed log @ q con 2017

  1. 1. Building reliable real-time services with Apache DistributedLog @sijieg
  2. 2. Logs are Everywhere ● DB Storage Engines - WAL ● DB Replication - Binlog, Log shipping ● Distributed Consensus - Replicated log ● Messaging/Pub-Sub - Kafka
  3. 3. Apache DistributedLog
  4. 4. Log Stream An endless, totally ordered, sequence of immutable records
  5. 5. Log Stream 1 2 3 4 5 6 7 11 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest
  6. 6. Sequence Numbers - DLSN 1 2 3 4 5 6 7 11 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest DLSN - System Sequence Number
  7. 7. Sequence Numbers - Transaction ID 1 2 3 4 5 6 7 11 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest DLSN - System Sequence Number Transaction ID - Application Sequence Number E.g. Offset or Timestamp
  8. 8. Sequence Numbers - Sequence ID 1 2 3 4 5 6 7 11 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest DLSN - System Sequence Number Transaction ID - Application Sequence Number E.g. Offset or Timestamp Sequence ID
  9. 9. Writer & Readers 1 2 3 4 5 6 7 11 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest New records added here Tailing Reads (close to head of stream) Catching-up Reads (rewind to any positions)
  10. 10. Read Parallelism 1 2 3 4 5 6 7 11 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest Read from multiple positions in parallel
  11. 11. Log Segments 1 2 3 4 5 6 7 11 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest Log Segment X Log Segment X+1 Log Segment X+2
  12. 12. Log Segment Store 1 2 3 4 5 6 7 11 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest Log Segment X Log Segment X+1 Log Segment X+2 Apache BookKeeper
  13. 13. Log Stream Metadata 1 2 3 4 5 6 7 1 1 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest Writer Reader Reader Reader - List of segments - Transaction Id Index - Truncation point - ... Stream Metadata Updates Notifications
  14. 14. Namespace 1 2 3 4 5 6 7 1 1 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest Writer Reader Reader Reader /manhattan/stream-x ... /ads/stream_xxx /ads/stream_yyy Namespace Lookup
  15. 15. Architecture MetadataStore Log Segment Store (BK) - Segments
  16. 16. Architecture MetadataStore Log Segment Store (BK) Log Streams - Abstraction & Naming - Data Management - Efficient Write & Read - Intra-cluster & Geo Replication - Segments - Raw Streams
  17. 17. Architecture MetadataStore Log Segment Store (BK) Log Streams - Abstraction & Naming - Data Management - Efficient Write & Read - Intra-cluster & Geo Replication - Segments - Raw Streams Write Proxy Read Proxy - Ownership Tracking - Batching, Compression Record Cache - Rate Limiting, Quota - - Serving
  18. 18. Architecture MetadataStore Log Segment Store (BK) Cold Storage (HDFS) Log Streams - Abstraction & Naming - Data Management - Efficient Write & Read - Intra-cluster & Geo Replication - Segments - Raw Streams Write Proxy Read Proxy - Ownership Tracking - Batching, Compression Record Cache - Rate Limiting, Quota - - Serving
  19. 19. Data Flow Write Client Write Proxy Bookie Bookie Bookie Read Proxy Read Client Read Client Read Client 1. write records 4. acknowledge 2. transmit buffer 3. Flush - Write a batched entry to bookies 5. Commit - Write Control Record 6. Long poll read 7. Speculative Read 8. Cache Records 9. Long poll read
  20. 20. Consensus
  21. 21. Consensus - Primary Leader Approach
  22. 22. Consensus - Log Replication
  23. 23. Consensus - Safety Ensurance ● Election Safety - CAS operation on metadata store ○ Log Segment Sequence Number monotonically increase ○ A log segment sequence number is guaranteed to only hand over to a writer once ● Log Segment Append-Only ○ A writer can only append entries to the log segment that is allocated to it ● Fencing - Termination mechanism of a log segment ○ No entries can be appended to a log segment if it is fenced
  24. 24. User Cases
  25. 25. Architecture MetadataStore Log Segment Store (BK) Cold Storage (HDFS) Log Streams - Abstraction & Naming - Data Management - Efficient Write & Read - Intra-cluster & Geo Replication - Segments - Raw Streams Write Proxy Read Proxy - Ownership Tracking - Batching, Compression Record Cache - Rate Limiting, Quota - - Serving - Applications - Different Consumer models DBs - e.g., Twitter’s Manhattan Deferred RPC (queuing) Self-serve Pub/Sub Stream Computing Cross DC Replication
  26. 26. Database Stronger Consistency
  27. 27. Stronger Consistency in Manhattan MH Coordinator MH Coordinator MH Coordinator 1 2 3 4 5 6 7 1 1 1 2 1 3 1 4 1 5 1 6 1 7 Oldest Newest MH Replica MH Replica MH Replica 1 2 3
  28. 28. Self-Serve Pub/Sub Message Delivery
  29. 29. Topic Partitioned Pub/Sub 1 2 3 4 5 6 7 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 1 2 3 4 5 6 7 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 1 2 3 4 5 6 7 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 New messages appended here Reads from anyposition - last position stored in offset store - rewind to any positions -rewind by time (e.g 15 mins ago)
  30. 30. Deferred RPC Reliable Queuing
  31. 31. Reliable RPC System E D E A E D A A E D E A E D A E D E A E D A Web Server RPC Queue RPC Worker RPC Worker Service A Service B Service C1 2 3 4
  32. 32. Scale at Twitter
  33. 33. Performance - Basic (GCP) ● Disk & Network Bound ● 1 Journal Disk + 5 Ledger Diks ● Each Disk can write/read at ~220MB/second ● 6 log streams, 1 write proxy + 3 bookies ● 1 writer + 1 tailing reader => 2 million records/second ● 3 catch-up raders => 7.5 million records/second ● End-to-End Latency : within 30ms when network is around 30% untilized
  34. 34. Performance - Effect of Record Size
  35. 35. Applications at Twitter ● Manhattan Key/Value Store - Stronger Consistency ● Durable Deferred RPC - Journal ● Real-time search indexing - Change propagation ● Self-serve Pub/Sub - Message Delivery, Ads Pipeline ● Stream Computing ○ Source & Sink ○ Stateful Processing in Heron (coming soon) ● Reliable cross datacenter replication ● ...
  36. 36. Scale at Twitter ● O(1) trillion records per day, O(10) petabytes per day ● O(10) thousands streams, O(1) million live log segments ● O(10^2 ) bookies, O(10^3 ) proxies ● Record size from 100 bytes to 20KB to even more ● Data is kept from hours to days, even up to a year
  37. 37. Future
  38. 38. Not Just Messaging ● Stream - Events between services ○ Persistent ○ Rewindable ○ Replayable ○ Time independent ● Unification of Messaging and Storage
  39. 39. Apache DistributedLog (incubating) ● Open sourced on 05/09/2016. ● Landed at Apache Incubator on 06/25/2016. ● Website ○ http://distributedlog.io/ ○ http://incubator.apache.org/projects/distributedlog.ht ml ● Code - https://github.com/apache/incubator-distributedlog
  40. 40. Apache DistributedLog (incubating) ● Mail List - dev@distributedlog.incubator.apache.org ● Jira - https://issues.apache.org/jira/browse/DL ● Project Ideas - https://cwiki.apache.org/confluence/display/DL/Project+Id eas ● Paper: “DistributedLog: A high performance replicated log service” (ICDE 2017)
  41. 41. Q/A ● Twitter: @sijieg ● Email: sijie@apache.org
  42. 42. Appendix ● Kafka vs DistributedLog
  43. 43. Kafka vs DL - Overall
  44. 44. Kafka vs DL - Data Segmentation
  45. 45. Kafka vs DL - Data Retention ● Kafka ○ Time based Retention ○ Log compaction by keys ● DL ○ Time based Retention (messaging) ○ Explicit truncation (database, replicated state machines)
  46. 46. Kafka vs DL - Cluster Expand ● Kafka - Partition Rebalance ○ Adding new brokers ○ Partitions outgrow of brokers’ capacity ○ Adding new partitions ● DL ○ New log segments will automatically allocated to new storage nodes ○ Scaling proxies (cpu, memory) independent of scaling storage
  47. 47. Kafka vs DL - Writer ● Kafka ○ Multiple-Writers Semantic via Brokers ● DL ○ Multiple-Writers Semantic via Write Proxies (messaging) ○ Single-Writer Semantic using Core Library (database, replicated state machines) ■ Fencing, Exclusive Writer
  48. 48. Kafka vs DL - Reader ● Kafka ○ Both writes and reads are served by the leader brokers ○ Polling ● DL ○ Reads from any storage replicas ○ Long poll + Speculative Reads
  49. 49. Kafka vs DL - Replication Scheme ● Kafka ○ ISR Replication ○ Follower brokers catchup with Leader broker ● DL ○ Quorum-Vote Replication ○ Ack Quorum is adjustable ○ Replication Repair
  50. 50. Kafka vs DL - Storage/Durability ● Kafka ○ File (set of files) per partition ○ Only write to filesystem page cache ● DL (BookKeeper) ○ Interleaved Storage ○ All writes are persisted to disk via explicit fsync before acknowledges ○ Physical I/O Isolation

×