3. LinkedIn by Numbers
World’s largest professional network
187M+ members world-wide as of Q3 2012
Growing at the rate of two per second
85 of Fortune 100 companies use Talent Solutions
to hire
> 2.6M company pages
> 4B search queries
75K+ developers leveraging out APIs
1.3M unique publishers
` Databus 3
4. The Consequence of Specialization in
Data Systems
Data Flow is essential
Data Consistency is critical !!!
`
5. Solution: Databus
Standardi
Standardi Standardi
Standardi Standardi
Standardi Standardi
Standardi
Standardi Search Graph Read
Updates
zation
zation zation
zation zation
zation zation
zation
zation Index Index Replicas
Primary
DB Data Change Events
Databus
` 5
6. Two Ways
Application code dual Extract changes from
writes to database and database commit log
pub-sub system
Easy on the surface Tough but possible
Consistent? Consistent!!!
`
7. Key Design Decisions : Semantics
• Logical clocks attached to the source
– Physical offsets could be used for internal
transport
– Simplifies data portability
• Pull model
– Restarts are simple
– Derived State = f (Source state, Clock)
– + Idempotence = Timeline Consistent!
` 7
8. Key Design Decisions : Systems
• Isolate fast consumers from slow consumers
– Workload separation between online, catch-up,
bootstrap
• Isolate sources from consumers
– Schema changes
– Physical layout changes
– Speed mismatch
• Schema-aware
– Filtering, Projections
– Typically network-bound can burn more CPU
` 8
9. Requirements
• Timeline consistency
• Guaranteed, at least once delivery
• Low latency
• Schema evolution
• Source independence
• Scalable consumers
• Handle for slow/new consumers without
affecting happy ones (look-back requirements)
` 9
11. 0
Initial Design (2007) Happy
Consumer
Source clock
timer
…
SCN
Direct Pull Relay Happy
DB In Memory Consumer
70000 Buffer
Proxied
3 hrs
Pull
100000 Relay
102400 Slow
DB
Consumer
Pros:
Cons:
1. Consumer Scaling
Slow consumers overwhelming the DB
2. Some isolation
` Databus 11
12. Software Architecture
Four Logical Components
• Fetcher
– Fetch from db, relay…
• Log Store
– Store log snippet
• Snapshot Store
– Store moving data
snapshot
• Subscription Client
– Orchestrate pull
across these
`
13. 0
Source clock
timer
SCN
The Databus System Happy
Snapshot Consumer
…
infinite
30000 Log
Relay Happy
10 days In Memory Consumer
70000 Relay
Buffer
80000
90000 3 hrs
100000
102400 Slow
DB
Consumer
Server
Log Storage Snapshot Store
Bootstrap Service
` 13
14. The Relay
• Change event buffering (~ 2 – 7 days)
• Low latency (10-15 ms)
• Filtering, Projection
• Hundreds of consumers per relay
• Scale-out, High-availability through
redundancy
`
16. The Bootstrap Service
• Catch-all for slow / new consumers
• Isolate source OLTP instance from large scans
• Log Store + Snapshot Store
• Optimizations
– Periodic merge
– Predicate push-down
– Catch-up versus full bootstrap
• Guaranteed progress for consumers via chunking
• Implementations
– Database (MySQL)
– Raw Files
• Bridges the continuum between stream and batch systems
`
17. The Consumer Client Library
• Glue between Databus infra and business logic
in the consumer
• Isolates the consumer from changes in the
databus layer.
• Switches between relay and bootstrap as
needed
• API
– Callback with transactions
– Iterators over windows
`
18. Fetcher Implementations
• Oracle
– Trigger-based
• MySQL
– Custom-storage-engine based
• In Labs
– Alternative implementations for Oracle
– OpenReplicator integration for MySQL
`
19. Meta-data Management
• Event definition, serialization and transport
– Avro
• Oracle, MySQL
– Avro definition generated from the table schema
• Schema evolution
– Only backwards-compatible changes allowed
• Isolation between upgrades on producer and
consumer
`
20. Scaling the consumers
(Partitioning)
• Server-side filtering
– Range, mod, hash
– Allows client to control partitioning function
• Consumer groups
– Distribute partitions evenly across a group
– Move partitions to available consumers on failure
– Minimize re-processing
`