This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)
7. LinkedIn Data Infrastructure: Sample Stack
Infra challenges in 3-phase
ecosystem are
diverse, complex and specific
Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
7
8. LinkedIn Data Infrastructure Solutions
Voldemort: Highly-Available
Distributed KV Store
• Key/value access at scale
8
9. Voldemort: Architecture
• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”
•
•
•
•
•
10 clusters, 100+ nodes
Largest cluster – 10K+ qps
Avg latency: 3ms
Hundreds of Stores
Largest store – 2.8TB+
10. LinkedIn Data Infrastructure Solutions
Espresso: Indexed Timeline-Consistent
Distributed Data Store
• Fill in the gap btw Oracle and KV store
10
11. Espresso: System Components
• Hierarchical data model
• Timeline consistency
• Rich functionality
• Transactions
• Secondary index
• Text search
• Partitioning/replication
• Change propagation
11
12. Generic Cluster Manager: Helix
• Generic Distributed State Model
•
•
•
•
ConfigManagement
Automatic Load Balancing
Fault tolerance
Cluster expansion and rebalancing
• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix
12
13. LinkedIn Data Infrastructure Solutions
Databus : Timeline-Consistent
Change Data Capture
• Deliver data store changes to apps
14. Databus at LinkedIn
DB
Capture
Changes
Relay
Event Win
On-line
Changes
On-line
Changes
Databus
Client Lib
Client
Snapshot at U
Databus
Client Lib
Consistent
Transport independent of data
source: Oracle, MySQL, …
Transactional semantics
In order, at least once delivery
Consumer n
Client
Bootstrap
DB
Consumer 1
Consumer 1
Consumer n
Tens of relays
Hundreds of sources
Low latency - milliseconds
14
15. LinkedIn Data Infrastructure Solutions
Kafka: High-Volume Low-Latency
Messaging System
• Log aggregation and queuing
15
Enterprise Facing is all about Segmentation and Connections Our base data lead to revenue-generating productsEnterprise Application-building problems with deterministic life-cycles Science is key for targeting and matching (e.g. CAP, Marketing Solutions) Key back-office play for Hiring, Sales and Marketing for 85% of Fortune-500
Transition needs to be goodProducts => data infrastructure requirements in previous slideAll products don’t make the same latency and freshness requirements from our data infrastructureThe way we bucketize this is….News and recommendations show up in both nearline and offline
Data Integration is hard. Having sane and same metadata across systems. Have a schema which works across the 3 phases. Want a rich evolving schemas and make the conforming push as much of data cleaning to source and upstream as much as possible so near-line and off-line helpsSessionization logic is in WH which makes it hard for near-line systems to useExtensible system where changing schema in one phase does not break downstream systemsDon’t build over-specialized systems: e.g. a monitoring system for PYMK – build Azkaban