5. Scale of Processing @
2.3 Trillion
Messages per Day
0.6 PB in 2.3 PB out
per Day (compressed)
16 Million
Messages per Second at peaks!
4.6K users
125 TB ingested per day
120 PB of HDFS
224K jobs per day across
13 clusters (9 K nodes)
220+ Applications
Most Applications require
Stateful Processing ~
several TBs (overall)
800+ nodes across 9
clusters
samza
6. Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
7. Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
Process
- Highly Reliable and
Fault-tolerant Processing of
Events
- Offline Batch Processing
- Near-realtime Stream
Processing
- Seamlessly Transport Results
from Offline Processing to
Online Services
8. Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
Process
- Highly Reliable and
Fault-tolerant Processing of
Events
- Offline Batch Processing
- Near-realtime Stream
Processing
- Seamlessly Transport Results
from Offline Processing to
Online Services
Access
- Persist Data Durably
- High availability for Serving
Online Services
- Data should be Searchable
11. Analytics Infrastructure Challenges
Computation
Cluster Management
System
Scaling up computation
● Limited shared computation resources
● Efficient computation to cut down cost of jobs
Scaling up cluster management
● Thousands of daily active cluster users
● Hundreds of thousands of daily jobs
● A mix of SLA requirements
Scaling up system
● Tens of thousands of nodes
● Tens of PT of data
THESCALINGPYRAMID
12. Our Solutions
Scaling up system
● Federated HDFS
● Dali - Logical Data Access Layer for Hadoop
Scaling up cluster management
● Hadoop OrgQueue
● Elasticity Tuner
Scaling up computation
● Dr. Elephant
● Better computation strategy for handling large datasets
13. LinkedIn Open Source Projects
Pinot
Dr Elephant
Cubert
Streaming
Near Realtime
Stream Processing
Data Management Performance Tuning OLAP Storage
Computation EngineWorkflow Manager
samza
Photon - ML