The Big Data Analytics Ecosystem at LinkedIn

The Big Data Analytics
Ecosystem at LinkedIn
Rajappa Iyer
September 17, 2013

Agenda
 LinkedIn by the numbers
 An Overview of Data Driven Products / Insights
 The Big Data Analytics Ecosystem
– Storage and Compute Platforms
– Data Transport Pipelines
– Data Processing Pipelines
– Operational Tooling - Metadata
 Q&A

LinkedIn: The World’s Largest
Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
238M+ 3M+
Company Pages
Connecting Talent  Opportunity. At scale…

Insights
(Analysts and Data
Scientists)
Data Driven Products and
Insights
Products for
Members
(Professionals)
Products for
Enterprises
(Companies)
Data,
Platforms,
Analytics

Products for Enterprises
Sell - Sales Navigator Market - Marketing Solutions
Hire - Talent Solutions

Example of Deeper Insight
Job Migration After Financial Collapse

A Simplified Overview of Data Flow
Hadoop
Camus
Lumos
Teradata
External
Partner Data
Ingest
Utilities
DWH ETL
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Activity
Data
Espresso /
Voldemort /
Oracle
Member Data
DatabusChanges
Derived
Data Set
Core Data
Set
Computed Results for Member Facing Products
Enterprise
Products

Storage and Compute Platforms
LinkedIn Confidential ©2013 All Rights Reserved 10
Most data in Avro format Access via Hive and Pig
Most ETL processes run here
Specialized batch processing
Algorithmic data mining

Storage and Compute Platforms
Integrated Data Warehouse
Standard BI Tools
Interactive Querying
(Low latency)
Workload Management

Transport Pipeline - Kafka
 High-volume, low-latency
messaging system
 Horizontally scalable
 Automatic load balancing
 Rewindability
 Intra-cluster replication
 Mainly used for log
aggregation and queuing

Transport Pipeline - Databus
 Timeline consistent data change capture
 Works with Oracle, MySQL, Espresso…
 Transactional semantics
 In-order, at least once delivery
 Low latency
 Has scaled to 100s of sources

Hadoop
Kafka
Brokers
Topic Registry
(zookeeper)
Camus
Incremental
Pull
Hourly Data
by Topic
Last
processed
offset by topic
Daily Data
by Topic
Camus
Daily
Compaction
Audit DB
Camus
Audit
Job
audit
counts
audit
counts
Hive
Registration
Topics
1 ﬁle/
run/
topic/
partitionTopics
Data + offsets
Processing Pipeline: Camus
Camus: Framework for ingesting Kafka streams to HDFS

Camus: Features
 Highly scalable due to adaptive input format
– Handled 10x increase in data volume without
change
 Restartable with checkpointing
 Robust auditing support
 Plays nicely with Hive and Pig
– Avro format support
– Hive metastore registration
 Open source
– https://github.com/linkedin/camus

Processing Pipeline: Lumos
Lumos: Framework to ingest database data to HDFS
PROD
Oracle
Virtual
Snapshot
Materializer
ETL Hadoop Cluster
Staging
Data
(internal)
Data-
Bus
DB
Extract
Lazy
Snapshot
Materializer
External
Data
Inc/Full
(internal)
DWH
processes
Meta-
Data
Published
Virtual
Snapshot
Pig/Hive
Loaders
PROD
Espresso

Lumos: Features
 Supports Espresso, Oracle and MySQL as sources
 Full snapshots and incremental dumps
 Automatic type translation for most database types
 Provides LAST UPDATE semantics for data
 Supports low latency requirements
– Reader API performs just-in-time compaction
 Snapshot constructed two ways:
– On demand compaction for upserts
– Periodic snapshotting that reflects deletes as well

Operational Support - Metadata
 ETL pipeline is a complex graph of workflows
– Our comprehensive dashboard production flow is
nearly 30 levels deep with complex dependencies
 To manage this, we needed to capture:
– Process dependencies
– Data dependencies
– Process execution history
– Data load status
– Data consumption status (watermarks)

Operational Metadata – v1
 Capture process
dependency graph
– Also capture useful
metadata such as process
owners
 Capture stats for each
execution of a workflow
– Time of execution
– Status
– Pointer to error logs
 Has proved quite useful for
monitoring critical chains
Workﬂow F
Workunit
W1
Workunit
W2
Workunit
W3
Workunit
W4
Workunit
W5
on success
on success on failure
on successon success
Start
Stop

Operational Metadata – v2
Data Entity
D1
Data Entity
D2
Data Entity
D3
Workﬂow F
consumes consumes
produces
 For each flow, capture input
and output data elements
 For each execution, capture
stats on data element, e.g.
– Number of records / lines read
– Number of records / lines
written
– Error counts
– Last processed record
 Can be time based or sequence
based
 This can be per flow as more
than one flow can consume a
data element

Operational Metadata – The Payoff
 Restartable ETL jobs
– Process new data since last successful previous run
 Catch up mode for ETL jobs
– Single run can consume data from multiple intervals
in one batch
– Next run will resume from correct place
 Data freshness and availability dashboard
 Coarse form of data lineage
– Impact analysis for unfortunately all-too-common
changes upstream

Putting it all Together
Online Data
Stores
Data
Transport
Pipelines
Data
Processing
Pipelines
Offline
Storage /
Compute
Analytics
Applications
Espresso
Voldemort
Kafka
Databus
Camus
Lumos
Hadoop
Teradata
Operational Metadata and Tooling

`whoami`
 Sr. Manager / DWH Architect @ LinkedIn
since 2011
 Prior to that:
– Director of Engineering at Digg
– Enterprise Data Architect at eBay
 www.linkedin.com/in/rajappaiyer/

Questions?
More at data.linkedin.com
We’re Hiring

The Big Data Analytics Ecosystem at LinkedIn

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à The Big Data Analytics Ecosystem at LinkedIn

Similaire à The Big Data Analytics Ecosystem at LinkedIn (20)

Dernier

Dernier (20)

The Big Data Analytics Ecosystem at LinkedIn