LinkedIn has several data driven products that improve the experience of its users -- whether they are professionals or enterprises. Supporting this is a large ecosystem of systems and processes that provide data and insights in a timely manner to the products that are driven by it.
This talk provides an overview of the various components of this ecosystem which are:
- Hadoop
- Teradata
- Kafka
- Databus
- Camus
- Lumos
etc.
08448380779 Call Girls In Friends Colony Women Seeking Men
The Big Data Analytics Ecosystem at LinkedIn
1. The Big Data Analytics
Ecosystem at LinkedIn
Rajappa Iyer
September 17, 2013
2. Agenda
LinkedIn by the numbers
An Overview of Data Driven Products / Insights
The Big Data Analytics Ecosystem
– Storage and Compute Platforms
– Data Transport Pipelines
– Data Processing Pipelines
– Operational Tooling - Metadata
Q&A
3. LinkedIn: The World’s Largest
Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
238M+ 3M+
Company Pages
Connecting Talent Opportunity. At scale…
9. A Simplified Overview of Data Flow
Hadoop
Camus
Lumos
Teradata
External
Partner Data
Ingest
Utilities
DWH ETL
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Activity
Data
Espresso /
Voldemort /
Oracle
Member Data
DatabusChanges
Derived
Data Set
Core Data
Set
Computed Results for Member Facing Products
Enterprise
Products
18. Operational Support - Metadata
ETL pipeline is a complex graph of workflows
– Our comprehensive dashboard production flow is
nearly 30 levels deep with complex dependencies
To manage this, we needed to capture:
– Process dependencies
– Data dependencies
– Process execution history
– Data load status
– Data consumption status (watermarks)
19. Operational Metadata – v1
Capture process
dependency graph
– Also capture useful
metadata such as process
owners
Capture stats for each
execution of a workflow
– Time of execution
– Status
– Pointer to error logs
Has proved quite useful for
monitoring critical chains
Workflow F
Workunit
W1
Workunit
W2
Workunit
W3
Workunit
W4
Workunit
W5
on success
on success on failure
on successon success
Start
Stop
20. Operational Metadata – v2
Data Entity
D1
Data Entity
D2
Data Entity
D3
Workflow F
consumes consumes
produces
For each flow, capture input
and output data elements
For each execution, capture
stats on data element, e.g.
– Number of records / lines read
– Number of records / lines
written
– Error counts
– Last processed record
Can be time based or sequence
based
This can be per flow as more
than one flow can consume a
data element
21. Operational Metadata – The Payoff
Restartable ETL jobs
– Process new data since last successful previous run
Catch up mode for ETL jobs
– Single run can consume data from multiple intervals
in one batch
– Next run will resume from correct place
Data freshness and availability dashboard
Coarse form of data lineage
– Impact analysis for unfortunately all-too-common
changes upstream
23. `whoami`
Sr. Manager / DWH Architect @ LinkedIn
since 2011
Prior to that:
– Director of Engineering at Digg
– Enterprise Data Architect at eBay
www.linkedin.com/in/rajappaiyer/