VK Business Profile - provides IT solutions and Web Development
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
1. Data processing at LinkedIn
with Apache Kafka
Jeff Weiner
Chief Executive Officer
Joel Koshy
Sr. Staff Software Engineer
Kartik Paramasivam
Director, Software Engineering
2. Outline
Kafka growth at LinkedIn
Canonical use cases
Search, analytics and storage platforms
Data pipelines
Stream processing
Conclusion
Q&A
7. Distributed near real-time OLAP
datastore with SQL query interface
Pinot
• 100B documents
• 1B documents ingested per day
• 100M queries per day
• 10’s of ms latency
13. Galene
• Base index generated
weekly (offline)
• Live updater pulls from
Kafka and Brooklin (DB
changes)
• Periodically combine
incremental snapshot and
live update buffer
14. Distributed replicated NoSQL store
Storage Node
API Server
MySQL
Router
Router
Router
Apache Helix
ZooKeeper
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
Data
Control
Routing Table
r
r
r
HTTP
Client
HTTP
33. Distributed stream processing framework
Samza
• Top-level Apache project since 2014
• In use at LinkedIn, Uber,
Metamarkets, Netflix, Intuit,
TripAdvisor, MobileAware,
Optimizely, etc.
• Increase in production usage at
LinkedIn – from ~20 to ~350
applications in two years
34. Stateless processing – message in, message out
• Schema translation
• Data transformation
(e.g., ID
obfuscation)
35. Stateless processing – accessing adjunct data
Key issues:
• Accidental DOS of member
DB
• Dealing with spikes
• I/O makes performance slow
37. Stateless processing – locally accessible adjunct data
• Awesome performance at low cost (100x
faster)
• No issues with accidental DoS
• No need to over provision the remote
database
Pros Cons
• Does not work for cases where the adjunct
data is large and not co-partitionable in input
stream
• Auto-scaling the processor gets trickier
• Repartitioning the Input Kafka topic can mess
up local state
38. Stateless processing – async data access
Synchronous API (existing) Asynchronous API
// execute on multiple threads
public interface StreamTask {
void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
// process message
}
}
// call-back based
public interface AsyncStreamTask {
void processAsync(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator),
TaskCallback callback) {
// process message with asynchronous calls
// fire callback upon completion
}
}
40. Managing state
● Full state checkpointing
● Simply does not scale for non-trivial application state
● … but makes it easier to achieve “repeatable results” when recovering from
failure
● Incremental state checkpointing
● Scales to any type of application state
● Achieving repeatable results requires additional techniques (e.g. variants of
de-dup or transaction support)
41. Managing local state
• Durably store “host-to-task”
mapping
• Minimize reseeding during
failures, adding/removing capacity
42. Samza processing pipeline
• Natural back-pressure
• Per-stage checkpointing instead of global
checkpointing
• Cost considerations – new Kafka feature
(KIP-107: deleteDataBefore)
50. Samza: a common API for data processing
● Application code does not change
● Stream Processing
● Batch data processing
● Configurable input sources and sinks (e.g. Kafka, Kinesis, Eventhub, HDFS
etc.)
51. Fluent API (0.13 release)
public class PageViewCounterExample implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”);
MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”);
pageViewEvents.
partitionBy(m -> m.getMessage().memberId).
window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m ->
m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1).
map(MyStreamOutput::new).
sendTo(outputStream);
}
}
52. Fluent API (0.13 release)
public class PageViewCounterExample implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”);
MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”);
pageViewEvents.
partitionBy(m -> m.getMessage().memberId).
window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m ->
m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1).
map(MyStreamOutput::new).
sendTo(outputStream);
}
public static void main(String[] args) throws Exception {
CommandLine cmdLine = new CommandLine();
Config config = cmdLine.loadConfig(cmdLine.parser().parse(args));
ApplicationRunner localRunner = ApplicationRunner.getLocalRunner(config);
localRunner.run(new PageViewCounterExample());
}
}
53. Deployment options
• Full control on application lifecycle
• Can be part of a bigger application
• ZK-based coordination
Standalone YARN-based
• Dashboard
• Management service
• Monitoring/alerts
• Long running service in YARN
57. Font check slide
THE FOLLOWING WORDS SHOULD BE IDENTICAL IN STYLE
Hello there.
Source Sans Pro Light If words do not look like the left side, please correct your font