Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Flink Streaming

7 484 vues

Publié le

Flink Streaming is the real-time data processing framework of Apache Flink. Flink streaming provides high level functional apis in Scala and Java backed by a high performance true-streaming runtime.

Publié dans : Données & analyses
  • Login to see the comments

Flink Streaming

  1. 1. Real-time data processing with Flink Streaming Gyula Fora gyfora@apache.org @GyulaFora 12/01/15 1
  2. 2. What is Flink Streaming Part of Apache Flink Real-time data processing High performance Expressive functional APIs Programmable in Java or Scala 12/01/15 2
  3. 3. This Talk • General introduction • Flink Streaming APIs • Running Flink programs • Overview of Flink internals • Development roadmap • Summary • Questions 12/01/15 3
  4. 4. Overview of stream processing trends Apache Storm • True streaming, low latency - lower throughput • Low level API (Bolts, Spouts) + Trident Spark Streaming • Stream processing on top of batch system, high throughput - higher latency • Functional API (DStreams), restricted by batch runtime Flink Streaming • True streaming with adjustable latency-throughput trade-off • Rich functional API exploiting streaming runtime; e.g. rich windowing semantics 12/01/15 4
  5. 5. Programming model Data Stream A A (1) A (2) B (1) B (2) C (1) C (2) X X Y Y Program Parallel Execution X Y Operator X Operator Y Data abstraction: Data Stream Data Stream B Data Stream C 12/01/15 5
  6. 6. Flink Streaming APIs 12/01/15 6
  7. 7. Word count – Java DataStream<String> text = env.socketTextStream(host, port); DataStream<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("W")) { out.collect(new Tuple2<>(token, 1)); }) .groupBy(0) .sum(1); 12/01/15 7 Socket stream Map Reduce Output stream
  8. 8. Word count - Scala case class Word(word: String, count: Long) val input = env.socketTextStream(host, port); val words = input flatMap { line => line.split("W+").map(Word(_,1)) } val counts = words groupBy "word" sum "count" 12/01/15 8 Socket stream Map Reduce Output stream
  9. 9. Overview of the API • Data stream sources – File system – Message queue connectors – Arbitrary source functionality • Stream transformations – Basic transformations: Map, Reduce, Filter, Aggregations… – Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) – Binary stream transformations: CoMap, CoReduce… – Temporal binary stream operators: Joins, Crosses… – Iterative stream transformations • Data stream outputs 12/01/15 9
  10. 10. Data stream sources • Process data from anywhere • File-system sources • Socket stream • Message queues – Kafka – RabbitMQ – Flume • Scala/Java collections, streams, sequence generator for development & testing • Arbitrary source functionality using the SourceFunction interface – Only have to implement an invoke(out: Collector) method 12/01/15 10
  11. 11. Basic transformations • Rich set of functional transformations: – Map, FlatMap, Reduce, GroupReduce, Filter, Project… • Aggregations by field name or position – Sum, Min, Max, MinBy, MaxBy, Count… 12/01/15 11 Reduce Merge FlatMap Sum Map Source Sink Source
  12. 12. Windowing • Flexible policy based windowing • Trigger and Eviction policies • Built-in policies: – Time: Time.of(length, TimeUnit/Custom timestamp) – Count: Count.of(windowSize) – Delta: Delta.of(treshold, Delta function, Start value) • Window transformations: – Reduce – ReduceGroup – Grouped Reduce/ReduceGroup • Custom trigger and eviction policies can also be implemented easily 12/01/15 12
  13. 13. Windowing example //Build new model every minute on the last 5 minutes //worth of data val model = trainingData .window(Time.of(5, MINUTES)) .every(Time.of(1, MINUTES)) .reduceGroup(buildModel) //Predict new data using the most up-to-date model val prediction = newData .connect(model) .map(predict); M P Training Data New Data Prediction 12/01/15 13
  14. 14. Temporal operators • Binary stream operators that work on time windows • Database style operators: – Join: s1.join(s2).onWindow(…).every(…) .where(key1).equalTo(key2) – Cross: s1.cross(s2).onWindow(…).every(…) • UDFs can also be used for custom operator logic on the elements in the windows 12/01/15 14
  15. 15. Window Join example case class Name(id: Long, name: String) case class Age(id: Long, age: Int) case class Person(name: String, age: Int) val names = ... val ages = ... names.join(ages) .onWindow(5, SECONDS) .where("id") .equalTo("id") {(n, a) => Person(n.name, a.age)} 12/01/15 15
  16. 16. Iterative stream processing T R Step function Feedback stream Output stream def iterate[R]( stepFunction: DataStream[T] => (DataStream[T], DataStream[R]), maxWaitTimeMillis: Long = 0 ): DataStream[R] 12/01/15 16
  17. 17. Iterative processing example val env = StreamExecutionEnvironment.getExecutionEnvironment env.generateSequence(1, 10).iterate(incrementToTen, 1000) .print env.execute("Iterative example") def incrementToTen(input: DataStream[Long]) = { val incremented = input.map {_ + 1} val split = incremented.split {x => if (x >= 10) "out" else "feedback"} (split.select("feedback"), split.select("out")) } 12/01/15 17 Numbe r stream Map Reduce Output stream “out” “feedback”
  18. 18. Running Flink Programs 12/01/15 18
  19. 19. Flink programs run everywhere Cluster (Batch) Local Debugging Fink Runtime or Apache Tez As Java Collection Programs Embedded (e.g., Web Container) 12/01/15 19
  20. 20. Little tuning or configuration needed • Requires no memory thresholds to configure – Flink manages its own memory • Requires no complicated network configs – Pipelining engine requires much less memory for data exchange • Requires no serializers to be configured – Flink handles its own type extraction and data representation • Programs can be adjusted to data automatically – Flink’s optimizer can choose execution strategies automatically 12/01/15 20
  21. 21. Under the hood 12/01/15 21
  22. 22. Distributed runtime • Master (Job Manager) handles job submission, scheduling, and metadata • Workers (Task Managers) execute operations • Data can be streamed between nodes • Data output is buffered for higher-throughput (tunable)12/01/15 22
  23. 23. Hybrid batch/streaming • True data streaming on the runtime layer • Data flow based runtime • No unnecessary synchronization steps • Batch and stream processing seamlessly integrated 12/01/15 23
  24. 24. Development roadmap 12/01/15 24
  25. 25. Roadmap • Fault tolerance – 2015 Q1-2 • Lambda architecture – 2015 Q2 • Integration with other frameworks – SAMOA – 2015 Q1 – Zeppelin – 2015 ? • Streaming machine learning library – 2015 Q3 • Streaming graph processing library – 2015 Q3 12/01/15 25
  26. 26. Fault tolerance • At-least-once semantics – Currently an alpha version – Source level in-memory replication – Record acknowledgments • Exactly once semantics – Final goal, current research – Upstream backup with state checkpointing 12/01/15 26
  27. 27. Lambda architecture In other systems Source: https://www.mapr.com/developercentral/lambda-architecture 12/01/15 27
  28. 28. Lambda architecture - One System - One API - One cluster 12/01/15 28 In Apache Flink
  29. 29. Performance 12/01/15 29
  30. 30. Flink vs Storm 12/01/15 30
  31. 31. Flink vs Spark 0 20000 40000 60000 80000 100000 120000 140000 160000 2 4 6 8 10 12 Time(ms) Memonry in GB Processing me (300 mil pkt) spark fli n k hpc 12/01/15 31
  32. 32. Summary • Flink combines true streaming runtime with expressive high-level APIs for a next-gen stream processing solution • Tunable throughput-latency trade-off with competitive performance at both ends • Iterative processing support opens new horizons in online machine learning • We are just getting started! – Lambda architecture – Integrations 12/01/15 32
  33. 33. Where to find us flink.apache.org github.com/apache/flink @ApacheFlink gyfora@apache.org 12/01/15 33
  34. 34. Thank you! 12/01/15 34

×