Data Intensive Applications with Apache Flink

•

0 j'aime•655 vues

A brief introduction to Apache Flink and an overview of the current possibilities it offers to develop Machine Learning solutions.

Logiciels

Milan – July 13 2016
Data Intensive Applications with Apache Flink
Simone Robutti
Machine Learning Engineer at Radicalbit
@SimoneRobutti

Agenda
1. Brief Introduction to Apache Flink
○ Why
○ What
○ How
2. Machine Learning on Flink
○ Present landscape
○ Future of the Ecosystem
3. Closing notes on Radicalbit (shameless plug ahead)

100% Buzzword-free guaranteed
Big Data
Machine
Intelligence
Web-scale
400x
It’s like the
human brain
Exactly-once
Exactly-once

Why Flink (and not Spark/Storm/Samza...)
Because it’s
production-ready
streaming-first
low-latency
fault-tolerant
high-throughput
processing engine

Flink: what is it?
From Flink’s Documentation

Flink’s Runtime
From Flink’s Documentation

Flink’s DataFlow
From Flink’s Documentation
Written by the user through DataSet/DataStream API
Compiled and optimized in the client

Flink’s DataFlow
From Flink’s Documentation
The compiled job is translated to distributed tasks by
the master and executed by workers

Ready and awesome for parallel ML
Work in progress for distributed ML
ML on Flink

Flink for Model Evaluation Pipelines
Source
Data
Preparation
Evaluation Sink
Source
Post
process
-ing
Composable, modular Flink Operator

Evaluation with Flink-JPMML
Source
Operator
Flink -
JPMML
Operator
Sink
Operator
Source
Operator
model.pmml
Small library that implements basic model eval.
Data
Preparation

“I have seen people insisting on using Hadoop for
datasets that could easily fit on a flash drive and could
easily be processed on a laptop.”
- Yann LeCun
-
ML on Flink

FlinkML
What: Out-of-the-box workhorse algorithms (ALS,
SVM, LinReg, LogReg …)
Status: early phase, slow development

FlinkML
Pro: available out of the box, written with Flink API
Cons: reinvents the wheel, only a few algorithms,
no model persistence

Samsara
What: Linear algebra framework
Status: mature

Samsara
Pro: generic algorithms with platform-specific
bindings, skilled community
Cons: covers only a few use cases

SAMOA
What: Online learning algorithm framework (VHT,
AMR, …)
Status: early phase, complicated relationship with
the industry

SAMOA
Pro: many powerful generic online learning
algorithms, backed by academics (MOA, Weka)
Cons: not production ready, academic focus

ML on Flink: the future of the ecosystem

Apache Beam
Programming model for data processing pipelines
● Streaming first, batch as a bounded stream
● Layered API: What, Where, When, How
● Platform agnostic: same program, different
runners

Apache Beam - Runners
● Flink
● Spark (Partial)
● Google Cloud Dataflow
● Plain Java
● Gearpump (WIP)
● Apex (WIP)

FlinkML Roadmap
● More algorithms!
● Evaluation framework
● Persistence/export
● Online Learning Framework

Proteus
Online Learning Platform - based on Flink
Source: Proteus’ website

Contributions
● Cassandra Connector
● Scala API extensions
● FlinkML (Linear Algebra Framework, MinHash)
● Akka Connector

Our vision
Flink can become the ideal choice to build real-time decision-
heavy applications with high data-throughput
To achieve this:
● Ambitious applications (aim for real-time services)
● Reliable distributed online learning (Proteus?)
● A Pipelining Framework (experiment fast, increase testability and
modularity)

THANKS!
Simone Robutti
Mail: simone.robutti@radicalbit.io Medium: @simone.robutti
Twitter: @SimoneRobutti — @weareradicalbit

Recommandé

Flink Forward SF 2017: Trevor Grant - Introduction to Online Machine Learning...Flink Forward

Márton Balassi Streaming ML with Flink- Flink Forward

Machine Learning Pipelinesjeykottalam

Machine learning model to productionGeorg Heiler

Deploying Enterprise Deep Learning Masterclass Preview - Enterprise Deep Lea...Sam Putnam [Deep Learning]

From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks

AutoML Toolkit – Deep DiveDatabricks

Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit

Recommandé

Flink Forward SF 2017: Trevor Grant - Introduction to Online Machine Learning...Flink Forward

Márton Balassi Streaming ML with Flink- Flink Forward

Machine Learning Pipelinesjeykottalam

Machine learning model to productionGeorg Heiler

Deploying Enterprise Deep Learning Masterclass Preview - Enterprise Deep Lea...Sam Putnam [Deep Learning]

From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks

AutoML Toolkit – Deep DiveDatabricks

Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale

Automated Hyperparameter Tuning, Scaling and TrackingDatabricks

Experimental Design for Distributed Machine Learning with Myles BakerDatabricks

Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby

Scalable Automatic Machine Learning in H2OSri Ambati

Balancing Automation and Explanation in Machine LearningDatabricks

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit

Machine Learning With SparkShivaji Dutta

Using H2O AutoML for Kaggle CompetitionsSri Ambati

Strata parallel m-ml-ops_sept_2017Nisha Talagala

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf

Spark Summit EU talk by Reza KarimiSpark Summit

Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks

Machine Learning In ProductionSamir Bessalah

Spark Summit EU talk by Oscar CastanedaSpark Summit

SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh

Splice Machine's use of Apache Spark and MLflowDatabricks

Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks

Deploying Machine Learning Models to ProductionAnass Bensrhir - Senior Data Scientist

Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP Ververica

Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann

Contenu connexe

Tendances

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale

Automated Hyperparameter Tuning, Scaling and TrackingDatabricks

Experimental Design for Distributed Machine Learning with Myles BakerDatabricks

Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby

Scalable Automatic Machine Learning in H2OSri Ambati

Balancing Automation and Explanation in Machine LearningDatabricks

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit

Machine Learning With SparkShivaji Dutta

Using H2O AutoML for Kaggle CompetitionsSri Ambati

Strata parallel m-ml-ops_sept_2017Nisha Talagala

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf

Spark Summit EU talk by Reza KarimiSpark Summit

Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks

Machine Learning In ProductionSamir Bessalah

Spark Summit EU talk by Oscar CastanedaSpark Summit

SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh

Splice Machine's use of Apache Spark and MLflowDatabricks

Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks

Deploying Machine Learning Models to ProductionAnass Bensrhir - Senior Data Scientist

Tendances (20)

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models

Automated Hyperparameter Tuning, Scaling and Tracking

Experimental Design for Distributed Machine Learning with Myles Baker

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Scalable Automatic Machine Learning in H2O

Balancing Automation and Explanation in Machine Learning

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...

Machine Learning With Spark

Using H2O AutoML for Kaggle Competitions

Strata parallel m-ml-ops_sept_2017

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...

Spark Summit EU talk by Reza Karimi

Advanced Hyperparameter Optimization for Deep Learning with MLflow

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

Machine Learning In Production

Spark Summit EU talk by Oscar Castaneda

SparkApplicationDevMadeEasy_Spark_Summit_2015

Splice Machine's use of Apache Spark and MLflow

Apache Spark's MLlib's Past Trajectory and new Directions

Deploying Machine Learning Models to Production

En vedette

Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP Ververica

Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann

Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...Flink Forward

Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward

Electricity price forecasting with Recurrent Neural NetworksTaegyun Jeon

Feature EngineeringHJ van Veen

En vedette (7)

Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...

Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...

Electricity price forecasting with Recurrent Neural Networks

Feature Engineering

Similaire à Data Intensive Applications with Apache Flink

Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi

Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Stephan Ewen

Apache Spark vs Apache FlinkAKASH SIHAG

Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit

Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi

Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi

Portable Streaming Pipelines with Apache Beamconfluent

Present and future of unified, portable, and efficient data processing with A...DataWorks Summit

Realizing the promise of portability with Apache BeamJ On The Beach

Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou

Flink in actionArtem Semenenko

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus

Near real-time anomaly detection at Lyftmarkgrover

Introduction to Apache Flinkdatamantra

Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise

LAMP is so yesterday, MEAN is so tomorrow! :) Sascha Sambale

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Technology Stack DiscussionZaiyang Li

Flink history, roadmap and visionStephan Ewen

Similaire à Data Intensive Applications with Apache Flink (20)

Apache Fink 1.0: A New Era for Real-World Streaming Analytics

Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...

Apache Spark vs Apache Flink

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks

Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks

Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

Portable Streaming Pipelines with Apache Beam

Present and future of unified, portable, and efficient data processing with A...

Realizing the promise of portability with Apache Beam

Portable batch and streaming pipelines with Apache Beam (Big Data Application...

Flink in action

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...

Near real-time anomaly detection at Lyft

Introduction to Apache Flink

Unified Batch and Real-Time Stream Processing Using Apache Flink

Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019

LAMP is so yesterday, MEAN is so tomorrow! :)

Apache Arrow at DataEngConf Barcelona 2018

Technology Stack Discussion

Flink history, roadmap and vision

Dernier

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

TECUNIQUE: Success Stories: IT Service providermohitmore19

Clustering techniques data mining book ....ShaimaaMohamedGalal

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Test Automation Strategy for Frontend and BackendArshad QA

Right Money Management App For Your Financial GoalsJhone kinadey

5 Signs You Need a Fashion PLM Software.pdfWave PLM

DNT_Corporate presentation know about usDynamic Netsoft

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Dernier (20)

Microsoft AI Transformation Partner Playbook.pdf

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

TECUNIQUE: Success Stories: IT Service provider

Clustering techniques data mining book ....

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Test Automation Strategy for Frontend and Backend

Right Money Management App For Your Financial Goals

5 Signs You Need a Fashion PLM Software.pdf

DNT_Corporate presentation know about us

Hand gesture recognition PROJECT PPT.pptx

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

why an Opensea Clone Script might be your perfect match.pdf

Active Directory Penetration Testing, cionsystems.com.pdf

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

HR Software Buyers Guide in 2024 - HRSoftware.com

A Secure and Reliable Document Management System is Essential.docx

Data Intensive Applications with Apache Flink

1. Milan – July 13 2016 Data Intensive Applications with Apache Flink Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti

2. Agenda 1. Brief Introduction to Apache Flink ○ Why ○ What ○ How 2. Machine Learning on Flink ○ Present landscape ○ Future of the Ecosystem 3. Closing notes on Radicalbit (shameless plug ahead)

3. 100% Buzzword-free guaranteed Big Data Machine Intelligence Web-scale 400x It’s like the human brain Exactly-once Exactly-once

4. Why Flink (and not Spark/Storm/Samza...) Because it’s production-ready streaming-first low-latency fault-tolerant high-throughput processing engine

5. Flink: what is it? From Flink’s Documentation

6. Connectors and integrations

7. Flink’s Runtime From Flink’s Documentation

8. Flink’s DataFlow From Flink’s Documentation Written by the user through DataSet/DataStream API Compiled and optimized in the client

9. Flink’s DataFlow From Flink’s Documentation The compiled job is translated to distributed tasks by the master and executed by workers

10. Machine Learning on Flink

11. Ready and awesome for parallel ML Work in progress for distributed ML ML on Flink

12. Flink for Model Evaluation Pipelines Source Data Preparation Evaluation Sink Source Post process -ing Composable, modular Flink Operator

13. Evaluation with Flink-JPMML Source Operator Flink - JPMML Operator Sink Operator Source Operator model.pmml Small library that implements basic model eval. Data Preparation

14. “I have seen people insisting on using Hadoop for datasets that could easily fit on a flash drive and could easily be processed on a laptop.” - Yann LeCun - ML on Flink

15.

16. FlinkML What: Out-of-the-box workhorse algorithms (ALS, SVM, LinReg, LogReg …) Status: early phase, slow development

17. FlinkML Pro: available out of the box, written with Flink API Cons: reinvents the wheel, only a few algorithms, no model persistence

18. Samsara What: Linear algebra framework Status: mature

19. Samsara Pro: generic algorithms with platform-specific bindings, skilled community Cons: covers only a few use cases

20. SAMOA What: Online learning algorithm framework (VHT, AMR, …) Status: early phase, complicated relationship with the industry

21. SAMOA Pro: many powerful generic online learning algorithms, backed by academics (MOA, Weka) Cons: not production ready, academic focus

22. ML on Flink: the future of the ecosystem

23. Apache Beam Programming model for data processing pipelines ● Streaming first, batch as a bounded stream ● Layered API: What, Where, When, How ● Platform agnostic: same program, different runners

24. Apache Beam - Runners ● Flink ● Spark (Partial) ● Google Cloud Dataflow ● Plain Java ● Gearpump (WIP) ● Apex (WIP)

25. BeamML: a runner-agnostic ML library

26. FlinkML Roadmap ● More algorithms! ● Evaluation framework ● Persistence/export ● Online Learning Framework

27. Proteus Online Learning Platform - based on Flink Source: Proteus’ website

28. The role of Radicalbit

29. Contributions ● Cassandra Connector ● Scala API extensions ● FlinkML (Linear Algebra Framework, MinHash) ● Akka Connector

30. Our vision Flink can become the ideal choice to build real-time decision- heavy applications with high data-throughput To achieve this: ● Ambitious applications (aim for real-time services) ● Reliable distributed online learning (Proteus?) ● A Pipelining Framework (experiment fast, increase testability and modularity)

31. Q&A

32. THANKS! Simone Robutti Mail: simone.robutti@radicalbit.io Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit