PROTEUS H2020

PROTEUS
Scalable Online Machine
Learning for Predictive
Analytics and Real-Time
Interactive Visualization
BONAVENTURA DEL MONTE
RESEARCHER @DFKI GMBH
PH.D. STUDENT @TU BERLIN
EUROPRO WORKSHOP, EDBT 2017This project is funded
by the European Union.
Horizon 2020

Value
Velocity
VarietyVeracity
Volume
2

4
PROTEUS is a EU H2020 funded research project which aims to design,
develop, and provide an open-source ready-to-use Big Data solution, able to
perform real-time interactive analytics and predictive analysis through
massive online machine learning, efficiently dealing with extremely large
historical data and data stream

CONTENTS
1. PROJECT DETAILS
2. VALIDATION SCENARIO
3. HYBRID PROCESSING ENGINE
4. SCALABLE ONLINE MACHINE LEARNING
5. REAL-TIME INTERACTIVE VISUAL ANALYTICS
6. CONCLUSION

7
Project details
 Expected Outcomes
 Hybrid processing
 Batch & Stream processing engine
 Declarative Language for batch & streams analytics
 Scalable Online machine Learning
 SOLMA Library
 Real-time interactive Visual Analytics
 Web charts library
 Incremental engine for interactive analytics
 Business Impact
 Validation in realistic industrial use case

8
Hot Strip Mill: Big Data scenario

 Smoother processing of data stream and historical data in the same Flink job
 A declarative language for batch and streaming analytics
 ETL and ML pipelines expressed in an unified language are holistically optimized
10
Hybrid Processing
Gather and
clean sensor
data
PCA
Train ML
Model
D3
D1
D2
Bridging the Gap: Towards Optimizations across Linear and Relational Algebra": Andreas Kunft, Alexander Alexandrov,
Asterios Katsifodimos, Volker Markl. BeyondMR workshop @SIGMOD 2016.

11
Scalable Online Machine Learning
 ML challenge: Distributed Data Streams
 Current state of the art of machine learning algorithms for Big Data is dominated by offline learning
algorithms that process data-at-rest
 Plenty of current data sources are streaming (online, data-in-motion): sensors, social networks,
clickstream, etc.
 In online learning, the algorithms see the data only once. The traditional meaning of online is that
data is processed sequentially one by one but for many epochs: prequential evaluation

12
Real-time Interactive Visual Analytics
 How to interactively visualize Big Data?
 Incremental Analytics engine: incremental partial results in ~ O(1)
 Visualization Layer: SSR-enabled web-based library seamlessly connected to
the Incremental Analytics engine
https://github.com/proteus-h2020/proteic

13
Conclusions
 PROTEUS is an EU H2020 international research project
 PROTEUS will contribute to the Big Data ecosystem with:
 An innovative hybrid engine for processing both data-at-rest and data-in-motion
 SOLMA: An new library for scalable online machine learning
 Big Data Visualization guidelines: new ways of presenting and working with Big Data
 Real-time interactive visualization technology: Incremental engine & web-based library
 PROTEUS will validate its innovations in a realistic industrial scenario
 PROTEUS will provide full-scale evaluation and impact assessment including
benchmarks, KPIs and anonymized datasets
 Specific metrics for the ArcelorMittal use case
 Generic indicators on the advancements in scalable machine learning, hybrid computation and real-time
interactive visual analytics.

14
Thanks for your attention!
Questions?
 Contact us:
 Bonaventura Del Monte
 bonaventura dot delmonte at dfki dot de
 www.dfki.berlin
www.proteus-bigdata.com
www.github.com/proteus-h2020

16
Apache Flink 101
 Massive parallel data flow engine with unified batch and stream
processing
 Rich set of operators (including native iteration)
 Flink Optimizer
 Inspired by optimizers of parallel database systems
 Physical optimization follows cost‐based approach
 Memory Management
 Flink manages its own memory
 Never breaks the JVM heap

17
Scalable Online Machine Learning
 PROTEUS contribution: SOLMA
 User-friendly
 Extensibility
 Basic scalable stream sketches that enable to query the stream
 Iterative algorithms for approximating the outcome of offline computation
 Ready-to-use (supervised & unsupervised) online ML algorithms in Apache Flink

PROTEUS H2020

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to PROTEUS H2020

Similar to PROTEUS H2020 (20)

Recently uploaded

Recently uploaded (20)

PROTEUS H2020

Editor's Notes