PROTEUS is an EU-funded research project that aims to develop an open-source big data solution for real-time predictive analytics and interactive visualization. It will provide a hybrid processing engine for both batch and streaming data, a library called SOLMA for scalable online machine learning, and tools for real-time interactive visual analytics. The system will be validated using a use case from ArcelorMittal on analyzing sensor data from a hot strip mill.
Schema on read is obsolete. Welcome metaprogramming..pdf
PROTEUS H2020
1. PROTEUS
Scalable Online Machine
Learning for Predictive
Analytics and Real-Time
Interactive Visualization
BONAVENTURA DEL MONTE
RESEARCHER @DFKI GMBH
PH.D. STUDENT @TU BERLIN
EUROPRO WORKSHOP, EDBT 2017This project is funded
by the European Union.
Horizon 2020
4. 4
PROTEUS is a EU H2020 funded research project which aims to design,
develop, and provide an open-source ready-to-use Big Data solution, able to
perform real-time interactive analytics and predictive analysis through
massive online machine learning, efficiently dealing with extremely large
historical data and data stream
10. Smoother processing of data stream and historical data in the same Flink job
A declarative language for batch and streaming analytics
ETL and ML pipelines expressed in an unified language are holistically optimized
10
Hybrid Processing
Gather and
clean sensor
data
PCA
Train ML
Model
D3
D1
D2
Bridging the Gap: Towards Optimizations across Linear and Relational Algebra": Andreas Kunft, Alexander Alexandrov,
Asterios Katsifodimos, Volker Markl. BeyondMR workshop @SIGMOD 2016.
11. 11
Scalable Online Machine Learning
ML challenge: Distributed Data Streams
Current state of the art of machine learning algorithms for Big Data is dominated by offline learning
algorithms that process data-at-rest
Plenty of current data sources are streaming (online, data-in-motion): sensors, social networks,
clickstream, etc.
In online learning, the algorithms see the data only once. The traditional meaning of online is that
data is processed sequentially one by one but for many epochs: prequential evaluation
12. 12
Real-time Interactive Visual Analytics
How to interactively visualize Big Data?
Incremental Analytics engine: incremental partial results in ~ O(1)
Visualization Layer: SSR-enabled web-based library seamlessly connected to
the Incremental Analytics engine
https://github.com/proteus-h2020/proteic
13. 13
Conclusions
PROTEUS is an EU H2020 international research project
PROTEUS will contribute to the Big Data ecosystem with:
An innovative hybrid engine for processing both data-at-rest and data-in-motion
SOLMA: An new library for scalable online machine learning
Big Data Visualization guidelines: new ways of presenting and working with Big Data
Real-time interactive visualization technology: Incremental engine & web-based library
PROTEUS will validate its innovations in a realistic industrial scenario
PROTEUS will provide full-scale evaluation and impact assessment including
benchmarks, KPIs and anonymized datasets
Specific metrics for the ArcelorMittal use case
Generic indicators on the advancements in scalable machine learning, hybrid computation and real-time
interactive visual analytics.
14. 14
Thanks for your attention!
Questions?
Contact us:
Bonaventura Del Monte
bonaventura dot delmonte at dfki dot de
www.dfki.berlin
www.proteus-bigdata.com
www.github.com/proteus-h2020
16. 16
Apache Flink 101
Massive parallel data flow engine with unified batch and stream
processing
Rich set of operators (including native iteration)
Flink Optimizer
Inspired by optimizers of parallel database systems
Physical optimization follows cost‐based approach
Memory Management
Flink manages its own memory
Never breaks the JVM heap
17. 17
Scalable Online Machine Learning
PROTEUS contribution: SOLMA
User-friendly
Extensibility
Basic scalable stream sketches that enable to query the stream
Iterative algorithms for approximating the outcome of offline computation
Ready-to-use (supervised & unsupervised) online ML algorithms in Apache Flink
Editor's Notes
As you probably got to know in the last couple of years, big data are not just a huge quantity of heterogeneous data whose analysis is rather complex and that are ingested at high rate in your data processing system. Indeed, at the end of the day, what really matters is how much you can capitalize exploiting big data.
However, if you are going to start a new big data related, you will be facing a zoo of technologies and the final choice which strictly relies on the use case is biased by the knowledge of the IT guys leading the project.
However, if you need to deal with large historical data as well as data streams and to perform predictive analysis and real-time interactive analytics then you may consider Proteus as it is an open source ready to use big data solution offering such capabilities. AND I WILL SHOW THAT IN THE NEXT SLIDES.
The presentation goes as follows
Research partners, pure IT companies and ArcelorMittal
Our validation scenario deals with the prediction of anomalies in the coils produced through the so-called Hot strip mill process, which comes from our ArcelorMittal partner, a leader in the steelmaking industry. In order to do such task, we need to perform analytics on streaming data and historical data.
3 main subsystems: an hybrid processing engine for large historical data and data streams powered by an enhanced version of Apache Flink (distributed dataflow system for batch and stream data in a single engine); a library for scalable online machine learning built on top of our processing engine and then the visualization stack which queries the solma library and the engine in real-time.
The ML challenge we are facing deals with data stream, we need online machine learning which suits better streaming processing rather than traditional batch ML. Online machine learning algorithms see data item one by one, generally speaking it firsts predicts the class of the item and then it does a single training step on the model. This is called prequential evaluation.
Our real-time interactive visual analytics stack tries to answer the question: How to interactively visualize big data? The answer is through incremental partial results that update the charts and a SSR-enabled webchart library