We are witnessing a proliferation of big data, which has lead to a zoo of data processing systems. Each system providing a different set of features. For example, Spark provides scalability to analytic tasks, but Java 8 Streams provides low-latency. Furthermore, complex applications, such as ETL and ML, are now requiring a mixture of platforms to perform tasks efficiently. In such complex data analytics pipelines, the use of multiple data processing system is not only for performance reasons, but also because of data diversity. Datasets often natively reside on different data formats and storage engines. Unfortunately, developers are left alone in the challenging tasks of: (1) choosing the right platform for their applications; and (2) performing tedious and costly data migration and integration tasks to obtain the results.
In this talk, we will present Rheem, an open source scalable cross-platform system that frees developers from these burdens. Rheem provides an abstraction layer on top of Spark (and other processing platforms) with the aim of enabling cross-platform optimization and interoperability. It automatically selects the best data processing platforms for a given task and also handles the cross-platform execution. In particular, we will discuss how Rheem allows Spark to work in tandem with other platforms in order to achieve higher performance. We will also show how easy a developer can write complex applications on top of Rheem to seamlessly use multiple different data processing platforms according to their tasks at hand. Using Rheem developers do not have to worry about the integration or data migration between Spark and other platforms.
25. RHEEM in a Nutshell
Rheem App
Apache Spark Java StreamsGraphChi
Postgres
HDFS LFS S3
Java Driver
Rheem API
Spark DriverPg Driver
Monitor
Cross-Platform
Optimizer
Cost
Learner
Cross-Platform
Executor
25
26. Fine-grained Platform Selection
Table source
Filter
Map
Group by
Collect
Table source
Filter
Map
Group by
Collect
Get real and
predicted
weights from the
year 2017
Calculate MSE
Group by Airline
Output results
26
Logical plan RHEEM plan Execution plan
27. Automatic Data Movement
Table source
Filter
Map
Group by
Collect
Table source
Filter
Map
Group by
Collect
Get real and
predicted
weights from the
year 2017
Calculate MSE
Group by Airline
Output results
27
Stage 2
Spark
Stage 1
Postgres
Logical plan RHEEM plan Execution plan
28. Automatic Data Movement
Table source
Filter
Map
Group by
Collect
Table source
Filter
Map
Group by
Collect
Get real and
predicted
weights from the
year 2017
Calculate MSE
Group by Airline
Output results
SQL 2 RDD
28
Stage 2
Spark
Stage 1
Postgres
Logical plan RHEEM plan Execution plan
33. Scala REST PIGLATIN PythonJava
The dRHEEM goes on…
Spark
Spark
SQL
MLlib
Spark
Streaming
GraphX
Spark
Batch
Postgres
LFS S3HDFS
Cross-Platform Apps
RHEEM
Java
Streams
GraphChi
33
Add more platforms
Integrate with resource
managers
Enhance data
exchange paths
Continuously improve
optimizer
Re-use intermediate
datasets across jobs
34. We want you!
Website: http://da.qcri.org/rheem/
GitHub: https://github.com/rheem-ecosystem
Apache Incubator Very Soon!
Interested? Then, become a
dRHEEMer!
Gitter: https://gitter.im/rheem-ecosystem/Lobby