1. Booking Hotel, Flight, Train, Event & Rental Car
Apache Spark
Created By Josi Aranda @ Tiket.com
2. Apache Spark
• Apache Spark is an open-source powerful distributed querying and processing
engine.
• It provides flexibility and extensibility of MapReduce but at significantly higher
speeds: Up to 100 times faster than Apache Hadoop when data is stored in memory
and up to 10 times when accessing disk.
3. Spark’s Features
Apache Spark achieves high performance
for both batch and streaming data, using a
state-of-the-art DAG scheduler, a query
optimizer, and a physical execution engine.
Speed
Logistic regression in Hadoop and Spark
4. Spark’s Features
Write applications quickly in Java, Scala,
Python, R, and SQL. Spark offers over 80
high-level operators that make it easy to
build parallel apps. And you can use it
interactively from the Scala, Python, R, and
SQL shells.
Ease of Use
Spark's Python DataFrame API
Read JSON files with automatic schema
inference
5. Spark’s Features
Combine SQL, streaming, and complex
analytics. Spark powers a stack of libraries
including SQL and DataFrames, MLlib for
machine learning, GraphX, and Spark
Streaming. You can combine these libraries
seamlessly in the same application.
Generality
6. Spark’s Features
Spark runs on Hadoop, Apache Mesos,
Kubernetes, standalone, or in the cloud. It
can access diverse data sources.
Runs Everywhere
7. Spark Execution Process
• Any Spark application spins off a single driver process (that can contain multiple
jobs) on the master node that then directs executor processes (that contain multiple
tasks) distributed to a number of worker nodes.
• The driver process determines the number and the composition of the task
processes directed to the executor nodes based on the graph generated for the
given job. Note, that any worker node can execute tasks from a number of different
jobs.
8. Resilient Distributed Dataset (RDD)
• Resilient Distributed Datasets (RDDs) are a distributed collection of immutable JVM
objects that allow you to perform calculations very quickly, and they are the
backbone of Apache Spark.
• RDDs have two sets of parallel operations: transformations (which return pointers to
new RDDs) and actions (which return values to the driver after running a
computation)
• RDD transformation operations are lazy in a sense that they do not compute their
results immediately. The transformations are only computed when an action is
executed and the results need to be returned to the driver.*
* RDD is like a teenager doing chores. They won’t do it until their mom starts to check.
(they will do it so fast and effectively)
11. RDD (cont.)
56312, paid, native_apps
56313, paid, web
56314, shopping_cart, web
56315, paid, web 56312, paid, native_apps
56313, paid, web
56315, paid, web
.filter(lambda line:line[1]==‘paid’)
.map(lambda line:(line[2],1))
(native_apps,1)
(web,2)
(native_apps,1)
(web,1)
(web,1)
.reduceByKey(lambda x,y:x+y)
12. Spark DataFrame
• A DataFrame is an immutable distributed collection of data that is organized into
named columns analogous to a table in a relational database. Introduced as an
experimental feature within Apache Spark 1.0 as SchemaRDD, they were renamed
to DataFrames as part of the Apache Spark 1.3 release.
• By imposing a structure onto a distributed collection of data, this allows Spark users
to query structured data in Spark SQL or using expression methods (instead of
lambdas).
14. Ways to Create DataFrame (cont.)
a)Traditional df creation. b). df creation with SQL direct. Both will
return the same result.
15. Spark Dataset
• Introduced in Apache Spark 1.6, the goal of Spark Datasets was to provide an API
that allows users to easily express transformations on domain objects, while also
providing the performance and benefits of the robust Spark SQL execution engine.
As part of the Spark 2.0 release (and as noted in the diagram above), the
DataFrame APIs is merged into the Dataset API thus unifying data processing
capabilities across all libraries.
• Conceptually, the Spark DataFrame is an alias for a collection of generic objects
Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast,
is a collection of strongly-typed JVM objects, dictated by a case class, in Scala or
Java