Spark is a unified analytics engine for large-scale data processing. It provides APIs for SQL queries, streaming data, and machine learning. Spark uses RDDs (Resilient Distributed Datasets) as its fundamental data abstraction, which allows data to be operated on in parallel. RDDs track lineage information to efficiently recover lost data. Spark offers advantages over MapReduce like being faster, using less code, and supporting iterative algorithms. It can also be used for both batch and streaming workloads using the same APIs. While still maturing, Spark is gaining popularity for its ease of use and performance.
Who has Spark in production?
Who has a POC?
Who has tried using the REPL?
Who has never heard of Spark before today?
I’d like to tell you a little about why I think Spark is cool, and why you should consider applying it to your data problems.
I’ll talk about the world of Mapreduce, AKA, before spark.
I’ll talk about what Spark is and why it’s cool, and we’ll talk a little about the comparisons between spark and storm.
Finally, we’ll talk about what the issues people report are.
Mapreduce is conceptually simple, but is difficult to program, and is really only suitable for batch-style work. Today, we want a more interactive, iterative experience for processing big data.
Mapreduce involves mappers, an automatic shuffle phase, and a reduce phase. As we move through these stages, there’s intermediate data produced, and it needs to be stored somewhere for subsequent phases and tasks. This means that you’ll often read and write the same data multiple times in mapreduce.
Later, DSLs like hive and pig came along to make processing easier, compiling a simpler language down to mapreduce. So with the DSLs, we gain efficiency in development of the processing pipelines, but mapreduce remains. DSLs eliminate boilerplate and can do some optimization, but do not get around the multiple stages of mapreduce, which necessarily involve extra I/O.
Spark was developed at UC Berkeley in recent years, and went open source in 2010. To me, Spark represents a real refinement over Hadoop, that brings a long list of benefits you’ll want as a part of your life.
In about 4 years Spark has seen tremendous adoption.
330 contributors, 50 companies, 10 platforms distributing Spark (as of 10/31). About a year ago, Databricks was formed to support Spark commercially.
Yahoo, Adobe, Ooyala, Sharethrough are in production with Spark, and all the logos you see here (and probably more) have devoted some effort to supporting and advancing the Spark community.
http://spark-summit.org/wp-content/uploads/2014/07/Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
Spark aims to be a unified platform for data processing, handling both batch and streaming use cases.
The general purpose execution engine plugs into Mesos and YARN for resource management, and can also use its own standalone clustering capability.
We can write programs for Spark in Scala, python and Java, and there are APIs included that allow us to:
(Spark SQL) write SQL queries in our code, and get back RDDs
(Streaming) process a never-ending stream of data at low latency
(MLLib) use a machine learning library for batch and iterative use cases
(GraphX) use a graph computation library to solve graph problems
Data sources include HDFS, Amazon S3, Kafka and of course your local file system - chances are if you’ve got a bunch of data, it’s going to be pretty easy for Spark to get at it.
All of this is supported by a really user-friendly API and very good and constantly improving documentation.
Let’s talk about some of the components that make up the Spark platform…
===
Resources:
Intro to Scala: http://www.artima.com/scalazine/articles/steps.html
Aims to generalize mapreduce and provides a relatively small API. It aims to use memory and pipelining to achieve dramatically better than mapreduce performance.
Spark appears to be poised to overtake Mapreduce as the dominant big data execution engine.
SQL Queries over your data, plus a mode of operation that acts as a “faster Hive”. Spark SQL can use your hive metastore, and query your hive tables using the Spark engine, which should be faster than Hive in many (but not all) cases.
JDBC available today.
Spark Streaming - operate on a stream of data.
Data can come from a distributed file system such as HDFS, a messaging system such as Kafka, or even a network socket. The framework turns the stream into a series of RDDs, or micro-batches based on a window of time, which can then be operated on using the same API as is used for batch.
MLLib is a machine learning library that aims to be fast, and more correct through iteration. My data scientist friends tell me that one way to get better results is to use a simple algorthm over more data. Spark and MLLib are terrific at iteration, so it’s a natural fit for the application of algorithms that benefit from iteration.
MLLib isn’t just for Ph.D’s though. It’s also got summary statistics that those of us with a high-school level understanding of statistics can use on our data.
Graph Basics: http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=graphsDataStrucs1
Advanced Training on GraphX - https://www.youtube.com/watch?v=Rb2oBG039dM
Taking Wikipedia as an example, you could:
Compute page rank using the link graph - the connections between Wikipedia pages - to generate a top-N pages report
Analyze the connections between terms and the documents they’re used in
Perform analysis of Wikipedia editors, to gain some understanding of the Wikipedia editor community
These are the headlines - Spark is easy to develop in, and it’s FAST. Let’s unroll some of this.
The REPL is a shell for Spark. You can write scala or python code, and get immediate feedback.
You can explore your data, test out ideas extremely quickly.
Even in Java (8), you can express programs with minimal cruft.
RDDs are resilient distributed data sets, and they are a fundamental component of Spark.
You can create RDDs in two ways - by parallelizing an existing data set (for example, generating an array then telling Spark to distribute it to workers) or by obtaining data from some external storage system.
Once created, you can transform the data. However, when you make transformations, you are not actually modifying the data in place, because RDDs are immutable - transformations always create new RDDs.
Transformations are your instructions on how to modify an RDD. Spark takes your transformations, and creates a graph of operations to carry out against the data. Nothing actually happens with your data until you perform an action, which forces Spark to evaluate and execute the graph in order to present you some result.
Like in Mapreduce DSLs, this allows for a “compile” stage of execution that enables optimization - pipelining, for example. Thus, even though your Spark program may spawn many stages, Spark can avoid the intermediate I/O that is one of the reasons mapreduce is considered “slow”.
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
By default, Spark will recompute your RDDs each time you call an action.
You can however have Spark persist your RDDs according to your preference. Among your options is to cache the RDDs, using memory. Caching an RDD in memory means you get to avoid recomputing the data from the transformation graph. This can obviously have a dramatic effect on performance.
You can also persist RDDs to disk, or use a combination of the two. Spark will automatically spill to disk if you request caching, but lack the RAM to hold the working set.
If you’re storing stuff in memory, at some point you’ll become concerned about how Spark will react to the loss of a node. The R in RDD stands for resilient, and the way Spark achieves resilience is by tracking data lineage. What this means is that Spark will maintain a graph of transformations from which it can recompute lost data. In the example here, we’ve got a simple scenario in which we’ve loaded a partition of data from HDFS, filtered it, yielding a filtered RDD, then we’ve applied a function to each of the elements of the RDDs, yielding a mapped RDD.
If you cache the RDD “msgs”, and the node fails, Spark will recompute the RDD using the lineage information.
If you’re storing stuff in memory, at some point you’ll become concerned about how Spark will react to the loss of a node. The R in RDD stands for resilient, and the way Spark achieves resilience is by tracking data lineage. What this means is that Spark will maintain a graph of transformations from which it can recompute lost data. In the example here, we’ve got a simple scenario in which we’ve loaded a partition of data from HDFS, filtered it, yielding a filtered RDD, then we’ve applied a function to each of the elements of the RDDs, yielding a mapped RDD.
If you cache the RDD “msgs”, and the node fails, Spark will recompute the RDD using the lineage information.
The lineage information is stored in an directed, acyclic graph, or a DAG.