2. Outline
An Overview on Spark
Spark Programming Guide
An Example on Spark
Running Applications on Spark
Spark Streaming
Spark Streaming Programing Guide
An Example on Spark Streaming
Spark and Storm: A Comparison
Spark SQL
15 January 2015Majid Hajibaba - Spark 2
4. Cluster Mode Overview
Spark applications run as independent sets of processes on a cluster
Executor processes run tasks in multiple threads
Driver should be close to the workers
For remotely operating, use RPC instead of remote driver
• Coordinator
• Standalone
• Mesos
• YARN
http://spark.apache.org/docs/1.0.1/cluster-overview.html
15 January 2015 4Majid Hajibaba - Spark
5. Core is a “computational engine” that is responsible for scheduling,
distributing, and monitoring applications in a cluster
higher-level components (Shark; GraphX; Streaming; …) are Like
libraries in a software project
tight integration has several benefits
Simple Improvements, Minimized Costs, Combine Processing Models
.
Spark - A Unified Stack
15 January 2015 5Majid Hajibaba - Spark
6. Spark Processing Model
15 January 2015 6Majid Hajibaba - Spark
In memory iterative MapReduce
MapReduce
Processing Model
7. Spark Goal
Provide distributed memory abstractions for clusters to support apps
with working sets
Retain the attractive properties of MapReduce:
Fault tolerance
Data locality
Scalability
Solution: augment data flow model with “resilient distributed datasets”
(RDDs)
15 January 2015 7Majid Hajibaba - Spark
8. Resilient Distributed Datasets (RDDs)
Immutable collection of elements that can be operated on in parallel
Created by transforming data using data flow operators (e.g. map)
Parallel operations on RDDs
Benefits
Consistency is easy
due to immutability
Inexpensive fault tolerance
log lineage
no replicating/checkpointing
Locality-aware scheduling of tasks on partitions
Applicable to a broad variety of applications
15 January 2015 8Majid Hajibaba - Spark
9. RDDs
15 January 2015Majid Hajibaba - Spark 9
Immutable
Collection of
Objects
Partitioned and Distributed
11. Linking with Spark
Spark 1.2.0 works with Java 6 and higher
To write a Spark application in Java, you need to add a dependency on
Spark. Spark is available through Maven Central at:
Importing Spark classes into the program:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.2.0
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
15 January 2015 11Majid Hajibaba - Spark
12. Initializing Spark - Creating a SparkContext
Tells Spark how to access a cluster
The entry point / The first thing a Spark program
This is done through the following constructor:
Example:
Or through SparkConf for advanced configuration
new SparkContext(master, appName, [sparkHome], [jars])
15 January 2015 12Majid Hajibaba - Spark
import org.apache.spark.api.java.JavaSparkContext;
JavaSparkContext ctx = new
JavaSparkContext("master_url",
"application name", ["path_to_spark_home",
"path_to_jars"]);
13. SparkConf
Configuration for a Spark application
Sets various Spark parameters as key-value pairs
SparkConf object contains information about the application
The constructor will load values from any spark.* Java system
properties set and the classpath in the application
Example
import org.apache.spark.SparkConf;
SparkConf conf =
new SparkConf().setAppName(appName).setMaster(master);
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
SparkConf sparkConf = new SparkConf().setAppName("application
name");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
15 January 2015 13Majid Hajibaba - Spark
14. Loading data into an RDD
Spark's primary unit for data representation
Allows for easy parallel operations on the data
Native collections in Java can serve as the basis for an RDD
number of partitions can be set manually by passing it as a second parameter to
parallelize (e.g. ctx.parallelize(data, 10)).
To loading external data from a file can use textFile method in SparkContext
as:
textFile(path: String, minSplits: Int )
path: the path of text file
minSplits: min number of splits for Hadoop RDDs
The resulting is an overridden string with each line being a unique element in
the RDD
import org.apache.spark.api.java.JavaRDD;
JavaRDD<Integer> dataRDD = ctx.parallelize(Arrays.asList(1,2,4));
15 January 2015 14Majid Hajibaba - Spark
15. textFile method
Read a text file and return it as an RDD of Strings
File can be take from
a local file system (available on all nodes in Distributed mode)
HDFS
Hadoop-supported file system URI
.
import org.apache.spark.api.java.JavaRDD;
JavaRDD<String> lines = ctx.textFile(“file_path”, 1);
import org.apache.spark.Sparkfiles;
import org.apache.spark.api.java.JavaRDD;
...
ctx.addFile(“file_path");
JavaRDD<String> lines = ctx.textFile(SparkFiles.get(“file_path"));
import org.apache.spark.api.java.JavaRDD;
...
JavaRDD<String> lines = ctx.textFile(“hdfs://...”);
15 January 2015 15Majid Hajibaba - Spark
16. Manipulating RDD
Transformations: to create a new dataset from an existing one
map: works on each individual element in the input RDD and produces a new
output element
Transformation functions do not transform the existing elements, rather they
return a new RDD with the new elements
Actions: to return a value to the driver program after running a computation
on the dataset
reduce: operates on pairs to aggregates all the data elements of the dataset
import org.apache.spark.api.java.function.Function;
rdd.map(new Function<Integer, Integer>() {
public Integer call(Integer x) { return x+1;}
});
import org.apache.spark.api.java.function.Function2;
rdd.reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) { return x+y;}
});
15 January 2015 16Majid Hajibaba - Spark
17. RDD Basics
A simple program
This dataset is not loaded in memory
lines is merely a pointer to the file
lineLengths is not immediately computed
Breaks the computation into tasks to run on separate machines
Each machine runs both its part of the map and a local reduction
Local reduction only answers to the driver program
To use lineLengths again later, we could add the following before the reduce:
This would cause lineLengths to be saved in memory after the first time it is
computed.
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
lineLengths.persist();
15 January 2015 17Majid Hajibaba - Spark
18. functions are represented by classes implementing the interfaces in the
org.apache.spark.api.java.function package
Two ways to create such functions:
1. Use lambda expressions to concisely define an implementation (In Java 8)
2. Implement the Function interfaces in your own class, and pass an instance of
it to Spark
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(new
Function<String, Integer>() {
public Integer call(String s) { return s.length(); }
});
int totalLength = lineLengths.reduce(new
Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b)
{ return a + b; }
});
class GetLength implements Function<String, Integer> {
public Integer call(String s) { return s.length(); }
}
class Sum implements Function2<Integer, Integer, Integer> {
public Integer call(Integer a, Integer b) { return a + b;}
}
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(new GetLength());
int totalLength = lineLengths.reduce(new Sum());
Passing Functions to Spark
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
15 January 2015 18Majid Hajibaba - Spark
19. Working with Key-Value Pairs
key-value pairs are represented using the scala.Tuple2 class
call new Tuple2(a, b) to create a tuple
access its fields with tuple._1() and tuple._2()
RDDs of key-value pairs
distributed “shuffle” operations (e.g. grouping or aggregating the elements
by a key)
Represented by the JavaPairRDD class
JavaPairRDDs can be constructed from JavaRDDs Using special versions of
the map operations (mapToPair, flatMapToPair)
The JavaPairRDD will have both standard RDD:
reduceByKey
sortByKey
import scala.Tuple2;
...
Tuple2<String, String> tuple = new Tuple2(“foo”,”bar”);
System.out.println(tuple._1() + “ " + tuple._2());
15 January 2015 19Majid Hajibaba - Spark
20. Working with Key-Value Pairs
reduceByKey example
to count how many times each line of text occurs in a file
sortByKey example
to sort the pairs alphabetically
and to bring them back to the driver program as an array of objects
import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
...
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new
Tuple2(s, 1));
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a +
b);
...
counts.sortByKey();
counts.collect();
15 January 2015 20Majid Hajibaba - Spark
21. flatMap
flatMap is a combination of map and flatten
Return a Sequence rather than a single item; Then flattens the result
Use case: to parse all the data, but may fail to parse some of it
15 January 2015Majid Hajibaba - Spark 21
http://www.slideshare.net/frodriguezolivera/apache-spark-streaming
25. A Complete Example
Word Counter Program
Package and classes
Import
needed
classes
Package’s name
(will be passed to spark submitter)
15 January 2015 26Majid Hajibaba - Spark
26. A Complete Example
Main Class
Creating a SparkContext
Creating a SparkConf
Application name
(will be passed to spark submitter)
Loading data into an RDD
Base RDD
15 January 2015 27Majid Hajibaba - Spark
27. A Complete Example
JavaRDDs and JavaPairRDDs functions
construct
JavaPairRDDs
from JavaRDDs
count how many
times each word of
text occurs in a file
values for each key are aggregated
create a tuple (key-value pairs )
Transformed RDD
15 January 2015 28Majid Hajibaba - Spark
28. A Complete Example
Printing results
accessing tuples
action
15 January 2015 29Majid Hajibaba - Spark
29. Iteration 1
output = count.collect();
Spark Execution Model
15 January 2015 30Majid Hajibaba - Spark
30. Iteration 2
output = count.reduce(func);
Spark Execution Model
15 January 2015 31Majid Hajibaba - Spark
35. Overview
Data can be ingested from many sources like Kafka, Flume, Twitter,
ZeroMQ, Kinesis or TCP sockets
Data can be processed using complex algorithms expressed with high-
level functions like map, reduce, join and window
Processed data can be pushed out to filesystems, databases, and live
dashboards
Potential for combining batch processing and streaming processing in
the same system
you can apply Spark’s machine learning and graph processing algorithms on
data streams
15 January 2015Majid Hajibaba - Spark 36
36. Run a streaming computation as a series of very small, deterministic
batch jobs
Chop up the live stream into batches of X seconds
Spark treats each batch of data
as RDDs and processes them using
RDD operations
Finally, the processed results of
the RDD operations are returned
in batches
Batch sizes as low as ½ second,
latency of about 1 second
Spark Streaming – How Work
15 January 2015Majid Hajibaba - Spark 37
37. Dstreams (Discretized Stream)
represents a continuous stream of data
is represented as a sequence of RDDs
can be created from
input data streams from sources such as Kafka, Flume, and Kinesis
by applying high-level operations on other Dstreams
Example: lines to words
15 January 2015Majid Hajibaba - Spark 38
38. Running Example - JavaNetworkWordCount
You will first need to run Netcat as a data server by using
Remember you must be installed spark
Then, in a different terminal, you can start the example by using
Then, any lines typed in the terminal running the netcat server will be
counted and printed on screen every second.
15 January 2015Majid Hajibaba - Spark 39
$ nc -lk 9999
$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999
40. Linking with Spark
Like as Spark batch processing
Spark 1.2.0 works with Java 6 and higher
To write a Spark application in Java, you need to add a dependency on
Spark.
add the following dependency to your Maven project.
add the following dependency to your SBT project.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.2.0</version>
</dependency>
15 January 2015 41Majid Hajibaba - Spark
libraryDependencies += "org.apache.spark" %
"spark-streaming_2.10" % "1.2.0"
41. Initializing – Creating StreamingContext
Like as SparkContext
Using constructor
The batchDuration is the size of the batches
the time interval at which streaming data will be divided into batches
can be created from a SparkConf object
can also be created from an existing JavaSparkContext
15 January 2015Majid Hajibaba - Spark 42
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Duration(1000));
...
JavaSparkContext ctx = ... //existing JavaSparkContext
JavaStreamingContext ssc =
new JavaStreamingContext(ctx, Durations.seconds(1));
new StreamingContext(master,appName,batchDuration,[sparkHome],[jars])
42. Setting the Right Batch Size
batches of data should be processed as fast as they are being generated
the batch interval used may have significant impact on the data rates
figure out the right batch size for an application
test it with a conservative batch interval and a low data rate
5-10 seconds
If system is stable (the delay is comparable to the batch size)
increasing the data rate and/or reducing the batch size
If system is unstable (the delay is continuously increasing)
Get to the previous stable batch size
15 January 2015Majid Hajibaba - Spark 43
43. Input DStreams and Receivers
Input DStream is associated with a Receiver
except file stream
Receiver
receives the data from a source and
stores it in memory for processing
Spark Streaming provides two categories of built-in streaming sources.
Basic sources
like file systems, socket connections, and Akka actors
directly available in the StreamingContext API
Advanced sources
like Kafka, Flume, Kinesis, Twitter, etc.
are available through extra utility classes
Custom sources
15 January 2015Majid Hajibaba - Spark 44
44. Basic Sources
File Streams
will monitor the directory dataDirectory and process any files created in that directory
For simple text files
Socket Streams
Custom Actors
Actors are concurrent processes that communicate by exchanging messages
Queue of RDDs
Each RDD into the queue will be treated as a batch of data in the DStream, and
processed like a stream
15 January 2015Majid Hajibaba - Spark 45
streamingContext.fileStream<KeyClass, ValueClass,
InputFormatClass>(dataDirectory);
streamingContext.textFileStream(dataDirectory)
streamingContext.actorStream(actorProps, actor-name)
streamingContext.queueStream(queueOfRDDs)
streamingContext.socketStream(String hostname, int port,
Function converter, StorageLevel storageLevel)
45. Advanced Sources
require interfacing with external non-Spark libraries
Twitter
Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven
Programming: Import the TwitterUtils class and create a DStream with
TwitterUtils.createStream as shown below
Deploying: Generate an uber JAR with all the dependencies (including the
dependency spark-streaming-twitter_2.10 and its transitive dependencies) and
then deploy the application. This is further explained in the Deploying section.
Flume
Kafka
Kinesis
15 January 2015Majid Hajibaba - Spark 46
import org.apache.spark.streaming.twitter.*;
TwitterUtils.createStream(jssc);
47. Socket Text Stream
Create an input stream from network source hostname:port
Data is received using a TCP socket
Receive bytes is interpreted as UTF8 encoded n delimited lines
Storage level to use for storing the received objects
15 January 2015Majid Hajibaba - Spark 48
socketTextStream(String hostname, int port);
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.api.java.StorageLevels;
...
ssc.socketTextStream(“localhost”,9999,
StorageLevels.MEMORY_AND_DISK_SER);
socketTextStream(String hostname, int port, StorageLevel
storageLevel)
48. Class ReceiverInputDStream
Abstract class for defining any InputDStream
Start a receiver on worker nodes to receive external data
JavaReceiverInputDStream
An interface to ReceiverInputDStream
The abstract class for defining input stream received over the network
Example:
Creates a DStream from text data received over a TCP socket connection
15 January 2015Majid Hajibaba - Spark 49
import org.apache.spark.api.java.StorageLevels;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
...
JavaReceiverInputDStream<String> lines =
ssc.socketTextStream(“localhost”, 9999, StorageLevels.MEMORY);
49. Output Operations on DStreams
Allow DStream’s data to be pushed out external systems
Trigger the actual execution of all the DStream transformations
Similar to actions for RDDs
15 January 2015Majid Hajibaba - Spark 50
Output Operation Meaning
print()
Prints first ten elements of every batch of data in a
DStream on the driver node running the streaming
application.
saveAsTextFiles (prefix, [suffix])
Save DStream's contents as a text files. The file name at
each batch interval is generated based on prefix and suffix.
saveAsObjectFiles(prefix, [suffix])
Save DStream's contents as a SequenceFile of serialized
Java objects.
saveAsHadoopFiles(prefix, [suffix]) Save DStream's contents as a Hadoop file.
foreachRDD(func)
Applies a function to each RDD generated from the
stream. This function should push the data in each RDD to
a external system, like saving the RDD to files, or writing
it over the network to a database. The function is executed
in the driver process running the streaming application.
50. Persisting (or caching) a dataset in memory across operations
Each node stores any computed partitions in memory and reuses them
Methods
.cache() just memory - for iterative algorithms
.persist() just memory - reuses in other actions on dataset
.persist(storageLevel) storageLevel:
Example:
.
RDD Persistence
15 January 2015 51Majid Hajibaba - Spark
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
import org.apache.spark.api.java.StorageLevels;
...
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
51. UpdateStateByKey
To maintain state
Update state with new information
Define the state
Define the state update function
using updateStateByKey requires the checkpointing
15 January 2015Majid Hajibaba - Spark 52
import com.google.common.base.Optional;
...
Function2<List<Integer>, Optional<Integer>, Optional<Integer>>
updateFunction = new Function2<List<Integer>, Optional<Integer>,
Optional<Integer>>() {
@Override public Optional<Integer> call(List<Integer> values,
Optional<Integer> state) {
Integer newSum = ... // add the new values with the
//previous running count
return Optional.of(newSum);
}};
...
JavaPairDStream<String, Integer> runningCounts =
pairs.updateStateByKey(updateFunction);
applied on a DStream containing words
52. To operate 24/7 and be resilient to failures
Needs to checkpoints enough information to recover from failures
Two types of data that are checkpointed
Metadata checkpointing
To recover from failure of the node running the driver
Includes Configuration; DStream operations; Incomplete batches
Data checkpointing
To cut off the dependency chains
Remove accumulated metadata in stateful operations
To enable checkpointing:
The interval of checkpointing of a DStream can be set by using
checkpoint interval of 5 - 10 times is good
dstream.checkpoint(checkpointInterval)
ctx.checkpoint(hdfsPath)
Checkpointing
15 January 2015Majid Hajibaba - Spark 53
54. A Complete Example
Network Word Counter Program
Package and classes
Import
needed
classes
Package’s name
(will be passed to spark submitter)
15 January 2015 55Majid Hajibaba - Spark
55. A Complete Example
Main Class
Creating a SparkStreamingContext
Creating a
SparkConf
Application name
(will be passed to spark submitter)
Socket Streams as Source
Input DStream
15 January 2015 56Majid Hajibaba - Spark
Setting batch size
56. A Complete Example
JavaDStream and JavaPairDStream functions
construct
JavaPairDstream
from JavaDstream
count how many
times each word
of text occurs in
an stream
values for each key are aggregated
create a tuple (key-value pairs )
Transformed DStream
15 January 2015 57Majid Hajibaba - Spark
57. A Complete Example
Printing results
Wait for the execution to stop
Start the execution of the
streams
15 January 2015 58Majid Hajibaba - Spark
Print the first ten elements
58. Spark and Storm
A Comparison
15 January
2015
59Majid Hajibaba - Spark
59. Spark vs. Strom
Spark Storm
Origin UC Berkeley, 2009 Twitter
Implemented in Scala Clojure (Lisp like)
Enterprise Support Yes No
Source Model Open Source Open Source
Big Data Processing Batch and Stream Stream
Processing Type processing in short
interval batches
real time
Latency a few Second sub-Second
Programming API Scala, Java, Python Any PL
Guarantee Data
Processing
Exactly one At least one
Bach Processing Yes No
Coordination With zookeeper zookeeper
15 January 2015 60Majid Hajibaba - Spark
64. Spark SQL
Allows relational queries expressed in SQL to be executed using Spark
Data Sources are in JavaSchemaRDDs
JavaSchemaRDD
new type of RDD
is similar to a table in a traditional relational database
are composed of Row objects along with a schema that describes it
can be created from an existing RDD, a JSON dataset, or …
15 January 2015Majid Hajibaba - Spark 65
66. Initializing - Creating JavaSQLContext
To create a basic JavaSQLContext, all you need is a JavaSparkContext
It must be based spark context
15 January 2015Majid Hajibaba - Spark 67
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.api.java.JavaSQLContext;
...
...
JavaSparkContext sc = ...; // An existing JavaSparkContext.
JavaSQLContext sqlContext = new JavaSQLContext(sc);
67. SchemaRDD
SchemaRDD can be operated on
as normal RDDs
as a temporary table
allows you to run SQL queries over it
Converting RDDs into SchemaRDDs
Reflection based approach
Uses reflection to infer the schema of an RDD
More concise code
Works well when we know the schema while writing the application
Programmatic based approach
Construct a schema and then apply it to an existing RDD
More verbose
Allows to construct SchemaRDDs when the columns and types are not known until
runtime
15 January 2015Majid Hajibaba - Spark 68
68. JavaBean
Is just a standard (a convention)
Is a class that encapsulates many objects into a single object
All properties private (using get/set)
A public no-argument constructor
Implements Serializable
Lots of libraries depend on it
15 January 2015Majid Hajibaba - Spark 69
public static class Person implements Serializable {
private String name;
private int age;
public String getName() { return name; }
public void setName(String name) { this.name = name; }
public int getAge() { return age; }
public void setAge(int age) { this.age = age; }
}
69. Reflection based - An Example
Load a text file like people.txt
Convert each line to a JavaBean
people now is an RDD of JavaBeans
15 January 2015Majid Hajibaba - Spark 70
JavaRDD<Person> people = sc.textFile("people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});
70. Reflection based - An Example
Apply a schema to an RDD of JavaBeans (people)
Register it as a temporary table
SQL can be run over RDDs that have been registered as tables
The result is SchemaRDD and support all the normal RDD operations
The columns of a row in the result can be accessed by ordinal
15 January 2015Majid Hajibaba - Spark 71
JavaSchemaRDD schemaPeople =
sqlContext.applySchema(people, Person.class);
schemaPeople.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19")
List<String> teenagerNames = teenagers.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
71. Programmatic based
JavaBean classes cannot be defined ahead of time
SchemaRDD can be created programmatically with three steps
Create an RDD of Rows from the original RDD
Create the schema represented by a StructType matching the structure of
Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via applySchema method provided by
JavaSQLContext.
Example
The structure of records (schema) is encoded in a string
Load a text file and convert each line to a JavaBean.
15 January 2015Majid Hajibaba - Spark 72
String schemaString = "name age";
JavaRDD<String> people =
sc.textFile("examples/src/main/resources/people.txt");
72. Programmatic based – An Example
Generate the schema based on the string of schema
Convert records of the RDD (people) to Rows
15 January 2015Majid Hajibaba - Spark 73
import org.apache.spark.sql.api.java.DataType;
import org.apache.spark.sql.api.java.StructField;
import org.apache.spark.sql.api.java.StructType;
...
List<StructField> fields = new ArrayList<StructField>();
for (String fieldName: schemaString.split(" ")) {
fields.add(DataType.createStructField(fieldName,
DataType.StringType, true));}
StructType schema = DataType.createStructType(fields);
import org.apache.spark.sql.api.java.Row;
...
JavaRDD<Row> rowRDD = people.map(
new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return Row.create(fields[0], fields[1].trim());
}
});
73. Programmatic based – An Example
Apply the schema to the RDD.
Register the SchemaRDD as a table.
SQL can be run over RDDs that have been registered as tables
The result is SchemaRDD and support all the normal RDD operations
The columns of a row in the result can be accessed by ordinal
15 January 2015Majid Hajibaba - Spark 74
JavaSchemaRDD peopleSchemaRDD =
sqlContext.applySchema(rowRDD, schema);
peopleSchemaRDD.registerTempTable("people");
JavaSchemaRDD results = sqlContext.sql("SELECT name FROM people");
List<String> names = results.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
74. JSON Datasets
Inferring the schema of a JSON dataset and load it to JavaSchemaRDD
Two methods in a JavaSQLContext
jsonFile() : loads data from a directory of JSON files where each line of the
files is a JSON object – but not regular multi-line JSON file
jsonRDD(): loads data from an existing RDD where each element of the RDD
is a string containing a JSON object
A JSON file can be like this:
15 January 2015Majid Hajibaba - Spark 75
JavaSchemaRDD people = sqlContext.jsonFile(path);
75. JSON Datasets
The inferred schema can be visualized using the printSchema()
The result is something like this:
Register this JavaSchemaRDD as a table
SQL statements can be run by using the sql methods
15 January 2015Majid Hajibaba - Spark 76
people.printSchema();
people.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19");
76. JSON Datasets
JavaSchemaRDD can be created for a JSON dataset represented by an
RDD[String] storing one JSON object per string
Arrays are native examples of RDDs
Register this JavaSchemaRDD as a table
SQL statements can be run by using the sql methods
.
15 January 2015Majid Hajibaba - Spark 77
List<String> jsonData =
Arrays.asList("{"name":"Yin","address":
{"city":"Columbus","state":"Ohio"}}");
JavaRDD<String> anotherPeopleRDD = sc.parallelize(jsonData);
JavaSchemaRDD anotherPeople =
sqlContext.jsonRDD(anotherPeopleRDD);
people.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19");
77. Thrift JDBC/ODBC server
To start the JDBC/ODBC server:
By default, the server listens on localhost:10000
We can use beeline to test the Thrift JDBC/ODBC server
Connect to the JDBC/ODBC server in beeline with
Beeline will ask for a username and password
Simply enter the username on your machine and a blank password
See existing databases;
Create a database;
15 January 2015Majid Hajibaba - Spark 78
$ ./sbin/start-thriftserver.sh
$ ./bin/beeline
beeline> !connect jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> SHOW DATABASES;
0: jdbc:hive2://localhost:10000> CREATE DATABASE DBTEST;
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run.
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
The Spark project contains multiple closely-integrated components. At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. These components are designed to interoperate closely, letting you combine them like libraries in a software project.
A philosophy of tight integration has several benefits. First, all libraries and higher level components in the stack benefit from improvements at the lower layers. Second, the costs (deployment, maintenance, testing, support) associated with running the stack are minimized, because instead of running 5-10 independent software systems, an organization only needs to run one. also each time a new component is added to the Spark stack, every organization that uses Spark will immediately be able to try this new component.
Finally, is the ability to build applications that seamlessly combine different processing models. For example, in Spark you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously analysts can query the resulting data, also in real-time, via SQL, e.g. to join the data with unstructured log files.
Spark Streaming
Spark Streaming is a Spark component that enables processing live streams of data. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service. Spark Streaming provides an API for manipulating data streams that closely matches the Spark Core’s RDD API, making it easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving in real-time. Underneath its API, Spark Streaming was designed to provide the same degree of fault tolerance, throughput, and scalability that the Spark Core provides.
Spark SQL
Spark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL). Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations. Beyond providing the SQL interface to Spark, Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Java and Scala, all within a single application. This tight integration with the rich and sophisticated computing environment provided by the rest of the Spark stack makes Spark SQL unlike any other open source data warehouse tool.
Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce:
» Fault tolerance (for crashes & stragglers)
» Data locality
» Scalability
Solution: augment data flow model with “resilient distributed datasets” (RDDs)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Spark 1.2.0 works with Java 6 and higher. If you are using Java 8, Spark supports lambda expressions for concisely writing functions, otherwise you can use the classes in the org.apache.spark.api.java.function package.
To write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.2.0
In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS. Some common HDFS version tags are listed on the third party distributions page.
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
Finally, you need to import some Spark classes into your program. Add the following lines:
A SparkContext class represents the connection to a Spark cluster and provides the entry point for interacting with Spark. We need to create a SparkContext instance so that we can interact with Spark and distribute our jobs.
Master: is a string specifying a Spark or Mesos cluster URL to connect to, or a special “local” string to run in local mode, as described below.
appName: is a name for your application, which will be shown in the cluster web UI.
sparkHome: The path at which Spark is installed on your worker machines (it should be the same on all of them).
jars: A list of JAR files on the local machine containing your application’s code and any dependencies, which Spark will deploy to all the worker nodes. You’ll need to package your application into a set of JARs using your build system.
or through new SparkContext(conf), which takes a SparkConf object for more advanced configuration.
Most of the time, you would create a SparkConf object with new SparkConf(), which will load values from any spark.* Java system properties set in your application as well. In this case, parameters you set directly on the SparkConf object take priority over system properties.
For unit tests, you can also call new SparkConf(false) to skip loading external settings and get the same configuration no matter what the system properties are.
All setter methods in this class support chaining. For example, you can write new SparkConf().setMaster("local").setAppName("My app").
Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
The Spark context provides a function called parallelize; this takes a Scala collection and turns it into an RDD that is of the
same type as the data input.
The simplest method for loading external data is loading text from a file. This requires the file to be available on all the nodes in the cluster, which isn't much of a problem for a local mode. When in a distributed mode, you will want to use Spark's addFile functionality to copy the file to all the machines in your cluster.
One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).
The Spark context provides a function called parallelize; this takes a Scala collection and turns it into an RDD that is of the
same type as the data input.
The simplest method for loading external data is loading text from a file. This requires the file to be available on all the nodes in the cluster, which isn't much of a problem for a local mode. When in a distributed mode, you will want to use Spark's addFile functionality to copy the file to all the machines in your cluster.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new distributed dataset representing the results. On the other hand, reduce is an action that aggregates all the elements of the dataset using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently.
Example is to produce a new RDD where you add one to every number, use rdd.map(x => x+1) or in Java
Example 2 is to sum all the elements
To illustrate RDD basics, consider the simple program below:
The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines is merely a pointer to the file.
The second line defines lineLengths as the result of a map transformation. Again, lineLengths is not immediately computed, due to laziness.
Finally, we run reduce, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.
Spark’s API relies heavily on passing functions in the driver program to run on the cluster. In Java, functions are represented by classes implementing the interfaces in the org.apache.spark.api.java.function package. There are two ways to create such functions:
Implement the Function interfaces in your own class, either as an anonymous inner class or a named one, and pass an instance of it to Spark.
In Java 8, use lambda expressions to concisely define an implementation.
While much of this guide uses lambda syntax for conciseness, it is easy to use all the same APIs in long-form. For example, we could have written our code above as follows:
While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.
In Java, key-value pairs are represented using the scala.Tuple2 class from the Scala standard library. You can simply call new Tuple2(a, b) to create a tuple, and access its fields later with tuple._1() and tuple._2().
RDDs of key-value pairs are represented by the JavaPairRDD class. You can construct JavaPairRDDs from JavaRDDs using special versions of the map operations, like mapToPair and flatMapToPair. The JavaPairRDD will have both standard RDD functions and special key-value ones.
For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:
The JavaPairRDD will have both standard RDD functions and special key-value ones.
For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:
The flatMap function is a useful utility, which lets you write a function that returns an Iterable object of the type you want and then flattens the results. A simple example
of this is a case where you want to parse all the data, but may fail to parse some of it. The flatMap function can be used to output an empty list if it failed, or a list with
the success if it worked.
In addition to the reduce function, there is a corresponding reduceByKey function that works on RDDs, which are key-value pairs to produce another RDD. Unlike when using map on a list in Scala, your function will run on a number of different machines, so you can't depend on a shared state with this
flatMap is a combination of map and flatten, so it first runs map on the sequence, then runs flatten, giving the result shown.
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one.
If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To create an assembly jar containing your code and its dependencies, both sbt and maven can be used.
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in the following figure.
Spark 1.2.0 works with Java 6 and higher. If you are using Java 8, Spark supports lambda expressions for concisely writing functions, otherwise you can use the classes in the org.apache.spark.api.java.function package.
To write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.2.0
In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS. Some common HDFS version tags are listed on the third party distributions page.
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
Finally, you need to import some Spark classes into your program. Add the following lines:
For a Spark Streaming application running on a cluster to be stable, the system should be able to process data as fast as it is being received. In other words, batches of data should be processed as fast as they are being generated. Whether this is true for an application can be found by monitoring the processing times in the streaming web UI, where the batch processing time should be less than the batch interval.
Depending on the nature of the streaming computation, the batch interval used may have significant impact on the data rates that can be sustained by the application on a fixed set of cluster resources. For example, let us consider the earlier WordCountNetwork example. For a particular data rate, the system may be able to keep up with reporting word counts every 2 seconds (i.e., batch interval of 2 seconds), but not every 500 milliseconds. So the batch interval needs to be set such that the expected data rate in production can be sustained.
A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for “Total delay” in Spark driver log4j logs, or use the StreamingListener interface). If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise, if the delay is continuously increasing, it means that the system is unable to keep up and it therefore unstable. Once you have an idea of a stable configuration, you can try increasing the data rate and/or reducing the batch size. Note that momentary increase in the delay due to temporary data rate increases maybe fine as long as the delay reduces back to a low value (i.e., less than batch size).
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores.
DStreams can be created with data streams received through Akka actors by using streamingContext.actorStream.
Actors are basically concurrent processes that communicate by exchanging messages.
This category of sources require interfacing with external non-Spark libraries, some of them with complex dependencies (e.g., Kafka and Flume). Hence, to minimize issues related to version conflicts of dependencies, the functionality to create DStreams from these sources have been moved to separate libraries, that can be linked to explicitly when necessary. For example, if you want to create a DStream using data from Twitter’s stream of tweets, you have to do the following.
Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven project dependencies.
Programming: Import the TwitterUtils class and create a DStream with TwitterUtils.createStream as shown below.
Deploying: Generate an uber JAR with all the dependencies (including the dependency spark-streaming-twitter_2.10 and its transitive dependencies) and then deploy the application. This is further explained in the Deploying section.
Input DStreams can also be created out of custom data sources. All you have to do is implement an user-defined receiver (see next section to understand what that is) that can receive data from the custom sources and push it into Spark. See the Custom Receiver Guide for details.
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps.
Define the state - The state can be of arbitrary data type.
Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream.
Let’s illustrate this with an example. Say you want to maintain a running count of each word seen in a text data stream. Here, the running count is the state and it is an integer. We define the update function as
Because stateful operations have a dependency on previous batches of data, they continuously accumulate metadata over time. To clear this metadata, streaming supports periodic checkpointing by saving intermediate data to HDFS. Note that checkpointing also incurs the cost of saving to HDFS which may cause the corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10 times of sliding interval of a DStream is good setting to try.
Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes:
Configuration - The configuration that were used to create the streaming application.
DStream operations - The set of DStream operations that define the streaming application.
Incomplete batches - Batches whose jobs are queued but have not completed yet.
Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depends on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increase in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
Spark SQL supports two different methods for converting existing RDDs into SchemaRDDs. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.
The second method for creating SchemaRDDs is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct SchemaRDDs when the columns and their types are not known until runtime.
Spark SQL supports automatically converting an RDD of JavaBeans into a Schema RDD. The BeanInfo, obtained using reflection, defines the schema of the table. Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays. You can create a JavaBean by creating a class that implements Serializable and has getters and setters fo
JavaBeans are classes that encapsulate many objects into a single object (the bean).r all of its fields.
When JavaBean classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a SchemaRDD can be created programmatically with three steps.
Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via applySchema method provided by JavaSQLContext.
Spark SQL can automatically infer the schema of a JSON dataset and load it as a JavaSchemaRDD. This conversion can be done using one of two methods in a JavaSQLContext :
jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object.
jsonRDD - loads data from an existing RDD where each element of the RDD is a string containing a JSON object.
Note that the file that is offered as jsonFile is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.