SlideShare a Scribd company logo
1 of 78
Spark
Majid Hajibaba
Outline
 An Overview on Spark
 Spark Programming Guide
 An Example on Spark
 Running Applications on Spark
 Spark Streaming
 Spark Streaming Programing Guide
 An Example on Spark Streaming
 Spark and Storm: A Comparison
 Spark SQL
15 January 2015Majid Hajibaba - Spark 2
An Overview
15 January 2015Majid Hajibaba - Spark 3
Cluster Mode Overview
 Spark applications run as independent sets of processes on a cluster
 Executor processes run tasks in multiple threads
 Driver should be close to the workers
 For remotely operating, use RPC instead of remote driver
• Coordinator
• Standalone
• Mesos
• YARN
http://spark.apache.org/docs/1.0.1/cluster-overview.html
15 January 2015 4Majid Hajibaba - Spark
 Core is a “computational engine” that is responsible for scheduling,
distributing, and monitoring applications in a cluster
 higher-level components (Shark; GraphX; Streaming; …) are Like
libraries in a software project
 tight integration has several benefits
 Simple Improvements, Minimized Costs, Combine Processing Models
 .
Spark - A Unified Stack
15 January 2015 5Majid Hajibaba - Spark
Spark Processing Model
15 January 2015 6Majid Hajibaba - Spark
 In memory iterative MapReduce
MapReduce
Processing Model
Spark Goal
 Provide distributed memory abstractions for clusters to support apps
with working sets
 Retain the attractive properties of MapReduce:
 Fault tolerance
 Data locality
 Scalability
 Solution: augment data flow model with “resilient distributed datasets”
(RDDs)
15 January 2015 7Majid Hajibaba - Spark
Resilient Distributed Datasets (RDDs)
 Immutable collection of elements that can be operated on in parallel
 Created by transforming data using data flow operators (e.g. map)
 Parallel operations on RDDs
 Benefits
 Consistency is easy
 due to immutability
 Inexpensive fault tolerance
 log lineage
 no replicating/checkpointing
 Locality-aware scheduling of tasks on partitions
 Applicable to a broad variety of applications
15 January 2015 8Majid Hajibaba - Spark
RDDs
15 January 2015Majid Hajibaba - Spark 9
Immutable
Collection of
Objects
Partitioned and Distributed
Spark Programming Guide
Linking with Spark
 Spark 1.2.0 works with Java 6 and higher
 To write a Spark application in Java, you need to add a dependency on
Spark. Spark is available through Maven Central at:
 Importing Spark classes into the program:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.2.0
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
15 January 2015 11Majid Hajibaba - Spark
Initializing Spark - Creating a SparkContext
 Tells Spark how to access a cluster
 The entry point / The first thing a Spark program
 This is done through the following constructor:
 Example:
 Or through SparkConf for advanced configuration
new SparkContext(master, appName, [sparkHome], [jars])
15 January 2015 12Majid Hajibaba - Spark
import org.apache.spark.api.java.JavaSparkContext;
JavaSparkContext ctx = new
JavaSparkContext("master_url",
"application name", ["path_to_spark_home",
"path_to_jars"]);
SparkConf
 Configuration for a Spark application
 Sets various Spark parameters as key-value pairs
 SparkConf object contains information about the application
 The constructor will load values from any spark.* Java system
properties set and the classpath in the application
 Example
import org.apache.spark.SparkConf;
SparkConf conf =
new SparkConf().setAppName(appName).setMaster(master);
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
SparkConf sparkConf = new SparkConf().setAppName("application
name");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
15 January 2015 13Majid Hajibaba - Spark
Loading data into an RDD
 Spark's primary unit for data representation
 Allows for easy parallel operations on the data
 Native collections in Java can serve as the basis for an RDD
 number of partitions can be set manually by passing it as a second parameter to
parallelize (e.g. ctx.parallelize(data, 10)).
 To loading external data from a file can use textFile method in SparkContext
as:
 textFile(path: String, minSplits: Int )
 path: the path of text file
 minSplits: min number of splits for Hadoop RDDs
 The resulting is an overridden string with each line being a unique element in
the RDD
import org.apache.spark.api.java.JavaRDD;
JavaRDD<Integer> dataRDD = ctx.parallelize(Arrays.asList(1,2,4));
15 January 2015 14Majid Hajibaba - Spark
textFile method
 Read a text file and return it as an RDD of Strings
 File can be take from
 a local file system (available on all nodes in Distributed mode)
 HDFS
 Hadoop-supported file system URI
.
import org.apache.spark.api.java.JavaRDD;
JavaRDD<String> lines = ctx.textFile(“file_path”, 1);
import org.apache.spark.Sparkfiles;
import org.apache.spark.api.java.JavaRDD;
...
ctx.addFile(“file_path");
JavaRDD<String> lines = ctx.textFile(SparkFiles.get(“file_path"));
import org.apache.spark.api.java.JavaRDD;
...
JavaRDD<String> lines = ctx.textFile(“hdfs://...”);
15 January 2015 15Majid Hajibaba - Spark
Manipulating RDD
 Transformations: to create a new dataset from an existing one
 map: works on each individual element in the input RDD and produces a new
output element
 Transformation functions do not transform the existing elements, rather they
return a new RDD with the new elements
 Actions: to return a value to the driver program after running a computation
on the dataset
 reduce: operates on pairs to aggregates all the data elements of the dataset
import org.apache.spark.api.java.function.Function;
rdd.map(new Function<Integer, Integer>() {
public Integer call(Integer x) { return x+1;}
});
import org.apache.spark.api.java.function.Function2;
rdd.reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) { return x+y;}
});
15 January 2015 16Majid Hajibaba - Spark
RDD Basics
 A simple program
 This dataset is not loaded in memory
 lines is merely a pointer to the file
 lineLengths is not immediately computed
 Breaks the computation into tasks to run on separate machines
 Each machine runs both its part of the map and a local reduction
 Local reduction only answers to the driver program
 To use lineLengths again later, we could add the following before the reduce:
 This would cause lineLengths to be saved in memory after the first time it is
computed.
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
lineLengths.persist();
15 January 2015 17Majid Hajibaba - Spark
 functions are represented by classes implementing the interfaces in the
org.apache.spark.api.java.function package
 Two ways to create such functions:
1. Use lambda expressions to concisely define an implementation (In Java 8)
2. Implement the Function interfaces in your own class, and pass an instance of
it to Spark
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(new
Function<String, Integer>() {
public Integer call(String s) { return s.length(); }
});
int totalLength = lineLengths.reduce(new
Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b)
{ return a + b; }
});
class GetLength implements Function<String, Integer> {
public Integer call(String s) { return s.length(); }
}
class Sum implements Function2<Integer, Integer, Integer> {
public Integer call(Integer a, Integer b) { return a + b;}
}
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(new GetLength());
int totalLength = lineLengths.reduce(new Sum());
Passing Functions to Spark
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
15 January 2015 18Majid Hajibaba - Spark
Working with Key-Value Pairs
 key-value pairs are represented using the scala.Tuple2 class
 call new Tuple2(a, b) to create a tuple
 access its fields with tuple._1() and tuple._2()
 RDDs of key-value pairs
 distributed “shuffle” operations (e.g. grouping or aggregating the elements
by a key)
 Represented by the JavaPairRDD class
 JavaPairRDDs can be constructed from JavaRDDs Using special versions of
the map operations (mapToPair, flatMapToPair)
 The JavaPairRDD will have both standard RDD:
 reduceByKey
 sortByKey
import scala.Tuple2;
...
Tuple2<String, String> tuple = new Tuple2(“foo”,”bar”);
System.out.println(tuple._1() + “ " + tuple._2());
15 January 2015 19Majid Hajibaba - Spark
Working with Key-Value Pairs
 reduceByKey example
 to count how many times each line of text occurs in a file
 sortByKey example
 to sort the pairs alphabetically
 and to bring them back to the driver program as an array of objects
import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
...
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new
Tuple2(s, 1));
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a +
b);
...
counts.sortByKey();
counts.collect();
15 January 2015 20Majid Hajibaba - Spark
flatMap
 flatMap is a combination of map and flatten
 Return a Sequence rather than a single item; Then flattens the result
 Use case: to parse all the data, but may fail to parse some of it
15 January 2015Majid Hajibaba - Spark 21
http://www.slideshare.net/frodriguezolivera/apache-spark-streaming
RDD Operations
15 January 2015 23Majid Hajibaba - Spark
An Example
Counting Words
15 January 2015 25Majid Hajibaba - Spark
A Complete Example
 Word Counter Program
 Package and classes
Import
needed
classes
Package’s name
(will be passed to spark submitter)
15 January 2015 26Majid Hajibaba - Spark
A Complete Example
 Main Class
Creating a SparkContext
Creating a SparkConf
Application name
(will be passed to spark submitter)
Loading data into an RDD
Base RDD
15 January 2015 27Majid Hajibaba - Spark
A Complete Example
 JavaRDDs and JavaPairRDDs functions
construct
JavaPairRDDs
from JavaRDDs
count how many
times each word of
text occurs in a file
values for each key are aggregated
create a tuple (key-value pairs )
Transformed RDD
15 January 2015 28Majid Hajibaba - Spark
A Complete Example
 Printing results
accessing tuples
action
15 January 2015 29Majid Hajibaba - Spark
 Iteration 1
 output = count.collect();
Spark Execution Model
15 January 2015 30Majid Hajibaba - Spark
 Iteration 2
 output = count.reduce(func);
Spark Execution Model
15 January 2015 31Majid Hajibaba - Spark
Running Applications on Spark
Building Application
 With sbt ($ sbt package)
 With maven ($ mvn package)
./src
./src/main
./src/main/java
./src/main/java/app.java
<project>
<artifactId>word-counter</artifactId>
<name>Word Counter</name>
<packaging>jar</packaging>
<version>1.0</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
</dependency>
</dependencies>
</project>
name := "Word Counter"
organization := "org.apache.spark"
version := "1.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
Directory layout
Pom.xml
name.sbt
15 January 2015 33Majid Hajibaba - Spark
Submitting Application
 Starting Spark (Master and Slaves)
 Submitting a job
 Submission syntax:
./bin/spark-submit 
--class <main-class>
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options
<application-jar> 
[application-arguments]
$ sudo ./bin/spark-submit
--class "org.apache.spark.examples.JavaWordCount"
--master spark://127.0.0.1:7077
test/target/word-counter-1.0.jar /var/log/syslog
$ ./sbin/start-all.sh
15 January 2015 34Majid Hajibaba - Spark
Spark Streaming
15 January 2015Majid Hajibaba - Spark 35
Overview
 Data can be ingested from many sources like Kafka, Flume, Twitter,
ZeroMQ, Kinesis or TCP sockets
 Data can be processed using complex algorithms expressed with high-
level functions like map, reduce, join and window
 Processed data can be pushed out to filesystems, databases, and live
dashboards
 Potential for combining batch processing and streaming processing in
the same system
 you can apply Spark’s machine learning and graph processing algorithms on
data streams
15 January 2015Majid Hajibaba - Spark 36
 Run a streaming computation as a series of very small, deterministic
batch jobs
 Chop up the live stream into batches of X seconds
 Spark treats each batch of data
as RDDs and processes them using
RDD operations
 Finally, the processed results of
the RDD operations are returned
in batches
 Batch sizes as low as ½ second,
latency of about 1 second
Spark Streaming – How Work
15 January 2015Majid Hajibaba - Spark 37
Dstreams (Discretized Stream)
 represents a continuous stream of data
 is represented as a sequence of RDDs
 can be created from
 input data streams from sources such as Kafka, Flume, and Kinesis
 by applying high-level operations on other Dstreams
 Example: lines to words
15 January 2015Majid Hajibaba - Spark 38
Running Example - JavaNetworkWordCount
 You will first need to run Netcat as a data server by using
 Remember you must be installed spark
 Then, in a different terminal, you can start the example by using
 Then, any lines typed in the terminal running the netcat server will be
counted and printed on screen every second.
15 January 2015Majid Hajibaba - Spark 39
$ nc -lk 9999
$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999
Spark Streaming Programing
Guide
15 January 2015Majid Hajibaba - Spark 40
Linking with Spark
 Like as Spark batch processing
 Spark 1.2.0 works with Java 6 and higher
 To write a Spark application in Java, you need to add a dependency on
Spark.
 add the following dependency to your Maven project.
 add the following dependency to your SBT project.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.2.0</version>
</dependency>
15 January 2015 41Majid Hajibaba - Spark
libraryDependencies += "org.apache.spark" %
"spark-streaming_2.10" % "1.2.0"
Initializing – Creating StreamingContext
 Like as SparkContext
 Using constructor
 The batchDuration is the size of the batches
 the time interval at which streaming data will be divided into batches
 can be created from a SparkConf object
 can also be created from an existing JavaSparkContext
15 January 2015Majid Hajibaba - Spark 42
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Duration(1000));
...
JavaSparkContext ctx = ... //existing JavaSparkContext
JavaStreamingContext ssc =
new JavaStreamingContext(ctx, Durations.seconds(1));
new StreamingContext(master,appName,batchDuration,[sparkHome],[jars])
Setting the Right Batch Size
 batches of data should be processed as fast as they are being generated
 the batch interval used may have significant impact on the data rates
 figure out the right batch size for an application
 test it with a conservative batch interval and a low data rate
 5-10 seconds
 If system is stable (the delay is comparable to the batch size)
 increasing the data rate and/or reducing the batch size
 If system is unstable (the delay is continuously increasing)
 Get to the previous stable batch size
15 January 2015Majid Hajibaba - Spark 43
Input DStreams and Receivers
 Input DStream is associated with a Receiver
 except file stream
 Receiver
 receives the data from a source and
 stores it in memory for processing
 Spark Streaming provides two categories of built-in streaming sources.
 Basic sources
 like file systems, socket connections, and Akka actors
 directly available in the StreamingContext API
 Advanced sources
 like Kafka, Flume, Kinesis, Twitter, etc.
 are available through extra utility classes
 Custom sources
15 January 2015Majid Hajibaba - Spark 44
Basic Sources
 File Streams
 will monitor the directory dataDirectory and process any files created in that directory
 For simple text files
 Socket Streams
 Custom Actors
 Actors are concurrent processes that communicate by exchanging messages
 Queue of RDDs
 Each RDD into the queue will be treated as a batch of data in the DStream, and
processed like a stream
15 January 2015Majid Hajibaba - Spark 45
streamingContext.fileStream<KeyClass, ValueClass,
InputFormatClass>(dataDirectory);
streamingContext.textFileStream(dataDirectory)
streamingContext.actorStream(actorProps, actor-name)
streamingContext.queueStream(queueOfRDDs)
streamingContext.socketStream(String hostname, int port,
Function converter, StorageLevel storageLevel)
Advanced Sources
 require interfacing with external non-Spark libraries
 Twitter
 Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven
 Programming: Import the TwitterUtils class and create a DStream with
TwitterUtils.createStream as shown below
 Deploying: Generate an uber JAR with all the dependencies (including the
dependency spark-streaming-twitter_2.10 and its transitive dependencies) and
then deploy the application. This is further explained in the Deploying section.
 Flume
 Kafka
 Kinesis
15 January 2015Majid Hajibaba - Spark 46
import org.apache.spark.streaming.twitter.*;
TwitterUtils.createStream(jssc);
Custom Sources
 implement an user-defined receive
15 January 2015Majid Hajibaba - Spark 47
Socket Text Stream
 Create an input stream from network source hostname:port
 Data is received using a TCP socket
 Receive bytes is interpreted as UTF8 encoded n delimited lines
 Storage level to use for storing the received objects
15 January 2015Majid Hajibaba - Spark 48
socketTextStream(String hostname, int port);
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.api.java.StorageLevels;
...
ssc.socketTextStream(“localhost”,9999,
StorageLevels.MEMORY_AND_DISK_SER);
socketTextStream(String hostname, int port, StorageLevel
storageLevel)
Class ReceiverInputDStream
 Abstract class for defining any InputDStream
 Start a receiver on worker nodes to receive external data
 JavaReceiverInputDStream
 An interface to ReceiverInputDStream
 The abstract class for defining input stream received over the network
 Example:
 Creates a DStream from text data received over a TCP socket connection
15 January 2015Majid Hajibaba - Spark 49
import org.apache.spark.api.java.StorageLevels;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
...
JavaReceiverInputDStream<String> lines =
ssc.socketTextStream(“localhost”, 9999, StorageLevels.MEMORY);
Output Operations on DStreams
 Allow DStream’s data to be pushed out external systems
 Trigger the actual execution of all the DStream transformations
 Similar to actions for RDDs
15 January 2015Majid Hajibaba - Spark 50
Output Operation Meaning
print()
Prints first ten elements of every batch of data in a
DStream on the driver node running the streaming
application.
saveAsTextFiles (prefix, [suffix])
Save DStream's contents as a text files. The file name at
each batch interval is generated based on prefix and suffix.
saveAsObjectFiles(prefix, [suffix])
Save DStream's contents as a SequenceFile of serialized
Java objects.
saveAsHadoopFiles(prefix, [suffix]) Save DStream's contents as a Hadoop file.
foreachRDD(func)
Applies a function to each RDD generated from the
stream. This function should push the data in each RDD to
a external system, like saving the RDD to files, or writing
it over the network to a database. The function is executed
in the driver process running the streaming application.
 Persisting (or caching) a dataset in memory across operations
 Each node stores any computed partitions in memory and reuses them
 Methods
 .cache()  just memory - for iterative algorithms
 .persist()  just memory - reuses in other actions on dataset
 .persist(storageLevel)  storageLevel:
 Example:
.
RDD Persistence
15 January 2015 51Majid Hajibaba - Spark
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
import org.apache.spark.api.java.StorageLevels;
...
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
UpdateStateByKey
 To maintain state
 Update state with new information
 Define the state
 Define the state update function
 using updateStateByKey requires the checkpointing
15 January 2015Majid Hajibaba - Spark 52
import com.google.common.base.Optional;
...
Function2<List<Integer>, Optional<Integer>, Optional<Integer>>
updateFunction = new Function2<List<Integer>, Optional<Integer>,
Optional<Integer>>() {
@Override public Optional<Integer> call(List<Integer> values,
Optional<Integer> state) {
Integer newSum = ... // add the new values with the
//previous running count
return Optional.of(newSum);
}};
...
JavaPairDStream<String, Integer> runningCounts =
pairs.updateStateByKey(updateFunction);
applied on a DStream containing words
 To operate 24/7 and be resilient to failures
 Needs to checkpoints enough information to recover from failures
 Two types of data that are checkpointed
 Metadata checkpointing
 To recover from failure of the node running the driver
 Includes Configuration; DStream operations; Incomplete batches
 Data checkpointing
 To cut off the dependency chains
 Remove accumulated metadata in stateful operations
 To enable checkpointing:
 The interval of checkpointing of a DStream can be set by using
 checkpoint interval of 5 - 10 times is good
dstream.checkpoint(checkpointInterval)
ctx.checkpoint(hdfsPath)
Checkpointing
15 January 2015Majid Hajibaba - Spark 53
An Stream Example
A Complete Example
 Network Word Counter Program
 Package and classes
Import
needed
classes
Package’s name
(will be passed to spark submitter)
15 January 2015 55Majid Hajibaba - Spark
A Complete Example
 Main Class
Creating a SparkStreamingContext
Creating a
SparkConf
Application name
(will be passed to spark submitter)
Socket Streams as Source
Input DStream
15 January 2015 56Majid Hajibaba - Spark
Setting batch size
A Complete Example
 JavaDStream and JavaPairDStream functions
construct
JavaPairDstream
from JavaDstream
count how many
times each word
of text occurs in
an stream
values for each key are aggregated
create a tuple (key-value pairs )
Transformed DStream
15 January 2015 57Majid Hajibaba - Spark
A Complete Example
 Printing results
Wait for the execution to stop
Start the execution of the
streams
15 January 2015 58Majid Hajibaba - Spark
Print the first ten elements
Spark and Storm
A Comparison
15 January
2015
59Majid Hajibaba - Spark
Spark vs. Strom
Spark Storm
Origin UC Berkeley, 2009 Twitter
Implemented in Scala Clojure (Lisp like)
Enterprise Support Yes No
Source Model Open Source Open Source
Big Data Processing Batch and Stream Stream
Processing Type processing in short
interval batches
real time
Latency a few Second sub-Second
Programming API Scala, Java, Python Any PL
Guarantee Data
Processing
Exactly one At least one
Bach Processing Yes No
Coordination With zookeeper zookeeper
15 January 2015 60Majid Hajibaba - Spark
Apache Spark
Ippon USA
15 January 2015 61Majid Hajibaba - Spark
Apache Storm
15 January 2015Majid Hajibaba - Spark 62
Comparison
 Higher throughput than Storm
 Spark Streaming: 670k records/sec/node
 Storm: 115k records/sec/node
 Commercial systems: 100-500k records/sec/node
15 January 2015Majid Hajibaba - Spark 63
Spark SQL
15 January 2015Majid Hajibaba - Spark 64
Spark SQL
 Allows relational queries expressed in SQL to be executed using Spark
 Data Sources are in JavaSchemaRDDs
 JavaSchemaRDD
 new type of RDD
 is similar to a table in a traditional relational database
 are composed of Row objects along with a schema that describes it
 can be created from an existing RDD, a JSON dataset, or …
15 January 2015Majid Hajibaba - Spark 65
Spark SQL Programming Guide
15 January 2015Majid Hajibaba - Spark 66
Initializing - Creating JavaSQLContext
 To create a basic JavaSQLContext, all you need is a JavaSparkContext
 It must be based spark context
15 January 2015Majid Hajibaba - Spark 67
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.api.java.JavaSQLContext;
...
...
JavaSparkContext sc = ...; // An existing JavaSparkContext.
JavaSQLContext sqlContext = new JavaSQLContext(sc);
SchemaRDD
 SchemaRDD can be operated on
 as normal RDDs
 as a temporary table
 allows you to run SQL queries over it
 Converting RDDs into SchemaRDDs
 Reflection based approach
 Uses reflection to infer the schema of an RDD
 More concise code
 Works well when we know the schema while writing the application
 Programmatic based approach
 Construct a schema and then apply it to an existing RDD
 More verbose
 Allows to construct SchemaRDDs when the columns and types are not known until
runtime
15 January 2015Majid Hajibaba - Spark 68
JavaBean
 Is just a standard (a convention)
 Is a class that encapsulates many objects into a single object
 All properties private (using get/set)
 A public no-argument constructor
 Implements Serializable
 Lots of libraries depend on it
15 January 2015Majid Hajibaba - Spark 69
public static class Person implements Serializable {
private String name;
private int age;
public String getName() { return name; }
public void setName(String name) { this.name = name; }
public int getAge() { return age; }
public void setAge(int age) { this.age = age; }
}
Reflection based - An Example
 Load a text file like people.txt
 Convert each line to a JavaBean
 people now is an RDD of JavaBeans
15 January 2015Majid Hajibaba - Spark 70
JavaRDD<Person> people = sc.textFile("people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});
Reflection based - An Example
 Apply a schema to an RDD of JavaBeans (people)
 Register it as a temporary table
 SQL can be run over RDDs that have been registered as tables
 The result is SchemaRDD and support all the normal RDD operations
 The columns of a row in the result can be accessed by ordinal
15 January 2015Majid Hajibaba - Spark 71
JavaSchemaRDD schemaPeople =
sqlContext.applySchema(people, Person.class);
schemaPeople.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19")
List<String> teenagerNames = teenagers.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
Programmatic based
 JavaBean classes cannot be defined ahead of time
 SchemaRDD can be created programmatically with three steps
 Create an RDD of Rows from the original RDD
 Create the schema represented by a StructType matching the structure of
Rows in the RDD created in Step 1.
 Apply the schema to the RDD of Rows via applySchema method provided by
JavaSQLContext.
 Example
 The structure of records (schema) is encoded in a string
 Load a text file and convert each line to a JavaBean.
15 January 2015Majid Hajibaba - Spark 72
String schemaString = "name age";
JavaRDD<String> people =
sc.textFile("examples/src/main/resources/people.txt");
Programmatic based – An Example
 Generate the schema based on the string of schema
 Convert records of the RDD (people) to Rows
15 January 2015Majid Hajibaba - Spark 73
import org.apache.spark.sql.api.java.DataType;
import org.apache.spark.sql.api.java.StructField;
import org.apache.spark.sql.api.java.StructType;
...
List<StructField> fields = new ArrayList<StructField>();
for (String fieldName: schemaString.split(" ")) {
fields.add(DataType.createStructField(fieldName,
DataType.StringType, true));}
StructType schema = DataType.createStructType(fields);
import org.apache.spark.sql.api.java.Row;
...
JavaRDD<Row> rowRDD = people.map(
new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return Row.create(fields[0], fields[1].trim());
}
});
Programmatic based – An Example
 Apply the schema to the RDD.
 Register the SchemaRDD as a table.
 SQL can be run over RDDs that have been registered as tables
 The result is SchemaRDD and support all the normal RDD operations
 The columns of a row in the result can be accessed by ordinal
15 January 2015Majid Hajibaba - Spark 74
JavaSchemaRDD peopleSchemaRDD =
sqlContext.applySchema(rowRDD, schema);
peopleSchemaRDD.registerTempTable("people");
JavaSchemaRDD results = sqlContext.sql("SELECT name FROM people");
List<String> names = results.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();
JSON Datasets
 Inferring the schema of a JSON dataset and load it to JavaSchemaRDD
 Two methods in a JavaSQLContext
 jsonFile() : loads data from a directory of JSON files where each line of the
files is a JSON object – but not regular multi-line JSON file
 jsonRDD(): loads data from an existing RDD where each element of the RDD
is a string containing a JSON object
 A JSON file can be like this:
15 January 2015Majid Hajibaba - Spark 75
JavaSchemaRDD people = sqlContext.jsonFile(path);
JSON Datasets
 The inferred schema can be visualized using the printSchema()
 The result is something like this:
 Register this JavaSchemaRDD as a table
 SQL statements can be run by using the sql methods
15 January 2015Majid Hajibaba - Spark 76
people.printSchema();
people.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19");
JSON Datasets
 JavaSchemaRDD can be created for a JSON dataset represented by an
RDD[String] storing one JSON object per string
 Arrays are native examples of RDDs
 Register this JavaSchemaRDD as a table
 SQL statements can be run by using the sql methods
.
15 January 2015Majid Hajibaba - Spark 77
List<String> jsonData =
Arrays.asList("{"name":"Yin","address":
{"city":"Columbus","state":"Ohio"}}");
JavaRDD<String> anotherPeopleRDD = sc.parallelize(jsonData);
JavaSchemaRDD anotherPeople =
sqlContext.jsonRDD(anotherPeopleRDD);
people.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19");
Thrift JDBC/ODBC server
 To start the JDBC/ODBC server:
 By default, the server listens on localhost:10000
 We can use beeline to test the Thrift JDBC/ODBC server
 Connect to the JDBC/ODBC server in beeline with
 Beeline will ask for a username and password
 Simply enter the username on your machine and a blank password
 See existing databases;
 Create a database;
15 January 2015Majid Hajibaba - Spark 78
$ ./sbin/start-thriftserver.sh
$ ./bin/beeline
beeline> !connect jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> SHOW DATABASES;
0: jdbc:hive2://localhost:10000> CREATE DATABASE DBTEST;
End
any question?
15 January 2015Majid Hajibaba - Spark 79

More Related Content

What's hot

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 

What's hot (20)

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Viewers also liked

Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)sdeeg
 
UMF Cloud Pilot: architecturing an IaaS offer for higher education
UMF Cloud Pilot: architecturing an IaaS offer for higher educationUMF Cloud Pilot: architecturing an IaaS offer for higher education
UMF Cloud Pilot: architecturing an IaaS offer for higher educationAndy Powell
 
Checklist for Competent Cloud Security Management
Checklist for Competent Cloud Security ManagementChecklist for Competent Cloud Security Management
Checklist for Competent Cloud Security ManagementCloud Credential Council
 
Battelle AoA Evaluation Report on Military Mesh Network Products
Battelle AoA Evaluation Report on Military Mesh Network Products Battelle AoA Evaluation Report on Military Mesh Network Products
Battelle AoA Evaluation Report on Military Mesh Network Products MeshDynamics
 
Storm (Distribute Stream Processing System)
Storm (Distribute Stream Processing System)Storm (Distribute Stream Processing System)
Storm (Distribute Stream Processing System)Majid Hajibaba
 
Cloud Computing Principles and Paradigms: 10 comet cloud-an autonomic cloud e...
Cloud Computing Principles and Paradigms: 10 comet cloud-an autonomic cloud e...Cloud Computing Principles and Paradigms: 10 comet cloud-an autonomic cloud e...
Cloud Computing Principles and Paradigms: 10 comet cloud-an autonomic cloud e...Majid Hajibaba
 
Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...
Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...
Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...Majid Hajibaba
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programmingphanleson
 
Simplifying M&A Consolidation | Salesforce Mergers and Acquisitions: Deamforc...
Simplifying M&A Consolidation | Salesforce Mergers and Acquisitions: Deamforc...Simplifying M&A Consolidation | Salesforce Mergers and Acquisitions: Deamforc...
Simplifying M&A Consolidation | Salesforce Mergers and Acquisitions: Deamforc...Jade Global
 
AWS IoT: a cloud platform for building IoT applications
AWS IoT: a cloud platform for building IoT applicationsAWS IoT: a cloud platform for building IoT applications
AWS IoT: a cloud platform for building IoT applicationsAndy Powell
 
Federal Cloud Computing Report - Market Connections & General Dynamics Inform...
Federal Cloud Computing Report - Market Connections & General Dynamics Inform...Federal Cloud Computing Report - Market Connections & General Dynamics Inform...
Federal Cloud Computing Report - Market Connections & General Dynamics Inform...Market Connections, Inc.
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark InternalBhuridech Sudsee
 
8 secure distributed data storage in cloud computing
8 secure distributed data storage in cloud computing8 secure distributed data storage in cloud computing
8 secure distributed data storage in cloud computingMajid Hajibaba
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingDavid Lauzon
 

Viewers also liked (20)

Kafka
KafkaKafka
Kafka
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)
 
The Lean Lifecycle in the Cloud
The Lean Lifecycle in the CloudThe Lean Lifecycle in the Cloud
The Lean Lifecycle in the Cloud
 
UMF Cloud Pilot: architecturing an IaaS offer for higher education
UMF Cloud Pilot: architecturing an IaaS offer for higher educationUMF Cloud Pilot: architecturing an IaaS offer for higher education
UMF Cloud Pilot: architecturing an IaaS offer for higher education
 
Checklist for Competent Cloud Security Management
Checklist for Competent Cloud Security ManagementChecklist for Competent Cloud Security Management
Checklist for Competent Cloud Security Management
 
Battelle AoA Evaluation Report on Military Mesh Network Products
Battelle AoA Evaluation Report on Military Mesh Network Products Battelle AoA Evaluation Report on Military Mesh Network Products
Battelle AoA Evaluation Report on Military Mesh Network Products
 
Storm (Distribute Stream Processing System)
Storm (Distribute Stream Processing System)Storm (Distribute Stream Processing System)
Storm (Distribute Stream Processing System)
 
Cloud Computing Principles and Paradigms: 10 comet cloud-an autonomic cloud e...
Cloud Computing Principles and Paradigms: 10 comet cloud-an autonomic cloud e...Cloud Computing Principles and Paradigms: 10 comet cloud-an autonomic cloud e...
Cloud Computing Principles and Paradigms: 10 comet cloud-an autonomic cloud e...
 
Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...
Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...
Cloud Computing Principles and Paradigms: 11 t-systems cloud-based solutions ...
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Simplifying M&A Consolidation | Salesforce Mergers and Acquisitions: Deamforc...
Simplifying M&A Consolidation | Salesforce Mergers and Acquisitions: Deamforc...Simplifying M&A Consolidation | Salesforce Mergers and Acquisitions: Deamforc...
Simplifying M&A Consolidation | Salesforce Mergers and Acquisitions: Deamforc...
 
AWS IoT: a cloud platform for building IoT applications
AWS IoT: a cloud platform for building IoT applicationsAWS IoT: a cloud platform for building IoT applications
AWS IoT: a cloud platform for building IoT applications
 
Federal Cloud Computing Report - Market Connections & General Dynamics Inform...
Federal Cloud Computing Report - Market Connections & General Dynamics Inform...Federal Cloud Computing Report - Market Connections & General Dynamics Inform...
Federal Cloud Computing Report - Market Connections & General Dynamics Inform...
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internal
 
8 secure distributed data storage in cloud computing
8 secure distributed data storage in cloud computing8 secure distributed data storage in cloud computing
8 secure distributed data storage in cloud computing
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 Debriefing
 

Similar to Apache Spark

Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxshivani22y
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Programming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire projectProgramming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire projectŁukasz Dumiszewski
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 

Similar to Apache Spark (20)

Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptxSpark and scala..................................... ppt.pptx
Spark and scala..................................... ppt.pptx
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Module01
 Module01 Module01
Module01
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark Fundamentals Training
Apache Spark Fundamentals TrainingApache Spark Fundamentals Training
Apache Spark Fundamentals Training
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Programming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire projectProgramming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire project
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 

More from Majid Hajibaba

Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...
Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...
Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...Majid Hajibaba
 
Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...
Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...
Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...Majid Hajibaba
 
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...Majid Hajibaba
 
Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...
Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...
Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...Majid Hajibaba
 
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Majid Hajibaba
 
Cloud Computing Principles and Paradigms: 3 enriching the integration as a se...
Cloud Computing Principles and Paradigms: 3 enriching the integration as a se...Cloud Computing Principles and Paradigms: 3 enriching the integration as a se...
Cloud Computing Principles and Paradigms: 3 enriching the integration as a se...Majid Hajibaba
 
Cloud Computing Principles and Paradigms: 2 migration into a cloud
Cloud Computing Principles and Paradigms: 2 migration into a cloudCloud Computing Principles and Paradigms: 2 migration into a cloud
Cloud Computing Principles and Paradigms: 2 migration into a cloudMajid Hajibaba
 
cloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdutioncloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdutionMajid Hajibaba
 
Master Thesis presentation
Master Thesis presentationMaster Thesis presentation
Master Thesis presentationMajid Hajibaba
 

More from Majid Hajibaba (9)

Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...
Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...
Cloud Computing Principles and Paradigms: 9 aneka-integration of private and ...
 
Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...
Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...
Cloud Computing Principles and Paradigms: 7 enhancing cloud computing environ...
 
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
Cloud Computing Principles and Paradigms: 6 on the management of virtual mach...
 
Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...
Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...
Cloud Computing Principles and Paradigms: 5 virtual machines provisioning and...
 
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
Cloud Computing Principles and Paradigms: 4 the enterprise cloud computing pa...
 
Cloud Computing Principles and Paradigms: 3 enriching the integration as a se...
Cloud Computing Principles and Paradigms: 3 enriching the integration as a se...Cloud Computing Principles and Paradigms: 3 enriching the integration as a se...
Cloud Computing Principles and Paradigms: 3 enriching the integration as a se...
 
Cloud Computing Principles and Paradigms: 2 migration into a cloud
Cloud Computing Principles and Paradigms: 2 migration into a cloudCloud Computing Principles and Paradigms: 2 migration into a cloud
Cloud Computing Principles and Paradigms: 2 migration into a cloud
 
cloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdutioncloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdution
 
Master Thesis presentation
Master Thesis presentationMaster Thesis presentation
Master Thesis presentation
 

Recently uploaded

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 

Recently uploaded (20)

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

Apache Spark

  • 2. Outline  An Overview on Spark  Spark Programming Guide  An Example on Spark  Running Applications on Spark  Spark Streaming  Spark Streaming Programing Guide  An Example on Spark Streaming  Spark and Storm: A Comparison  Spark SQL 15 January 2015Majid Hajibaba - Spark 2
  • 3. An Overview 15 January 2015Majid Hajibaba - Spark 3
  • 4. Cluster Mode Overview  Spark applications run as independent sets of processes on a cluster  Executor processes run tasks in multiple threads  Driver should be close to the workers  For remotely operating, use RPC instead of remote driver • Coordinator • Standalone • Mesos • YARN http://spark.apache.org/docs/1.0.1/cluster-overview.html 15 January 2015 4Majid Hajibaba - Spark
  • 5.  Core is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications in a cluster  higher-level components (Shark; GraphX; Streaming; …) are Like libraries in a software project  tight integration has several benefits  Simple Improvements, Minimized Costs, Combine Processing Models  . Spark - A Unified Stack 15 January 2015 5Majid Hajibaba - Spark
  • 6. Spark Processing Model 15 January 2015 6Majid Hajibaba - Spark  In memory iterative MapReduce MapReduce Processing Model
  • 7. Spark Goal  Provide distributed memory abstractions for clusters to support apps with working sets  Retain the attractive properties of MapReduce:  Fault tolerance  Data locality  Scalability  Solution: augment data flow model with “resilient distributed datasets” (RDDs) 15 January 2015 7Majid Hajibaba - Spark
  • 8. Resilient Distributed Datasets (RDDs)  Immutable collection of elements that can be operated on in parallel  Created by transforming data using data flow operators (e.g. map)  Parallel operations on RDDs  Benefits  Consistency is easy  due to immutability  Inexpensive fault tolerance  log lineage  no replicating/checkpointing  Locality-aware scheduling of tasks on partitions  Applicable to a broad variety of applications 15 January 2015 8Majid Hajibaba - Spark
  • 9. RDDs 15 January 2015Majid Hajibaba - Spark 9 Immutable Collection of Objects Partitioned and Distributed
  • 11. Linking with Spark  Spark 1.2.0 works with Java 6 and higher  To write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at:  Importing Spark classes into the program: groupId = org.apache.spark artifactId = spark-core_2.10 version = 1.2.0 import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.SparkConf; 15 January 2015 11Majid Hajibaba - Spark
  • 12. Initializing Spark - Creating a SparkContext  Tells Spark how to access a cluster  The entry point / The first thing a Spark program  This is done through the following constructor:  Example:  Or through SparkConf for advanced configuration new SparkContext(master, appName, [sparkHome], [jars]) 15 January 2015 12Majid Hajibaba - Spark import org.apache.spark.api.java.JavaSparkContext; JavaSparkContext ctx = new JavaSparkContext("master_url", "application name", ["path_to_spark_home", "path_to_jars"]);
  • 13. SparkConf  Configuration for a Spark application  Sets various Spark parameters as key-value pairs  SparkConf object contains information about the application  The constructor will load values from any spark.* Java system properties set and the classpath in the application  Example import org.apache.spark.SparkConf; SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; SparkConf sparkConf = new SparkConf().setAppName("application name"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); 15 January 2015 13Majid Hajibaba - Spark
  • 14. Loading data into an RDD  Spark's primary unit for data representation  Allows for easy parallel operations on the data  Native collections in Java can serve as the basis for an RDD  number of partitions can be set manually by passing it as a second parameter to parallelize (e.g. ctx.parallelize(data, 10)).  To loading external data from a file can use textFile method in SparkContext as:  textFile(path: String, minSplits: Int )  path: the path of text file  minSplits: min number of splits for Hadoop RDDs  The resulting is an overridden string with each line being a unique element in the RDD import org.apache.spark.api.java.JavaRDD; JavaRDD<Integer> dataRDD = ctx.parallelize(Arrays.asList(1,2,4)); 15 January 2015 14Majid Hajibaba - Spark
  • 15. textFile method  Read a text file and return it as an RDD of Strings  File can be take from  a local file system (available on all nodes in Distributed mode)  HDFS  Hadoop-supported file system URI . import org.apache.spark.api.java.JavaRDD; JavaRDD<String> lines = ctx.textFile(“file_path”, 1); import org.apache.spark.Sparkfiles; import org.apache.spark.api.java.JavaRDD; ... ctx.addFile(“file_path"); JavaRDD<String> lines = ctx.textFile(SparkFiles.get(“file_path")); import org.apache.spark.api.java.JavaRDD; ... JavaRDD<String> lines = ctx.textFile(“hdfs://...”); 15 January 2015 15Majid Hajibaba - Spark
  • 16. Manipulating RDD  Transformations: to create a new dataset from an existing one  map: works on each individual element in the input RDD and produces a new output element  Transformation functions do not transform the existing elements, rather they return a new RDD with the new elements  Actions: to return a value to the driver program after running a computation on the dataset  reduce: operates on pairs to aggregates all the data elements of the dataset import org.apache.spark.api.java.function.Function; rdd.map(new Function<Integer, Integer>() { public Integer call(Integer x) { return x+1;} }); import org.apache.spark.api.java.function.Function2; rdd.reduce(new Function2<Integer, Integer, Integer>() { public Integer call(Integer x, Integer y) { return x+y;} }); 15 January 2015 16Majid Hajibaba - Spark
  • 17. RDD Basics  A simple program  This dataset is not loaded in memory  lines is merely a pointer to the file  lineLengths is not immediately computed  Breaks the computation into tasks to run on separate machines  Each machine runs both its part of the map and a local reduction  Local reduction only answers to the driver program  To use lineLengths again later, we could add the following before the reduce:  This would cause lineLengths to be saved in memory after the first time it is computed. JavaRDD<String> lines = ctx.textFile("data.txt"); JavaRDD<Integer> lineLengths = lines.map(s -> s.length()); int totalLength = lineLengths.reduce((a, b) -> a + b); lineLengths.persist(); 15 January 2015 17Majid Hajibaba - Spark
  • 18.  functions are represented by classes implementing the interfaces in the org.apache.spark.api.java.function package  Two ways to create such functions: 1. Use lambda expressions to concisely define an implementation (In Java 8) 2. Implement the Function interfaces in your own class, and pass an instance of it to Spark JavaRDD<String> lines = sc.textFile("data.txt"); JavaRDD<Integer> lineLengths = lines.map(new Function<String, Integer>() { public Integer call(String s) { return s.length(); } }); int totalLength = lineLengths.reduce(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); class GetLength implements Function<String, Integer> { public Integer call(String s) { return s.length(); } } class Sum implements Function2<Integer, Integer, Integer> { public Integer call(Integer a, Integer b) { return a + b;} } JavaRDD<String> lines = sc.textFile("data.txt"); JavaRDD<Integer> lineLengths = lines.map(new GetLength()); int totalLength = lineLengths.reduce(new Sum()); Passing Functions to Spark JavaRDD<String> lines = ctx.textFile("data.txt"); JavaRDD<Integer> lineLengths = lines.map(s -> s.length()); int totalLength = lineLengths.reduce((a, b) -> a + b); 15 January 2015 18Majid Hajibaba - Spark
  • 19. Working with Key-Value Pairs  key-value pairs are represented using the scala.Tuple2 class  call new Tuple2(a, b) to create a tuple  access its fields with tuple._1() and tuple._2()  RDDs of key-value pairs  distributed “shuffle” operations (e.g. grouping or aggregating the elements by a key)  Represented by the JavaPairRDD class  JavaPairRDDs can be constructed from JavaRDDs Using special versions of the map operations (mapToPair, flatMapToPair)  The JavaPairRDD will have both standard RDD:  reduceByKey  sortByKey import scala.Tuple2; ... Tuple2<String, String> tuple = new Tuple2(“foo”,”bar”); System.out.println(tuple._1() + “ " + tuple._2()); 15 January 2015 19Majid Hajibaba - Spark
  • 20. Working with Key-Value Pairs  reduceByKey example  to count how many times each line of text occurs in a file  sortByKey example  to sort the pairs alphabetically  and to bring them back to the driver program as an array of objects import scala.Tuple2; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; ... JavaRDD<String> lines = ctx.textFile("data.txt"); JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1)); JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b); ... counts.sortByKey(); counts.collect(); 15 January 2015 20Majid Hajibaba - Spark
  • 21. flatMap  flatMap is a combination of map and flatten  Return a Sequence rather than a single item; Then flattens the result  Use case: to parse all the data, but may fail to parse some of it 15 January 2015Majid Hajibaba - Spark 21 http://www.slideshare.net/frodriguezolivera/apache-spark-streaming
  • 22. RDD Operations 15 January 2015 23Majid Hajibaba - Spark
  • 24. Counting Words 15 January 2015 25Majid Hajibaba - Spark
  • 25. A Complete Example  Word Counter Program  Package and classes Import needed classes Package’s name (will be passed to spark submitter) 15 January 2015 26Majid Hajibaba - Spark
  • 26. A Complete Example  Main Class Creating a SparkContext Creating a SparkConf Application name (will be passed to spark submitter) Loading data into an RDD Base RDD 15 January 2015 27Majid Hajibaba - Spark
  • 27. A Complete Example  JavaRDDs and JavaPairRDDs functions construct JavaPairRDDs from JavaRDDs count how many times each word of text occurs in a file values for each key are aggregated create a tuple (key-value pairs ) Transformed RDD 15 January 2015 28Majid Hajibaba - Spark
  • 28. A Complete Example  Printing results accessing tuples action 15 January 2015 29Majid Hajibaba - Spark
  • 29.  Iteration 1  output = count.collect(); Spark Execution Model 15 January 2015 30Majid Hajibaba - Spark
  • 30.  Iteration 2  output = count.reduce(func); Spark Execution Model 15 January 2015 31Majid Hajibaba - Spark
  • 32. Building Application  With sbt ($ sbt package)  With maven ($ mvn package) ./src ./src/main ./src/main/java ./src/main/java/app.java <project> <artifactId>word-counter</artifactId> <name>Word Counter</name> <packaging>jar</packaging> <version>1.0</version> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.2.0</version> </dependency> </dependencies> </project> name := "Word Counter" organization := "org.apache.spark" version := "1.0" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0" Directory layout Pom.xml name.sbt 15 January 2015 33Majid Hajibaba - Spark
  • 33. Submitting Application  Starting Spark (Master and Slaves)  Submitting a job  Submission syntax: ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments] $ sudo ./bin/spark-submit --class "org.apache.spark.examples.JavaWordCount" --master spark://127.0.0.1:7077 test/target/word-counter-1.0.jar /var/log/syslog $ ./sbin/start-all.sh 15 January 2015 34Majid Hajibaba - Spark
  • 34. Spark Streaming 15 January 2015Majid Hajibaba - Spark 35
  • 35. Overview  Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets  Data can be processed using complex algorithms expressed with high- level functions like map, reduce, join and window  Processed data can be pushed out to filesystems, databases, and live dashboards  Potential for combining batch processing and streaming processing in the same system  you can apply Spark’s machine learning and graph processing algorithms on data streams 15 January 2015Majid Hajibaba - Spark 36
  • 36.  Run a streaming computation as a series of very small, deterministic batch jobs  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches  Batch sizes as low as ½ second, latency of about 1 second Spark Streaming – How Work 15 January 2015Majid Hajibaba - Spark 37
  • 37. Dstreams (Discretized Stream)  represents a continuous stream of data  is represented as a sequence of RDDs  can be created from  input data streams from sources such as Kafka, Flume, and Kinesis  by applying high-level operations on other Dstreams  Example: lines to words 15 January 2015Majid Hajibaba - Spark 38
  • 38. Running Example - JavaNetworkWordCount  You will first need to run Netcat as a data server by using  Remember you must be installed spark  Then, in a different terminal, you can start the example by using  Then, any lines typed in the terminal running the netcat server will be counted and printed on screen every second. 15 January 2015Majid Hajibaba - Spark 39 $ nc -lk 9999 $ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999
  • 39. Spark Streaming Programing Guide 15 January 2015Majid Hajibaba - Spark 40
  • 40. Linking with Spark  Like as Spark batch processing  Spark 1.2.0 works with Java 6 and higher  To write a Spark application in Java, you need to add a dependency on Spark.  add the following dependency to your Maven project.  add the following dependency to your SBT project. <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>1.2.0</version> </dependency> 15 January 2015 41Majid Hajibaba - Spark libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.2.0"
  • 41. Initializing – Creating StreamingContext  Like as SparkContext  Using constructor  The batchDuration is the size of the batches  the time interval at which streaming data will be divided into batches  can be created from a SparkConf object  can also be created from an existing JavaSparkContext 15 January 2015Majid Hajibaba - Spark 42 import org.apache.spark.SparkConf; import org.apache.spark.streaming.Duration; import org.apache.spark.streaming.api.java.JavaStreamingContext; SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); JavaStreamingContext ssc = new JavaStreamingContext(conf, Duration(1000)); ... JavaSparkContext ctx = ... //existing JavaSparkContext JavaStreamingContext ssc = new JavaStreamingContext(ctx, Durations.seconds(1)); new StreamingContext(master,appName,batchDuration,[sparkHome],[jars])
  • 42. Setting the Right Batch Size  batches of data should be processed as fast as they are being generated  the batch interval used may have significant impact on the data rates  figure out the right batch size for an application  test it with a conservative batch interval and a low data rate  5-10 seconds  If system is stable (the delay is comparable to the batch size)  increasing the data rate and/or reducing the batch size  If system is unstable (the delay is continuously increasing)  Get to the previous stable batch size 15 January 2015Majid Hajibaba - Spark 43
  • 43. Input DStreams and Receivers  Input DStream is associated with a Receiver  except file stream  Receiver  receives the data from a source and  stores it in memory for processing  Spark Streaming provides two categories of built-in streaming sources.  Basic sources  like file systems, socket connections, and Akka actors  directly available in the StreamingContext API  Advanced sources  like Kafka, Flume, Kinesis, Twitter, etc.  are available through extra utility classes  Custom sources 15 January 2015Majid Hajibaba - Spark 44
  • 44. Basic Sources  File Streams  will monitor the directory dataDirectory and process any files created in that directory  For simple text files  Socket Streams  Custom Actors  Actors are concurrent processes that communicate by exchanging messages  Queue of RDDs  Each RDD into the queue will be treated as a batch of data in the DStream, and processed like a stream 15 January 2015Majid Hajibaba - Spark 45 streamingContext.fileStream<KeyClass, ValueClass, InputFormatClass>(dataDirectory); streamingContext.textFileStream(dataDirectory) streamingContext.actorStream(actorProps, actor-name) streamingContext.queueStream(queueOfRDDs) streamingContext.socketStream(String hostname, int port, Function converter, StorageLevel storageLevel)
  • 45. Advanced Sources  require interfacing with external non-Spark libraries  Twitter  Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven  Programming: Import the TwitterUtils class and create a DStream with TwitterUtils.createStream as shown below  Deploying: Generate an uber JAR with all the dependencies (including the dependency spark-streaming-twitter_2.10 and its transitive dependencies) and then deploy the application. This is further explained in the Deploying section.  Flume  Kafka  Kinesis 15 January 2015Majid Hajibaba - Spark 46 import org.apache.spark.streaming.twitter.*; TwitterUtils.createStream(jssc);
  • 46. Custom Sources  implement an user-defined receive 15 January 2015Majid Hajibaba - Spark 47
  • 47. Socket Text Stream  Create an input stream from network source hostname:port  Data is received using a TCP socket  Receive bytes is interpreted as UTF8 encoded n delimited lines  Storage level to use for storing the received objects 15 January 2015Majid Hajibaba - Spark 48 socketTextStream(String hostname, int port); import org.apache.spark.streaming.api.java.JavaStreamingContext; import org.apache.spark.api.java.StorageLevels; ... ssc.socketTextStream(“localhost”,9999, StorageLevels.MEMORY_AND_DISK_SER); socketTextStream(String hostname, int port, StorageLevel storageLevel)
  • 48. Class ReceiverInputDStream  Abstract class for defining any InputDStream  Start a receiver on worker nodes to receive external data  JavaReceiverInputDStream  An interface to ReceiverInputDStream  The abstract class for defining input stream received over the network  Example:  Creates a DStream from text data received over a TCP socket connection 15 January 2015Majid Hajibaba - Spark 49 import org.apache.spark.api.java.StorageLevels; import org.apache.spark.streaming.api.java.JavaStreamingContext; import org.apache.spark.streaming.api.java.JavaReceiverInputDStream; ... JavaReceiverInputDStream<String> lines = ssc.socketTextStream(“localhost”, 9999, StorageLevels.MEMORY);
  • 49. Output Operations on DStreams  Allow DStream’s data to be pushed out external systems  Trigger the actual execution of all the DStream transformations  Similar to actions for RDDs 15 January 2015Majid Hajibaba - Spark 50 Output Operation Meaning print() Prints first ten elements of every batch of data in a DStream on the driver node running the streaming application. saveAsTextFiles (prefix, [suffix]) Save DStream's contents as a text files. The file name at each batch interval is generated based on prefix and suffix. saveAsObjectFiles(prefix, [suffix]) Save DStream's contents as a SequenceFile of serialized Java objects. saveAsHadoopFiles(prefix, [suffix]) Save DStream's contents as a Hadoop file. foreachRDD(func) Applies a function to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. The function is executed in the driver process running the streaming application.
  • 50.  Persisting (or caching) a dataset in memory across operations  Each node stores any computed partitions in memory and reuses them  Methods  .cache()  just memory - for iterative algorithms  .persist()  just memory - reuses in other actions on dataset  .persist(storageLevel)  storageLevel:  Example: . RDD Persistence 15 January 2015 51Majid Hajibaba - Spark MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_SER DISK_ONLY import org.apache.spark.api.java.StorageLevels; ... JavaReceiverInputDStream<String> lines = ssc.socketTextStream( args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);
  • 51. UpdateStateByKey  To maintain state  Update state with new information  Define the state  Define the state update function  using updateStateByKey requires the checkpointing 15 January 2015Majid Hajibaba - Spark 52 import com.google.common.base.Optional; ... Function2<List<Integer>, Optional<Integer>, Optional<Integer>> updateFunction = new Function2<List<Integer>, Optional<Integer>, Optional<Integer>>() { @Override public Optional<Integer> call(List<Integer> values, Optional<Integer> state) { Integer newSum = ... // add the new values with the //previous running count return Optional.of(newSum); }}; ... JavaPairDStream<String, Integer> runningCounts = pairs.updateStateByKey(updateFunction); applied on a DStream containing words
  • 52.  To operate 24/7 and be resilient to failures  Needs to checkpoints enough information to recover from failures  Two types of data that are checkpointed  Metadata checkpointing  To recover from failure of the node running the driver  Includes Configuration; DStream operations; Incomplete batches  Data checkpointing  To cut off the dependency chains  Remove accumulated metadata in stateful operations  To enable checkpointing:  The interval of checkpointing of a DStream can be set by using  checkpoint interval of 5 - 10 times is good dstream.checkpoint(checkpointInterval) ctx.checkpoint(hdfsPath) Checkpointing 15 January 2015Majid Hajibaba - Spark 53
  • 54. A Complete Example  Network Word Counter Program  Package and classes Import needed classes Package’s name (will be passed to spark submitter) 15 January 2015 55Majid Hajibaba - Spark
  • 55. A Complete Example  Main Class Creating a SparkStreamingContext Creating a SparkConf Application name (will be passed to spark submitter) Socket Streams as Source Input DStream 15 January 2015 56Majid Hajibaba - Spark Setting batch size
  • 56. A Complete Example  JavaDStream and JavaPairDStream functions construct JavaPairDstream from JavaDstream count how many times each word of text occurs in an stream values for each key are aggregated create a tuple (key-value pairs ) Transformed DStream 15 January 2015 57Majid Hajibaba - Spark
  • 57. A Complete Example  Printing results Wait for the execution to stop Start the execution of the streams 15 January 2015 58Majid Hajibaba - Spark Print the first ten elements
  • 58. Spark and Storm A Comparison 15 January 2015 59Majid Hajibaba - Spark
  • 59. Spark vs. Strom Spark Storm Origin UC Berkeley, 2009 Twitter Implemented in Scala Clojure (Lisp like) Enterprise Support Yes No Source Model Open Source Open Source Big Data Processing Batch and Stream Stream Processing Type processing in short interval batches real time Latency a few Second sub-Second Programming API Scala, Java, Python Any PL Guarantee Data Processing Exactly one At least one Bach Processing Yes No Coordination With zookeeper zookeeper 15 January 2015 60Majid Hajibaba - Spark
  • 60. Apache Spark Ippon USA 15 January 2015 61Majid Hajibaba - Spark
  • 61. Apache Storm 15 January 2015Majid Hajibaba - Spark 62
  • 62. Comparison  Higher throughput than Storm  Spark Streaming: 670k records/sec/node  Storm: 115k records/sec/node  Commercial systems: 100-500k records/sec/node 15 January 2015Majid Hajibaba - Spark 63
  • 63. Spark SQL 15 January 2015Majid Hajibaba - Spark 64
  • 64. Spark SQL  Allows relational queries expressed in SQL to be executed using Spark  Data Sources are in JavaSchemaRDDs  JavaSchemaRDD  new type of RDD  is similar to a table in a traditional relational database  are composed of Row objects along with a schema that describes it  can be created from an existing RDD, a JSON dataset, or … 15 January 2015Majid Hajibaba - Spark 65
  • 65. Spark SQL Programming Guide 15 January 2015Majid Hajibaba - Spark 66
  • 66. Initializing - Creating JavaSQLContext  To create a basic JavaSQLContext, all you need is a JavaSparkContext  It must be based spark context 15 January 2015Majid Hajibaba - Spark 67 import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.api.java.JavaSQLContext; ... ... JavaSparkContext sc = ...; // An existing JavaSparkContext. JavaSQLContext sqlContext = new JavaSQLContext(sc);
  • 67. SchemaRDD  SchemaRDD can be operated on  as normal RDDs  as a temporary table  allows you to run SQL queries over it  Converting RDDs into SchemaRDDs  Reflection based approach  Uses reflection to infer the schema of an RDD  More concise code  Works well when we know the schema while writing the application  Programmatic based approach  Construct a schema and then apply it to an existing RDD  More verbose  Allows to construct SchemaRDDs when the columns and types are not known until runtime 15 January 2015Majid Hajibaba - Spark 68
  • 68. JavaBean  Is just a standard (a convention)  Is a class that encapsulates many objects into a single object  All properties private (using get/set)  A public no-argument constructor  Implements Serializable  Lots of libraries depend on it 15 January 2015Majid Hajibaba - Spark 69 public static class Person implements Serializable { private String name; private int age; public String getName() { return name; } public void setName(String name) { this.name = name; } public int getAge() { return age; } public void setAge(int age) { this.age = age; } }
  • 69. Reflection based - An Example  Load a text file like people.txt  Convert each line to a JavaBean  people now is an RDD of JavaBeans 15 January 2015Majid Hajibaba - Spark 70 JavaRDD<Person> people = sc.textFile("people.txt").map( new Function<String, Person>() { public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } });
  • 70. Reflection based - An Example  Apply a schema to an RDD of JavaBeans (people)  Register it as a temporary table  SQL can be run over RDDs that have been registered as tables  The result is SchemaRDD and support all the normal RDD operations  The columns of a row in the result can be accessed by ordinal 15 January 2015Majid Hajibaba - Spark 71 JavaSchemaRDD schemaPeople = sqlContext.applySchema(people, Person.class); schemaPeople.registerTempTable("people"); JavaSchemaRDD teenagers = sqlContext.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19") List<String> teenagerNames = teenagers.map( new Function<Row, String>() { public String call(Row row) { return "Name: " + row.getString(0); } }).collect();
  • 71. Programmatic based  JavaBean classes cannot be defined ahead of time  SchemaRDD can be created programmatically with three steps  Create an RDD of Rows from the original RDD  Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.  Apply the schema to the RDD of Rows via applySchema method provided by JavaSQLContext.  Example  The structure of records (schema) is encoded in a string  Load a text file and convert each line to a JavaBean. 15 January 2015Majid Hajibaba - Spark 72 String schemaString = "name age"; JavaRDD<String> people = sc.textFile("examples/src/main/resources/people.txt");
  • 72. Programmatic based – An Example  Generate the schema based on the string of schema  Convert records of the RDD (people) to Rows 15 January 2015Majid Hajibaba - Spark 73 import org.apache.spark.sql.api.java.DataType; import org.apache.spark.sql.api.java.StructField; import org.apache.spark.sql.api.java.StructType; ... List<StructField> fields = new ArrayList<StructField>(); for (String fieldName: schemaString.split(" ")) { fields.add(DataType.createStructField(fieldName, DataType.StringType, true));} StructType schema = DataType.createStructType(fields); import org.apache.spark.sql.api.java.Row; ... JavaRDD<Row> rowRDD = people.map( new Function<String, Row>() { public Row call(String record) throws Exception { String[] fields = record.split(","); return Row.create(fields[0], fields[1].trim()); } });
  • 73. Programmatic based – An Example  Apply the schema to the RDD.  Register the SchemaRDD as a table.  SQL can be run over RDDs that have been registered as tables  The result is SchemaRDD and support all the normal RDD operations  The columns of a row in the result can be accessed by ordinal 15 January 2015Majid Hajibaba - Spark 74 JavaSchemaRDD peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema); peopleSchemaRDD.registerTempTable("people"); JavaSchemaRDD results = sqlContext.sql("SELECT name FROM people"); List<String> names = results.map( new Function<Row, String>() { public String call(Row row) { return "Name: " + row.getString(0); } }).collect();
  • 74. JSON Datasets  Inferring the schema of a JSON dataset and load it to JavaSchemaRDD  Two methods in a JavaSQLContext  jsonFile() : loads data from a directory of JSON files where each line of the files is a JSON object – but not regular multi-line JSON file  jsonRDD(): loads data from an existing RDD where each element of the RDD is a string containing a JSON object  A JSON file can be like this: 15 January 2015Majid Hajibaba - Spark 75 JavaSchemaRDD people = sqlContext.jsonFile(path);
  • 75. JSON Datasets  The inferred schema can be visualized using the printSchema()  The result is something like this:  Register this JavaSchemaRDD as a table  SQL statements can be run by using the sql methods 15 January 2015Majid Hajibaba - Spark 76 people.printSchema(); people.registerTempTable("people"); JavaSchemaRDD teenagers = sqlContext.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19");
  • 76. JSON Datasets  JavaSchemaRDD can be created for a JSON dataset represented by an RDD[String] storing one JSON object per string  Arrays are native examples of RDDs  Register this JavaSchemaRDD as a table  SQL statements can be run by using the sql methods . 15 January 2015Majid Hajibaba - Spark 77 List<String> jsonData = Arrays.asList("{"name":"Yin","address": {"city":"Columbus","state":"Ohio"}}"); JavaRDD<String> anotherPeopleRDD = sc.parallelize(jsonData); JavaSchemaRDD anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD); people.registerTempTable("people"); JavaSchemaRDD teenagers = sqlContext.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19");
  • 77. Thrift JDBC/ODBC server  To start the JDBC/ODBC server:  By default, the server listens on localhost:10000  We can use beeline to test the Thrift JDBC/ODBC server  Connect to the JDBC/ODBC server in beeline with  Beeline will ask for a username and password  Simply enter the username on your machine and a blank password  See existing databases;  Create a database; 15 January 2015Majid Hajibaba - Spark 78 $ ./sbin/start-thriftserver.sh $ ./bin/beeline beeline> !connect jdbc:hive2://localhost:10000 0: jdbc:hive2://localhost:10000> SHOW DATABASES; 0: jdbc:hive2://localhost:10000> CREATE DATABASE DBTEST;
  • 78. End any question? 15 January 2015Majid Hajibaba - Spark 79

Editor's Notes

  1. Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run. Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
  2. The Spark project contains multiple closely-integrated components. At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. These components are designed to interoperate closely, letting you combine them like libraries in a software project. A philosophy of tight integration has several benefits. First, all libraries and higher level components in the stack benefit from improvements at the lower layers. Second, the costs (deployment, maintenance, testing, support) associated with running the stack are minimized, because instead of running 5-10 independent software systems, an organization only needs to run one. also each time a new component is added to the Spark stack, every organization that uses Spark will immediately be able to try this new component. Finally, is the ability to build applications that seamlessly combine different processing models. For example, in Spark you can write one application that uses machine learning to classify data in real time as it is ingested from streaming sources. Simultaneously analysts can query the resulting data, also in real-time, via SQL, e.g. to join the data with unstructured log files. Spark Streaming Spark Streaming is a Spark component that enables processing live streams of data. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service. Spark Streaming provides an API for manipulating data streams that closely matches the Spark Core’s RDD API, making it easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving in real-time. Underneath its API, Spark Streaming was designed to provide the same degree of fault tolerance, throughput, and scalability that the Spark Core provides. Spark SQL Spark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL). Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations. Beyond providing the SQL interface to Spark, Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Java and Scala, all within a single application. This tight integration with the rich and sophisticated computing environment provided by the rest of the Spark stack makes Spark SQL unlike any other open source data warehouse tool.
  3. Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: » Fault tolerance (for crashes & stragglers) » Data locality » Scalability Solution: augment data flow model with “resilient distributed datasets” (RDDs)
  4. Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
  5. Spark 1.2.0 works with Java 6 and higher. If you are using Java 8, Spark supports lambda expressions for concisely writing functions, otherwise you can use the classes in the org.apache.spark.api.java.function package. To write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at: groupId = org.apache.spark artifactId = spark-core_2.10 version = 1.2.0 In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS. Some common HDFS version tags are listed on the third party distributions page. groupId = org.apache.hadoop artifactId = hadoop-client version = <your-hdfs-version> Finally, you need to import some Spark classes into your program. Add the following lines:
  6. A SparkContext class represents the connection to a Spark cluster and provides the entry point for interacting with Spark. We need to create a SparkContext instance so that we can interact with Spark and distribute our jobs. Master: is a string specifying a Spark or Mesos cluster URL to connect to, or a special “local” string to run in local mode, as described below. appName: is a name for your application, which will be shown in the cluster web UI. sparkHome: The path at which Spark is installed on your worker machines (it should be the same on all of them). jars: A list of JAR files on the local machine containing your application’s code and any dependencies, which Spark will deploy to all the worker nodes. You’ll need to package your application into a set of JARs using your build system. or through new SparkContext(conf), which takes a SparkConf object for more advanced configuration.
  7. Most of the time, you would create a SparkConf object with new SparkConf(), which will load values from any spark.* Java system properties set in your application as well. In this case, parameters you set directly on the SparkConf object take priority over system properties. For unit tests, you can also call new SparkConf(false) to skip loading external settings and get the same configuration no matter what the system properties are. All setter methods in this class support chaining. For example, you can write new SparkConf().setMaster("local").setAppName("My app"). Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
  8. The Spark context provides a function called parallelize; this takes a Scala collection and turns it into an RDD that is of the same type as the data input. The simplest method for loading external data is loading text from a file. This requires the file to be available on all the nodes in the cluster, which isn't much of a problem for a local mode. When in a distributed mode, you will want to use Spark's addFile functionality to copy the file to all the machines in your cluster. One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).
  9. The Spark context provides a function called parallelize; this takes a Scala collection and turns it into an RDD that is of the same type as the data input. The simplest method for loading external data is loading text from a file. This requires the file to be available on all the nodes in the cluster, which isn't much of a problem for a local mode. When in a distributed mode, you will want to use Spark's addFile functionality to copy the file to all the machines in your cluster.
  10. RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new distributed dataset representing the results. On the other hand, reduce is an action that aggregates all the elements of the dataset using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset). All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. Example is to produce a new RDD where you add one to every number, use rdd.map(x => x+1) or in Java Example 2 is to sum all the elements
  11. To illustrate RDD basics, consider the simple program below: The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines is merely a pointer to the file. The second line defines lineLengths as the result of a map transformation. Again, lineLengths is not immediately computed, due to laziness. Finally, we run reduce, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.
  12. Spark’s API relies heavily on passing functions in the driver program to run on the cluster. In Java, functions are represented by classes implementing the interfaces in the org.apache.spark.api.java.function package. There are two ways to create such functions: Implement the Function interfaces in your own class, either as an anonymous inner class or a named one, and pass an instance of it to Spark. In Java 8, use lambda expressions to concisely define an implementation. While much of this guide uses lambda syntax for conciseness, it is easy to use all the same APIs in long-form. For example, we could have written our code above as follows:
  13. While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key. In Java, key-value pairs are represented using the scala.Tuple2 class from the Scala standard library. You can simply call new Tuple2(a, b) to create a tuple, and access its fields later with tuple._1() and tuple._2(). RDDs of key-value pairs are represented by the JavaPairRDD class. You can construct JavaPairRDDs from JavaRDDs using special versions of the map operations, like mapToPair and flatMapToPair. The JavaPairRDD will have both standard RDD functions and special key-value ones. For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:
  14. The JavaPairRDD will have both standard RDD functions and special key-value ones. For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:
  15. The flatMap function is a useful utility, which lets you write a function that returns an Iterable object of the type you want and then flattens the results. A simple example of this is a case where you want to parse all the data, but may fail to parse some of it. The flatMap function can be used to output an empty list if it failed, or a list with the success if it worked. In addition to the reduce function, there is a corresponding reduceByKey function that works on RDDs, which are key-value pairs to produce another RDD. Unlike when using map on a list in Scala, your function will run on a number of different machines, so you can't depend on a shared state with this flatMap is a combination of map and flatten, so it first runs map on the sequence, then runs flatten, giving the result shown.
  16. The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one. If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To create an assembly jar containing your code and its dependencies, both sbt and maven can be used. --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077) --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) † --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown). application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. application-arguments: Arguments passed to the main method of your main class, if any
  17. Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.
  18. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs. Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream. This is shown in the following figure.
  19. Spark 1.2.0 works with Java 6 and higher. If you are using Java 8, Spark supports lambda expressions for concisely writing functions, otherwise you can use the classes in the org.apache.spark.api.java.function package. To write a Spark application in Java, you need to add a dependency on Spark. Spark is available through Maven Central at: groupId = org.apache.spark artifactId = spark-core_2.10 version = 1.2.0 In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS. Some common HDFS version tags are listed on the third party distributions page. groupId = org.apache.hadoop artifactId = hadoop-client version = <your-hdfs-version> Finally, you need to import some Spark classes into your program. Add the following lines:
  20. For a Spark Streaming application running on a cluster to be stable, the system should be able to process data as fast as it is being received. In other words, batches of data should be processed as fast as they are being generated. Whether this is true for an application can be found by monitoring the processing times in the streaming web UI, where the batch processing time should be less than the batch interval. Depending on the nature of the streaming computation, the batch interval used may have significant impact on the data rates that can be sustained by the application on a fixed set of cluster resources. For example, let us consider the earlier WordCountNetwork example. For a particular data rate, the system may be able to keep up with reporting word counts every 2 seconds (i.e., batch interval of 2 seconds), but not every 500 milliseconds. So the batch interval needs to be set such that the expected data rate in production can be sustained. A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for “Total delay” in Spark driver log4j logs, or use the StreamingListener interface). If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise, if the delay is continuously increasing, it means that the system is unable to keep up and it therefore unstable. Once you have an idea of a stable configuration, you can try increasing the data rate and/or reducing the batch size. Note that momentary increase in the delay due to temporary data rate increases maybe fine as long as the delay reduces back to a low value (i.e., less than batch size).
  21. Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported). For simple text files, there is an easier method streamingContext.textFileStream(dataDirectory). And file streams do not require running a receiver, hence does not require allocating cores. DStreams can be created with data streams received through Akka actors by using streamingContext.actorStream. Actors are basically concurrent processes that communicate by exchanging messages.
  22. This category of sources require interfacing with external non-Spark libraries, some of them with complex dependencies (e.g., Kafka and Flume). Hence, to minimize issues related to version conflicts of dependencies, the functionality to create DStreams from these sources have been moved to separate libraries, that can be linked to explicitly when necessary. For example, if you want to create a DStream using data from Twitter’s stream of tweets, you have to do the following. Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven project dependencies. Programming: Import the TwitterUtils class and create a DStream with TwitterUtils.createStream as shown below. Deploying: Generate an uber JAR with all the dependencies (including the dependency spark-streaming-twitter_2.10 and its transitive dependencies) and then deploy the application. This is further explained in the Deploying section.
  23. Input DStreams can also be created out of custom data sources. All you have to do is implement an user-defined receiver (see next section to understand what that is) that can receive data from the custom sources and push it into Spark. See the Custom Receiver Guide for details.
  24. One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use. You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
  25. The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. To use this, you will have to do two steps. Define the state - The state can be of arbitrary data type. Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream. Let’s illustrate this with an example. Say you want to maintain a running count of each word seen in a text data stream. Here, the running count is the state and it is an integer. We define the update function as
  26. Because stateful operations have a dependency on previous batches of data, they continuously accumulate metadata over time. To clear this metadata, streaming supports periodic checkpointing by saving intermediate data to HDFS. Note that checkpointing also incurs the cost of saving to HDFS which may cause the corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10 times of sliding interval of a DStream is good setting to try. Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that were used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depends on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increase in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
  27. http://www.ipponusa.com/spark-storm-spring-xd-comparison/
  28. Spark SQL supports two different methods for converting existing RDDs into SchemaRDDs. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application. The second method for creating SchemaRDDs is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct SchemaRDDs when the columns and their types are not known until runtime.
  29. Spark SQL supports automatically converting an RDD of JavaBeans into a Schema RDD. The BeanInfo, obtained using reflection, defines the schema of the table. Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays. You can create a JavaBean by creating a class that implements Serializable and has getters and setters fo JavaBeans are classes that encapsulate many objects into a single object (the bean).r all of its fields.
  30. When JavaBean classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a SchemaRDD can be created programmatically with three steps. Create an RDD of Rows from the original RDD; Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via applySchema method provided by JavaSQLContext.
  31. Spark SQL can automatically infer the schema of a JSON dataset and load it as a JavaSchemaRDD. This conversion can be done using one of two methods in a JavaSQLContext : jsonFile - loads data from a directory of JSON files where each line of the files is a JSON object. jsonRDD - loads data from an existing RDD where each element of the RDD is a string containing a JSON object. Note that the file that is offered as jsonFile is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.