This session talks about how unit testing of Spark applications is done, as well as tells the best way to do it. This includes writing unit tests with and without Spark Testing Base package, which is a spark package containing base classes to use when writing tests with Spark.
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
Unit testing of spark applications
1. Unit Testing of Spark ApplicationsUnit Testing of Spark Applications
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
2. AgendaAgenda
● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
3. What is Spark ?What is Spark ?
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.
● Runs on Hadoop, Mesos, or
in the cloud.
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.
● Runs on Hadoop, Mesos, or
in the cloud.
src: http://spark.apache.org/src: http://spark.apache.org/
4. What is Unit Testing ?What is Unit Testing ?
● Unit Testing is a Software Testing method by which individual units
of source code are tested to determine whether they are fit for use or
not.
● They ensure that code meets its design specifications and behaves as
intended.
● Its goal is to isolate each part of the program and show that the
individual parts are correct.
● Unit Testing is a Software Testing method by which individual units
of source code are tested to determine whether they are fit for use or
not.
● They ensure that code meets its design specifications and behaves as
intended.
● Its goal is to isolate each part of the program and show that the
individual parts are correct.
src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
5. Why we need Unit Testing ?Why we need Unit Testing ?
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.
● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.
● Simplifies integration
- Makes Integration Tests easier to write.
● Documentation
- Provides a living documentation of the system.
● Design
- Can act as formal design of project.
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.
● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.
● Simplifies integration
- Makes Integration Tests easier to write.
● Documentation
- Provides a living documentation of the system.
● Design
- Can act as formal design of project.
src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
6. Unit Testing of Spark ApplicationsUnit Testing of Spark Applications
7. Unit to TestUnit to Test
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class WordCount {
def get(url: String, sc: SparkContext): RDD[(String, Int)] = {
val lines = sc.textFile(url)
lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
}
}
8. Method 1Method 1
import org.scalatest.{ BeforeAndAfterAll, FunSuite }
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class WordCountTest extends FunSuite with BeforeAndAfterAll {
private var sparkConf: SparkConf = _
private var sc: SparkContext = _
override def beforeAll() {
sparkConf = new SparkConf().setAppName("unit-testing").setMaster("local")
sc = new SparkContext(sparkConf)
}
private val wordCount = new WordCount
test("get word count rdd") {
val result = wordCount.get("file.txt", sc)
assert(result.take(10).length === 10)
}
override def afterAll() {
sc.stop()
}
}
9. Cons of Method 1Cons of Method 1
● Explicit management of SparkContext creation and
destruction.
● Developer has to write more lines of code for testing.
● Code duplication as Before and After step has to be repeated
in all Test Suites.
● Explicit management of SparkContext creation and
destruction.
● Developer has to write more lines of code for testing.
● Code duplication as Before and After step has to be repeated
in all Test Suites.
10. Method 2 (Better Way)Method 2 (Better Way)
"com.holdenkarau" %% "spark-testing-base" % "1.6.1_0.3.2"
Spark Testing Base
A spark package containing base classes to use when writing
tests with Spark.
Spark Testing Base
A spark package containing base classes to use when writing
tests with Spark.
How ?How ?
11. Method 2 (Better Way) contd...Method 2 (Better Way) contd...
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
class WordCountTest extends FunSuite with SharedSparkContext {
private val wordCount = new WordCount
test("get word count rdd") {
val result = wordCount.get("file.txt", sc)
assert(result.take(10).length === 10)
}
}
Example 1Example 1
12. Method 2 (Better Way) contd...Method 2 (Better Way) contd...
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
import com.holdenkarau.spark.testing.RDDComparisons
class WordCountTest extends FunSuite with SharedSparkContext {
private val wordCount = new WordCount
test("get word count rdd with comparison") {
val expected =
sc.textFile("file.txt")
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
val result = wordCount.get("file.txt", sc)
assert(RDDComparisons.compare(expected, result).isEmpty)
}
}
Example 2Example 2
13. Pros of Method 2Pros of Method 2
● Succinct code.
● Rich Test API.
● Supports Scala, Java and Python.
● Provides API for testing Streaming applications too.
● Has in-built RDD comparators.
● Supports both Local & Cluster mode testing.
● Succinct code.
● Rich Test API.
● Supports Scala, Java and Python.
● Provides API for testing Streaming applications too.
● Has in-built RDD comparators.
● Supports both Local & Cluster mode testing.
14. When to use What ?When to use What ?
Method 1
● For Small Scale Spark
applications.
● No requirement of extended
capabilities of spark-testing-base.
● For Sample applications.
Method 1
● For Small Scale Spark
applications.
● No requirement of extended
capabilities of spark-testing-base.
● For Sample applications.
Method 2
● For Large Scale Spark
applications.
● Requirement of Cluster mode or
Performance testing.
● For Production applications.
Method 2
● For Large Scale Spark
applications.
● Requirement of Cluster mode or
Performance testing.
● For Production applications.