Hands-on Session on Big Data processing using Apache Spark and Hadoop Distributed File System
This is the first session in the series of "Apache Spark Hands-on"
Topics Covered
+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations - Transformation
+ RDD Operations - Actions
+ Hands-on demos using CloudxLab
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Apache Spark Introduction - CloudxLab
1. Welcome to
Hands-on Session on Big Data
processing using Apache Spark
reachus@cloudxlab.com
CloudxLab.com
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
2. Agenda
1 Apache Spark Introduction
2 CloudxLab Introduction
3 Introduction to RDD (Resilient Distributed Datasets)
4 Loading data into an RDD
5 RDD Operations Transformation
6 RDD Operations Actions
7 Hands-on demos using CloudxLab
8 Questions and Answers
3. Hands-On: Objective
Compute the word frequency of a text file stored in
HDFS - Hadoop Distributed File System
Using
Apache Spark
4. Welcome to CloudxLab Session
• Learn Through Practice
• Real Environment
• Connect From Anywhere
• Connect From Any Device
A cloud based lab for
students to gain hands-on experience in Big Data
Technologies such as Hadoop and Spark
• Centralized Data sets
• No Installation
• No Compatibility Issues
• 24x7 Support
5. About Instructor?
2015 CloudxLab A big data platform.
2014 KnowBigData Founded
2014
Amazon
Built High Throughput Systems for Amazon.com site using
in-house NoSql.
2012
2012 InMobi Built Recommender that churns 200 TB
2011
tBits Global
Founded tBits Global
Built an enterprise grade Document Management System
2006
D.E.Shaw
Built the big data systems before the
term was coined
2002
2002 IIT Roorkee Finished B.Tech.
6. Apache
A fast and general engine for large-scale data
processing.
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
8. Getting Started - Launching the console
Open CloudxLab.com to get login/password
Login into Console
Or
• Download
• http://spark.apache.org/downloads.html
• Install python
• (optional) Install Hadoop
Run pyspark
9. SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
A collection of elements partitioned across cluster
• RDD Can be persisted in memory
• RDD Auto recover from node failures
• Can have any data type but has a special dataset type for key-value
• Supports two type of operations: transformation and action
• RDD is read only
10. Convert an existing array into RDD:
myarray = sc.parallelize([1,3,5,6,19, 21]);
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
Load a file From HDFS:
lines = sc.textFile('/data/mr/wordcount/input/big.txt')
Check first 10 lines:
lines.take(10);
// Does the actual execution of loading and printing 10 lines.
12. SPARK - TRANSFORMATIONS
map(func)
Return a new distributed dataset formed by passing each
element of the source through a function func. Analogous to
foreach of pig.
filter(func)
Return a new dataset formed by selecting those elements of
the source on which func returns true.
flatMap(
func)
Similar to map, but each input item can be mapped to 0 or
more output items
groupByKey
([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of
(K, Iterable<V>) pairs.
See More: sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey,join
https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD.
html
13. Define a function to convert a line into words:
def toWords(mystr):
wordsArr = mystr.split()
return wordsArr
SPARK - Break the line into words
Check first 10 lines:
words.take(10);
// Does the actual execution of loading and printing 10 lines.
Execute the flatmap() transformation:
words = lines.flatMap(toWords);
14. Define a function to clean & convert to key-value:
import re
def cleanKV(mystr):
mystr = mystr.lower()
mystr = re.sub("[^0-9a-z]", "", mystr) #replace non alphanums with space
return (mystr, 1); # returning a tuple - word & count
SPARK - Cleaning the data
Execute the map() transformation:
cleanWordsKV = words.map(cleanKV);
//passing “clean” function as argument
Check first 10 words pairs:
cleanWordsKV.take(10);
// Does the actual execution of loading and printing 10 lines.
16. SPARK - ACTIONS
reduce(func)
Aggregate elements of dataset using a function:
• Takes 2 arguments and returns one
• Commutative and associative for parallelism
count() Return the number of elements in the dataset.
collect()
Return all elements of dataset as an array at driver. Used for
small output.
take(n)
Return an array with the first n elements of the dataset.
Not Parallel.
See More: first(), takeSample(), takeOrdered(), saveAsTextFile(path), reduceByKey()
https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD
17. Define an aggregation function:
def sum(x, y):
return x+y;
SPARK - Compute the words count
Check first 10 counts:
wordsCount.take(10);
// Does the actual execution of loading and printing 10 lines.
Execute the reduce action:
wordsCount = cleanWordsKV.reduceByKey(sum);
Save:
wordsCount.saveAsTextFile("mynewdirectory");
18. www.KnowBigData.com
After taking a shot with his bow, the archer took a bow.INPUT
words = lines.flatMap(toWords);
(After,1) (taking,1) (bow.,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,,1)
(after,1) (taking,1) (bow,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,1)
words.reduceByKey(sm)
(after,1) (taking,1)(bow,1) (his,1) (with,1)(shot,1)(a,1)(a,1) (took,1)(archer,1) (the,1)(bow,1)
(a,2)
(bow,2)
words = lines.map(cleanKV);
sm sm
SAVE TO HDFS FIle