SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Welcome to
Hands-on Session on Big Data
processing using Apache Spark
reachus@cloudxlab.com
CloudxLab.com
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
Agenda
1 Apache Spark Introduction
2 CloudxLab Introduction
3 Introduction to RDD (Resilient Distributed Datasets)
4 Loading data into an RDD
5 RDD Operations Transformation
6 RDD Operations Actions
7 Hands-on demos using CloudxLab
8 Questions and Answers
Hands-On: Objective
Compute the word frequency of a text file stored in
HDFS - Hadoop Distributed File System
Using
Apache Spark
Welcome to CloudxLab Session
• Learn Through Practice
• Real Environment
• Connect From Anywhere
• Connect From Any Device
A cloud based lab for
students to gain hands-on experience in Big Data
Technologies such as Hadoop and Spark
• Centralized Data sets
• No Installation
• No Compatibility Issues
• 24x7 Support
About Instructor?
2015 CloudxLab A big data platform.
2014 KnowBigData Founded
2014
Amazon
Built High Throughput Systems for Amazon.com site using
in-house NoSql.
2012
2012 InMobi Built Recommender that churns 200 TB
2011
tBits Global
Founded tBits Global
Built an enterprise grade Document Management System
2006
D.E.Shaw
Built the big data systems before the
term was coined
2002
2002 IIT Roorkee Finished B.Tech.
Apache
A fast and general engine for large-scale data
processing.
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
Spark Architecture
Spark Core
StandaloneAmazon EC2
Hadoop
YARN
Apache Mesos
HDFS
HBase
Hive
Tachyon
...
SQL Streaming MLLib GraphXSparkR Java Python Scala
Libraries
Languages
Getting Started - Launching the console
Open CloudxLab.com to get login/password
Login into Console
Or
• Download
• http://spark.apache.org/downloads.html
• Install python
• (optional) Install Hadoop
Run pyspark
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
A collection of elements partitioned across cluster
• RDD Can be persisted in memory
• RDD Auto recover from node failures
• Can have any data type but has a special dataset type for key-value
• Supports two type of operations: transformation and action
• RDD is read only
Convert an existing array into RDD:
myarray = sc.parallelize([1,3,5,6,19, 21]);
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
Load a file From HDFS:
lines = sc.textFile('/data/mr/wordcount/input/big.txt')
Check first 10 lines:
lines.take(10);
// Does the actual execution of loading and printing 10 lines.
SPARK - TRANSFORMATIONS
persist()
cache()
SPARK - TRANSFORMATIONS
map(func)
Return a new distributed dataset formed by passing each
element of the source through a function func. Analogous to
foreach of pig.
filter(func)
Return a new dataset formed by selecting those elements of
the source on which func returns true.
flatMap(
func)
Similar to map, but each input item can be mapped to 0 or
more output items
groupByKey
([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of
(K, Iterable<V>) pairs.
See More: sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey,join
https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD.
html
Define a function to convert a line into words:
def toWords(mystr):
wordsArr = mystr.split()
return wordsArr
SPARK - Break the line into words
Check first 10 lines:
words.take(10);
// Does the actual execution of loading and printing 10 lines.
Execute the flatmap() transformation:
words = lines.flatMap(toWords);
Define a function to clean & convert to key-value:
import re
def cleanKV(mystr):
mystr = mystr.lower()
mystr = re.sub("[^0-9a-z]", "", mystr) #replace non alphanums with space
return (mystr, 1); # returning a tuple - word & count
SPARK - Cleaning the data
Execute the map() transformation:
cleanWordsKV = words.map(cleanKV);
//passing “clean” function as argument
Check first 10 words pairs:
cleanWordsKV.take(10);
// Does the actual execution of loading and printing 10 lines.
SPARK - ACTIONS
Return value to the driver
SPARK - ACTIONS
reduce(func)
Aggregate elements of dataset using a function:
• Takes 2 arguments and returns one
• Commutative and associative for parallelism
count() Return the number of elements in the dataset.
collect()
Return all elements of dataset as an array at driver. Used for
small output.
take(n)
Return an array with the first n elements of the dataset.
Not Parallel.
See More: first(), takeSample(), takeOrdered(), saveAsTextFile(path), reduceByKey()
https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD
Define an aggregation function:
def sum(x, y):
return x+y;
SPARK - Compute the words count
Check first 10 counts:
wordsCount.take(10);
// Does the actual execution of loading and printing 10 lines.
Execute the reduce action:
wordsCount = cleanWordsKV.reduceByKey(sum);
Save:
wordsCount.saveAsTextFile("mynewdirectory");
www.KnowBigData.com
After taking a shot with his bow, the archer took a bow.INPUT
words = lines.flatMap(toWords);
(After,1) (taking,1) (bow.,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,,1)
(after,1) (taking,1) (bow,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,1)
words.reduceByKey(sm)
(after,1) (taking,1)(bow,1) (his,1) (with,1)(shot,1)(a,1)(a,1) (took,1)(archer,1) (the,1)(bow,1)
(a,2)
(bow,2)
words = lines.map(cleanKV);
sm sm
SAVE TO HDFS FIle
Thank you.
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
reachus@cloudxlab.com
Subscribe to our Youtube channel for latest videos - https://www.
youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA

Contenu connexe

Tendances

Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 

Tendances (20)

Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 

En vedette

En vedette (14)

Scala tutorial
Scala tutorialScala tutorial
Scala tutorial
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 

Similaire à Apache Spark Introduction - CloudxLab

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 

Similaire à Apache Spark Introduction - CloudxLab (20)

Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Data Science
Data ScienceData Science
Data Science
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 

Dernier

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Apache Spark Introduction - CloudxLab

  • 1. Welcome to Hands-on Session on Big Data processing using Apache Spark reachus@cloudxlab.com CloudxLab.com +1 419 665 3276 (US) +91 803 959 1464 (IN)
  • 2. Agenda 1 Apache Spark Introduction 2 CloudxLab Introduction 3 Introduction to RDD (Resilient Distributed Datasets) 4 Loading data into an RDD 5 RDD Operations Transformation 6 RDD Operations Actions 7 Hands-on demos using CloudxLab 8 Questions and Answers
  • 3. Hands-On: Objective Compute the word frequency of a text file stored in HDFS - Hadoop Distributed File System Using Apache Spark
  • 4. Welcome to CloudxLab Session • Learn Through Practice • Real Environment • Connect From Anywhere • Connect From Any Device A cloud based lab for students to gain hands-on experience in Big Data Technologies such as Hadoop and Spark • Centralized Data sets • No Installation • No Compatibility Issues • 24x7 Support
  • 5. About Instructor? 2015 CloudxLab A big data platform. 2014 KnowBigData Founded 2014 Amazon Built High Throughput Systems for Amazon.com site using in-house NoSql. 2012 2012 InMobi Built Recommender that churns 200 TB 2011 tBits Global Founded tBits Global Built an enterprise grade Document Management System 2006 D.E.Shaw Built the big data systems before the term was coined 2002 2002 IIT Roorkee Finished B.Tech.
  • 6. Apache A fast and general engine for large-scale data processing. • Really fast MapReduce • 100x faster than Hadoop MapReduce in memory, • 10x faster on disk. • Builds on similar paradigms as MapReduce • Integrated with Hadoop
  • 7. Spark Architecture Spark Core StandaloneAmazon EC2 Hadoop YARN Apache Mesos HDFS HBase Hive Tachyon ... SQL Streaming MLLib GraphXSparkR Java Python Scala Libraries Languages
  • 8. Getting Started - Launching the console Open CloudxLab.com to get login/password Login into Console Or • Download • http://spark.apache.org/downloads.html • Install python • (optional) Install Hadoop Run pyspark
  • 9. SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET A collection of elements partitioned across cluster • RDD Can be persisted in memory • RDD Auto recover from node failures • Can have any data type but has a special dataset type for key-value • Supports two type of operations: transformation and action • RDD is read only
  • 10. Convert an existing array into RDD: myarray = sc.parallelize([1,3,5,6,19, 21]); SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET Load a file From HDFS: lines = sc.textFile('/data/mr/wordcount/input/big.txt') Check first 10 lines: lines.take(10); // Does the actual execution of loading and printing 10 lines.
  • 12. SPARK - TRANSFORMATIONS map(func) Return a new distributed dataset formed by passing each element of the source through a function func. Analogous to foreach of pig. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap( func) Similar to map, but each input item can be mapped to 0 or more output items groupByKey ([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. See More: sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey,join https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaPairRDD. html
  • 13. Define a function to convert a line into words: def toWords(mystr): wordsArr = mystr.split() return wordsArr SPARK - Break the line into words Check first 10 lines: words.take(10); // Does the actual execution of loading and printing 10 lines. Execute the flatmap() transformation: words = lines.flatMap(toWords);
  • 14. Define a function to clean & convert to key-value: import re def cleanKV(mystr): mystr = mystr.lower() mystr = re.sub("[^0-9a-z]", "", mystr) #replace non alphanums with space return (mystr, 1); # returning a tuple - word & count SPARK - Cleaning the data Execute the map() transformation: cleanWordsKV = words.map(cleanKV); //passing “clean” function as argument Check first 10 words pairs: cleanWordsKV.take(10); // Does the actual execution of loading and printing 10 lines.
  • 15. SPARK - ACTIONS Return value to the driver
  • 16. SPARK - ACTIONS reduce(func) Aggregate elements of dataset using a function: • Takes 2 arguments and returns one • Commutative and associative for parallelism count() Return the number of elements in the dataset. collect() Return all elements of dataset as an array at driver. Used for small output. take(n) Return an array with the first n elements of the dataset. Not Parallel. See More: first(), takeSample(), takeOrdered(), saveAsTextFile(path), reduceByKey() https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD
  • 17. Define an aggregation function: def sum(x, y): return x+y; SPARK - Compute the words count Check first 10 counts: wordsCount.take(10); // Does the actual execution of loading and printing 10 lines. Execute the reduce action: wordsCount = cleanWordsKV.reduceByKey(sum); Save: wordsCount.saveAsTextFile("mynewdirectory");
  • 18. www.KnowBigData.com After taking a shot with his bow, the archer took a bow.INPUT words = lines.flatMap(toWords); (After,1) (taking,1) (bow.,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,,1) (after,1) (taking,1) (bow,1)(his,1)(with,1)(shot,1) (a,1)(a,1) (took,1)(archer,1)(the,1)(bow,1) words.reduceByKey(sm) (after,1) (taking,1)(bow,1) (his,1) (with,1)(shot,1)(a,1)(a,1) (took,1)(archer,1) (the,1)(bow,1) (a,2) (bow,2) words = lines.map(cleanKV); sm sm SAVE TO HDFS FIle
  • 19. Thank you. +1 419 665 3276 (US) +91 803 959 1464 (IN) reachus@cloudxlab.com Subscribe to our Youtube channel for latest videos - https://www. youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA