SlideShare une entreprise Scribd logo
1  sur  24
Notes Sharing
Richard Kuo, Professional-Technical Architect,
Domain 2.0 Architecture & Planning
Agenda
• Big Data
• Overview of Spark
• Main Concepts
– RDD
– Transformations
– Programming Model
• Observation
01/06/15 Creative Common, BY, SA, NC 2
What is Apache Spark?
• Fast and general cluster computing system,
interoperable with Hadoop.
• Improves efficiency through:
– In-memory computing primitives
– General computational graph
• Improves usability through:
– Rich APIs in Scala, Java, Python
– Interactive shell
01/06/15 Creative Common, BY, SA, NC 3
Big Data: Hadoop Ecosystem
01/06/15 Creative Common, BY, SA, NC 4
Distributed Computing
01/06/15 Creative Common, BY, SA, NC 5
Comparison with Hadoop
Hadoop Spark
Map Reduce Framework Generalized Computation
Usually data is on disk (HDFS) On disk or in memory
Not ideal for iterative works Data can be cached in memory, great for
iterative works
Batch process Real time streaming or batch
Up to 10x faster when data is in disk
Up to 100x faster when data is in memory
2-5x time less code to write
Support Scala, Java and Python
Code re-use across modules
Interactive shell for ad-hoc exploratory
Library support: GraphX, Machine
Learning, SQL, R, Streaming, …
01/06/15 Creative Common, BY, SA, NC 6
01/06/15 Creative Common, BY, SA, NC 7
Compare to Hadoop:
01/06/15 Creative Common, BY, SA, NC 8
System performance degrade gracefully with less RAM
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully cached
Executiontime(s)
% of working set in cache
01/06/15 Creative Common, BY, SA, NC 9
Software Components
• Spark runs as a library in
your program (1 instance
per app)
• Runs tasks locally or on
cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems
via Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
01/06/15 Creative Common, BY, SA, NC 10
Spark Architecture
• [Spark
Standalone
• |Mesos
• |Yarn]
Node
Client
01/06/15 Creative Common, BY, SA, NC 11
Key Concept: RDD’s
Resilient Distributed Datasets
• Collections of objects
spread across a cluster,
stored in RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)
01/06/15 Creative Common, BY, SA, NC 12
Write programs in terms of operations on distributed
datasets
Fault Recovery
RDDs track the series of transformations used to build
them (their lineage) to re-compute lost data, no data
replication across wire.
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
01/06/15 Creative Common, BY, SA, NC 13
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Language Support
Standalone Programs
•Python, Scala, & Java
Interactive Shells
• Python & Scala
Performance
• Java & Scala are faster due to
static typing
• …but Python is often fine
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
01/06/15 Creative Common, BY, SA, NC 14
Interactive Shell
• The fastest way to
learn Spark
• Available in Python and
Scala
• Runs as an application
on an existing Spark
Cluster…
• Or can run locally
01/06/15 Creative Common, BY, SA, NC 15
DEMO
01/06/15 Creative Common, BY, SA, NC 16
Transformation
01/06/15 Creative Common, BY, SA, NC 17
Spark Streaming
01/06/15 Creative Common, BY, SA, NC 18
Spark Streaming
01/06/15 Creative Common, BY, SA, NC 19
Spark Streaming: Word Count
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
// Create the context with a 1 second batch size
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
01/06/15 Creative Common, BY, SA, NC 20
Create Spark Context
Create, map, reduce
Output
Start
Analytics
01/06/15 Creative Common, BY, SA, NC 21
Conclusion
• Spark offers a rich API to make data analytics fast:
both less to write and fast to run.
• Achieves 100x speedups in real applications.
• Growing community.
01/06/15 Creative Common, BY, SA, NC 22
Observations:
• A lot of data, different kinds of data, generated
faster, need analyzed in real-time.
• All* products are data products.
• More complicate analytic algorithms applies to
commercial products and services.
• Not all data analysis requires the same accuracy.
• Expectation on service delivery increases.
01/06/15 Creative Common, BY, SA, NC 23
Reference:
• AMPLab at UC Berkeley
• Databrick
• UC BerkeleyX
– CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015,
5 weeks
– CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks
• Spark Summit 2014 Training
• Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing
• An Architecture for Fast and General Data Processing on
Large Clusters
• Richard’s Study Notes
– Self Study AMPCamp
– Hortonworks HDP 2.2 Study
01/06/15 Creative Common, BY, SA, NC 24

Contenu connexe

Tendances

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 

Tendances (20)

Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Internals
InternalsInternals
Internals
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

En vedette

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Machine learning Mindmap
Machine learning MindmapMachine learning Mindmap
Machine learning MindmapYee Jie NG
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Visual book summaries
Visual book summariesVisual book summaries
Visual book summarieschrisvdberge
 
Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesTheMilestoneBrand
 
Julius Caesar - Summary
Julius Caesar - SummaryJulius Caesar - Summary
Julius Caesar - SummaryMaximoff
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & ScalaEdureka!
 
Great Executive Summaries
Great Executive SummariesGreat Executive Summaries
Great Executive SummariesAndy Forbes
 
The Lean Startup - Visual Summary
The Lean Startup - Visual SummaryThe Lean Startup - Visual Summary
The Lean Startup - Visual SummaryBrett Suddreth
 

En vedette (14)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Machine learning Mindmap
Machine learning MindmapMachine learning Mindmap
Machine learning Mindmap
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Visual book summaries
Visual book summariesVisual book summaries
Visual book summaries
 
ProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technologyProQuest Safari: essentials of computing and popular technology
ProQuest Safari: essentials of computing and popular technology
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Derrick Miles on Executive Book Summaries
Derrick Miles on Executive Book SummariesDerrick Miles on Executive Book Summaries
Derrick Miles on Executive Book Summaries
 
Julius Caesar - Summary
Julius Caesar - SummaryJulius Caesar - Summary
Julius Caesar - Summary
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Great Executive Summaries
Great Executive SummariesGreat Executive Summaries
Great Executive Summaries
 
The Lean Startup - Visual Summary
The Lean Startup - Visual SummaryThe Lean Startup - Visual Summary
The Lean Startup - Visual Summary
 
Inside Apple
Inside AppleInside Apple
Inside Apple
 

Similaire à Spark Study Notes

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 

Similaire à Spark Study Notes (20)

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Spark core
Spark coreSpark core
Spark core
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 

Plus de Richard Kuo

Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkRichard Kuo
 
View Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering ProspectiveView Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering ProspectiveRichard Kuo
 
Telecom Infra Project study notes
Telecom Infra Project study notesTelecom Infra Project study notes
Telecom Infra Project study notesRichard Kuo
 
5g, gpu and fpga
5g, gpu and fpga5g, gpu and fpga
5g, gpu and fpgaRichard Kuo
 
Kubernetes20151017a
Kubernetes20151017aKubernetes20151017a
Kubernetes20151017aRichard Kuo
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaRichard Kuo
 
UML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201aUML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201aRichard Kuo
 
Open v switch20150410b
Open v switch20150410bOpen v switch20150410b
Open v switch20150410bRichard Kuo
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020bRichard Kuo
 
Cloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibmCloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibmRichard Kuo
 

Plus de Richard Kuo (15)

Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
 
View Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering ProspectiveView Orchestration from Model Driven Engineering Prospective
View Orchestration from Model Driven Engineering Prospective
 
Telecom Infra Project study notes
Telecom Infra Project study notesTelecom Infra Project study notes
Telecom Infra Project study notes
 
5g, gpu and fpga
5g, gpu and fpga5g, gpu and fpga
5g, gpu and fpga
 
Learning
Learning Learning
Learning
 
Kubernetes20151017a
Kubernetes20151017aKubernetes20151017a
Kubernetes20151017a
 
IaaS with Chef
IaaS with ChefIaaS with Chef
IaaS with Chef
 
Ontology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpediaOntology, Semantic Web and DBpedia
Ontology, Semantic Web and DBpedia
 
SDN and NFV
SDN and NFVSDN and NFV
SDN and NFV
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
UML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201aUML, OWL and REA based enterprise business model 20110201a
UML, OWL and REA based enterprise business model 20110201a
 
Open v switch20150410b
Open v switch20150410bOpen v switch20150410b
Open v switch20150410b
 
Docker and coreos20141020b
Docker and coreos20141020bDocker and coreos20141020b
Docker and coreos20141020b
 
Git studynotes
Git studynotesGit studynotes
Git studynotes
 
Cloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibmCloud computing reference architecture from nist and ibm
Cloud computing reference architecture from nist and ibm
 

Dernier

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Spark Study Notes

  • 1. Notes Sharing Richard Kuo, Professional-Technical Architect, Domain 2.0 Architecture & Planning
  • 2. Agenda • Big Data • Overview of Spark • Main Concepts – RDD – Transformations – Programming Model • Observation 01/06/15 Creative Common, BY, SA, NC 2
  • 3. What is Apache Spark? • Fast and general cluster computing system, interoperable with Hadoop. • Improves efficiency through: – In-memory computing primitives – General computational graph • Improves usability through: – Rich APIs in Scala, Java, Python – Interactive shell 01/06/15 Creative Common, BY, SA, NC 3
  • 4. Big Data: Hadoop Ecosystem 01/06/15 Creative Common, BY, SA, NC 4
  • 6. Comparison with Hadoop Hadoop Spark Map Reduce Framework Generalized Computation Usually data is on disk (HDFS) On disk or in memory Not ideal for iterative works Data can be cached in memory, great for iterative works Batch process Real time streaming or batch Up to 10x faster when data is in disk Up to 100x faster when data is in memory 2-5x time less code to write Support Scala, Java and Python Code re-use across modules Interactive shell for ad-hoc exploratory Library support: GraphX, Machine Learning, SQL, R, Streaming, … 01/06/15 Creative Common, BY, SA, NC 6
  • 8. Compare to Hadoop: 01/06/15 Creative Common, BY, SA, NC 8
  • 9. System performance degrade gracefully with less RAM 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache 01/06/15 Creative Common, BY, SA, NC 9
  • 10. Software Components • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage 01/06/15 Creative Common, BY, SA, NC 10
  • 11. Spark Architecture • [Spark Standalone • |Mesos • |Yarn] Node Client 01/06/15 Creative Common, BY, SA, NC 11
  • 12. Key Concept: RDD’s Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) 01/06/15 Creative Common, BY, SA, NC 12 Write programs in terms of operations on distributed datasets
  • 13. Fault Recovery RDDs track the series of transformations used to build them (their lineage) to re-compute lost data, no data replication across wire. val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) 01/06/15 Creative Common, BY, SA, NC 13 HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 14. Language Support Standalone Programs •Python, Scala, & Java Interactive Shells • Python & Scala Performance • Java & Scala are faster due to static typing • …but Python is often fine Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); 01/06/15 Creative Common, BY, SA, NC 14
  • 15. Interactive Shell • The fastest way to learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • Or can run locally 01/06/15 Creative Common, BY, SA, NC 15
  • 18. Spark Streaming 01/06/15 Creative Common, BY, SA, NC 18
  • 19. Spark Streaming 01/06/15 Creative Common, BY, SA, NC 19
  • 20. Spark Streaming: Word Count import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } 01/06/15 Creative Common, BY, SA, NC 20 Create Spark Context Create, map, reduce Output Start
  • 22. Conclusion • Spark offers a rich API to make data analytics fast: both less to write and fast to run. • Achieves 100x speedups in real applications. • Growing community. 01/06/15 Creative Common, BY, SA, NC 22
  • 23. Observations: • A lot of data, different kinds of data, generated faster, need analyzed in real-time. • All* products are data products. • More complicate analytic algorithms applies to commercial products and services. • Not all data analysis requires the same accuracy. • Expectation on service delivery increases. 01/06/15 Creative Common, BY, SA, NC 23
  • 24. Reference: • AMPLab at UC Berkeley • Databrick • UC BerkeleyX – CS100.1x Introduction to Big Data with Apache Spark, starts 23 Feb 2015, 5 weeks – CS190.1x Scalable Machine Learning, starts 14 Apr 2015, 5 weeks • Spark Summit 2014 Training • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing • An Architecture for Fast and General Data Processing on Large Clusters • Richard’s Study Notes – Self Study AMPCamp – Hortonworks HDP 2.2 Study 01/06/15 Creative Common, BY, SA, NC 24

Notes de l'éditeur

  1. MPI (Message Passing Interface)
  2. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf
  3. Gracefully
  4. The barrier to entry for working with the spark API is minimal
  5. (word, 1L) reduceByKey(_, _)
  6. from http://spark.apache.org/docs/latest/streaming-programming-guide.html /** * Usage: NetworkWordCount <hostname> <port> * To run this on your local machine, you need to first run a Netcat server * `$ nc -lk 9999` * and then run the example * `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999` */