SlideShare une entreprise Scribd logo
1  sur  60
Introduction to Apache Spark
Hubert 范姜 @hubert
HadoopCon Taiwan
2015
Sep. 19 , 2015 Taipei
Photo from http://quotesgram.com/spark-quotes/
Who are we?
• 亦思科技
• 位於新竹科學園區
• 過去主要客戶為園區各大製造廠
• 2010.7 以研發雲端計算軟體工具之投資計畫獲准進駐新竹科學園區
• 2011 與清華大學資工系鍾葉青教授合作進行產學合作
• 少數獲邀參與國際雲端計算研討會 IEEE CloudCom的專業公司
• 少數已經有實際經驗協助客戶完成建置 Hadoop 系統的資訊廠商
• 2012.01 JackHare (ANSI SQL JDBC Driver)
• 2012.11 HareDB Hbase Client
• 2013.08 Hare ( High Speed Query in HBase)
• 2013.12 榮獲科學園區創新產品獎
• 2014.12 榮獲資訊月創新金質獎
Hadoop
HBase
Hive
Spark
HareDB Core
HBase Client HDFS Client
Solr Cloud
Security
KerberosSentry
Indexing
Restful Service JDBC/ODBC Cluster Monitor
HareDB Arch.
WHAT IS SPARK ?
What is Apache Spark ?
• It is an open source cluster computing
framework
• In contrast to Hadoop's two-stage disk-
based MapReduce paradigm, Spark's
multi-stage in-memory primitives provides
performance up to 100 faster for certain
applications.
Databricks
• Founded in late 2013
• By the creators of Apache Spark
• Original team from UC Berkeley AMPLab
(Algorithms,Machines,People)
• Contributed more than 75% of the code
added to Spark in 2014
World Record
From Spark Summit 2015, Matei Zaharia
Spark is hot !
From Spark Summit 2015, Matei Zaharia
SPARK 會取代 HADOOP ?
From Spark Summit 2015, Mike Olson (Cloudera),
http://www.slideshare.net/SparkSummit/spark-in-the-hadoop-ecosystem-mike-olson
From http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/
&& Spark Summit 2015, Arun C. Murthy
From Spark Summit 2015, Anil Gadre (MapR),
http://www.slideshare.net/SparkSummit/spark-in-the-hadoop-ecosystem-mike-olson
http://www.slideshare.net/SparkSummit/intro-to-spark-development
http://www.slideshare.net/SparkSummit/intro-to-spark-development
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Spark Software Stack
Resource
Virtualization
Storage
Processing Engine
Access and
Interfaces
Mesos Hadoop Yarn
HDFS,S3
Spark Core
Spark
Streami
ng Spark
SQL
Spark R GraphX
MLlib
Splash
MLPipelinesBlinkDB
VS
10X ~100X
300 MB/s
600 MB/s
10GB/s
1Gb/s = 125MB/s
1Gb/s
125MB/s
Nodes in the
same rack
Nodes in
another
rack
0.1Gb/s
12.5MB/s
1.資料記憶體化
2.資料在地化
Physical Bottleneck
Spark 執行流程
1
2
2
3
3
http://www.codeproject.com/Articles/1023037/Introduction-to-Apache-
Spark
R D D
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of machines
that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like parallel
operations.
RDDs achieve fault tolerance through a notion of lineage:
if a partition of an RDD is lost, the RDD has enough
information about how it was derived from other RDDs to
be able to rebuild just that partition.”
What is RDD ?
(Scala & Python only)
Interactive Shell
# Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
Read From TextFile
item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
RDD
Ex
RDD
W
RDD
Ex
RDD
W
RDD
Ex
RDD
W
more partitions = more parallelism
Where is RDD ?
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Error, ts, msg1
Warn, ts,
msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts,
msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
logLinesRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
errorsRDD
.filter( )
(input/base RDD)
http://www.slideshare.net/SparkSummit/intro-to-spark-development
errorsRDD
.coalesce( 2 )
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
.collect( )
Driver
http://www.slideshare.net/SparkSummit/intro-to-spark-development
.collect( )
Execute !
Driver
http://www.slideshare.net/SparkSummit/intro-to-spark-development
.collect( )
Driver
logLinesRD
D
http://www.slideshare.net/SparkSummit/intro-to-spark-development
.collect( )
logLinesRD
D
errorsRDD
cleanedRDD
.filter( )
.coalesce( 2 )
Driver
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
http://www.slideshare.net/SparkSummit/intro-to-spark-development
.collect( )
Driver
logLinesRDD
errorsRDD
cleanedRDD
data
.filter( )
.coalesce( 2, shuffle= False)
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Driver
logLinesRDD
errorsRDD
cleanedRDD
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Driver
data
http://www.slideshare.net/SparkSummit/intro-to-spark-development
logLinesRD
D
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts,
msg4
Error, ts, msg1 cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
errorMsg1RD
D
.collect( )
.saveToCassandra( )
.count( )
5
http://www.slideshare.net/SparkSummit/intro-to-spark-development
logLinesRD
D
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts,
msg4
Error, ts, msg1 cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
errorMsg1RDD
.collect( )
.count( )
.saveToCassandra( )
5
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Lifecycle of a Spark program
1. Create some input RDDs from external data or
parallelize a collection in your driver program.
2. Lazily transform them to define new RDDs
using transformations like filter() or map()
3. Ask Spark to cache() any intermediate RDDs
that will need to be reused.
4. Launch actions such as count() and collect()
to kick off a parallel computation, which is then
optimized and executed by Spark.
http://www.slideshare.net/SparkSummit/intro-to-spark-development
map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
Transformations
http://www.slideshare.net/SparkSummit/intro-to-spark-development
reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
... ...
Actions
http://www.slideshare.net/SparkSummit/intro-to-spark-development
SPARK SQL
sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
What is Spark SQL
NEW FEATURES IN 1.4
AND 1.5
DataFrames
42
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
DataFrames
43
http://www.slideshare.net/SparkSummit/reynold-xin
DataFrames
44
http://www.slideshare.net/SparkSummit/reynold-xin
DataFrames
45
http://www.slideshare.net/SparkSummit/reynold-xin
DataFrames
46
http://www.slideshare.net/SparkSummit/reynold-xin
DataFrames
47
http://www.slideshare.net/SparkSummit/reynold-xin
Spark R
48
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
Spark R
49
Spark R
50
Machine Learning Pipelines
51
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
External Data Sources
52
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
External Data Sources
53
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
Tungsten
54
http://www.slideshare.net/SparkSummit/reynold-xin
Tungsten
55
http://www.slideshare.net/SparkSummit/reynold-xin
Tungsten
56
http://www.slideshare.net/SparkSummit/reynold-xin
Tungsten
57
http://www.slideshare.net/SparkSummit/reynold-xin
All New Spark
58
http://www.slideshare.net/SparkSummit/reynold-xin
Spark 1.5
• A large part of Spark 1.5, on the other hand, focuses
on under-the-hood changes to improve
Spark’s performance, usability, and operational
stability.
• Spark 1.5 delivers the first phase of Project Tungsten
Reference: https://databricks.com/blog/2015/08/18/spark-1-5-preview-now-available-in-databricks.html
Thank you

Contenu connexe

Tendances

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 

Tendances (20)

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 

En vedette

高科技產業資料分析解決方案 Hare DB
高科技產業資料分析解決方案 Hare DB高科技產業資料分析解決方案 Hare DB
高科技產業資料分析解決方案 Hare DBEtu Solution
 
Japan earthquake Austin Tyndall
Japan earthquake Austin TyndallJapan earthquake Austin Tyndall
Japan earthquake Austin Tyndallyourpassport
 
Poaching Tyler Amburgey
Poaching Tyler AmburgeyPoaching Tyler Amburgey
Poaching Tyler Amburgeyyourpassport
 
Easier and Faster for hbase in HadoopCon 2014
Easier and Faster for hbase in HadoopCon 2014Easier and Faster for hbase in HadoopCon 2014
Easier and Faster for hbase in HadoopCon 2014Hubert Fan Chiang
 
S.s presentation Zachary
S.s presentation ZacharyS.s presentation Zachary
S.s presentation Zacharyyourpassport
 
3rd Hour- Homelessness Around The World. By. Emily A. Scharich.(:
3rd Hour- Homelessness Around The World. By. Emily A. Scharich.(:3rd Hour- Homelessness Around The World. By. Emily A. Scharich.(:
3rd Hour- Homelessness Around The World. By. Emily A. Scharich.(:yourpassport
 
AFP 2011 report universal
AFP 2011 report universalAFP 2011 report universal
AFP 2011 report universalarchforpeople
 
Il segreto del Dio di Michelangelo
Il segreto del Dio di MichelangeloIl segreto del Dio di Michelangelo
Il segreto del Dio di MichelangeloGiulio Maira
 
Domestic violence Hunter g
Domestic violence Hunter gDomestic violence Hunter g
Domestic violence Hunter gyourpassport
 
Marine biology adl
Marine biology adlMarine biology adl
Marine biology adlyourpassport
 

En vedette (20)

高科技產業資料分析解決方案 Hare DB
高科技產業資料分析解決方案 Hare DB高科技產業資料分析解決方案 Hare DB
高科技產業資料分析解決方案 Hare DB
 
Cvmaira[1]
Cvmaira[1]Cvmaira[1]
Cvmaira[1]
 
Devon 2
Devon 2Devon 2
Devon 2
 
Adaptació
AdaptacióAdaptació
Adaptació
 
Slide
SlideSlide
Slide
 
Japan earthquake Austin Tyndall
Japan earthquake Austin TyndallJapan earthquake Austin Tyndall
Japan earthquake Austin Tyndall
 
Poaching Tyler Amburgey
Poaching Tyler AmburgeyPoaching Tyler Amburgey
Poaching Tyler Amburgey
 
Easier and Faster for hbase in HadoopCon 2014
Easier and Faster for hbase in HadoopCon 2014Easier and Faster for hbase in HadoopCon 2014
Easier and Faster for hbase in HadoopCon 2014
 
5 indikatorov
5 indikatorov5 indikatorov
5 indikatorov
 
S.s presentation Zachary
S.s presentation ZacharyS.s presentation Zachary
S.s presentation Zachary
 
Child abuse: eric
Child abuse: ericChild abuse: eric
Child abuse: eric
 
3rd Hour- Homelessness Around The World. By. Emily A. Scharich.(:
3rd Hour- Homelessness Around The World. By. Emily A. Scharich.(:3rd Hour- Homelessness Around The World. By. Emily A. Scharich.(:
3rd Hour- Homelessness Around The World. By. Emily A. Scharich.(:
 
Wind dylanhearns
Wind dylanhearnsWind dylanhearns
Wind dylanhearns
 
Liver cancer
Liver cancerLiver cancer
Liver cancer
 
AFP 2011 report universal
AFP 2011 report universalAFP 2011 report universal
AFP 2011 report universal
 
Cyclones,quentin
Cyclones,quentinCyclones,quentin
Cyclones,quentin
 
Child labor
Child laborChild labor
Child labor
 
Il segreto del Dio di Michelangelo
Il segreto del Dio di MichelangeloIl segreto del Dio di Michelangelo
Il segreto del Dio di Michelangelo
 
Domestic violence Hunter g
Domestic violence Hunter gDomestic violence Hunter g
Domestic violence Hunter g
 
Marine biology adl
Marine biology adlMarine biology adl
Marine biology adl
 

Similaire à Introduction to Apache Spark

Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkEren Avşaroğulları
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 

Similaire à Introduction to Apache Spark (20)

Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Spark learning
Spark learningSpark learning
Spark learning
 
Apache Spark Fundamentals Training
Apache Spark Fundamentals TrainingApache Spark Fundamentals Training
Apache Spark Fundamentals Training
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 

Dernier

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Dernier (20)

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

Introduction to Apache Spark

Notes de l'éditeur

  1. Mike Olson : Chief Strategy Officer at Cloudera
  2. Senior Vice President, Product Management
  3. This RDD has 5 partitions. An RDD is simply a distributed collection of elements. You can think of the distributed collections like of like an array or list in your single machine program, except that it’s spread out across multiple nodes in the cluster. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them. So, Spark gives you APIs and functions that lets you do something on the whole collection in parallel using all the nodes.
  4. Introduce that Spark has Operations which can be transformations or actions. Those are 4 green unique blocks in a single HDFS file Here we are filtering out the warnings and info messages so we are left with just errors in the RDD. This doesn’t actually read the file from HDFS just yet… we’re just building out a lineage graph
  5. directed acyclic graph. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again. A collection of tasks that must be ordered into a sequence, subject to constraints that certain tasks must be performed earlier than others, may be represented as a DAG with a vertex for each task and an edge for each constraint https://en.wikipedia.org/wiki/Directed_acyclic_graph
  6. ----- Meeting Notes (6/15/15 16:02) ----- This is a stage (which we'll talk about later).
  7. Now the RDDs dissapear and get destroyed
  8. It’s okay if only part of the RDD actually fits in memory Talk about lineage: parent RDD and child RDD
  9. ----- Meeting Notes (6/15/15 16:08) ----- Also note that an application can have many such 1 through 4 procedures.
  10. Actions force the evaluation of the transformations required for the RDD they are called on, since they are required to actually produce output.