SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Ericsson Internal | 2015-08-11 | Page 2
• Wh a t ?
• Wh y ?
• Ho w ?
• De m o
• EDR A n a l y t i c s
AGENDA
Ericsson Internal | 2015-08-11 | Page 3
Spark eco-system
Technology landscape
Spark eco-system
Ericsson Internal | 2015-08-11 | Page 4
“Fast and general engine for big
data processing with libraries for
SQL, streaming, advanced
analytics(machine learning)
Ericsson Internal | 2015-08-11 | Page 5
WHAT?
Originally developed in 2009 in
UC Berkeley’sAMPLab
Fully open sourced in 2010 –
now at Apache Software
Foundation
http://spark.apache.org
Ericsson Internal | 2015-08-11 | Page 6
Spark is the Most Active
Open Source Project in
Big Data
Projectcontributorsinpastyear
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Ericsson Internal | 2015-08-11 | Page 7
Distributors Applications
7
The Spark Community
Ericsson Internal | 2015-08-11 | Page 8
2015 SNAPSHOT
Ericsson Internal | 2015-08-11 | Page 9
WHY SPARK?
Speed
Run programs up to
100x faster than
Hadoop Map
Reduce in memory,
or 10x faster on
disk.
Ease of Use
Supports different
languages for
developing
applications using
Spark
Generality
Combine SQL,
streaming, and
complex analytics
into one platform
Runs
Everywhere
Spark runs on
Hadoop, Mesos,
standalone, or in
the cloud.
Ericsson Internal | 2015-08-11 | Page 10
Easy: Get Started
Immediately
Interactive Shell
Ericsson Internal | 2015-08-11 | Page 11
Monitoring
Ericsson Internal | 2015-08-11 | Page 12
FEATURE COMPARISON
12
Source: Daytona GraySort benchmark, sortbenchmark.org
Ericsson Internal | 2015-08-11 | Page 13
WORD COUNT
Ericsson Internal | 2015-08-11 | Page 14
Spark eco-system
Local YARN Mesos
Spark Streaming Spark SQL GraphX MLLib
Spark Core Engine (Scala/Java/Python)
Standalone cluster
Persistence
Cluster Manager
…
1
4
Ericsson Internal | 2015-08-11 | Page 15
SPARK ON HDFS
Ericsson Internal | 2015-08-11 | Page 16
HADOOP SPARK
SQL Query interface HIVE SPARKSQL
Machine Learning APACHE MAHOUT MLIB
Graph processing APACHE GIRAPH GRAPHX
Streaming APACHE STORM SPARK STREAMING
ECOSYSTEM
Ericsson Internal | 2015-08-11 | Page 17
HOW?
Ericsson Internal | 2015-08-11 | Page 18
So, HOW is It BETTER
Ericsson Internal | 2015-08-11 | Page 19
THE BIG QUESTION?
Is Spark going to replace Hadoop?
Answer – Yes, Spark will be used on top of Hadoop and replace
MapReduce Reasons:
1. Hadoop MapReduce cannot handle real-time
processing
2. Hadoop MapReduce is slower than Hadoop Spark
3. With rise of IOT, Spark is a must
Ericsson Internal | 2015-08-11 | Page 20
RDD & SPARK
COMPONENTS
Technology landscape
Spark eco-system
Ericsson Internal | 2015-08-11 | Page 21
RESILIENT Distributed
Dataset
RDDs track lineage information that can be used to efficiently
re-compute lost data
Ericsson Internal | 2015-08-11 | Page 22
Partitions in the
cluster
SparkM
SparkW
SparkWSparkW
SparkW
partition
RDD
@doanduy 2
2
Ericsson Internal | 2015-08-11 | Page 23
RDD TRANSFORMATIONS
& ACTIONS
Ericsson Internal | 2015-08-11 | Page 24
PARTITION
TRANSFORMATION
map(tuple => (tuple._3, tuple))
groupByKey()
countByKey()
partition
RDD
direct transformation
shuffle
Ericsson Internal | 2015-08-11 | Page 25
Stage 1
Stages
Shuffle operation
Stage 2
Delimits "shuffle"
frontiers
@doanduy 2
5
Ericsson Internal | 2015-08-11 | Page 26
SPARK COMPONENTS
Ericsson Internal | 2015-08-11 | Page 27
SPARK STREAMING
Ericsson Internal | 2015-08-11 | Page 28
SPARK SQL
Ericsson Internal | 2015-08-11 | Page 29
Let’s try some
examples…
Ericsson Internal | 2015-08-11 | Page 30
Spark Shell
./bin/spark-shell --master local[2]
The --master option specifies the master URL for a distributed cluster, or local to run
locally with one thread, or local[N] to run locally with N threads. You should start by
using local for testing.
Ericsson Internal | 2015-08-11 | Page 31
scala> textFile.count() // Number of items in this RDD
ees0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
Simplier scala> textFile.filter(line =>
line.contains("Spark")).count() // How many lines contain
"Spark"?
res3: Long = 15
scala> val textFile = sc.textFile(“../README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
Basic operations…
Ericsson Internal | 2015-08-11 | Page 32
Map - Reduce
scala> textFile.map(line => line.split("
").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15
scala> import java.lang.Math
scala> textFile.map(line => line.split("
").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line =>
line.split(" ")).map(word => (word, 1)).reduceByKey((a,
b) => a + b)
wordCounts: spark.RDD[(String, Int)] =
spark.ShuffledAggregatedRDD@71f027b8
wordCounts.collect()
Ericsson Internal | 2015-08-11 | Page 33
With Caching…
scala> linesWithSpark.cache()
res7: spark.RDD[String] =
spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
Ericsson Internal | 2015-08-11 | Page 34
With HDFS…
val lines = spark.textFile(“hdfs://...”)
val errors = lines.filter(line =>
line.startsWith(“ERROR”))
println(Total errors: + errors.count())
Ericsson Internal | 2015-08-11 | Page 35
Job Submission
$SPARK_HOME/bin/spark-submit 
--class "SimpleApp" 
--master local[4] 
target/scala-2.10/simple-project_2.10-1.0.jar
Ericsson Internal | 2015-08-11 | Page 36
Configuration
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
Ericsson Internal | 2015-08-11 | Page 37
SQL to RDD Translation
Projection & selection
SELECT name, age
FROM people
WHERE age ≥ 13 AND age ≤ 19
SELECT name, age
WHERE age ≥ 13 AND age ≤ 19
val people:RDD[Person]
val teenagers:RDD[(String,Int)]
= people
.filter(p => p.age ≥ 13 && p.age ≤ 19)
.map(p => (p.name, p.age))
.map(p => (p.name, p.age))
.filter(p => p.age ≥ 13 && p.age ≤ 19)
THANK
YOU

Contenu connexe

Tendances

Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying SparkDatabricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRDatabricks
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinNigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinZenika
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 

Tendances (20)

Operational Tips for Deploying Spark
Operational Tips for Deploying SparkOperational Tips for Deploying Spark
Operational Tips for Deploying Spark
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinNigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
PySaprk
PySaprkPySaprk
PySaprk
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 

En vedette

Node Security: The Good, Bad & Ugly
Node Security: The Good, Bad & UglyNode Security: The Good, Bad & Ugly
Node Security: The Good, Bad & UglyBishan Singh
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedInAllen Wittenauer
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedInAllen Wittenauer
 
NodeJS ecosystem
NodeJS ecosystemNodeJS ecosystem
NodeJS ecosystemYukti Kaura
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginningsDaniel Leon
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Node.js Enterprise Middleware
Node.js Enterprise MiddlewareNode.js Enterprise Middleware
Node.js Enterprise MiddlewareBehrad Zari
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathSpark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 

En vedette (20)

Node Security: The Good, Bad & Ugly
Node Security: The Good, Bad & UglyNode Security: The Good, Bad & Ugly
Node Security: The Good, Bad & Ugly
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
NodeJS ecosystem
NodeJS ecosystemNodeJS ecosystem
NodeJS ecosystem
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Node.js Enterprise Middleware
Node.js Enterprise MiddlewareNode.js Enterprise Middleware
Node.js Enterprise Middleware
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 

Similaire à Apache spark linkedin

Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!Edureka!
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefitsJohan Picard
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceEdureka!
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Apache spark
Apache spark Apache spark
Apache spark Edureka!
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache SparkEdureka!
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
Review on Apache Spark Technology
Review on Apache Spark TechnologyReview on Apache Spark Technology
Review on Apache Spark TechnologyIRJET Journal
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 

Similaire à Apache spark linkedin (20)

Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
 
A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduce
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Apache spark
Apache spark Apache spark
Apache spark
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Review on Apache Spark Technology
Review on Apache Spark TechnologyReview on Apache Spark Technology
Review on Apache Spark Technology
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Apache Spark Fundamentals Training
Apache Spark Fundamentals TrainingApache Spark Fundamentals Training
Apache Spark Fundamentals Training
 
Module01
 Module01 Module01
Module01
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
PYSPARK PROGRAMMING.pdf
PYSPARK PROGRAMMING.pdfPYSPARK PROGRAMMING.pdf
PYSPARK PROGRAMMING.pdf
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

Plus de Yukti Kaura

Cloud computing saas
Cloud computing   saasCloud computing   saas
Cloud computing saasYukti Kaura
 
Cloud computing - Basics and Beyond
Cloud computing - Basics and BeyondCloud computing - Basics and Beyond
Cloud computing - Basics and BeyondYukti Kaura
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Web services for Laymen
Web services for LaymenWeb services for Laymen
Web services for LaymenYukti Kaura
 
Clean code - Agile Software Craftsmanship
Clean code - Agile Software CraftsmanshipClean code - Agile Software Craftsmanship
Clean code - Agile Software CraftsmanshipYukti Kaura
 
Basics of Flex Components, Skinning
Basics of Flex Components, SkinningBasics of Flex Components, Skinning
Basics of Flex Components, SkinningYukti Kaura
 

Plus de Yukti Kaura (8)

Cloud computing saas
Cloud computing   saasCloud computing   saas
Cloud computing saas
 
Cloud computing - Basics and Beyond
Cloud computing - Basics and BeyondCloud computing - Basics and Beyond
Cloud computing - Basics and Beyond
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Web services for Laymen
Web services for LaymenWeb services for Laymen
Web services for Laymen
 
Spring batch
Spring batch Spring batch
Spring batch
 
Clean code - Agile Software Craftsmanship
Clean code - Agile Software CraftsmanshipClean code - Agile Software Craftsmanship
Clean code - Agile Software Craftsmanship
 
Maven overview
Maven overviewMaven overview
Maven overview
 
Basics of Flex Components, Skinning
Basics of Flex Components, SkinningBasics of Flex Components, Skinning
Basics of Flex Components, Skinning
 

Dernier

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Apache spark linkedin

  • 1.
  • 2. Ericsson Internal | 2015-08-11 | Page 2 • Wh a t ? • Wh y ? • Ho w ? • De m o • EDR A n a l y t i c s AGENDA
  • 3. Ericsson Internal | 2015-08-11 | Page 3 Spark eco-system Technology landscape Spark eco-system
  • 4. Ericsson Internal | 2015-08-11 | Page 4 “Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics(machine learning)
  • 5. Ericsson Internal | 2015-08-11 | Page 5 WHAT? Originally developed in 2009 in UC Berkeley’sAMPLab Fully open sourced in 2010 – now at Apache Software Foundation http://spark.apache.org
  • 6. Ericsson Internal | 2015-08-11 | Page 6 Spark is the Most Active Open Source Project in Big Data Projectcontributorsinpastyear Giraph Storm Tez 0 20 40 60 80 100 120 140
  • 7. Ericsson Internal | 2015-08-11 | Page 7 Distributors Applications 7 The Spark Community
  • 8. Ericsson Internal | 2015-08-11 | Page 8 2015 SNAPSHOT
  • 9. Ericsson Internal | 2015-08-11 | Page 9 WHY SPARK? Speed Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk. Ease of Use Supports different languages for developing applications using Spark Generality Combine SQL, streaming, and complex analytics into one platform Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud.
  • 10. Ericsson Internal | 2015-08-11 | Page 10 Easy: Get Started Immediately Interactive Shell
  • 11. Ericsson Internal | 2015-08-11 | Page 11 Monitoring
  • 12. Ericsson Internal | 2015-08-11 | Page 12 FEATURE COMPARISON 12 Source: Daytona GraySort benchmark, sortbenchmark.org
  • 13. Ericsson Internal | 2015-08-11 | Page 13 WORD COUNT
  • 14. Ericsson Internal | 2015-08-11 | Page 14 Spark eco-system Local YARN Mesos Spark Streaming Spark SQL GraphX MLLib Spark Core Engine (Scala/Java/Python) Standalone cluster Persistence Cluster Manager … 1 4
  • 15. Ericsson Internal | 2015-08-11 | Page 15 SPARK ON HDFS
  • 16. Ericsson Internal | 2015-08-11 | Page 16 HADOOP SPARK SQL Query interface HIVE SPARKSQL Machine Learning APACHE MAHOUT MLIB Graph processing APACHE GIRAPH GRAPHX Streaming APACHE STORM SPARK STREAMING ECOSYSTEM
  • 17. Ericsson Internal | 2015-08-11 | Page 17 HOW?
  • 18. Ericsson Internal | 2015-08-11 | Page 18 So, HOW is It BETTER
  • 19. Ericsson Internal | 2015-08-11 | Page 19 THE BIG QUESTION? Is Spark going to replace Hadoop? Answer – Yes, Spark will be used on top of Hadoop and replace MapReduce Reasons: 1. Hadoop MapReduce cannot handle real-time processing 2. Hadoop MapReduce is slower than Hadoop Spark 3. With rise of IOT, Spark is a must
  • 20. Ericsson Internal | 2015-08-11 | Page 20 RDD & SPARK COMPONENTS Technology landscape Spark eco-system
  • 21. Ericsson Internal | 2015-08-11 | Page 21 RESILIENT Distributed Dataset RDDs track lineage information that can be used to efficiently re-compute lost data
  • 22. Ericsson Internal | 2015-08-11 | Page 22 Partitions in the cluster SparkM SparkW SparkWSparkW SparkW partition RDD @doanduy 2 2
  • 23. Ericsson Internal | 2015-08-11 | Page 23 RDD TRANSFORMATIONS & ACTIONS
  • 24. Ericsson Internal | 2015-08-11 | Page 24 PARTITION TRANSFORMATION map(tuple => (tuple._3, tuple)) groupByKey() countByKey() partition RDD direct transformation shuffle
  • 25. Ericsson Internal | 2015-08-11 | Page 25 Stage 1 Stages Shuffle operation Stage 2 Delimits "shuffle" frontiers @doanduy 2 5
  • 26. Ericsson Internal | 2015-08-11 | Page 26 SPARK COMPONENTS
  • 27. Ericsson Internal | 2015-08-11 | Page 27 SPARK STREAMING
  • 28. Ericsson Internal | 2015-08-11 | Page 28 SPARK SQL
  • 29. Ericsson Internal | 2015-08-11 | Page 29 Let’s try some examples…
  • 30. Ericsson Internal | 2015-08-11 | Page 30 Spark Shell ./bin/spark-shell --master local[2] The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing.
  • 31. Ericsson Internal | 2015-08-11 | Page 31 scala> textFile.count() // Number of items in this RDD ees0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) Simplier scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 scala> val textFile = sc.textFile(“../README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 Basic operations…
  • 32. Ericsson Internal | 2015-08-11 | Page 32 Map - Reduce scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15 scala> import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8 wordCounts.collect()
  • 33. Ericsson Internal | 2015-08-11 | Page 33 With Caching… scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 34. Ericsson Internal | 2015-08-11 | Page 34 With HDFS… val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(line => line.startsWith(“ERROR”)) println(Total errors: + errors.count())
  • 35. Ericsson Internal | 2015-08-11 | Page 35 Job Submission $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
  • 36. Ericsson Internal | 2015-08-11 | Page 36 Configuration val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)
  • 37. Ericsson Internal | 2015-08-11 | Page 37 SQL to RDD Translation Projection & selection SELECT name, age FROM people WHERE age ≥ 13 AND age ≤ 19 SELECT name, age WHERE age ≥ 13 AND age ≤ 19 val people:RDD[Person] val teenagers:RDD[(String,Int)] = people .filter(p => p.age ≥ 13 && p.age ≤ 19) .map(p => (p.name, p.age)) .map(p => (p.name, p.age)) .filter(p => p.age ≥ 13 && p.age ≤ 19)