SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Spark Streaming &
Spark SQL
Yousun Jeong
jerryjung@sk.com
History - Spark
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become one
of the largest OSS communities in big data, with over
200 contributors in 50+ organizations
“Organizations that are looking at big data challenges – including collection, ETL,
storage, exploration and analytics – should consider Spark for its in-memory
performance and the breadth of its model. It supports advanced analytics solutions
on Hadoop clusters, including the iterative model required for machine learning and
graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)
History - Spark
Some key points about Spark:
• handles batch, interactive, and real-time within a single
framework
• native integration with Java, Python, Scala programming
at a higher level of abstraction
• multi-step Directed Acrylic Graphs (DAGs). 

many stages compared to just Hadoop Map and
Reduce only.
Data Sharing in MR
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
Spark
Benchmark Test
databricks.com/blog/2014/11/05/spark-officially- sets-a-new-record-in-large-scale-
sorting.html
RDD
Resilient Distributed Datasets (RDD) are the primary
abstraction in Spark – a fault-tolerant collection of
elements that can be operated on in parallel
There are currently two types:
• parallelized collections – take an existing Scala collection
and run functions on it in parallel
• Hadoop datasets – run functions on each record of a file
in Hadoop distributed file system or any other storage
system supported by Hadoop
Fault Tolerance
• An RDD is an immutable, deterministically re-
computable, distributed dataset.
• RDD tracks lineage info rebuild lost data
Benefit of Spark
Spark help us to have the gains in processing speed and implement various
big data applications easily and speedily
▪ Support for Event Stream
Processing
▪ Fast Data Queries in Real Time
▪ Improved Programmer Productivity
▪ Fast Batch Processing of Large Data
Set
Why I use spark …
Big Data
Big Data is not just “big”
The 3V of Big Data
Big Data Processing
1. Batch Processing
• processing data en masse
• big & complex
• higher latencies ex) MR
2. Stream Processing
• one-at-a-time processing
• computations are relatively simple and generally independent
• sub-second latency ex) Storm
3. Micro-Batching
• small batch size (batch+streaming)
Spark Streaming Integration
Spark Streaming In Action
import org.apache.spark.streaming._ 

import org.apache.spark.streaming.StreamingContext._ 



// create a StreamingContext with a SparkConf configuration
val ssc = new StreamingContext(sparkConf, Seconds(10)) 



// create a DStream that will connect to serverIP:serverPort
val lines = ssc.socketTextStream(serverIP, serverPort) 



// split each line into words 

val words = lines.flatMap(_.split(" ")) 



// count each word in each batch 

val pairs = words.map(word => (word, 1)) 

val wordCounts = pairs.reduceByKey(_ + _) 



// print a few of the counts to the console
wordCounts.print() 



ssc.start() // Start the computation

ssc.awaitTermination() // Wait for the computation to terminate
Spark UI
Spark SQL
Spark SQL In Action
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingDataTable = sql("""
SELECT e.action, u.age, u.latitude, u.logitude
FROM Users u
JOIN Events e
ON u.userId = e.userId”"")
// Since `sql` returns an RDD, the results of the above
// query can be easily used in MLlib
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD().run(trainingData)
Spark SQL In Action
val allCandidates = sql("""
SELECT userId,
age,
latitude,
logitude
FROM Users
WHERE subscribed = FALSE”"")
// Results of ML algorithms can be used as tables
// in subsequent SQL statements.
case class Score(userId: Int, score: Double)
val scores = allCandidates.map { row =>
val features = Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
scores.registerAsTable("Scores")
MR vs RDD - Compute an
Average
RDD vs DF - Compute an
Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people").groupBy("name").agg("name", avg("age")).collect()
Spark 2.0 : Structured
Streaming
• Structured Streaming
• High-level streaming API built on Spark SQL engine
• Runs the same queries on DataFrames
• Event time, windowing, sessions, sources & sinks
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queries at runtime
• Build and apply ML models
Spark 2.0 Example: Page
View Count
Input: records in Kafka
Query: select count(*) group by page, minute(evtime)
Trigger:“every 5 sec”
Output mode: “update-in-place”, into MySQL sink
logs =
ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).

agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
Spark 2.0 Use Case: Fraud
Detection
Spark 2.0 Performance
Q & A
Thank You!

Contenu connexe

Tendances

Tendances (20)

Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 

Similaire à Spark streaming , Spark SQL

Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 

Similaire à Spark streaming , Spark SQL (20)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 

Plus de Yousun Jeong (10)

Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Spark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on KubernetesSpark day 2017 - Spark on Kubernetes
Spark day 2017 - Spark on Kubernetes
 
Druid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druidDruid meetup 4th_sql_on_druid
Druid meetup 4th_sql_on_druid
 
Kubernetes on aws
Kubernetes on awsKubernetes on aws
Kubernetes on aws
 
Kafka for begginer
Kafka for begginerKafka for begginer
Kafka for begginer
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with Druid
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례
Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례
Enterprise 환경에서의 오픈소스 기반 아키텍처 적용 사례
 
2012 07 28_cloud_reference_architecture_openplatform
2012 07 28_cloud_reference_architecture_openplatform2012 07 28_cloud_reference_architecture_openplatform
2012 07 28_cloud_reference_architecture_openplatform
 

Dernier

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 

Spark streaming , Spark SQL

  • 1. Spark Streaming & Spark SQL Yousun Jeong jerryjung@sk.com
  • 2. History - Spark Developed in 2009 at UC Berkeley AMPLab, then open sourced in 2010, Spark has since become one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations “Organizations that are looking at big data challenges – including collection, ETL, storage, exploration and analytics – should consider Spark for its in-memory performance and the breadth of its model. It supports advanced analytics solutions on Hadoop clusters, including the iterative model required for machine learning and graph analysis.” Gartner, Advanced Analytics and Data Science (2014)
  • 3. History - Spark Some key points about Spark: • handles batch, interactive, and real-time within a single framework • native integration with Java, Python, Scala programming at a higher level of abstraction • multi-step Directed Acrylic Graphs (DAGs). 
 many stages compared to just Hadoop Map and Reduce only.
  • 4. Data Sharing in MR http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
  • 7. RDD Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel There are currently two types: • parallelized collections – take an existing Scala collection and run functions on it in parallel • Hadoop datasets – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop
  • 8. Fault Tolerance • An RDD is an immutable, deterministically re- computable, distributed dataset. • RDD tracks lineage info rebuild lost data
  • 9. Benefit of Spark Spark help us to have the gains in processing speed and implement various big data applications easily and speedily ▪ Support for Event Stream Processing ▪ Fast Data Queries in Real Time ▪ Improved Programmer Productivity ▪ Fast Batch Processing of Large Data Set Why I use spark …
  • 10. Big Data Big Data is not just “big” The 3V of Big Data
  • 11. Big Data Processing 1. Batch Processing • processing data en masse • big & complex • higher latencies ex) MR 2. Stream Processing • one-at-a-time processing • computations are relatively simple and generally independent • sub-second latency ex) Storm 3. Micro-Batching • small batch size (batch+streaming)
  • 13. Spark Streaming In Action import org.apache.spark.streaming._ 
 import org.apache.spark.streaming.StreamingContext._ 
 
 // create a StreamingContext with a SparkConf configuration val ssc = new StreamingContext(sparkConf, Seconds(10)) 
 
 // create a DStream that will connect to serverIP:serverPort val lines = ssc.socketTextStream(serverIP, serverPort) 
 
 // split each line into words 
 val words = lines.flatMap(_.split(" ")) 
 
 // count each word in each batch 
 val pairs = words.map(word => (word, 1)) 
 val wordCounts = pairs.reduceByKey(_ + _) 
 
 // print a few of the counts to the console wordCounts.print() 
 
 ssc.start() // Start the computation
 ssc.awaitTermination() // Wait for the computation to terminate
  • 16. Spark SQL In Action // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId”"") // Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)
  • 17. Spark SQL In Action val allCandidates = sql(""" SELECT userId, age, latitude, logitude FROM Users WHERE subscribed = FALSE”"") // Results of ML algorithms can be used as tables // in subsequent SQL statements. case class Score(userId: Int, score: Double) val scores = allCandidates.map { row => val features = Array[Double](row(1), row(2), row(3)) Score(row(0), model.predict(features)) } scores.registerAsTable("Scores")
  • 18. MR vs RDD - Compute an Average
  • 19. RDD vs DF - Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people").groupBy("name").agg("name", avg("age")).collect()
  • 20. Spark 2.0 : Structured Streaming • Structured Streaming • High-level streaming API built on Spark SQL engine • Runs the same queries on DataFrames • Event time, windowing, sessions, sources & sinks • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serve using JDBC • Change queries at runtime • Build and apply ML models
  • 21. Spark 2.0 Example: Page View Count Input: records in Kafka Query: select count(*) group by page, minute(evtime) Trigger:“every 5 sec” Output mode: “update-in-place”, into MySQL sink logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(logs.user_id).
 agg(sum(logs.time)) .write.format("jdbc") .stream("jdbc:mysql//...")
  • 22. Spark 2.0 Use Case: Fraud Detection
  • 24. Q & A