SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Prajod Vettiyattil 
Architect, Open source 
Wipro 
in.linkedin.com/in/prajod 
@prajods 
Apache Spark The Next Gen toolset for Big Data Processing 
Namitha M S 
Architect, Advanced Technologies 
Wipro 
in.linkedin.com/in/namithams 
Open Source India 
Nov 2014 
Bangalore
•Big Data 
•Hadoop stack and its limitations 
•Spark: An overview 
•Streaming, GraphX and MLlib 
•Performance characteristics of Spark 
Agenda
•Data too huge for normal systems 
•3 Vs: Volume, Variety, Velocity 
•Storage challenge 
•Analysis challenge 
•Query results take hours, days or months 
Big Data 
Data disks
The Big Data Analysis Triad 
Batch 
Interactive 
Streaming
The Hadoop stack 
•Distributed data processing 
•Fault tolerant 
•Process peta byte data sets 
•Ecosystem tools 
•Hive DB, Hbase 
•Pig 
•Storm 
•Hadoop 
•Map 
•Reduce 
•Shuffle, partition, sort 
•HDFS
Hadoop: Data flow 
Partition for target reducers 
Buffer in memory 
Map 
Input data files 
Sort each partition by key 
Merge all partitions and write to disk 
Potential spill to disk 
Merge round 1 
Merge round 2 
Merge round N 
http fetch from 
map node 
Reduce 
Merge sort 
… 
Output 
High disk I/O 
On Map nodes 
On Reduce nodes
•Batch mode 
•Only the batch layer in the Lambda pattern 
•No real time 
•No repetitive queries 
•Iterative algorithms 
•Interactive data querying 
•Poor support for distributed memory 
Limitations of Hadoop
Spark: An overview 
•“Over time, fewer projects will use MapReduce, and more will use Spark” 
•Doug Cutting, creator of Hadoop 
•New architecture: scale better and simplify 
•In memory processing for Big Data 
•Cached intermediate data sets 
•Multi-step DAG based execution 
•Resilient Distributed Data(RDD) sets 
•The core innovation in Spark
Spark Ecosystem tools 
Apache Spark 
Spark SQL 
Streaming 
MLlib 
GraphX 
Spark R 
Blink DB 
Shark 
Bagel
DAG Execution Engine 
Map 
Collect 
Filter 
Map 
Reduce 
Sort 
Collect 
DAG = Directed Acyclic Graph
•Resilient Distributed Data sets 
•Features 
•Read only 
•Fault tolerance without replication 
•Uses data lineage for recovery 
•Low network I/O 
•Partitions/Slices 
•parallel tasks 
RDD 
Disk 
Transform 1 
RDD 1 
Transform 2 
RDD 2 
Data partitions
Lambda architecture pattern 
•Used for Lambda architecture implementation 
•Batch layer 
•Speed layer 
•Serving layer 
Batch Layer 
Speed Layer 
Serving Layer 
Input 
Data consumers 
Query 
Query
Spark Streaming 
•For stream processing in Spark 
•Real time data 
•Like Twitter queries 
•Discretized streams(DStreams) 
•Micro batches 
•Sequence of RDDs
Discretized Streams 
Spark Streaming 
Spark 
Batches of x seconds 
Input 
Output
Why Spark Streaming 
•Near real time processing (0.5 – 2 sec latency) 
•Parallel recovery of lost nodes and stragglers 
•Implementation of Lambda architecture 
•Single engine for batch and stream 
•Not suited for low latency requirements 
•i.e., 100ms
Apache Storm vs Spark Streaming 
Feature 
Spark Streaming 
Storm 
Processing Model 
Micro-Batching 
Event Stream processing 
Message Delivery options 
Inherently fault tolerant, exactly once delivery 
At least once, at most once, exactly once 
Flexibility 
Coarse grained transformation 
Fine grained transformation 
Implemented in 
Scala 
Clojure 
Development Cost 
Common platform for both batch and stream 
Only stream – separate setup for batch 
Applicability 
Machine learning, Interactive analytics, near real time analytics 
Near real time analytics, Natural language processing
GraphX & MLlib 
• Data parallel Vs Graph Parallel processing 
• Wikipedia search vs Facebook connection search, Page 
rank 
• Spark MLlib implements high quality machine 
learning algorithms 
• Iterative Algorithm Paradigm 
• Leverage Spark’s in memory data sets 
( ) (t 1) t x  f x  
f(xt) f(xt+1) 
x(t) x(t+1)
Performance characteristics 
Performance of Spark 
•100x faster in memory 
•10x faster on disk 
Graph courtesy: spark.apache.org
Hadoop vs Spark 
Hadoop 
Spark 
Spark 
World Record 
100 TB * 
1 PB 
Data Size 
102.5 TB 
100 TB 
1000 TB 
Elapsed Time 
72 mins 
23 mins 
234 mins 
# Nodes 
2100 
206 
190 
# Cores 
50400 
6592 
6080 
# Reducers 
10,000 
29,000 
250,000 
Rate 
1.42 TB/min 
4.27 TB/min 
4.27 TB/min 
Rate/node 
0.67 GB/min 
20.7 GB/min 
22.5 GB/min 
Data courtesy: databricks.com
1 TB performance test: data per sec
1 TB performance test data rate vs RAM size
Apache Spark 
•New architecture 
•RDD, DAG 
•In memory processing 
•Map reduce and more 
•GraphX 
•MLlib 
•Spark streaming 
Summary 
Ecosystem tools 
•Spark R 
•Blink DB 
•Storm 
Spark performance 
•GBs per second 
•RAM to data size 
•Inflexion point
Questions 
Prajod Vettiyattil 
Architect, Open source 
Wipro 
@prajods 
in.linkedin.com/in/prajod 
Namitha M S 
Architect, Advanced Technologies 
Wipro 
in.linkedin.com/in/namithams

Contenu connexe

Tendances

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 

Tendances (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 

En vedette

En vedette (20)

Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integration
 
SparkR + Zeppelin
SparkR + ZeppelinSparkR + Zeppelin
SparkR + Zeppelin
 
Event Driven Architecture with Apache Camel
Event Driven Architecture with Apache CamelEvent Driven Architecture with Apache Camel
Event Driven Architecture with Apache Camel
 
Tokyo webmining発表資料 20111127
Tokyo webmining発表資料 20111127Tokyo webmining発表資料 20111127
Tokyo webmining発表資料 20111127
 
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
テキストマイニングで発掘!? 売上とユーザーレビューの相関分析
 
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
Interactive Data Science From Scratch with Apache Zeppelin and Apache SparkInteractive Data Science From Scratch with Apache Zeppelin and Apache Spark
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
 
Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache Flink
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms compared
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
fluent-plugin-norikra #fluentdcasual
fluent-plugin-norikra #fluentdcasualfluent-plugin-norikra #fluentdcasual
fluent-plugin-norikra #fluentdcasual
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
ストリームデータ分散処理基盤Storm
ストリームデータ分散処理基盤Stormストリームデータ分散処理基盤Storm
ストリームデータ分散処理基盤Storm
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 

Similaire à Apache Spark: The Next Gen toolset for Big Data Processing

Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 

Similaire à Apache Spark: The Next Gen toolset for Big Data Processing (20)

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,Real-Time Analytics with Apache Cassandra and Apache Spark,
Real-Time Analytics with Apache Cassandra and Apache Spark,
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 

Dernier

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Dernier (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Apache Spark: The Next Gen toolset for Big Data Processing

  • 1. Prajod Vettiyattil Architect, Open source Wipro in.linkedin.com/in/prajod @prajods Apache Spark The Next Gen toolset for Big Data Processing Namitha M S Architect, Advanced Technologies Wipro in.linkedin.com/in/namithams Open Source India Nov 2014 Bangalore
  • 2. •Big Data •Hadoop stack and its limitations •Spark: An overview •Streaming, GraphX and MLlib •Performance characteristics of Spark Agenda
  • 3. •Data too huge for normal systems •3 Vs: Volume, Variety, Velocity •Storage challenge •Analysis challenge •Query results take hours, days or months Big Data Data disks
  • 4. The Big Data Analysis Triad Batch Interactive Streaming
  • 5. The Hadoop stack •Distributed data processing •Fault tolerant •Process peta byte data sets •Ecosystem tools •Hive DB, Hbase •Pig •Storm •Hadoop •Map •Reduce •Shuffle, partition, sort •HDFS
  • 6. Hadoop: Data flow Partition for target reducers Buffer in memory Map Input data files Sort each partition by key Merge all partitions and write to disk Potential spill to disk Merge round 1 Merge round 2 Merge round N http fetch from map node Reduce Merge sort … Output High disk I/O On Map nodes On Reduce nodes
  • 7. •Batch mode •Only the batch layer in the Lambda pattern •No real time •No repetitive queries •Iterative algorithms •Interactive data querying •Poor support for distributed memory Limitations of Hadoop
  • 8. Spark: An overview •“Over time, fewer projects will use MapReduce, and more will use Spark” •Doug Cutting, creator of Hadoop •New architecture: scale better and simplify •In memory processing for Big Data •Cached intermediate data sets •Multi-step DAG based execution •Resilient Distributed Data(RDD) sets •The core innovation in Spark
  • 9. Spark Ecosystem tools Apache Spark Spark SQL Streaming MLlib GraphX Spark R Blink DB Shark Bagel
  • 10. DAG Execution Engine Map Collect Filter Map Reduce Sort Collect DAG = Directed Acyclic Graph
  • 11. •Resilient Distributed Data sets •Features •Read only •Fault tolerance without replication •Uses data lineage for recovery •Low network I/O •Partitions/Slices •parallel tasks RDD Disk Transform 1 RDD 1 Transform 2 RDD 2 Data partitions
  • 12. Lambda architecture pattern •Used for Lambda architecture implementation •Batch layer •Speed layer •Serving layer Batch Layer Speed Layer Serving Layer Input Data consumers Query Query
  • 13. Spark Streaming •For stream processing in Spark •Real time data •Like Twitter queries •Discretized streams(DStreams) •Micro batches •Sequence of RDDs
  • 14. Discretized Streams Spark Streaming Spark Batches of x seconds Input Output
  • 15. Why Spark Streaming •Near real time processing (0.5 – 2 sec latency) •Parallel recovery of lost nodes and stragglers •Implementation of Lambda architecture •Single engine for batch and stream •Not suited for low latency requirements •i.e., 100ms
  • 16. Apache Storm vs Spark Streaming Feature Spark Streaming Storm Processing Model Micro-Batching Event Stream processing Message Delivery options Inherently fault tolerant, exactly once delivery At least once, at most once, exactly once Flexibility Coarse grained transformation Fine grained transformation Implemented in Scala Clojure Development Cost Common platform for both batch and stream Only stream – separate setup for batch Applicability Machine learning, Interactive analytics, near real time analytics Near real time analytics, Natural language processing
  • 17. GraphX & MLlib • Data parallel Vs Graph Parallel processing • Wikipedia search vs Facebook connection search, Page rank • Spark MLlib implements high quality machine learning algorithms • Iterative Algorithm Paradigm • Leverage Spark’s in memory data sets ( ) (t 1) t x  f x  f(xt) f(xt+1) x(t) x(t+1)
  • 18. Performance characteristics Performance of Spark •100x faster in memory •10x faster on disk Graph courtesy: spark.apache.org
  • 19. Hadoop vs Spark Hadoop Spark Spark World Record 100 TB * 1 PB Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 6592 6080 # Reducers 10,000 29,000 250,000 Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min Data courtesy: databricks.com
  • 20. 1 TB performance test: data per sec
  • 21. 1 TB performance test data rate vs RAM size
  • 22. Apache Spark •New architecture •RDD, DAG •In memory processing •Map reduce and more •GraphX •MLlib •Spark streaming Summary Ecosystem tools •Spark R •Blink DB •Storm Spark performance •GBs per second •RAM to data size •Inflexion point
  • 23. Questions Prajod Vettiyattil Architect, Open source Wipro @prajods in.linkedin.com/in/prajod Namitha M S Architect, Advanced Technologies Wipro in.linkedin.com/in/namithams