SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Introduction to Spark
Ivan Morozov
In this talk…
Introduction
to Spark
Resilient
Distributed
Datasets (RDDs)
How to build a
statistical
model
Lessons Learned
3
• Started as a project of AMP Lab at UC Berkley in 2009
!
• Open sourced in 2010
!
• Apache Incubator project since June 2013
!
• Databricks was founded 2013 as company behind Spark
!
• Top Level Project at Apache in February 2014
History
State of Play4
100 %
Open  Source
300+ 50+
Contributors Organisations
Introduction
to Spark
Resilient
Distributed
Datasets (RDDs)
How to build a
statistical
model
Lessons Learned
Why a new programming model for BigData analysis?6
ITERATION ITERATION
Input HDFS
write
HDFS
write
HDFS
write
…
HDFS
read
Input
Query  1
Query  3
Query  2
HDFS
read Result  1
Result  2
Result  3
Iterative Model
Ad-Hoc querying
Why a new programming model for BigData analysis?7
ITERATION ITERATION
Input HDFS
read
Input Query  1
Query  3
Query  2
HDFS
read / Pre-proceed
Result  1
Result  2
Result  3
Iterative Model
Ad-Hoc querying
MEMORY   MEMORY  
MEMORY  
Facts8
• Implemented in ~14.000 lines of Scala
• API’s for Scala, Python and Java
• Vertical and Horizontal scalable
• Fault tolerance and fast recomputation
• Load Balancing
• On top of in memory cluster computing data structure with rich set of
operations
• Libraries for Machine Learning, Graph Computation, Stream
Processing and Ad-Hoc querying
• API to control computation flow and persistance management
• Different deployment options (Standalone, YARN, MR)
• Interoperability with lot of systems ( HIVE, EC2, Mesos, HBase … )
Architecture9
Spark  
SQL
Spark  
Streaming
MLlib GraphX
Apache  Spark
MESOS
WORKER WORKER WORKER WORKER
HDFS Cassandra
S3
Programming Model10
WORKER
WORKER
WORKER
Driver  
Input Spark Context
Input Data Tasks
Results
Input  Data
Input  Data
Input  Data
Input: HDFS Cluster,Hadoop,Hive…
Input Data: RDD[T]
Tasks: Serialized Java Objects
Result: RDD[A] Computing: In Memory
Driver: Spark Programm
Programming Model11
val conf = new SparkConf()	
.setMaster("local[2]")	
.setAppName("ScalaMeetup")	
.set("spark.executor.memory", “1g")	
!
val sc = new SparkContext(conf)	
!
val data = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))	
val biggerAsFive = data.filter( _ > 5)	
!
biggerAsFive.cache().collect()	
!
val exp = biggerAsFive.map(x => x * x)	
val result = exp.reduce(_+_)
INFO DAGScheduler: Stage 62 (reduce at <console>:18) finished in 0.021 s!
INFO SparkContext: Job finished: reduce at <console>:18, took 0.024262042 s!
result: Int = 330
Launch with Spark-Submit12
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit 
--class org.apache.spark.examples.ScalaMeetupMain 
--master yarn-cluster  # can also be `yarn-client` for client mode
--executor-memory 20G 
--num-executors 50 
/path/to/examples.jar 
1000
You can run you packaged apps with spark-submit command
Not covered today…13
• MLIB,GraphX,SparkSQL,SparkStreaming
!
• Third-party systems communication
!
• …
Introduction
to Spark
Resilient
Distributed
Datasets (RDDs)
How to build a
statistical
model
Lessons Learned
RDD’s Creation15
• Distributed Immutable memory abstraction
• Can only be created through:
• read from stable storage
• transformation other RDD
dataFile.txt :	
line1 with 	
line2 some	
line3 data
val data : RDD[T] = sc.textFile(“dataFile.txt”)	
// T => String	
data.foreach(println) 	
// line1 with 	
// line2 some	
// line3 data
val data : RDD[T] = sc.textFile(“dataFile.txt”)	
// T => String	
val data_ : RDD[A] = data.filter(_.startsWith(“line1”))
// A => String	
data.foreach(println)	
// line1 with
RDD’s Operations16
Picture from Resilient Distributed Datasets paper.
val data : RDD[String] = 	
(sc:SparkContext).parallelize	
(List(	
“a”,	
“b”,	
“I_AM_A_VERY_LONG_STRING”))	
!
case class LongString(element :String)	
!
val nothingFound = “NO Long STRING FOUND”	
!
val result = data.map ((x:String) => {	
	 if(x > 0)	
	 	 val longString = x	
	 	 .toList	
	 	 .find(_.size() > 10)	
!
	 	 val result = longString match {	
	 	 	 case Some(x:String) => LongString(x)	
	 	 	 case None => nothingFound	
	 	 }	
	 	 result	
	 else	
	 	 nothingFound	
})	
!
result.collect() 	
// LongString(“I_AM_A_VERY_LONG_STRING”)
Transformations  are  Lazy  Operations
Actions  are  Reduce  Operations
Closures  should  be  associative  
RDD’s representation17
• Partitions: LIST of atomar pieces to the given dataset
!
• Dependencies between parent and child RDD’s
• Narrow (e.g. map..)
• Wide (e.g. join..)
!
• RDD’s are immutable
!
• Lineage graph to ensure fault recovery
RDD’s job scheduling & fault recovering18
myRDD0.persist()	
myRDD1.map(func)	
.union(myRDD2)	
.join(myRDD0)myRDD1
myRDD2
myRDD0
Result
map(…)
union(…)
join(…)
map(…)
union(…)
RDD1
RDD2
RDD0
Join(…)
Not covered today…19
• Interpreter integration
!
• Memory management
!
• Checkpointing policies
!
• …
Coffee Time
Introduction
to Spark
Resilient
Distributed
Datasets (RDDs)
How to build a
statistical
model
Lessons Learned
Conclusion22
Code
Introduction
to Spark
Resilient
Distributed
Datasets (RDDs)
How to build a
statistical
model
Lessons Learned
Most important advice24
Read the excellent documentation
JAR Assembling25
!
• Use consistent Scala,AKKA,Spark versions
!
• If combine AKKA & Spark use the same AKKA (shaded-protobuf)
!
• When your project runs unpackaged well, not mean JAR act same way
Work with data26
• Caching your Data: 1TB of Data in Memory 5-7seconds vs ~3 Minutes
• Work with small subset of data and validate complex transformations
• MLib is not enough, consult statistician to validate your model
• Well placed logging can save you a lot of time
Some evaluation27
30 GB
0
9h
18h
1d 3h
1d 12h
Spark Old system
Spark Old system
Future28
https://databricks.com/spark
Resources29
Spark Doc : https://spark.apache.org/docs/
Databricks Doc : https://databricks.com/spark
Typesafe Spark Workshop : http://typesafe.com/activator/template/spark-
workshop
Spark Paper : https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDD Paper : http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf
Conclusion30
• Fits best into batch processing use cases
• Scala was absolut right for spark
• Large community support is helpful
• RDD’s are an powerful & flexible data structure
Question Time
Thanks for Watching

Contenu connexe

Tendances

PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016Duyhai Doan
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure wayBahadir Cambel
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmInfoFarm
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applicationsKexin Xie
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataJimmy Angelakos
 

Tendances (20)

PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
PySaprk
PySaprkPySaprk
PySaprk
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 

Similaire à Scala Meetup Hamburg - Spark

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 

Similaire à Scala Meetup Hamburg - Spark (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 

Dernier

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Scala Meetup Hamburg - Spark

  • 2. In this talk… Introduction to Spark Resilient Distributed Datasets (RDDs) How to build a statistical model Lessons Learned
  • 3. 3 • Started as a project of AMP Lab at UC Berkley in 2009 ! • Open sourced in 2010 ! • Apache Incubator project since June 2013 ! • Databricks was founded 2013 as company behind Spark ! • Top Level Project at Apache in February 2014 History
  • 4. State of Play4 100 % Open  Source 300+ 50+ Contributors Organisations
  • 5. Introduction to Spark Resilient Distributed Datasets (RDDs) How to build a statistical model Lessons Learned
  • 6. Why a new programming model for BigData analysis?6 ITERATION ITERATION Input HDFS write HDFS write HDFS write … HDFS read Input Query  1 Query  3 Query  2 HDFS read Result  1 Result  2 Result  3 Iterative Model Ad-Hoc querying
  • 7. Why a new programming model for BigData analysis?7 ITERATION ITERATION Input HDFS read Input Query  1 Query  3 Query  2 HDFS read / Pre-proceed Result  1 Result  2 Result  3 Iterative Model Ad-Hoc querying MEMORY   MEMORY   MEMORY  
  • 8. Facts8 • Implemented in ~14.000 lines of Scala • API’s for Scala, Python and Java • Vertical and Horizontal scalable • Fault tolerance and fast recomputation • Load Balancing • On top of in memory cluster computing data structure with rich set of operations • Libraries for Machine Learning, Graph Computation, Stream Processing and Ad-Hoc querying • API to control computation flow and persistance management • Different deployment options (Standalone, YARN, MR) • Interoperability with lot of systems ( HIVE, EC2, Mesos, HBase … )
  • 9. Architecture9 Spark   SQL Spark   Streaming MLlib GraphX Apache  Spark MESOS WORKER WORKER WORKER WORKER HDFS Cassandra S3
  • 10. Programming Model10 WORKER WORKER WORKER Driver   Input Spark Context Input Data Tasks Results Input  Data Input  Data Input  Data Input: HDFS Cluster,Hadoop,Hive… Input Data: RDD[T] Tasks: Serialized Java Objects Result: RDD[A] Computing: In Memory Driver: Spark Programm
  • 11. Programming Model11 val conf = new SparkConf() .setMaster("local[2]") .setAppName("ScalaMeetup") .set("spark.executor.memory", “1g") ! val sc = new SparkContext(conf) ! val data = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)) val biggerAsFive = data.filter( _ > 5) ! biggerAsFive.cache().collect() ! val exp = biggerAsFive.map(x => x * x) val result = exp.reduce(_+_) INFO DAGScheduler: Stage 62 (reduce at <console>:18) finished in 0.021 s! INFO SparkContext: Job finished: reduce at <console>:18, took 0.024262042 s! result: Int = 330
  • 12. Launch with Spark-Submit12 # Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit --class org.apache.spark.examples.ScalaMeetupMain --master yarn-cluster # can also be `yarn-client` for client mode --executor-memory 20G --num-executors 50 /path/to/examples.jar 1000 You can run you packaged apps with spark-submit command
  • 13. Not covered today…13 • MLIB,GraphX,SparkSQL,SparkStreaming ! • Third-party systems communication ! • …
  • 14. Introduction to Spark Resilient Distributed Datasets (RDDs) How to build a statistical model Lessons Learned
  • 15. RDD’s Creation15 • Distributed Immutable memory abstraction • Can only be created through: • read from stable storage • transformation other RDD dataFile.txt : line1 with line2 some line3 data val data : RDD[T] = sc.textFile(“dataFile.txt”) // T => String data.foreach(println) // line1 with // line2 some // line3 data val data : RDD[T] = sc.textFile(“dataFile.txt”) // T => String val data_ : RDD[A] = data.filter(_.startsWith(“line1”)) // A => String data.foreach(println) // line1 with
  • 16. RDD’s Operations16 Picture from Resilient Distributed Datasets paper. val data : RDD[String] = (sc:SparkContext).parallelize (List( “a”, “b”, “I_AM_A_VERY_LONG_STRING”)) ! case class LongString(element :String) ! val nothingFound = “NO Long STRING FOUND” ! val result = data.map ((x:String) => { if(x > 0) val longString = x .toList .find(_.size() > 10) ! val result = longString match { case Some(x:String) => LongString(x) case None => nothingFound } result else nothingFound }) ! result.collect() // LongString(“I_AM_A_VERY_LONG_STRING”) Transformations  are  Lazy  Operations Actions  are  Reduce  Operations Closures  should  be  associative  
  • 17. RDD’s representation17 • Partitions: LIST of atomar pieces to the given dataset ! • Dependencies between parent and child RDD’s • Narrow (e.g. map..) • Wide (e.g. join..) ! • RDD’s are immutable ! • Lineage graph to ensure fault recovery
  • 18. RDD’s job scheduling & fault recovering18 myRDD0.persist() myRDD1.map(func) .union(myRDD2) .join(myRDD0)myRDD1 myRDD2 myRDD0 Result map(…) union(…) join(…) map(…) union(…) RDD1 RDD2 RDD0 Join(…)
  • 19. Not covered today…19 • Interpreter integration ! • Memory management ! • Checkpointing policies ! • …
  • 21. Introduction to Spark Resilient Distributed Datasets (RDDs) How to build a statistical model Lessons Learned
  • 23. Introduction to Spark Resilient Distributed Datasets (RDDs) How to build a statistical model Lessons Learned
  • 24. Most important advice24 Read the excellent documentation
  • 25. JAR Assembling25 ! • Use consistent Scala,AKKA,Spark versions ! • If combine AKKA & Spark use the same AKKA (shaded-protobuf) ! • When your project runs unpackaged well, not mean JAR act same way
  • 26. Work with data26 • Caching your Data: 1TB of Data in Memory 5-7seconds vs ~3 Minutes • Work with small subset of data and validate complex transformations • MLib is not enough, consult statistician to validate your model • Well placed logging can save you a lot of time
  • 27. Some evaluation27 30 GB 0 9h 18h 1d 3h 1d 12h Spark Old system Spark Old system
  • 29. Resources29 Spark Doc : https://spark.apache.org/docs/ Databricks Doc : https://databricks.com/spark Typesafe Spark Workshop : http://typesafe.com/activator/template/spark- workshop Spark Paper : https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf RDD Paper : http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf
  • 30. Conclusion30 • Fits best into batch processing use cases • Scala was absolut right for spark • Large community support is helpful • RDD’s are an powerful & flexible data structure