SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Performance Optimization Case
Study: Shattering Hadoop's Sort
Record with Spark and Scala
Reynold Xin (@rxin)
March 17, 2015
Who am I?
Reynold Xin (@rxin)
Co-founder, Databricks
Apache Spark committer & maintainer on core
API, shuffle, network, block manager, SQL, GraphX
On leave from PhD @ UC Berkeley AMPLab
Spark is In-Memory and Fast
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop MR
Spark
Time per Iteration (s)
“Spark is in-memory. It doesn’t work with BIG
DATA.”
“It is too expensive to buy a cluster with enough
memory to fit our data.”
Common Misconception About Spark
“Spark is in-memory. It doesn’t work with BIG
DATA.”
“It is too expensive to buy a cluster with enough
memory to fit our data.”
Spark Project Goals
Works well with GBs, TBs, or PBs or data
Works well with different storage media (RAM,
HDDs, SSDs)
Works well from low-latency streaming jobs to
long batch jobs
Spark Project Goals
Works well with KBs, MBs, … or PBs or data
Works well with different storage media (RAM,
HDDs, SSDs)
Works well from low-latency streaming jobs to
long batch jobs
Sort Benchmark
Originally sponsored by Jim Gray to measure
advancements in software and hardware in 1987
Participants often used purpose-built hardware/
software to compete
–  Large companies: IBM, Microsoft, Yahoo, …
–  Academia: UC Berkeley, UCSD, MIT, …
Sort Benchmark
1MB -> 100MB -> 1TB (1998) -> 100TB (2009)
Sorting 100 TB and 1 PB
Participated in the most challenging category
–  Daytona GraySort
–  Sort 100TB in a fault-tolerant manner
–  1 trillion records generated by a 3rd-party harness (10B
key + 90B payload)
Set a new world record (tied with UCSD)
–  Saturated throughput of 8 SSDs and 10Gbps network
–  1st time public cloud + open source won
Sorted 1PB on our own to push scalability
11
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
Why Sorting?
Sorting stresses “shuffle”, which underpins
everything from SQL to machine learning (group
by, join, sort, ALS, etc)
Sorting is challenging because there is no
reduction in data.
Sort 100TB = 500TB disk I/O + 200TB network
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
Sort-based Shuffle
Old hash-based shuffle: require R
(#reduce tasks) concurrent streams
with buffers, i.e. limits R.
New sort-based shuffle: sort
records by partition first, and then
write them out. One active stream
at a time.
Went up to 250,000 tasks in PB sort
Network Transport
High performance event-driven architecture using
Netty (SEDA-like)
Explicitly manages pools of memory outside JVM
garbage collection
Zero-copy (avoid user space buffer through
FileChannel.transferTo)
Network Transport
Sustaining ~1.1GB/node/s
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
Warning
The guidelines here apply only to hyper
performance optimizations
–  Majority of the code should not be perf sensitive
Measure before you optimize
–  It is ridiculously hard to write good benchmarks
–  Use jmh if you want to do that
http://github.com/databricks/style-style-guide
Prefer while loops
array.zipWithIndex.foreach { case (x, i) =>
…
}
vs
var i = 0
while (i < array.length) {
...
i += 1
}
Avoid Collection Library
Avoid Scala collection library
–  Generally slower than java.util.*
Avoid Java collection library
–  Not specialized (too many objects)
–  Bad cache locality
Prefer private[this]
A.scala:
class A {
private val field = 0
def accessor = field
}
scalac –print A.scala
Class A extends Object {
private[this] val field: Int = _;
<stable> <accessor> private def field(): Int = A.this.field;
def accessor(): Int = A.this.field();
def <init>(): A = {
A.super.<init>();
A.this.field = 0;
()
}
};
JIT might not be able to inline this
Prefer private[this]
B.scala:
class B {
private[this] val field = 0
def accessor = field
}
scalac –print B.scala
Class B extends Object {
private[this] val field: Int = _;
def accessor(): Int = B.this.field;
def <init>(): B = {
B.super.<init>();
B.this.field = 0;
()
}
};
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
Garbage Collection
Hard to tune GC when memory utilization > 60%
Do away entirely with GC: off-heap allocation
GC Take 1: DirectByteBuffer
java.nio.ByteBuffer.allocateDirect(long size);
Limited size: 2GB only
Hard to recycle
–  only claimed by GC
–  possible to control with reflection, but there is a
static synchronize!
GC Take 2: sun.misc.Unsafe
Available on most JVM implementations
(frequently exploited by trading platforms)
Provide C-like explicit memory allocation
class Unsafe {
public native long allocateMemory(long bytes);
public native void freeMemory(long address);
…
}
GC Take 2: sun.misc.Unsafe
class Unsafe {
…
public native copyMemory(
long srcAddress, long destAddress,
long bytes);
public native long getLong(long address);
public native int getInt(long address);
...
}
Caveats when using Unsafe
Maintain a thread local pool of memory blocks to
avoid allocation overhead and fragmentation
(Netty uses jemalloc)
-ea during debugging, -da in benchmark mode
Cache Friendly Sorting
Naïve Scala/Java collection sort is extremely slow
–  object dereferences (random memory access)
–  polymorphic comparators (cannot be inlined)
Spark’s TimSort allows operating on custom data
memory layout
–  Be careful with inlining again
Data Layout Take 1
100B …
Consecutive Memory Block (size N * 100B)
Comparison: uses the 10B key to compare.
Swap: requires swapping 100B.
Too much memcpy!
Record format: 10B key + 90B payload
Data Layout Take 2
Swap: cheap (8B or 4B)
Comparison: requires
dereference 2 pointers
(random access)
Bad cache locality!
…
100bytearray
100bytearray
100bytearray
100bytearray
100bytearray
Array[Array[Byte]]
Data Layout Take 3
14 bytes …
Consecutive Memory Block (size N * 14B)
10B key prefix + 4B position
Sort comparison: using the 10B prefix inline
Swap: only swap 14B
Good cache locality & low memcpy overhead!
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
Pipelining
Add a semaphore to throttle concurrent “reads”
What did we do exactly?
Improve single-thread sorting speed
–  Better sort algorithm
–  Better data layout for cache locality
Reduce garbage collection overhead
–  Explicitly manage memory using Unsafe
Reduce resource contention via pipelining
Increase network transfer throughput
–  Netty (zero-copy, jemalloc)
Extra Readings
Databricks Scala Guide (Performance Section)
https://github.com/databricks/scala-style-guide
Blog post about this benchmark entry
http://tinyurl.com/spark-sort
Thanks!
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala

Contenu connexe

Tendances

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 

Tendances (20)

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 

En vedette

5 How The Model Works (With Notes)
5 How The Model Works (With Notes)5 How The Model Works (With Notes)
5 How The Model Works (With Notes)Abhishek Datta
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmerGirish Kumar A L
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable LanguageMario Gleichmann
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsMICHRAFY MUSTAFA
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFramesJen Aman
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics Databricks
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive InspectionLinda Tillman
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 

En vedette (20)

5 How The Model Works (With Notes)
5 How The Model Works (With Notes)5 How The Model Works (With Notes)
5 How The Model Works (With Notes)
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmer
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
Apache hive
Apache hiveApache hive
Apache hive
 
Python to scala
Python to scalaPython to scala
Python to scala
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable Language
 
Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
 
Fun[ctional] spark with scala
Fun[ctional] spark with scalaFun[ctional] spark with scala
Fun[ctional] spark with scala
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and Implementations
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive Inspection
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 

Similaire à Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala

Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraHandaru Sakti
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache SparkSimon Lia-Jonassen
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 

Similaire à Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala (20)

Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 

Dernier (20)

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala

  • 1. Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala Reynold Xin (@rxin) March 17, 2015
  • 2. Who am I? Reynold Xin (@rxin) Co-founder, Databricks Apache Spark committer & maintainer on core API, shuffle, network, block manager, SQL, GraphX On leave from PhD @ UC Berkeley AMPLab
  • 3. Spark is In-Memory and Fast 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop MR Spark Time per Iteration (s)
  • 4. “Spark is in-memory. It doesn’t work with BIG DATA.” “It is too expensive to buy a cluster with enough memory to fit our data.”
  • 5. Common Misconception About Spark “Spark is in-memory. It doesn’t work with BIG DATA.” “It is too expensive to buy a cluster with enough memory to fit our data.”
  • 6. Spark Project Goals Works well with GBs, TBs, or PBs or data Works well with different storage media (RAM, HDDs, SSDs) Works well from low-latency streaming jobs to long batch jobs
  • 7. Spark Project Goals Works well with KBs, MBs, … or PBs or data Works well with different storage media (RAM, HDDs, SSDs) Works well from low-latency streaming jobs to long batch jobs
  • 8. Sort Benchmark Originally sponsored by Jim Gray to measure advancements in software and hardware in 1987 Participants often used purpose-built hardware/ software to compete –  Large companies: IBM, Microsoft, Yahoo, … –  Academia: UC Berkeley, UCSD, MIT, …
  • 9. Sort Benchmark 1MB -> 100MB -> 1TB (1998) -> 100TB (2009)
  • 10. Sorting 100 TB and 1 PB Participated in the most challenging category –  Daytona GraySort –  Sort 100TB in a fault-tolerant manner –  1 trillion records generated by a 3rd-party harness (10B key + 90B payload) Set a new world record (tied with UCSD) –  Saturated throughput of 8 SSDs and 10Gbps network –  1st time public cloud + open source won Sorted 1PB on our own to push scalability
  • 11. 11 On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  • 12. Why Sorting? Sorting stresses “shuffle”, which underpins everything from SQL to machine learning (group by, join, sort, ALS, etc) Sorting is challenging because there is no reduction in data. Sort 100TB = 500TB disk I/O + 200TB network
  • 13. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 14. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 15. Sort-based Shuffle Old hash-based shuffle: require R (#reduce tasks) concurrent streams with buffers, i.e. limits R. New sort-based shuffle: sort records by partition first, and then write them out. One active stream at a time. Went up to 250,000 tasks in PB sort
  • 16. Network Transport High performance event-driven architecture using Netty (SEDA-like) Explicitly manages pools of memory outside JVM garbage collection Zero-copy (avoid user space buffer through FileChannel.transferTo)
  • 18. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 19. Warning The guidelines here apply only to hyper performance optimizations –  Majority of the code should not be perf sensitive Measure before you optimize –  It is ridiculously hard to write good benchmarks –  Use jmh if you want to do that http://github.com/databricks/style-style-guide
  • 20. Prefer while loops array.zipWithIndex.foreach { case (x, i) => … } vs var i = 0 while (i < array.length) { ... i += 1 }
  • 21. Avoid Collection Library Avoid Scala collection library –  Generally slower than java.util.* Avoid Java collection library –  Not specialized (too many objects) –  Bad cache locality
  • 22. Prefer private[this] A.scala: class A { private val field = 0 def accessor = field } scalac –print A.scala Class A extends Object { private[this] val field: Int = _; <stable> <accessor> private def field(): Int = A.this.field; def accessor(): Int = A.this.field(); def <init>(): A = { A.super.<init>(); A.this.field = 0; () } }; JIT might not be able to inline this
  • 23. Prefer private[this] B.scala: class B { private[this] val field = 0 def accessor = field } scalac –print B.scala Class B extends Object { private[this] val field: Int = _; def accessor(): Int = B.this.field; def <init>(): B = { B.super.<init>(); B.this.field = 0; () } };
  • 24. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 25. Garbage Collection Hard to tune GC when memory utilization > 60% Do away entirely with GC: off-heap allocation
  • 26. GC Take 1: DirectByteBuffer java.nio.ByteBuffer.allocateDirect(long size); Limited size: 2GB only Hard to recycle –  only claimed by GC –  possible to control with reflection, but there is a static synchronize!
  • 27. GC Take 2: sun.misc.Unsafe Available on most JVM implementations (frequently exploited by trading platforms) Provide C-like explicit memory allocation class Unsafe { public native long allocateMemory(long bytes); public native void freeMemory(long address); … }
  • 28. GC Take 2: sun.misc.Unsafe class Unsafe { … public native copyMemory( long srcAddress, long destAddress, long bytes); public native long getLong(long address); public native int getInt(long address); ... }
  • 29. Caveats when using Unsafe Maintain a thread local pool of memory blocks to avoid allocation overhead and fragmentation (Netty uses jemalloc) -ea during debugging, -da in benchmark mode
  • 30. Cache Friendly Sorting Naïve Scala/Java collection sort is extremely slow –  object dereferences (random memory access) –  polymorphic comparators (cannot be inlined) Spark’s TimSort allows operating on custom data memory layout –  Be careful with inlining again
  • 31. Data Layout Take 1 100B … Consecutive Memory Block (size N * 100B) Comparison: uses the 10B key to compare. Swap: requires swapping 100B. Too much memcpy! Record format: 10B key + 90B payload
  • 32. Data Layout Take 2 Swap: cheap (8B or 4B) Comparison: requires dereference 2 pointers (random access) Bad cache locality! … 100bytearray 100bytearray 100bytearray 100bytearray 100bytearray Array[Array[Byte]]
  • 33. Data Layout Take 3 14 bytes … Consecutive Memory Block (size N * 14B) 10B key prefix + 4B position Sort comparison: using the 10B prefix inline Swap: only swap 14B Good cache locality & low memcpy overhead!
  • 34.
  • 35. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 37. Add a semaphore to throttle concurrent “reads”
  • 38. What did we do exactly? Improve single-thread sorting speed –  Better sort algorithm –  Better data layout for cache locality Reduce garbage collection overhead –  Explicitly manage memory using Unsafe Reduce resource contention via pipelining Increase network transfer throughput –  Netty (zero-copy, jemalloc)
  • 39.
  • 40. Extra Readings Databricks Scala Guide (Performance Section) https://github.com/databricks/scala-style-guide Blog post about this benchmark entry http://tinyurl.com/spark-sort