SlideShare a Scribd company logo
1 of 55
Download to read offline
How Spark Beat Hadoop @ 100 TB Sort
+ 
Project Tungsten
London Spark Meetup
Chris Fregly, Principal Data Solutions Engineer
IBM Spark Technology Center
Oct 12, 2015
Power of data. Simplicity of design. Speed of innovation.
IBM | spark.tc
IBM | spark.tc
Announcements
Skimlinks!
!
Martin Goodson!
Organizer, London Spark Meetup!
!
IBM | spark.tc
Who am I?! !
Streaming Data Engineer!
Netflix Open Source Committer!
!
Data Solutions Engineer!
Apache Contributor!
!
Principal Data Solutions Engineer!
IBM Technology Center!
Meetup Organizer!
Advanced Apache Meetup!
Book Author!
Advanced Spark (2016)!
IBM | spark.tc
Advanced Apache Spark Meetup
Total Spark Experts: 1200+ in only 3 mos!!
#5 most active Spark Meetup in the world!!
!
Goals!
Dig deep into the Spark & extended-Spark codebase!
!
Study integrations such as Cassandra, ElasticSearch,!
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc!
!
Surface and share the patterns and idioms of these !
well-designed, distributed, big data components!
IBM | spark.tc
Recent Events
Cassandra Summit 2015!
Real-time Advanced Analytics w/ Spark & Cassandra!
!
!
!
!
!
Strata NYC 2015!
Practical Data Science w/ Spark: Recommender Systems!
!
Available on Slideshare!
http://slideshare.net/cfregly!
!
IBM | spark.tc
Freg-a-palooza Upcoming World Tour
  London Spark Meetup (Oct 12th)!
  Scotland Data Science Meetup (Oct 13th)!
  Dublin Spark Meetup (Oct 15th)!
  Barcelona Spark Meetup (Oct 20th)!
  Madrid Spark/Big Data Meetup (Oct 22nd)!
  Paris Spark Meetup (Oct 26th)!
  Amsterdam Spark Summit (Oct 27th – Oct 29th)!
  Delft Dutch Data Science Meetup (Oct 29th) !
  Brussels Spark Meetup (Oct 30th)!
  Zurich Big Data Developers Meetup (Nov 2nd)!
High probability!
I’ll end up in jail!
or married!!
Daytona GraySort tChallenge
sortbenchmark.org!
IBM | spark.tc
Topics of this Talk: Mechanical Sympathy!
Tungsten => Bare Metal!
Seek Once, Scan Sequentially!!
CPU Cache Locality and Efficiency!
Use Data Structs Customized to Your Workload!
Go Off-Heap Whenever Possible !
spark.unsafe.offHeap=true!
IBM | spark.tc
What is the Daytona GraySort Challenge?!
Key Metric!
Throughput of sorting 100TB of 100 byte data,10 byte key!
Total time includes launching app and writing output file!
!
Daytona!
App must be general purpose!
!
Gray!
Named after Jim Gray!
IBM | spark.tc
Daytona GraySort Challenge: Input and Resources!
Input!
Records are 100 bytes in length!
First 10 bytes are random key!
Input generator: ordinal.com/gensort.html!
28,000 fixed-size partitions for 100 TB sort!
250,000 fixed-size partitions for 1 PB sort!
1 partition = 1 HDFS block = 1 executor !
Aligned to avoid partial read I/O ie. imaginary data ^----^!
Hardware and Runtime Resources!
Commercially available and off-the-shelf!
Unmodified, no over/under-clocking!
Generates 500TB of disk I/O, 200TB network I/O!
1st record of!
1st 10 bytes:!
“JimGrayRIP”!
IBM | spark.tc
Daytona GraySort Challenge: Rules!
Must sort to/from OS files in secondary storage!
!
No raw disk since I/O subsystem is being tested!
!
File and device striping (RAID 0) are encouraged!
!
Output file(s) must have correct key order!
IBM | spark.tc
Daytona GraySort Challenge: Task Scheduling!
Types of Data Locality!
PROCESS_LOCAL!
NODE_LOCAL!
RACK_LOCAL!
ANY!
!
Delay Scheduling!
spark.locality.wait.node: time to wait for next shitty level!
Set to infinite to reduce shittiness, force NODE_LOCAL!
Straggling Executor JVMs naturally fade away on each run!
Decreasing!
Level of!
Read !
Performance!
IBM | spark.tc
Daytona GraySort Challenge: Winning Results!
On-disk only, in-memory caching disabled!!
EC2 (i2.8xlarge)! EC2 (i2.8xlarge)!
28,000!
partitions!
250,000 !
partitions (!!)!
(3 GBps/node!
* 206 nodes)!
IBM | spark.tc
Daytona GraySort Challenge: EC2 Configuration!
206 EC2 Worker nodes, 1 Master node!
AWS i2.8xlarge!
32 Intel Xeon CPU E5-2670 @ 2.5 Ghz!
244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4!
NOOP I/O scheduler: FIFO, request merging, no reordering!
3 GBps mixed read/write disk I/O per node!
Deployed within Placement Group/VPC!
Enhanced Networking!
Single Root I/O Virtualization (SR-IOV): extension of PCIe!
10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)!
IBM | spark.tc
Daytona GraySort Challenge: Winning Configuration!
Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17!
Disabled in-memory caching -- all on-disk!!
HDFS 2.4.1 short-circuit local reads, 2x replication!
Writes flushed after each of the 5 runs!
28,000 partitions / (206 nodes * 32 cores) = 4.25 runs, round up 5 runs!
Netty 4.0.23.Final with native epoll!
Speculative Execution disabled: spark.speculation=false!
Force NODE_LOCAL: spark.locality.wait.node=Infinite !
Force Netty Off-Heap: spark.shuffle.io.preferDirectBuffers!
Spilling disabled: spark.shuffle.spill=false!
All compression disabled (network, on-disk, etc)!
IBM | spark.tc
Daytona GraySort Challenge: Partitioning!
Range Partitioning (vs. Hash Partitioning)!
Take advantage of sequential key space!
Similar keys grouped together within a partition!
Ranges defined by sampling 79 values per partition!
Driver sorts samples and defines range boundaries!
Sampling took ~10 seconds for 28,000 partitions!
!
IBM | spark.tc
Daytona GraySort Challenge: Why Bother?!
Sorting relies heavily on shuffle, I/O subsystem!
!
Shuffle is major bottleneck in big data processing!
Large number of partitions can exhaust OS resources!
!
Shuffle optimization benefits all high-level libraries!
!
Goal is to saturate network controller on all nodes!
~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)!
IBM | spark.tc
Daytona GraySort Challenge: Per Node Results!
!
!
!
!
!
Reducers: ~1.1 GB/s/node network I/O!
(max 1.25 Gbps for 10 GB ethernet)!
Mappers: 3 GB/s/node disk I/O (8x800 SSD)!
206 nodes !
* !
1.1 Gbps/node !
~=!
220 Gbps !
Quick Shuffle Refresher
!
!
!
!
!
!
!
!
!
!
!
!
IBM | spark.tc
Shuffle Overview!
All to All, Cartesian Product Operation!
Least ->!
Useful!
Example!
I Could!
Find ->!
!
!
!
!
!
!
!
!
!
!
!
!
IBM | spark.tc
Spark Shuffle Overview!
Most ->!
Confusing!
Example!
I Could!
Find ->!
Stages are Defined by Shuffle Boundaries!
IBM | spark.tc
Shuffle Intermediate Data: Spill to Disk!
Intermediate shuffle data stored in memory!
Spill to Disk!
spark.shuffle.spill=true!
spark.shuffle.memoryFraction=% of all shuffle buffers!
Competes with spark.storage.memoryFraction!
Bump this up from default!! Will help Spark SQL, too.!
Skipped Stages!
Reuse intermediate shuffle data found on reducer!
DAG for that partition can be truncated!
IBM | spark.tc
Shuffle Intermediate Data: Compression!
spark.shuffle.compress!
Compress outputs (mapper)!
!
spark.shuffle.spill.compress!
Compress spills (reducer)!
!
spark.io.compression.codec!
LZF: Most workloads (new default for Spark)!
Snappy: LARGE workloads (less memory required to compress)!
IBM | spark.tc
Spark Shuffle Operations!
join!
distinct!
cogroup!
coalesce!
repartition!
sortByKey!
groupByKey!
reduceByKey!
aggregateByKey!
IBM | spark.tc
Spark Shuffle Managers!
spark.shuffle.manager = {!
hash < 10,000 Reducers!
Output file determined by hashing the key of (K,V) pair!
Each mapper creates an output buffer/file per reducer!
Leads to M*R number of output buffers/files per shuffle!
sort >= 10,000 Reducers!
Default since Spark 1.2!
Wins Daytona GraySort Challenge w/ 250,000 reducers!!!
tungsten-sort -> Default in Spark 1.5!
Uses com.misc.Unsafe for direct access to off heap!
}!
IBM | spark.tc
Shuffle Managers!
IBM | spark.tc
Shuffle Performance Tuning!
Hash Shuffle Manager (no longer default)!
spark.shuffle.consolidateFiles: mapper output files!
o.a.s.shuffle.FileShuffleBlockResolver!
Intermediate Files!
Increase spark.shuffle.file.buffer: reduce seeks & sys calls!
Increase spark.reducer.maxSizeInFlight if memory allows!
Use smaller number of larger workers to reduce total files!
SQL: BroadcastHashJoin vs. ShuffledHashJoin!
spark.sql.autoBroadcastJoinThreshold !
Use DataFrame.explain(true) or EXPLAIN to verify!
Mechanical Sympathy
IBM | spark.tc
Mechanical Sympathy!
Use as much of the CPU cache line as possible!!!
!
!
!
!
!
!
!
!
!
IBM | spark.tc
Naïve Matrix Multiplication: Not Cache Friendly!
Naive:!
for (i = 0; i < N; ++i)!
for (j = 0; j < N; ++j)!
for (k = 0; k < N; ++k)!
res[i][j] += mat1[i][k] * mat2[k][j];!
Clever: !
double mat2transpose [N][N];!
for (i = 0; i < N; ++i)!
for (j = 0; j < N; ++j)!
mat2transpose[i][j] = mat2[j][i];!
for (i = 0; i < N; ++i)!
for (j = 0; j < N; ++j)!
for (k = 0; k < N; ++k)!
res[i][j] += mat1[i][k] * mat2transpose[j][k];!
Prefetch Not Effective
On !
Row Wise Traversal!
Force All !
Column Traversal by!
Transposing Matrix 2!
Winning Optimizations 
Deployed across Spark 1.1 and 1.2
IBM | spark.tc
Daytona GraySort Challenge: Winning Optimizations!
CPU-Cache Locality: Mechanical Sympathy!
& Cache Locality/Alignment!
!
Optimized Sort Algorithm: Elements of (K, V) Pairs!
!
Reduce Network Overhead: Async Netty, epoll!
!
Reduce OS Resource Utilization: Sort Shuffle!
IBM | spark.tc
CPU-Cache Locality: Mechanical Sympathy!
AlphaSort paper ~1995!
Chris Nyberg and Jim Gray!
!
Naïve!
List (Pointer-to-Record)!
Requires Key to be dereferenced for comparison!
!
AlphaSort!
List (Key, Pointer-to-Record)!
Key is directly available for comparison!
!
Key! Ptr!
Ptr!
IBM | spark.tc
CPU-Cache Locality: Cache Locality/Alignment!
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes!
*4 bytes when using compressed OOPS (<32 GB heap)!
Not binary in size!
Not CPU-cache friendly!
Cache Alignment Options!
Add Padding (2 bytes)!
Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes!
(Key-Prefix, Pointer-to-Record)!
Key distribution affects performance!
Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes!
Key!
Key!
Ptr!
Ptr!
Ptr!
Key-Prefx!
Pad!
With Padding!
Cache-line!
Friendly!
IBM | spark.tc
CPU-Cache Locality: Performance Comparison!
IBM | spark.tc
Similar Technique: Direct Cache Access!
^ Packet header placed into CPU cache ^!
IBM | spark.tc
Optimized Sort Algorithm: Elements of (K, V) Pairs!
o.a.s.util.collection.TimSort!
Based on JDK 1.7 TimSort!
Performs best on partially-sorted datasets !
Optimized for elements of (K,V) pairs!
Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)!
!
o.a.s.util.collection.AppendOnlyMap!
Open addressing hash, quadratic probing!
Array of [(K, V), (K, V)] !
Good memory locality!
Keys never removed, values only append!
(^2 Probing)!
IBM | spark.tc
Reduce Network Overhead: Async Netty, epoll!
New Network Module based on Async Netty!
Replaces old java.nio, low-level, socket-based code!
Zero-copy epoll uses kernel-space between disk & network!
Custom memory management reduces GC pauses!
spark.shuffle.blockTransferService=netty!
Spark-Netty Performance Tuning!
spark.shuffle.io.numConnectionsPerPeer!
Increase to saturate hosts with multiple disks!
spark.shuffle.io.preferDirectBuffers!
On or Off-heap (Off-heap is default)!
IBM | spark.tc
Hash Shuffle Manager!!
!
!
!
!
!
!
!
!
!
!
M*R num open files per shuffle; M=num mappers!
R=num reducers!
Mapper Opens 1 File per Partition/Reducer!
HDFS!
(2x repl)!
HDFS!
(2x repl)!
S!
IBM | spark.tc
Reduce OS Resource Utilization: Sort Shuffle!
!
!
!
!
!
!
!
!
M open files per shuffle; M = num of mappers!
spark.shuffle.sort.bypassMergeThreshold!
Merge Sort!
(Disk)!
Reducers seek and
scan from range offset!
of Master File on
Mapper!
TimSort!
(RAM)!
HDFS!
(2x repl)!
HDFS!
(2x repl)!
SPARK-2926:!
Replace
TimSort w/
Merge Sort!
(Memory)!
Mapper Merge Sorts Partitions into 1 Master File
Indexed by Partition Range Offsets!
<- Master->!
File!
Project Tungsten
Deployed across Spark 1.4 and 1.5
IBM | spark.tc
Significant Spark Core Changes!
Disk!
Network!
CPU!
Memory!
Daytona GraySort Optimizations!
(Spark 1.1-1.2, Late 2014)!
Tungsten Optimizations!
(Spark 1.4-1.5, Late 2015)!
IBM | spark.tc
Why is CPU the Bottleneck?!
Network and Disk I/O bandwidth are relatively high!
!
GraySort optimizations improved network & shuffle!
!
Predicate pushdowns and partition pruning!
!
Columnar file formats like Parquet and ORC!
!
CPU used for serialization, hashing, compression!
IBM | spark.tc
tungsten-sort Shuffle Manager!
“I don’t know your data structure, but my array[] will beat it!”
Custom Data Structures for Sort/Shuffle Workload!
UnsafeRow: !
!
!
!
Rows are !
8-byte aligned
Primitives are inlined!
Row.equals(), Row.hashCode()!
operate on raw bytes!
Offset (Int) and Length (Int)!
Stored in a single Long!
IBM | spark.tc
sun.misc.Unsafe!
Info!
addressSize()!
pageSize()!
Objects!
allocateInstance()!
objectFieldOffset()!
Classes!
staticFieldOffset()!
defineClass()!
defineAnonymousClass()!
ensureClassInitialized()!
Synchronization!
monitorEnter()!
tryMonitorEnter()!
monitorExit()!
compareAndSwapInt()!
putOrderedInt()!
Arrays!
arrayBaseOffset()!
arrayIndexScale()!
Memory!
allocateMemory()!
copyMemory()!
freeMemory()!
getAddress() – not guaranteed correct if GC occurs!
getInt()/putInt()!
getBoolean()/putBoolean()!
getByte()/putByte()!
getShort()/putShort()!
getLong()/putLong()!
getFloat()/putFloat()!
getDouble()/putDouble()!
getObjectVolatile()/putObjectVolatile()!
Used by Spark!
IBM | spark.tc
Spark + com.misc.Unsafe!
org.apache.spark.sql.execution.!
aggregate.SortBasedAggregate!
aggregate.TungstenAggregate!
aggregate.AggregationIterator!
aggregate.udaf!
aggregate.utils!
SparkPlanner!
rowFormatConverters!
UnsafeFixedWidthAggregationMap!
UnsafeExternalSorter!
UnsafeExternalRowSorter!
UnsafeKeyValueSorter!
UnsafeKVExternalSorter!
local.ConvertToUnsafeNode!
local.ConvertToSafeNode!
local.HashJoinNode!
local.ProjectNode!
local.LocalNode!
local.BinaryHashJoinNode!
local.NestedLoopJoinNode!
joins.HashJoin!
joins.HashSemiJoin!
joins.HashedRelation!
joins.BroadcastHashJoin!
joins.ShuffledHashOuterJoin (not yet converted)!
joins.BroadcastHashOuterJoin!
joins.BroadcastLeftSemiJoinHash!
joins.BroadcastNestedLoopJoin!
joins.SortMergeJoin!
joins.LeftSemiJoinBNL!
joins.SortMergerOuterJoin!
Exchange!
SparkPlan!
UnsafeRowSerializer!
SortPrefixUtils!
sort!
basicOperators!
aggregate.SortBasedAggregationIterator!
aggregate.TungstenAggregationIterator!
datasources.WriterContainer!
datasources.json.JacksonParser!
datasources.jdbc.JDBCRDD!
Window!
org.apache.spark.!
unsafe.Platform!
unsafe.KVIterator!
unsafe.array.LongArray!
unsafe.array.ByteArrayMethods!
unsafe.array.BitSet!
unsafe.bitset.BitSetMethods!
unsafe.hash.Murmur3_x86_32!
unsafe.map.BytesToBytesMap!
unsafe.map.HashMapGrowthStrategy!
unsafe.memory.TaskMemoryManager!
unsafe.memory.ExecutorMemoryManager!
unsafe.memory.MemoryLocation!
unsafe.memory.UnsafeMemoryAllocator!
unsafe.memory.MemoryAllocator (trait/interface)!
unsafe.memory.MemoryBlock!
unsafe.memory.HeapMemoryAllocator!
unsafe.memory.ExecutorMemoryManager!
unsafe.sort.RecordComparator!
unsafe.sort.PrefixComparator!
unsafe.sort.PrefixComparators!
unsafe.sort.UnsafeSorterSpillWriter!
serializer.DummySerializationInstance!
shuffle.unsafe.UnsafeShuffleManager!
shuffle.unsafe.UnsafeShuffleSortDataFormat!
shuffle.unsafe.SpillInfo!
shuffle.unsafe.UnsafeShuffleWriter!
shuffle.unsafe.UnsafeShuffleExternalSorter!
shuffle.unsafe.PackedRecordPointer!
shuffle.ShuffleMemoryManager!
util.collection.unsafe.sort.UnsafeSorterSpillMerger!
util.collection.unsafe.sort.UnsafeSorterSpillReader!
util.collection.unsafe.sort.UnsafeSorterSpillWriter!
util.collection.unsafe.sort.UnsafeShuffleInMemorySorter!
util.collection.unsafe.sort.UnsafeInMemorySorter!
util.collection.unsafe.sort.RecordPointerAndKeyPrefix!
util.collection.unsafe.sort.UnsafeSorterIterator!
network.shuffle.ExternalShuffleBlockResolver!
scheduler.Task!
rdd.SqlNewHadoopRDD!
executor.Executor!
org.apache.spark.sql.catalyst.expressions.!
regexpExpressions!
BoundAttribute!
SortOrder!
SpecializedGetters!
ExpressionEvalHelper!
UnsafeArrayData!
UnsafeReaders!
UnsafeMapData!
Projection!
LiteralGeneartor!
UnsafeRow!
JoinedRow!
SpecializedGetters!
InputFileName!
SpecificMutableRow!
codegen.CodeGenerator!
codegen.GenerateProjection!
codegen.GenerateUnsafeRowJoiner!
codegen.GenerateSafeProjection!
codegen.GenerateUnsafeProjection!
codegen.BufferHolder!
codegen.UnsafeRowWriter!
codegen.UnsafeArrayWriter!
complexTypeCreator!
rows!
literals!
misc!
stringExpressions!
Over 200 source!
files affected!!!
IBM | spark.tc
CPU and Memory Optimizations!
Custom Managed Memory

Reduces GC overhead

Both on and off heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder serialized records

LZF can reorder compressed records
More CPU Cache-aware Data Structs & Algorithms

o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap
Code Generation (default in 1.5)

Generate source code from overall query plan

Janino generates bytecode from source code

100+ UDFs converted to use code generation
Details in !
SPARK-7075!
UnsafeFixedWithAggregationMap,& !
TungstenAggregationIterator!
CodeGenerator &!
GeneratorUnsafeRowJoiner!UnsafeSortDataFormat &!
UnsafeShuffleSortDataFormat &!
PackedRecordPointer &!
UnsafeRow!
UnsafeInMemorySorter &
UnsafeExternalSorter &
UnsafeShuffleWriter!
Mostly Same Join Code,!
added if (isUnsafeMode)!
UnsafeShuffleManager &!
UnsafeShuffleInMemorySorter &
UnsafeShuffleExternalSorter!
IBM | spark.tc
Code Generation!
Turned on by default in Spark 1.5
Problem: Generic expression evaluation

Expensive on JVM

Virtual func calls

Branches based on expression type

Excessive object creation due to primitive boxing
Implementation

Defer the source code generation to each operator, type, etc

Scala quasiquotes provide Scala AST manipulation/rewriting

Generated source code is compiled to bytecode w/ Janino

100+ UDFs now using code gen
IBM | spark.tc
Code Generation: Spark SQL UDFs!
100+ UDFs now using code gen – More to come in Spark 1.6!
Details in !
SPARK-8159!
IBM | spark.tc
Project Tungsten: Beyond Core and Spark SQL!
SortDataFormat<K, Buffer>: Base trait
^ implements ^
UncompressedInBlockSort: MLlib.ALS
EdgeArraySortDataFormat: GraphX.Edge
IBM | spark.tc
Relevant Links!
  http://sortbenchmark.org/ApacheSpark2014.pdf!
!
  https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html!
  http://0x0fff.com/spark-architecture-shuffle/!
  http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf!
  http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cach
e-to-improve-performance!
  http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf!
  http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/!
  http://docs.scala-lang.org/overviews/quasiquotes/intro.html!
IBM | spark.tc
More Relevant Links from Scott Meyers!
  http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches!
  http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do!
ScottMatrix * BruceMatrix^T = CaitlynMatrix!
Special thanks to Skimlinks!!!
IBM Spark Technology Center is Hiring! "
Nice and collaborative people only, please!!
IBM | spark.tc
Sign up for our newsletter at
Thank You, London!!
Special thanks to Skimlinks!!!
IBM Spark Technology Center is Hiring! "
Nice and collaborative people only, please!!
IBM | spark.tc
Sign up for our newsletter at
Thank You, London!!
Power of data. Simplicity of design. Speed of innovation.
IBM Spark

More Related Content

What's hot

Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...Spark Summit
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...jaxLondonConference
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 

What's hot (20)

Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
A Spark Framework For &lt; $100, &lt; 1 Hour, Accurate Personalized DNA Analy...
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
PySaprk
PySaprkPySaprk
PySaprk
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 

Similar to London Spark Meetup Project Tungsten Oct 12 2015

Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...Chris Fregly
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Chris Fregly
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Chris Fregly
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Chris Fregly
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Chris Fregly
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...Chris Fregly
 

Similar to London Spark Meetup Project Tungsten Oct 12 2015 (20)

Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...Scotland Data Science Meetup Oct 13, 2015:  Spark SQL, DataFrames, Catalyst, ...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
Advanced Apache Spark Meetup Approximations and Probabilistic Data Structures...
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
 

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

More from Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Recently uploaded

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 

Recently uploaded (20)

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 

London Spark Meetup Project Tungsten Oct 12 2015

  • 1. How Spark Beat Hadoop @ 100 TB Sort + Project Tungsten London Spark Meetup Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Oct 12, 2015 Power of data. Simplicity of design. Speed of innovation. IBM | spark.tc
  • 2. IBM | spark.tc Announcements Skimlinks! ! Martin Goodson! Organizer, London Spark Meetup! !
  • 3. IBM | spark.tc Who am I?! ! Streaming Data Engineer! Netflix Open Source Committer! ! Data Solutions Engineer! Apache Contributor! ! Principal Data Solutions Engineer! IBM Technology Center! Meetup Organizer! Advanced Apache Meetup! Book Author! Advanced Spark (2016)!
  • 4. IBM | spark.tc Advanced Apache Spark Meetup Total Spark Experts: 1200+ in only 3 mos!! #5 most active Spark Meetup in the world!! ! Goals! Dig deep into the Spark & extended-Spark codebase! ! Study integrations such as Cassandra, ElasticSearch,! Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc! ! Surface and share the patterns and idioms of these ! well-designed, distributed, big data components!
  • 5. IBM | spark.tc Recent Events Cassandra Summit 2015! Real-time Advanced Analytics w/ Spark & Cassandra! ! ! ! ! ! Strata NYC 2015! Practical Data Science w/ Spark: Recommender Systems! ! Available on Slideshare! http://slideshare.net/cfregly! !
  • 6. IBM | spark.tc Freg-a-palooza Upcoming World Tour   London Spark Meetup (Oct 12th)!   Scotland Data Science Meetup (Oct 13th)!   Dublin Spark Meetup (Oct 15th)!   Barcelona Spark Meetup (Oct 20th)!   Madrid Spark/Big Data Meetup (Oct 22nd)!   Paris Spark Meetup (Oct 26th)!   Amsterdam Spark Summit (Oct 27th – Oct 29th)!   Delft Dutch Data Science Meetup (Oct 29th) !   Brussels Spark Meetup (Oct 30th)!   Zurich Big Data Developers Meetup (Nov 2nd)! High probability! I’ll end up in jail! or married!!
  • 8. IBM | spark.tc Topics of this Talk: Mechanical Sympathy! Tungsten => Bare Metal! Seek Once, Scan Sequentially!! CPU Cache Locality and Efficiency! Use Data Structs Customized to Your Workload! Go Off-Heap Whenever Possible ! spark.unsafe.offHeap=true!
  • 9. IBM | spark.tc What is the Daytona GraySort Challenge?! Key Metric! Throughput of sorting 100TB of 100 byte data,10 byte key! Total time includes launching app and writing output file! ! Daytona! App must be general purpose! ! Gray! Named after Jim Gray!
  • 10. IBM | spark.tc Daytona GraySort Challenge: Input and Resources! Input! Records are 100 bytes in length! First 10 bytes are random key! Input generator: ordinal.com/gensort.html! 28,000 fixed-size partitions for 100 TB sort! 250,000 fixed-size partitions for 1 PB sort! 1 partition = 1 HDFS block = 1 executor ! Aligned to avoid partial read I/O ie. imaginary data ^----^! Hardware and Runtime Resources! Commercially available and off-the-shelf! Unmodified, no over/under-clocking! Generates 500TB of disk I/O, 200TB network I/O! 1st record of! 1st 10 bytes:! “JimGrayRIP”!
  • 11. IBM | spark.tc Daytona GraySort Challenge: Rules! Must sort to/from OS files in secondary storage! ! No raw disk since I/O subsystem is being tested! ! File and device striping (RAID 0) are encouraged! ! Output file(s) must have correct key order!
  • 12. IBM | spark.tc Daytona GraySort Challenge: Task Scheduling! Types of Data Locality! PROCESS_LOCAL! NODE_LOCAL! RACK_LOCAL! ANY! ! Delay Scheduling! spark.locality.wait.node: time to wait for next shitty level! Set to infinite to reduce shittiness, force NODE_LOCAL! Straggling Executor JVMs naturally fade away on each run! Decreasing! Level of! Read ! Performance!
  • 13. IBM | spark.tc Daytona GraySort Challenge: Winning Results! On-disk only, in-memory caching disabled!! EC2 (i2.8xlarge)! EC2 (i2.8xlarge)! 28,000! partitions! 250,000 ! partitions (!!)! (3 GBps/node! * 206 nodes)!
  • 14. IBM | spark.tc Daytona GraySort Challenge: EC2 Configuration! 206 EC2 Worker nodes, 1 Master node! AWS i2.8xlarge! 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz! 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4! NOOP I/O scheduler: FIFO, request merging, no reordering! 3 GBps mixed read/write disk I/O per node! Deployed within Placement Group/VPC! Enhanced Networking! Single Root I/O Virtualization (SR-IOV): extension of PCIe! 10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)!
  • 15. IBM | spark.tc Daytona GraySort Challenge: Winning Configuration! Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17! Disabled in-memory caching -- all on-disk!! HDFS 2.4.1 short-circuit local reads, 2x replication! Writes flushed after each of the 5 runs! 28,000 partitions / (206 nodes * 32 cores) = 4.25 runs, round up 5 runs! Netty 4.0.23.Final with native epoll! Speculative Execution disabled: spark.speculation=false! Force NODE_LOCAL: spark.locality.wait.node=Infinite ! Force Netty Off-Heap: spark.shuffle.io.preferDirectBuffers! Spilling disabled: spark.shuffle.spill=false! All compression disabled (network, on-disk, etc)!
  • 16. IBM | spark.tc Daytona GraySort Challenge: Partitioning! Range Partitioning (vs. Hash Partitioning)! Take advantage of sequential key space! Similar keys grouped together within a partition! Ranges defined by sampling 79 values per partition! Driver sorts samples and defines range boundaries! Sampling took ~10 seconds for 28,000 partitions! !
  • 17. IBM | spark.tc Daytona GraySort Challenge: Why Bother?! Sorting relies heavily on shuffle, I/O subsystem! ! Shuffle is major bottleneck in big data processing! Large number of partitions can exhaust OS resources! ! Shuffle optimization benefits all high-level libraries! ! Goal is to saturate network controller on all nodes! ~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)!
  • 18. IBM | spark.tc Daytona GraySort Challenge: Per Node Results! ! ! ! ! ! Reducers: ~1.1 GB/s/node network I/O! (max 1.25 Gbps for 10 GB ethernet)! Mappers: 3 GB/s/node disk I/O (8x800 SSD)! 206 nodes ! * ! 1.1 Gbps/node ! ~=! 220 Gbps !
  • 20. ! ! ! ! ! ! ! ! ! ! ! ! IBM | spark.tc Shuffle Overview! All to All, Cartesian Product Operation! Least ->! Useful! Example! I Could! Find ->!
  • 21. ! ! ! ! ! ! ! ! ! ! ! ! IBM | spark.tc Spark Shuffle Overview! Most ->! Confusing! Example! I Could! Find ->! Stages are Defined by Shuffle Boundaries!
  • 22. IBM | spark.tc Shuffle Intermediate Data: Spill to Disk! Intermediate shuffle data stored in memory! Spill to Disk! spark.shuffle.spill=true! spark.shuffle.memoryFraction=% of all shuffle buffers! Competes with spark.storage.memoryFraction! Bump this up from default!! Will help Spark SQL, too.! Skipped Stages! Reuse intermediate shuffle data found on reducer! DAG for that partition can be truncated!
  • 23. IBM | spark.tc Shuffle Intermediate Data: Compression! spark.shuffle.compress! Compress outputs (mapper)! ! spark.shuffle.spill.compress! Compress spills (reducer)! ! spark.io.compression.codec! LZF: Most workloads (new default for Spark)! Snappy: LARGE workloads (less memory required to compress)!
  • 24. IBM | spark.tc Spark Shuffle Operations! join! distinct! cogroup! coalesce! repartition! sortByKey! groupByKey! reduceByKey! aggregateByKey!
  • 25. IBM | spark.tc Spark Shuffle Managers! spark.shuffle.manager = {! hash < 10,000 Reducers! Output file determined by hashing the key of (K,V) pair! Each mapper creates an output buffer/file per reducer! Leads to M*R number of output buffers/files per shuffle! sort >= 10,000 Reducers! Default since Spark 1.2! Wins Daytona GraySort Challenge w/ 250,000 reducers!!! tungsten-sort -> Default in Spark 1.5! Uses com.misc.Unsafe for direct access to off heap! }!
  • 27. IBM | spark.tc Shuffle Performance Tuning! Hash Shuffle Manager (no longer default)! spark.shuffle.consolidateFiles: mapper output files! o.a.s.shuffle.FileShuffleBlockResolver! Intermediate Files! Increase spark.shuffle.file.buffer: reduce seeks & sys calls! Increase spark.reducer.maxSizeInFlight if memory allows! Use smaller number of larger workers to reduce total files! SQL: BroadcastHashJoin vs. ShuffledHashJoin! spark.sql.autoBroadcastJoinThreshold ! Use DataFrame.explain(true) or EXPLAIN to verify!
  • 29. IBM | spark.tc Mechanical Sympathy! Use as much of the CPU cache line as possible!!! ! ! ! ! ! ! ! ! !
  • 30. IBM | spark.tc Naïve Matrix Multiplication: Not Cache Friendly! Naive:! for (i = 0; i < N; ++i)! for (j = 0; j < N; ++j)! for (k = 0; k < N; ++k)! res[i][j] += mat1[i][k] * mat2[k][j];! Clever: ! double mat2transpose [N][N];! for (i = 0; i < N; ++i)! for (j = 0; j < N; ++j)! mat2transpose[i][j] = mat2[j][i];! for (i = 0; i < N; ++i)! for (j = 0; j < N; ++j)! for (k = 0; k < N; ++k)! res[i][j] += mat1[i][k] * mat2transpose[j][k];! Prefetch Not Effective On ! Row Wise Traversal! Force All ! Column Traversal by! Transposing Matrix 2!
  • 31. Winning Optimizations Deployed across Spark 1.1 and 1.2
  • 32. IBM | spark.tc Daytona GraySort Challenge: Winning Optimizations! CPU-Cache Locality: Mechanical Sympathy! & Cache Locality/Alignment! ! Optimized Sort Algorithm: Elements of (K, V) Pairs! ! Reduce Network Overhead: Async Netty, epoll! ! Reduce OS Resource Utilization: Sort Shuffle!
  • 33. IBM | spark.tc CPU-Cache Locality: Mechanical Sympathy! AlphaSort paper ~1995! Chris Nyberg and Jim Gray! ! Naïve! List (Pointer-to-Record)! Requires Key to be dereferenced for comparison! ! AlphaSort! List (Key, Pointer-to-Record)! Key is directly available for comparison! ! Key! Ptr! Ptr!
  • 34. IBM | spark.tc CPU-Cache Locality: Cache Locality/Alignment! Key(10 bytes) + Pointer(4 bytes*) = 14 bytes! *4 bytes when using compressed OOPS (<32 GB heap)! Not binary in size! Not CPU-cache friendly! Cache Alignment Options! Add Padding (2 bytes)! Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes! (Key-Prefix, Pointer-to-Record)! Key distribution affects performance! Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes! Key! Key! Ptr! Ptr! Ptr! Key-Prefx! Pad! With Padding! Cache-line! Friendly!
  • 35. IBM | spark.tc CPU-Cache Locality: Performance Comparison!
  • 36. IBM | spark.tc Similar Technique: Direct Cache Access! ^ Packet header placed into CPU cache ^!
  • 37. IBM | spark.tc Optimized Sort Algorithm: Elements of (K, V) Pairs! o.a.s.util.collection.TimSort! Based on JDK 1.7 TimSort! Performs best on partially-sorted datasets ! Optimized for elements of (K,V) pairs! Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)! ! o.a.s.util.collection.AppendOnlyMap! Open addressing hash, quadratic probing! Array of [(K, V), (K, V)] ! Good memory locality! Keys never removed, values only append! (^2 Probing)!
  • 38. IBM | spark.tc Reduce Network Overhead: Async Netty, epoll! New Network Module based on Async Netty! Replaces old java.nio, low-level, socket-based code! Zero-copy epoll uses kernel-space between disk & network! Custom memory management reduces GC pauses! spark.shuffle.blockTransferService=netty! Spark-Netty Performance Tuning! spark.shuffle.io.numConnectionsPerPeer! Increase to saturate hosts with multiple disks! spark.shuffle.io.preferDirectBuffers! On or Off-heap (Off-heap is default)!
  • 39. IBM | spark.tc Hash Shuffle Manager!! ! ! ! ! ! ! ! ! ! ! M*R num open files per shuffle; M=num mappers! R=num reducers! Mapper Opens 1 File per Partition/Reducer! HDFS! (2x repl)! HDFS! (2x repl)!
  • 40. S! IBM | spark.tc Reduce OS Resource Utilization: Sort Shuffle! ! ! ! ! ! ! ! ! M open files per shuffle; M = num of mappers! spark.shuffle.sort.bypassMergeThreshold! Merge Sort! (Disk)! Reducers seek and scan from range offset! of Master File on Mapper! TimSort! (RAM)! HDFS! (2x repl)! HDFS! (2x repl)! SPARK-2926:! Replace TimSort w/ Merge Sort! (Memory)! Mapper Merge Sorts Partitions into 1 Master File Indexed by Partition Range Offsets! <- Master->! File!
  • 41. Project Tungsten Deployed across Spark 1.4 and 1.5
  • 42. IBM | spark.tc Significant Spark Core Changes! Disk! Network! CPU! Memory! Daytona GraySort Optimizations! (Spark 1.1-1.2, Late 2014)! Tungsten Optimizations! (Spark 1.4-1.5, Late 2015)!
  • 43. IBM | spark.tc Why is CPU the Bottleneck?! Network and Disk I/O bandwidth are relatively high! ! GraySort optimizations improved network & shuffle! ! Predicate pushdowns and partition pruning! ! Columnar file formats like Parquet and ORC! ! CPU used for serialization, hashing, compression!
  • 44. IBM | spark.tc tungsten-sort Shuffle Manager! “I don’t know your data structure, but my array[] will beat it!” Custom Data Structures for Sort/Shuffle Workload! UnsafeRow: ! ! ! ! Rows are ! 8-byte aligned Primitives are inlined! Row.equals(), Row.hashCode()! operate on raw bytes! Offset (Int) and Length (Int)! Stored in a single Long!
  • 46. IBM | spark.tc Spark + com.misc.Unsafe! org.apache.spark.sql.execution.! aggregate.SortBasedAggregate! aggregate.TungstenAggregate! aggregate.AggregationIterator! aggregate.udaf! aggregate.utils! SparkPlanner! rowFormatConverters! UnsafeFixedWidthAggregationMap! UnsafeExternalSorter! UnsafeExternalRowSorter! UnsafeKeyValueSorter! UnsafeKVExternalSorter! local.ConvertToUnsafeNode! local.ConvertToSafeNode! local.HashJoinNode! local.ProjectNode! local.LocalNode! local.BinaryHashJoinNode! local.NestedLoopJoinNode! joins.HashJoin! joins.HashSemiJoin! joins.HashedRelation! joins.BroadcastHashJoin! joins.ShuffledHashOuterJoin (not yet converted)! joins.BroadcastHashOuterJoin! joins.BroadcastLeftSemiJoinHash! joins.BroadcastNestedLoopJoin! joins.SortMergeJoin! joins.LeftSemiJoinBNL! joins.SortMergerOuterJoin! Exchange! SparkPlan! UnsafeRowSerializer! SortPrefixUtils! sort! basicOperators! aggregate.SortBasedAggregationIterator! aggregate.TungstenAggregationIterator! datasources.WriterContainer! datasources.json.JacksonParser! datasources.jdbc.JDBCRDD! Window! org.apache.spark.! unsafe.Platform! unsafe.KVIterator! unsafe.array.LongArray! unsafe.array.ByteArrayMethods! unsafe.array.BitSet! unsafe.bitset.BitSetMethods! unsafe.hash.Murmur3_x86_32! unsafe.map.BytesToBytesMap! unsafe.map.HashMapGrowthStrategy! unsafe.memory.TaskMemoryManager! unsafe.memory.ExecutorMemoryManager! unsafe.memory.MemoryLocation! unsafe.memory.UnsafeMemoryAllocator! unsafe.memory.MemoryAllocator (trait/interface)! unsafe.memory.MemoryBlock! unsafe.memory.HeapMemoryAllocator! unsafe.memory.ExecutorMemoryManager! unsafe.sort.RecordComparator! unsafe.sort.PrefixComparator! unsafe.sort.PrefixComparators! unsafe.sort.UnsafeSorterSpillWriter! serializer.DummySerializationInstance! shuffle.unsafe.UnsafeShuffleManager! shuffle.unsafe.UnsafeShuffleSortDataFormat! shuffle.unsafe.SpillInfo! shuffle.unsafe.UnsafeShuffleWriter! shuffle.unsafe.UnsafeShuffleExternalSorter! shuffle.unsafe.PackedRecordPointer! shuffle.ShuffleMemoryManager! util.collection.unsafe.sort.UnsafeSorterSpillMerger! util.collection.unsafe.sort.UnsafeSorterSpillReader! util.collection.unsafe.sort.UnsafeSorterSpillWriter! util.collection.unsafe.sort.UnsafeShuffleInMemorySorter! util.collection.unsafe.sort.UnsafeInMemorySorter! util.collection.unsafe.sort.RecordPointerAndKeyPrefix! util.collection.unsafe.sort.UnsafeSorterIterator! network.shuffle.ExternalShuffleBlockResolver! scheduler.Task! rdd.SqlNewHadoopRDD! executor.Executor! org.apache.spark.sql.catalyst.expressions.! regexpExpressions! BoundAttribute! SortOrder! SpecializedGetters! ExpressionEvalHelper! UnsafeArrayData! UnsafeReaders! UnsafeMapData! Projection! LiteralGeneartor! UnsafeRow! JoinedRow! SpecializedGetters! InputFileName! SpecificMutableRow! codegen.CodeGenerator! codegen.GenerateProjection! codegen.GenerateUnsafeRowJoiner! codegen.GenerateSafeProjection! codegen.GenerateUnsafeProjection! codegen.BufferHolder! codegen.UnsafeRowWriter! codegen.UnsafeArrayWriter! complexTypeCreator! rows! literals! misc! stringExpressions! Over 200 source! files affected!!!
  • 47. IBM | spark.tc CPU and Memory Optimizations! Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder serialized records LZF can reorder compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap Code Generation (default in 1.5) Generate source code from overall query plan Janino generates bytecode from source code 100+ UDFs converted to use code generation Details in ! SPARK-7075! UnsafeFixedWithAggregationMap,& ! TungstenAggregationIterator! CodeGenerator &! GeneratorUnsafeRowJoiner!UnsafeSortDataFormat &! UnsafeShuffleSortDataFormat &! PackedRecordPointer &! UnsafeRow! UnsafeInMemorySorter & UnsafeExternalSorter & UnsafeShuffleWriter! Mostly Same Join Code,! added if (isUnsafeMode)! UnsafeShuffleManager &! UnsafeShuffleInMemorySorter & UnsafeShuffleExternalSorter!
  • 48. IBM | spark.tc Code Generation! Turned on by default in Spark 1.5 Problem: Generic expression evaluation Expensive on JVM Virtual func calls Branches based on expression type Excessive object creation due to primitive boxing Implementation Defer the source code generation to each operator, type, etc Scala quasiquotes provide Scala AST manipulation/rewriting Generated source code is compiled to bytecode w/ Janino 100+ UDFs now using code gen
  • 49. IBM | spark.tc Code Generation: Spark SQL UDFs! 100+ UDFs now using code gen – More to come in Spark 1.6! Details in ! SPARK-8159!
  • 50. IBM | spark.tc Project Tungsten: Beyond Core and Spark SQL! SortDataFormat<K, Buffer>: Base trait ^ implements ^ UncompressedInBlockSort: MLlib.ALS EdgeArraySortDataFormat: GraphX.Edge
  • 51. IBM | spark.tc Relevant Links!   http://sortbenchmark.org/ApacheSpark2014.pdf! !   https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html!   http://0x0fff.com/spark-architecture-shuffle/!   http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf!   http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cach e-to-improve-performance!   http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf!   http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/!   http://docs.scala-lang.org/overviews/quasiquotes/intro.html!
  • 52. IBM | spark.tc More Relevant Links from Scott Meyers!   http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches!   http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do! ScottMatrix * BruceMatrix^T = CaitlynMatrix!
  • 53. Special thanks to Skimlinks!!! IBM Spark Technology Center is Hiring! " Nice and collaborative people only, please!! IBM | spark.tc Sign up for our newsletter at Thank You, London!!
  • 54. Special thanks to Skimlinks!!! IBM Spark Technology Center is Hiring! " Nice and collaborative people only, please!! IBM | spark.tc Sign up for our newsletter at Thank You, London!!
  • 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark