Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date

Click to edit Master text styles
IBM Spark
spark.tc

After Dark 1.5
High Performance, Real-time, Streaming,
Machine Learning, Natural Language Processing,
Text Analytics, and Recommendations

Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center
** We’re Hiring -- Only Nice People, Please!! **
Paris Spark Meetup
October 26, 2015

IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2

Streaming Data Engineer
Netﬂix Open Source Committer

Data Solutions Engineer 
Apache Contributor

Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced (2016)

IBM Spark
spark.tc
spark.tc
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
1400+ members in just 3 mos!
4th most active Spark Meetup!!
meetup.com/Advanced-Apache-Spark-Meetup
Meetup Goals
  Dig deep into Spark & extended-Spark codebase
  Study integrations incl Cassandra, ElasticSearch, 
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
  Surface & share patterns & idioms of these  
well-designed, distributed, big data components

IBM Spark
spark.tc
spark.tc
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Spark/Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit & Meetup (Oct 27th)
Delft Dutch Data Science Meetup (Oct 29th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Developers Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
4
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Apache Spark Meetup (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Apache Spark Meetup (Dec 8th)
Mountain View Advanced Apache Spark Meetup (Dec 10th)
Washington DC Advanced Apache Spark Meetup (Dec 17th)

Freg-a-palooza!

IBM Spark
spark.tc
spark.tc
IBM Spark
What is Spark After Dark?
Fun, Spark-based dating reference application
*Not a movie recommendation engine!!
Generate recommendations based on user similarity
Demonstrate Apache Spark and related big data
projects
5

IBM Spark
spark.tc
spark.tc
IBM Spark
Tools of this Talk
6
  Redis
  Docker
  Ganglia
  Streaming, Kafka
  Cassandra, NoSQL
  Parquet, JSON, ORC, Avro
  Apache Zeppelin Notebooks
  Spark SQL, DataFrames, Hive
  ElasticSearch, Logstash, Kibana
  Spark ML, GraphX, Stanford CoreNLP
and…

IBM Spark
spark.tc
spark.tc
IBM Spark
Overall Themes of this Talk
  Filter Early, Filter Deep
  Approximations are OK
  Minimize Random Seeks
  Maximize Sequential Scans
  Go Oﬀ-Heap when Possible
  Parallelism is Required at Scale
  Must Reduce Dimensions at Scale
  Seek Performance Gains at all Layers
  Customize Data Structs for your Workload
7
  Be Nice and Collaborate with your Peers!

IBM Spark
spark.tc
spark.tc
IBM Spark
High-Level Sections
Spark Core: Performance Tuning
Spark SQL: DataSources and Tuning
Spark Streaming: Scale, Tuning, Approx
Spark ML: Scale, Dim Reduce, NLP
8

IBM Spark
spark.tc
Spark Core: Performance Tuning
Acknowledging Mechanical Sympathy

100TB Daytona GraySort Challenge

Project Tungsten
9

IBM Spark
spark.tc
spark.tc
IBM Spark
Acknowledging Mechanical Sympathy
“Hardware and software working together in harmony”

-Martin Thompson

http://mechanical-sympathy.blogspot.com

Spark Mechanical Sympathy Concerns

Saturate Network I/O

Saturate Disk I/O

Minimize Memory Footprint and GC

Maximize CPU Cache Locality

10

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark and Mechanical Sympathy
Saturate Network I/O
Saturate Disk I/O

Minimize Memory and GC
Maximize CPU Cache Locality

11
Project  
Tungsten
Spark 1.4-1.6
Daytona
GraySort
Spark 1.1-1.2

IBM Spark
spark.tc
spark.tc
IBM Spark
AlphaSort Trick for Sorting
AlphaSort paper, 1995

Chris Nyberg and Jim Gray

Naïve

List (Pointer-to-Record)

Requires Key to be dereferenced for comparison

AlphaSort

List (Key, Pointer)

Key is directly available for comparison

12
Ptr!
Ptr!Key!

IBM Spark
spark.tc
spark.tc
IBM Spark
Key! Ptr!
Pad!
/Pad
CPU Cache Line and Memory Sympathy
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes

*4 bytes when using compressed OOPS (<32 GB heap)

Not binary in size 

Not CPU-cache friendly
Add Padding (2 bytes)

Key(10 bytes) + Pad(2 bytes)  
+ Pointer(4 bytes)=16 bytes
Key-Prefix, Pointer

Key distribution affects perf

Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes

13
Ptr!
Key-Prefix
Key! Ptr!
Cache-line 
Friendly!
2x Cache-line 
Friendly!
Not cache-line 
Friendly!

IBM Spark
spark.tc
spark.tc
IBM Spark
Performance Comparison
14

IBM Spark
spark.tc
spark.tc
IBM Spark
Similar Technique: Direct Cache Access
Packet header placed into CPU cache

15

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Lines
16

IBM Spark
spark.tc
spark.tc
IBM Spark
Instrumenting and Monitoring CPU
Linux perf command!
17

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Find dot product of each row and column vector
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += matA[i][k] * matB[k][j];

18
Skipping row-wise,
not using full CPU cache line, 
ineﬀective pre-fetching

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU Cache Friendly Matrix Multiplication

// Transpose B
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)

matBtran [i][j] = matB[j][i]; 

// Modify dot product calculation for B transpose
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += matA[i][k] * matBtran[j][k];
19
Good use of CPU cache line,
eﬀective prefetching

IBM Spark
spark.tc
Demo!
Comparing CPU Naïve & Cache-Friendly Matrix Multiplication
20

IBM Spark
spark.tc
spark.tc
IBM Spark
Results of Naïve vs. Cache Friendly
Naïve Matrix Multiply
21
Cache Friendly Matrix Multiply
~72x
~8x
~3x
~3x
~2x
~7x
~10x
perf stat --repeat 5 --scale --event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
java -Xmx13G -XX:-Inline -jar ~/sbt/bin/sbt-launch.jar "tungsten/run-main com.advancedspark.tungsten.matrix.Cache[Friendly|Naïve]MatrixMultiply 256 1"

IBM Spark
spark.tc
spark.tc
IBM Spark
Visualizing and Finding Hotspots
Flame Graphs with Java Stack Traces
22
Images courtesy of http://techblog.netﬂix.com/2015/07/java-in-ﬂames.html!
Java Stack Traces!!

IBM Spark
spark.tc
100TB Daytona GraySort Challenge
Focus on Network and Disk I/O Optimizations
Improve Data Structs/Algos for Sort & Shuﬄe
Saturate Network and Disk Controllers
23

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Results
24
Spark Goals:
  Saturate Network I/O
  Saturate Disk I/O
(2013) (2014)

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Hardware Conﬁguration
Compute

206 EC2 Worker nodes, 1 Master node

AWS i2.8xlarge

32 Intel Xeon CPU E5-2670 @ 2.5 Ghz

244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4

NOOP I/O scheduler: FIFO, request merging, no reordering

3 GBps mixed read/write disk I/O per node

Network

Deployed within Placement Group/VPC

Using AWS Enhanced Networking

Single Root I/O Virtualization (SR-IOV): extension of PCIe

10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)

25

IBM Spark
spark.tc
spark.tc
IBM Spark
Winning Software Conﬁguration
Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17
Disable caching, compression, spec execution, shuﬄe spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit for local reads, 2x replication
4-6 tasks allocated / partition is Spark recommendation

206 nodes * 32 cores = 6592 cores

6592 cores * 4 = 26,368 partitions

6592 cores * 6 = 39,552 partitions

6592 cores * 4.25 = 28,000 partitions was empirically best
Range partitioning takes advantage of sequential keyspace
26

IBM Spark
spark.tc
spark.tc
IBM Spark
New Shuffle Manager
New “Sort-based” shuffle manager replaces Hash-based

New Data Structures and Algos for Shuffle Sort

ie. New TimSort for Arrays of (K,V) Pairs
27

IBM Spark
spark.tc
spark.tc
IBM Spark
New Network Module
Replaces old java.nio, low-level, socket-based code

Zero-copy epoll: kernel-space between disk & network

Custom memory management

spark.shuffle.blockTransferService=netty

Spark-Netty Performance Tuning

spark.shuffle.io.numConnectionsPerPeer

Increase to saturate hosts with multiple disks

spark.shuffle.io.preferDirectBuffers

On or Off-heap (Off-heap is default)

28

IBM Spark
spark.tc
spark.tc
IBM Spark
New Algorithms and Data Structures
Optimized for sort and shuﬄe
o.a.s.util.collection.TimSort

Based on JDK 1.7 TimSort

Performs best on partially-sorted datasets

Optimized for elements of (K,V) pairs

Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap

Open addressing hash, quadratic probing

Array of [(K, V), (K, V)]

Good memory locality

Keys never removed, values only append
29

IBM Spark
spark.tc
IBM | spark.tc
Met Performance Goals!
Reducers: 1.1 Gbps/node network I/O
(theoretical max = 1.25 Gbps for 10 GB ethernet)
Mappers: 3 GBps/node disk I/O (8x800 SSD)
206 nodes * 1.1 Gbps/node ~= 220 Gbps

IBM Spark
spark.tc
spark.tc
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (no longer default)

spark.shuffle.consolidateFiles: mapper output files

o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files

Increase spark.shuffle.file.buffer: reduce seeks & sys calls

Increase spark.reducer.maxSizeInFlight if memory allows

Use smaller number of larger workers to reduce total files
SQL: BroadcastHashJoin vs. ShuffledHashJoin

spark.sql.autoBroadcastJoinThreshold

Use DataFrame.explain(true) or EXPLAIN to verify

31

IBM Spark
spark.tc
Project Tungsten
Focus on CPU Cache and Memory Optimizations
Further Improve Data Structures and Algorithms
Operate on Serialized/Compressed Data
Provide Path to Oﬀ Heap
32

IBM Spark
spark.tc
spark.tc
IBM Spark
Why is CPU the Bottleneck?
Network and Disk I/O bandwidth are relatively high

GraySort optimizations improved network & shuﬄe

More partitioning, pruning, and predicate pushdowns

Poprularity of columnar ﬁle formats like Parquet/ORC

CPU is used for serialization, hashing, compression!
33

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark Shuffle Managers
spark.shuffle.manager =

hash < 10,000 Reducers

Output file determined by hashing the key of (K,V) pair

Each mapper creates an output buffer/file per reducer

Leads to M*R number of output buffers/files per shuffle

sort >= 10,000 Reducers

Default since Spark 1.2

Minimizes OS resources

Uses Netty to optimize Network I/O

Created custom Data Struts/Algos

Wins Daytona GraySort Challenge

unsafe -> Tungsten, Default in Spark 1.5

Uses com.misc.Unsafe to sellf-manage binary array buffers

Uses custom serialization format

Can operate on compressed and serialized buffers
34

IBM Spark
spark.tc
spark.tc
IBM Spark
New Data Structures

“I don’t know your data structure, but my array[] will beat it!”
Custom Data Structures for Sort/Shuﬄe Workload

UnsafeRow:

BytesToBytesMap::

35

IBM Spark
spark.tc
spark.tc
IBM Spark
sun.misc.Unsafe
36
Info

addressSize()

pageSize()
Objects

allocateInstance()

objectFieldOffset()
Classes

staticFieldOffset()

defineClass()

defineAnonymousClass()

ensureClassInitialized()
Synchronization

monitorEnter()

tryMonitorEnter()

monitorExit()

compareAndSwapInt()

putOrderedInt()
Arrays

arrayBaseOffset()

arrayIndexScale()
Memory

allocateMemory()

copyMemory()

freeMemory()

getAddress() – not guaranteed after GC

getInt()/putInt()

getBoolean()/putBoolean()

getByte()/putByte()

getShort()/putShort()

getLong()/putLong()

getFloat()/putFloat()

getDouble()/putDouble()

getObjectVolatile()/putObjectVolatile()

IBM Spark
spark.tc
spark.tc
IBM Spark
Spark + com.misc.Unsafe
37
org.apache.spark.sql.execution.
aggregate.SortBasedAggregate
aggregate.TungstenAggregate
aggregate.AggregationIterator
aggregate.udaf
aggregate.utils
SparkPlanner
rowFormatConverters
UnsafeFixedWidthAggregationMap
UnsafeExternalSorter
UnsafeExternalRowSorter
UnsafeKeyValueSorter
UnsafeKVExternalSorter
local.ConvertToUnsafeNode
local.ConvertToSafeNode
local.HashJoinNode
local.ProjectNode
local.LocalNode
local.BinaryHashJoinNode
local.NestedLoopJoinNode
joins.HashJoin
joins.HashSemiJoin
joins.HashedRelation
joins.BroadcastHashJoin
joins.ShuffledHashOuterJoin (not yet converted)
joins.BroadcastHashOuterJoin
joins.BroadcastLeftSemiJoinHash
joins.BroadcastNestedLoopJoin
joins.SortMergeJoin
joins.LeftSemiJoinBNL
joins.SortMergerOuterJoin
Exchange
SparkPlan
UnsafeRowSerializer
SortPrefixUtils
sort
basicOperators
aggregate.SortBasedAggregationIterator
aggregate.TungstenAggregationIterator
datasources.WriterContainer
datasources.json.JacksonParser
datasources.jdbc.JDBCRDD
org.apache.spark.
unsafe.Platform
unsafe.KVIterator
unsafe.array.LongArray
unsafe.array.ByteArrayMethods
unsafe.array.BitSet
unsafe.bitset.BitSetMethods
unsafe.hash.Murmur3_x86_32
unsafe.map.BytesToBytesMap
unsafe.map.HashMapGrowthStrategy
unsafe.memory.TaskMemoryManager
unsafe.memory.ExecutorMemoryManager
unsafe.memory.MemoryLocation
unsafe.memory.UnsafeMemoryAllocator
unsafe.memory.MemoryAllocator (trait/interface)
unsafe.memory.MemoryBlock
unsafe.memory.HeapMemoryAllocator
unsafe.memory.ExecutorMemoryManager
unsafe.sort.RecordComparator
unsafe.sort.PrefixComparator
unsafe.sort.PrefixComparators
unsafe.sort.UnsafeSorterSpillWriter
serializer.DummySerializationInstance
shuffle.unsafe.UnsafeShuffleManager
shuffle.unsafe.UnsafeShuffleSortDataFormat
shuffle.unsafe.SpillInfo
shuffle.unsafe.UnsafeShuffleWriter
shuffle.unsafe.UnsafeShuffleExternalSorter
shuffle.unsafe.PackedRecordPointer
shuffle.ShuffleMemoryManager
util.collection.unsafe.sort.UnsafeSorterSpillMerger
util.collection.unsafe.sort.UnsafeSorterSpillReader
util.collection.unsafe.sort.UnsafeSorterSpillWriter
util.collection.unsafe.sort.UnsafeShuffleInMemorySorter
util.collection.unsafe.sort.UnsafeInMemorySorter
util.collection.unsafe.sort.RecordPointerAndKeyPrefix
util.collection.unsafe.sort.UnsafeSorterIterator
network.shuffle.ExternalShuffleBlockResolver
scheduler.Task
rdd.SqlNewHadoopRDD
executor.Executor
org.apache.spark.sql.catalyst.expressions.
regexpExpressions
BoundAttribute
SortOrder
SpecializedGetters
ExpressionEvalHelper
UnsafeArrayData
UnsafeReaders
UnsafeMapData
Projection
LiteralGeneartor
UnsafeRow
JoinedRow
SpecializedGetters
InputFileName
SpecificMutableRow
codegen.CodeGenerator
codegen.GenerateProjection
codegen.GenerateUnsafeRowJoiner
codegen.GenerateSafeProjection
codegen.GenerateUnsafeProjection
codegen.BufferHolder
codegen.UnsafeRowWriter
codegen.UnsafeArrayWriter
complexTypeCreator
rows
literals
misc
stringExpressions
Over 200 source
files affected!!

IBM Spark
spark.tc
spark.tc
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory

Reduces GC overhead

Both on and oﬀ heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder serialized records

LZF can reorder compressed records
More CPU Cache-aware Data Structs & Algorithms

o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap
Code Generation (default in 1.5)

Generate source code from overall query plan

Janino generates bytecode from source code

100+ UDFs converted to use code generation
38
UnsafeFixedWithAggregationMap,&
TungstenAggregationIterator
CodeGenerator &
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat &
UnsafeShuffleSortDataFormat &
PackedRecordPointer &
UnsafeRow
UnsafeInMemorySorter &
UnsafeExternalSorter &
UnsafeShuffleWriter
Mostly Same Join Code,
added if (isUnsafeMode)
UnsafeShuffleManager &
UnsafeShuffleInMemorySorter &
UnsafeShuffleExternalSorterDetails inSPARK-7075

IBM Spark
spark.tc
IBM | spark.tc
Code Generation (Default in 1.5)
Problem
Generic expression evaluation
Expensive on JVM
Virtual func calls
Branches based on expression type
Boxing causes excessive object creation
Implementation
Defer source code generation to each operator, type, etc
Scala quasiquotes provide AST manipulation & rewriting
Generates source code, compiled to bytecode w/ Janino
100+ UDFs now using code gen

IBM Spark
spark.tc
IBM | spark.tc
Code Generation: Spark SQL UDFs
100+ UDFs now using code gen – More to come in Spark 1.6!
Details in
SPARK-8159

IBM Spark
spark.tc
IBM | spark.tc
Project Tungsten in Other Spark Libraries
SortDataFormat<K, Buﬀer>: Base trait

UncompressedInBlockSort: MLlib.ALS

EdgeArraySortDataFormat: GraphX.Edge

IBM Spark
spark.tc
Spark SQL: DataSources and Tuning
Understand Partitions, Pruning, Predicate Pushdowns

Understand DataFrames, Catalyst, DataSources

Create a DataSource Implementation

42

IBM Spark
spark.tc
spark.tc
IBM Spark
Partitions
Partition based on data usage patterns

/genders.parquet/gender=M/…

/gender=F/… <-- Use case: access users by gender

/gender=U/…
Partition Discovery (Read Path)

Infer partitions from organization of data (ie. gender=F)

Dynamic Partitions (Write Path)

Dynamically create partitions based on given column(s)

SQL: INSERT TABLE genders PARTITION (gender) SELECT …

DF: gendersDF.write.format("parquet").partitionBy("gender").save(…)
43

IBM Spark
spark.tc
spark.tc
IBM Spark
Pruning
Partition Pruning

Filter out entire rows that have been pre-partitioned

SELECT id, gender FROM genders where gender = ‘U’

Column Pruning

Filter out entire columns for all rows if not required

Optimized for columnar storage formats (Parquet)

Minimize data shuﬄe during joins
44
gender = partition key

IBM Spark
spark.tc
spark.tc
IBM Spark
Predicate Pushdowns
“Predicate” == “Filter”
Filters rows as deep into the data source as possible
Predicate returns [true|false] for given func/condition
45

IBM Spark
spark.tc
spark.tc
IBM Spark
Putting It All Together
Reduce Columns: Column Pruning
Reduce Rows: Partitioning, Predicate Pushdown
SELECT b FROM table WHERE a in [a2,a3]

46

IBM Spark
spark.tc
spark.tc
IBM Spark
DataFrames Overview
Inspired by R and Pandas DataFrames
Cross language support

SQL, Python, Scala, Java, R
Levels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serialize/pickle to Python
DataFrame is Container for Logical Plan

Lazy transformations represented as tree

Catalyst Optimizer creates physical plan
DataFrame.rdd returns the underlying RDD if needed
Custom UDF using registerFunction()
New, experimental UDAF support
47
Use DataFrames
instead of RDDs!!

IBM Spark
spark.tc
spark.tc
IBM Spark
Catalyst Optimizer

Optimize DataFrame Transformation Tree
Subquery elimination: use aliases to collapse subqueries
Constant folding: replace expression with constant
Simplify filters: remove unnecessary filters
Predicate/filter pushdowns: avoid unnecessary data load
Projection collapsing: avoid unnecessary projections
Create Custom Rules
Rules are Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)

48
Implements!
oas.sql.catalyst.rules.Ruleå!
Apply to any stage!
JVM code
generation

IBM Spark
spark.tc
spark.tc
IBM Spark
Columnar Storage Format
49
Skip whole chunks with min-max heuristics 
stored in each chunk (sorted data only)

IBM Spark
spark.tc
spark.tc
IBM Spark
Parquet File Format
  Based on Google Dremel
  Implemented by Twitter and Cloudera
  Columnar storage format
  Optimized for fast columnar aggregations
  Tight compression
  Supports pushdowns
  Nested, self-describing, evolving schema
50

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Compression
  Run Length Encoding: Repeated data
  Dictionary Encoding: Fixed set of values
  Delta, Preﬁx Encoding: Sorted data
51

IBM Spark
spark.tc
spark.tc
IBM Spark
Query Plan Debugging
52
gendersCsvDF.select($"id", $"gender").ﬁlter("gender != 'F'").ﬁlter("gender != 'M'").explain(true)
DataFrame.queryExecution.logical
DataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.optimizedPlan

IBM Spark
spark.tc
spark.tc
IBM Spark
Query Plan Visualization & Query Metrics
53
Eﬀectiveness
of Filter
CPU Cache  
Friendly
Binary Format
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
Peak Memory for
Joins and Aggs

IBM Spark
spark.tc
Demo!
Show Various File Formats, Partitioning Schemes,  
DataSource Implementations, and Query Plans
54
RATINGS
========
UserID,ProﬁleID,Rating
(1-10)
GENDERS
========
UserID,Gender
(M,F,U)
Anonymous, Public
Dating Dataset

IBM Spark
spark.tc
spark.tc
IBM Spark
DataSources API
Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data

TableScan (impl): Read all data from source

PrunedFilteredScan (impl): Column pruning & predicate pushdowns

InsertableRelation (impl): Insert/overwrite data based on SaveMode

RelationProvider (trait/interface): Handle options, BaseRelation factory
Execution (o.a.s.sql.execution.commands.scala)

RunnableCommand (trait/interface): Common commands like EXPLAIN

ExplainCommand(impl: case class)

CacheTableCommand(impl: case class)
Filters (o.a.s.sql.sources.ﬁlters.scala)

Filter (abstract class): Handles all predicates/ﬁlters supported by this source

EqualTo (impl)

GreaterThan (impl)

StringStartsWith (impl)
55

IBM Spark
spark.tc
spark.tc
IBM Spark
Native Spark SQL DataSources
56

IBM Spark
spark.tc
spark.tc
IBM Spark
JSON Data Source
DataFrame

val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json 
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")

57
json() convenience method

IBM Spark
spark.tc
spark.tc
IBM Spark
JDBC Data Source
Add Driver to Spark JVM System Classpath

$ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrame

val jdbcConﬁg = Map("driver" -> "org.postgresql.Driver",

"url" -> "jdbc:postgresql:hostname:port/database",

"dbtable" -> ”schema.tablename")

df.read.format("jdbc").options(jdbcConﬁg).load()

SQL

CREATE TABLE genders USING jdbc  

OPTIONS (url, dbtable, driver, …)

58

IBM Spark
spark.tc
spark.tc
IBM Spark
Parquet Data Source
Configuration

spark.sql.parquet.filterPushdown=true

spark.sql.parquet.mergeSchema=true

spark.sql.parquet.cacheMetadata=true

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames

val gendersDF = sqlContext.read.format("parquet")

.load("file:/root/pipeline/datasets/dating/genders.parquet")

gendersDF.write.format("parquet").partitionBy("gender")

.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL

CREATE TABLE genders USING parquet

OPTIONS

(path "file:/root/pipeline/datasets/dating/genders.parquet")

59

IBM Spark
spark.tc
spark.tc
IBM Spark
ORC Data Source
Configuration

spark.sql.orc.filterPushdown=true
DataFrames

val gendersDF = sqlContext.read.format("orc")

.load("file:/root/pipeline/datasets/dating/genders")

gendersDF.write.format("orc").partitionBy("gender")

.save("file:/root/pipeline/datasets/dating/genders")
SQL

CREATE TABLE genders USING orc

OPTIONS

(path "file:/root/pipeline/datasets/dating/genders")

60

IBM Spark
spark.tc
spark.tc
IBM Spark
Third-Party Spark SQL DataSources
61

IBM Spark
spark.tc
spark.tc
IBM Spark
CSV DataSource (Databricks)
Github
https://github.com/databricks/spark-csv
Maven

com.databricks:spark-csv_2.10:1.2.0
Code

val gendersCsvDF = sqlContext.read

.format("com.databricks.spark.csv”)

.load("ﬁle:/root/pipeline/datasets/dating/gender.csv.bz2")

.toDF("id", "gender")
62
toDF() is required if CSV does not contain header

IBM Spark
spark.tc
spark.tc
IBM Spark
Avro DataSource (Databricks)
Github

https://github.com/databricks/spark-avro

Maven

com.databricks:spark-avro_2.10:2.0.1

Code

val df = sqlContext.read

.format("com.databricks.spark.avro")

.load("ﬁle:/root/pipeline/datasets/dating/gender.avro”)

63

IBM Spark
spark.tc
spark.tc
IBM Spark
ElasticSearch DataSource (Elastic.co)
Github

https://github.com/elastic/elasticsearch-hadoop

Maven

org.elasticsearch:elasticsearch-spark_2.10:2.1.0

Code

val esConﬁg = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",  

"es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

.options(esConﬁg).save("<index>/<document-type>")

64

IBM Spark
spark.tc
spark.tc
IBM Spark
AWS Redshift Data Source (Databricks)
Github

https://github.com/databricks/spark-redshift

Maven

com.databricks:spark-redshift:0.5.0

Code

val df: DataFrame = sqlContext.read

.format("com.databricks.spark.redshift")

.option("url", "jdbc:redshift://<hostname>:<port>/<database>…")

.option("query", "select x, count(*) my_table group by x")

.option("tempdir", "s3n://tmpdir")

.load(...)
65
UNLOAD and copy to tmp
bucket in S3 enables
parallel reads

IBM Spark
spark.tc
spark.tc
IBM Spark
Cassandra DataSource (DataStax)
Github

https://github.com/datastax/spark-cassandra-connector

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write

.format("org.apache.spark.sql.cassandra")

.mode(SaveMode.Append)

.options(Map("keyspace"->"<keyspace>",

"table"->"<table>")).save(…)

66

IBM Spark
spark.tc
spark.tc
IBM Spark
Cassandra Pushdown Support
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala

Pushdown Predicate Rules

1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate

2. Only push down primary key column predicates with = or IN predicate.

3. If there are regular columns in the pushdown predicates, they should have
at least one EQ expression on an indexed column and no IN predicates.

4. All partition column predicates must be included in the predicates to be pushed down,
only the last part of the partition key can be an IN predicate. For each partition column,

only one predicate is allowed.

5. For cluster column predicates, only last predicate can be non-EQ predicate

including IN predicate, and preceding column predicates must be EQ predicates.

If there is only one cluster column predicate, the predicates could be any non-IN predicate.

6. There is no pushdown predicates if there is any OR condition or NOT IN condition.

7. We're not allowed to push down multiple predicates for the same column if any of them

is equality or IN predicate.

67

IBM Spark
spark.tc
spark.tc
IBM Spark
Rumor of New Cassandra DataSource
By-pass CQL front door used for transactional data

Bulk read/write directly from/to SSTables

Similar to existing Netflix Open Source project

https://github.com/Netflix/aegisthus

Promotes Cassandra to first-class Analytics Option

Potentially only part of DataStax Enterprise?!

Please mail a nasty letter to your local DataStax office

68

IBM Spark
spark.tc
spark.tc
IBM Spark
Creating a Custom Data Source
Study Existing Native and Third-Party Data Source Impls

Native: JDBC (o.a.s.sql.execution.datasources.jdbc)

class JDBCRelation extends BaseRelation

with PrunedFilteredScan

with InsertableRelation
Third-Party: Cassandra (o.a.s.sql.cassandra)

class CassandraSourceRelation extends BaseRelation

with PrunedFilteredScan

with InsertableRelation

<Insert Your Custom Data Source Here!>

69

IBM Spark
spark.tc
spark.tc
IBM Spark
Cloudant DataSource (IBM)
Github

http://spark-packages.org/package/cloudant/spark-cloudant

Maven

com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

Code

ratingsDF.write.format("com.cloudant.spark")

.mode(SaveMode.Append)

.options(Map("cloudant.host"->"<account>.cloudant.com",

"cloudant.username"->"<username>",

"cloudant.password"->"<password>"))

.save("<ﬁlename>")
70

IBM Spark
spark.tc
spark.tc
IBM Spark
DB2 and BigSQL DataSources (IBM)
Coming Soon!
71

IBM Spark
spark.tc
spark.tc
IBM Spark
Rumor of REST DataSource (Databricks)
Coming Soon?

Ask Michael Armbrust
Spark SQL Lead @ Databricks
72

IBM Spark
spark.tc
spark.tc
IBM Spark
Custom DataSource (Me and You All!)
Coming Right Now!
73
DEMO ALERT!!

IBM Spark
spark.tc
Demo!
Create a Custom DataSource
74

IBM Spark
spark.tc
spark.tc
IBM Spark
Contributing a Custom Data Source
spark-packages.org

Managed by

Contains links to externally-managed github projects

Ratings and comments

Requires supporte Spark version for each package
Examples

https://github.com/databricks/spark-csv

https://github.com/databricks/spark-avro

https://github.com/databricks/spark-redshift

75

IBM Spark
spark.tc
Spark Streaming: Scaling & Approximations
Understand Parallelism, Recovery, and Back Pressure

Describe Common Streaming Count Approximations
76

IBM Spark
spark.tc
spark.tc
IBM Spark
Direct Kafka Streaming
  KafkaRDD partitions store relevant oﬀsets
  Each partition acts as a Receiver
  Tasks/workers pull from Kafka in parallel
  Partitions rebuild from Kafka using oﬀsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
77

IBM Spark
spark.tc
spark.tc
IBM Spark
Parallelism of Direct Kafka Streaming
78

IBM Spark
spark.tc
spark.tc
IBM Spark
Not-so-direct Kinesis Streaming
  KinesisRDD partitions store relevant offsets
  Single receiver required to see all data/offsets
  Kinesis offsets not deterministic like Kafka
  Partitions rebuild from Kinesis using offsets
  No Write Ahead Log (WAL) needed
  Optimizes happy path by avoiding the WAL
  At least once delivery guarantee
79

IBM Spark
spark.tc
spark.tc
IBM Spark
Streaming Back Pressure
More than Throttling

Push back on the source

Requires buﬀered source (Kafka, Kinesis)

Based on fundamentals of Control Theory

Contributed by TypeSafe
80

IBM Spark
spark.tc
spark.tc
IBM Spark
HyperLogLog
  Approximate cardinality 
(approx count distinct)
  Fixed, low memory
  Tunable error percentage
  Only 1.5KB @ 2% error,10^9 elements
  Twitter’s Algebird
  Streaming example in Spark codebase
  Spark’s countApproxDistinctByKey()
81
http://research.neustar.biz/

IBM Spark
spark.tc
spark.tc
IBM Spark
Count Min Sketch
  Approximate counters
  Better than HashMap
  Low, ﬁxed memory
  Known error bounds
  Large num of counters
  From Twitter Algebird
  Streaming example in Spark codebase
82

IBM Spark
spark.tc
spark.tc
IBM Spark
Monte Carlo Simulations
From Manhattan Project (A-bomb)
Simulate movement of neutrons

Law of Large Numbers (LLN)
Average of results of many trials 
Converge on expected value

SparkPi example in
Spark codebase 

Pi ~ (# red dots / 

# total dots * 4)
83

IBM Spark
spark.tc
Spark ML: High Scale Machine Learning
Deﬁne Similarity and Dimension Reduction

Describe Sampling and Bucketing

Generate 10 Recommendations
84

IBM Spark
spark.tc
Live, Interactive Demo!
sparkafterdark.com
85

IBM Spark
spark.tc
spark.tc
IBM Spark
Audience Participation Needed!!
86
->
You are 
here
->
Audience Instructions
  Navigate to sparkafterdark.com
  Click 3 actresses and 3 actors

  Wait for us to analyze together!
Note: This is totally anonymous!!

Project Links
  https://github.com/ﬂuxcapacitor/pipeline
  https://hub.docker.com/r/ﬂuxcapacitor

IBM Spark
spark.tc
Similarity
87

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Similarity
Euclidean: linear measure
Magnitude bias
Cosine: angle measure
Adjust for magnitude bias
Jaccard: (intersection / union)
Popularity bias
Log Likelihood
Adjust for popularity bias

88
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1!
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z!

IBM Spark
spark.tc
spark.tc
IBM Spark
All-Pairs Similarity Comparison
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuﬄe: O(m*n^2); m=rows, n=cols

Minimize shuﬄe through approximations!
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (ie.0)
Principle Component Analysis
89
Dimension reduction!!

IBM Spark
spark.tc
Dimension Reduction
Sampling and Bucketing
90

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce m: DIMSUM Sampling
“Dimension Independent Matrix Square Using MR”
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)

Twitter: 40% eﬃciency gain over Cosine Similarity

91

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce m: LSH Bucketing
“Locality Sensitive Hashing”
Split m into b buckets
Use similarity hash algorithm
Requires pre-processing of data
Compare bucket contents in parallel
Converts O(m*n^2) -> O(m*n/b*b^2);

m=rows, n=cols, b=buckets
ie. 500k x 500k matrix

O(1.25e17) -> O(1.25e13); b=50

github.com/mrsqueeze/spark-hash
92

IBM Spark
spark.tc
spark.tc
IBM Spark
Reduce n: Remove Most Frequent Value
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2);  
nnz=num nonzeros, nnz << n

Note: Choose most frequent value (may not be 0)
93
(index,value)
(index,value)

IBM Spark
spark.tc
Recommendations
Summary Statistics and Historical Analysis
Collaborative Filtering and Clustering
Text Featurization and NLP
94

IBM Spark
spark.tc
spark.tc
IBM Spark
Types of Recommendations
Non-personalized 
No preference or behavior data for user, yet
aka “Cold Start Problem”

Personalized 
User-Item Similarity 
Items that others with similar prefs have liked
Item-Item Similarity 
Items similar to your previously-liked items
95

IBM Spark
spark.tc
spark.tc
IBM Spark
Recommendation Terminology
User
User seeking recommendations
Item
Item that has been liked or rated
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll
Feature Engineering
Dimension reduction
96

IBM Spark
spark.tc
Non-Personalized Recommendations
Use Aggregate Data to Generate Recommendations
97

IBM Spark
spark.tc
spark.tc
IBM Spark
  Top Users by Like Count

“I might like users who have the most-likes overall
based on historical data.”
SparkSQL, DataFrames: Summary Stat, Aggs

98

IBM Spark
spark.tc
spark.tc
IBM Spark
  Top Inﬂuencers by Like Graph 

“I might like the most-inﬂuential users in overall like graph.”
GraphX: PageRank

99

IBM Spark
spark.tc
Demo!
Generate Recommnedations using Summary Stats & PageRank
100

IBM Spark
spark.tc
Personalized Recommendations
Use Similarity to Generate Personalized Recommendations
101

IBM Spark
spark.tc
spark.tc
IBM Spark
  Like Behavior of Similar Users
“I like the same people that you like.  
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
102

IBM Spark
spark.tc
Demo!
Generate Recommendations using  
Collaborative Filtering and Matrix Factorization
103

IBM Spark
spark.tc
spark.tc
IBM Spark
  Similar Text-based Proﬁles as Me 

“Our proﬁles have similar keywords and named entities.  
We might like each other!”
MLlib: Word2Vec, TF/IDF, k-skip n-grams
104

IBM Spark
spark.tc
spark.tc
IBM Spark
  Similar Profiles to Previous Likes 

105
“Your profile text has similar keywords and named entities to
other profiles of people I like. I might like you, too!”
MLlib: Word2Vec, TF/IDF, Doc Similarity

IBM Spark
spark.tc
spark.tc
IBM Spark
  Relevant, High-Value Emails

“Your initial email references a lot of things in my proﬁle. 
I might like you for making the eﬀort!”
MLlib: Word2Vec, TF/IDF, Entity Recognition

106
^
Her Email< My Profile

IBM Spark
spark.tc
The Future of Recommendations
107

IBM Spark
spark.tc
spark.tc
IBM Spark
  Eigenfaces: Facial Recognition
“Your face looks similar to others that I’ve liked. 
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity

108
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

IBM Spark
spark.tc
spark.tc
IBM Spark
  NLP Conversation Starter Bot!
“If your responses to my generic opening
lines are positive, I may read your proﬁle.”  
MLlib: TF/IDF, DecisionTrees,
Sentiment Analysis
109
Positive Negative

IBM Spark
spark.tc
110
Maintaining the Spark

IBM Spark
spark.tc
spark.tc
IBM Spark
⑨  Recommendations for Couples
“I want Mad Max. You want Message In a Bottle.  
Let’s ﬁnd something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity 
GraphX: Nearest Neighbors, Shortest Path

similar

similar
• 
plots ->
<- actors

111

IBM Spark
spark.tc
Final Recommendation!
112

IBM Spark
spark.tc
spark.tc
IBM Spark
  Get Off the Computer & Meet People!
Thank you, Paris!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, CA, USA
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/ﬂuxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/ﬂuxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
113

IBM Spark
spark.tc
spark.tc
IBM Spark
More Relevant Links
http://meetup.com/Advanced-Apache-Spark-Meetup
http://advancedspark.com
http://github.com/fluxcapacitor/pipeline
http://hub.docker.com/r/fluxcapacitor/pipeline
http://sortbenchmark.org/ApacheSpark2014.pd
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches)
http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do)
https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Tutorial
http://techblog.netflix.com/2015/07/java-in-flames.html
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches
http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do

114

IBM Spark
spark.tc
What’s Next?
115

IBM Spark
spark.tc
spark.tc
IBM Spark
What’s Next?
Autoscaling Spark Workers

Completely Docker-based

Docker Compose and Docker Machine
Lots of Demos and Examples!

Zeppelin & IPython/Jupyter notebooks

Advanced streaming use cases

Advanced ML, Graph, and NLP use cases
Performance Tuning and Proﬁling

Work closely with Brendan Gregg & Netﬂix

Surface & share more low-level details of Spark internals
116

IBM Spark
spark.tc
spark.tc
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Spark/Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit & Meetup (Oct 27th)
Delft Dutch Data Science Meetup (Oct 29th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Developers Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
117
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Apache Spark Meetup (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Apache Spark Meetup (Dec 8th)
Mountain View Advanced Apache Spark Meetup (Dec 10th)
Washington DC Advanced Apache Spark Meetup (Dec 17th)

Freg-a-palooza!

IBM Spark
spark.tc
IBM Spark

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date

Similar to Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date (20)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date