* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Apache Spark Meetup to Date
1. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
After Dark 1.5
High Performance, Real-time, Streaming,
Machine Learning, Natural Language Processing,
Text Analytics, and Recommendations
Chris Fregly
Principal Data Solutions Engineer
IBM Spark Technology Center
** We’re Hiring -- Only Nice People, Please!! **
Paris Spark Meetup
October 26, 2015
2. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Meetup Organizer
Advanced Apache Meetup
Book Author
Advanced (2016)
3. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
1400+ members in just 3 mos!
4th most active Spark Meetup!!
meetup.com/Advanced-Apache-Spark-Meetup
Meetup Goals
Dig deep into Spark & extended-Spark codebase
Study integrations incl Cassandra, ElasticSearch,
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
Surface & share patterns & idioms of these
well-designed, distributed, big data components
4. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Spark/Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit & Meetup (Oct 27th)
Delft Dutch Data Science Meetup (Oct 29th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Developers Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
4
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Apache Spark Meetup (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Apache Spark Meetup (Dec 8th)
Mountain View Advanced Apache Spark Meetup (Dec 10th)
Washington DC Advanced Apache Spark Meetup (Dec 17th)
Freg-a-palooza!
5. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What is Spark After Dark?
Fun, Spark-based dating reference application
*Not a movie recommendation engine!!
Generate recommendations based on user similarity
Demonstrate Apache Spark and related big data
projects
5
6. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Tools of this Talk
6
Redis
Docker
Ganglia
Streaming, Kafka
Cassandra, NoSQL
Parquet, JSON, ORC, Avro
Apache Zeppelin Notebooks
Spark SQL, DataFrames, Hive
ElasticSearch, Logstash, Kibana
Spark ML, GraphX, Stanford CoreNLP
and…
7. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Overall Themes of this Talk
Filter Early, Filter Deep
Approximations are OK
Minimize Random Seeks
Maximize Sequential Scans
Go Off-Heap when Possible
Parallelism is Required at Scale
Must Reduce Dimensions at Scale
Seek Performance Gains at all Layers
Customize Data Structs for your Workload
7
Be Nice and Collaborate with your Peers!
8. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
High-Level Sections
Spark Core: Performance Tuning
Spark SQL: DataSources and Tuning
Spark Streaming: Scale, Tuning, Approx
Spark ML: Scale, Dim Reduce, NLP
8
9. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Spark Core: Performance Tuning
Acknowledging Mechanical Sympathy
100TB Daytona GraySort Challenge
Project Tungsten
9
10. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Acknowledging Mechanical Sympathy
“Hardware and software working together in harmony”
-Martin Thompson
http://mechanical-sympathy.blogspot.com
Spark Mechanical Sympathy Concerns
Saturate Network I/O
Saturate Disk I/O
Minimize Memory Footprint and GC
Maximize CPU Cache Locality
10
11. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Mechanical Sympathy
Saturate Network I/O
Saturate Disk I/O
Minimize Memory and GC
Maximize CPU Cache Locality
11
Project
Tungsten
Spark 1.4-1.6
Daytona
GraySort
Spark 1.1-1.2
12. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AlphaSort Trick for Sorting
AlphaSort paper, 1995
Chris Nyberg and Jim Gray
Naïve
List (Pointer-to-Record)
Requires Key to be dereferenced for comparison
AlphaSort
List (Key, Pointer)
Key is directly available for comparison
12
Ptr!
Ptr!Key!
13. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Key! Ptr!
Pad!
/Pad
CPU Cache Line and Memory Sympathy
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes
*4 bytes when using compressed OOPS (<32 GB heap)
Not binary in size
Not CPU-cache friendly
Add Padding (2 bytes)
Key(10 bytes) + Pad(2 bytes)
+ Pointer(4 bytes)=16 bytes
Key-Prefix, Pointer
Key distribution affects perf
Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
13
Ptr!
Key-Prefix
Key! Ptr!
Cache-line
Friendly!
2x Cache-line
Friendly!
Not cache-line
Friendly!
14. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Comparison
14
15. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Technique: Direct Cache Access
Packet header placed into CPU cache
15
16. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Lines
16
17. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Instrumenting and Monitoring CPU
Linux perf command!
17
18. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Find dot product of each row and column vector
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += matA[i][k] * matB[k][j];
18
Skipping row-wise,
not using full CPU cache line,
ineffective pre-fetching
19. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Matrix Multiplication
// Transpose B
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
matBtran [i][j] = matB[j][i];
// Modify dot product calculation for B transpose
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += matA[i][k] * matBtran[j][k];
19
Good use of CPU cache line,
effective prefetching
20. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Comparing CPU Naïve & Cache-Friendly Matrix Multiplication
20
21. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Naïve vs. Cache Friendly
Naïve Matrix Multiply
21
Cache Friendly Matrix Multiply
~72x
~8x
~3x
~3x
~2x
~7x
~10x
perf stat --repeat 5 --scale --event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
java -Xmx13G -XX:-Inline -jar ~/sbt/bin/sbt-launch.jar "tungsten/run-main com.advancedspark.tungsten.matrix.Cache[Friendly|Naïve]MatrixMultiply 256 1"
22. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Visualizing and Finding Hotspots
Flame Graphs with Java Stack Traces
22
Images courtesy of http://techblog.netflix.com/2015/07/java-in-flames.html!
Java Stack Traces!!
23. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
100TB Daytona GraySort Challenge
Focus on Network and Disk I/O Optimizations
Improve Data Structs/Algos for Sort & Shuffle
Saturate Network and Disk Controllers
23
24. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Results
24
Spark Goals:
Saturate Network I/O
Saturate Disk I/O
(2013) (2014)
25. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Hardware Configuration
Compute
206 EC2 Worker nodes, 1 Master node
AWS i2.8xlarge
32 Intel Xeon CPU E5-2670 @ 2.5 Ghz
244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4
NOOP I/O scheduler: FIFO, request merging, no reordering
3 GBps mixed read/write disk I/O per node
Network
Deployed within Placement Group/VPC
Using AWS Enhanced Networking
Single Root I/O Virtualization (SR-IOV): extension of PCIe
10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)
25
26. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Software Configuration
Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17
Disable caching, compression, spec execution, shuffle spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit for local reads, 2x replication
4-6 tasks allocated / partition is Spark recommendation
206 nodes * 32 cores = 6592 cores
6592 cores * 4 = 26,368 partitions
6592 cores * 6 = 39,552 partitions
6592 cores * 4.25 = 28,000 partitions was empirically best
Range partitioning takes advantage of sequential keyspace
26
27. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Shuffle Manager
New “Sort-based” shuffle manager replaces Hash-based
New Data Structures and Algos for Shuffle Sort
ie. New TimSort for Arrays of (K,V) Pairs
27
28. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Network Module
Replaces old java.nio, low-level, socket-based code
Zero-copy epoll: kernel-space between disk & network
Custom memory management
spark.shuffle.blockTransferService=netty
Spark-Netty Performance Tuning
spark.shuffle.io.numConnectionsPerPeer
Increase to saturate hosts with multiple disks
spark.shuffle.io.preferDirectBuffers
On or Off-heap (Off-heap is default)
28
29. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Algorithms and Data Structures
Optimized for sort and shuffle
o.a.s.util.collection.TimSort
Based on JDK 1.7 TimSort
Performs best on partially-sorted datasets
Optimized for elements of (K,V) pairs
Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap
Open addressing hash, quadratic probing
Array of [(K, V), (K, V)]
Good memory locality
Keys never removed, values only append
29
30. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
IBM | spark.tc
Met Performance Goals!
Reducers: 1.1 Gbps/node network I/O
(theoretical max = 1.25 Gbps for 10 GB ethernet)
Mappers: 3 GBps/node disk I/O (8x800 SSD)
206 nodes * 1.1 Gbps/node ~= 220 Gbps
31. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (no longer default)
spark.shuffle.consolidateFiles: mapper output files
o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files
Increase spark.shuffle.file.buffer: reduce seeks & sys calls
Increase spark.reducer.maxSizeInFlight if memory allows
Use smaller number of larger workers to reduce total files
SQL: BroadcastHashJoin vs. ShuffledHashJoin
spark.sql.autoBroadcastJoinThreshold
Use DataFrame.explain(true) or EXPLAIN to verify
31
32. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Project Tungsten
Focus on CPU Cache and Memory Optimizations
Further Improve Data Structures and Algorithms
Operate on Serialized/Compressed Data
Provide Path to Off Heap
32
33. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Why is CPU the Bottleneck?
Network and Disk I/O bandwidth are relatively high
GraySort optimizations improved network & shuffle
More partitioning, pruning, and predicate pushdowns
Poprularity of columnar file formats like Parquet/ORC
CPU is used for serialization, hashing, compression!
33
34. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark Shuffle Managers
spark.shuffle.manager =
hash < 10,000 Reducers
Output file determined by hashing the key of (K,V) pair
Each mapper creates an output buffer/file per reducer
Leads to M*R number of output buffers/files per shuffle
sort >= 10,000 Reducers
Default since Spark 1.2
Minimizes OS resources
Uses Netty to optimize Network I/O
Created custom Data Struts/Algos
Wins Daytona GraySort Challenge
unsafe -> Tungsten, Default in Spark 1.5
Uses com.misc.Unsafe to sellf-manage binary array buffers
Uses custom serialization format
Can operate on compressed and serialized buffers
34
35. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Data Structures
“I don’t know your data structure, but my array[] will beat it!”
Custom Data Structures for Sort/Shuffle Workload
UnsafeRow:
BytesToBytesMap::
35
36. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
sun.misc.Unsafe
36
Info
addressSize()
pageSize()
Objects
allocateInstance()
objectFieldOffset()
Classes
staticFieldOffset()
defineClass()
defineAnonymousClass()
ensureClassInitialized()
Synchronization
monitorEnter()
tryMonitorEnter()
monitorExit()
compareAndSwapInt()
putOrderedInt()
Arrays
arrayBaseOffset()
arrayIndexScale()
Memory
allocateMemory()
copyMemory()
freeMemory()
getAddress() – not guaranteed after GC
getInt()/putInt()
getBoolean()/putBoolean()
getByte()/putByte()
getShort()/putShort()
getLong()/putLong()
getFloat()/putFloat()
getDouble()/putDouble()
getObjectVolatile()/putObjectVolatile()
38. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory
Reduces GC overhead
Both on and off heap
Exact size calculations
Direct Binary Processing
Operate on serialized/compressed arrays
Kryo can reorder serialized records
LZF can reorder compressed records
More CPU Cache-aware Data Structs & Algorithms
o.a.s.unsafe.map.BytesToBytesMap vs. j.u.HashMap
Code Generation (default in 1.5)
Generate source code from overall query plan
Janino generates bytecode from source code
100+ UDFs converted to use code generation
38
UnsafeFixedWithAggregationMap,&
TungstenAggregationIterator
CodeGenerator &
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat &
UnsafeShuffleSortDataFormat &
PackedRecordPointer &
UnsafeRow
UnsafeInMemorySorter &
UnsafeExternalSorter &
UnsafeShuffleWriter
Mostly Same Join Code,
added if (isUnsafeMode)
UnsafeShuffleManager &
UnsafeShuffleInMemorySorter &
UnsafeShuffleExternalSorterDetails inSPARK-7075
39. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
IBM | spark.tc
Code Generation (Default in 1.5)
Problem
Generic expression evaluation
Expensive on JVM
Virtual func calls
Branches based on expression type
Boxing causes excessive object creation
Implementation
Defer source code generation to each operator, type, etc
Scala quasiquotes provide AST manipulation & rewriting
Generates source code, compiled to bytecode w/ Janino
100+ UDFs now using code gen
40. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
IBM | spark.tc
Code Generation: Spark SQL UDFs
100+ UDFs now using code gen – More to come in Spark 1.6!
Details in
SPARK-8159
41. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
IBM | spark.tc
Project Tungsten in Other Spark Libraries
SortDataFormat<K, Buffer>: Base trait
UncompressedInBlockSort: MLlib.ALS
EdgeArraySortDataFormat: GraphX.Edge
42. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Spark SQL: DataSources and Tuning
Understand Partitions, Pruning, Predicate Pushdowns
Understand DataFrames, Catalyst, DataSources
Create a DataSource Implementation
42
43. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Partitions
Partition based on data usage patterns
/genders.parquet/gender=M/…
/gender=F/… <-- Use case: access users by gender
/gender=U/…
Partition Discovery (Read Path)
Infer partitions from organization of data (ie. gender=F)
Dynamic Partitions (Write Path)
Dynamically create partitions based on given column(s)
SQL: INSERT TABLE genders PARTITION (gender) SELECT …
DF: gendersDF.write.format("parquet").partitionBy("gender").save(…)
43
44. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Pruning
Partition Pruning
Filter out entire rows that have been pre-partitioned
SELECT id, gender FROM genders where gender = ‘U’
Column Pruning
Filter out entire columns for all rows if not required
Optimized for columnar storage formats (Parquet)
Minimize data shuffle during joins
44
gender = partition key
45. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Predicate Pushdowns
“Predicate” == “Filter”
Filters rows as deep into the data source as possible
Predicate returns [true|false] for given func/condition
45
46. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Putting It All Together
Reduce Columns: Column Pruning
Reduce Rows: Partitioning, Predicate Pushdown
SELECT b FROM table WHERE a in [a2,a3]
46
47. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DataFrames Overview
Inspired by R and Pandas DataFrames
Cross language support
SQL, Python, Scala, Java, R
Levels performance of Python, Scala, Java, and R
Generates JVM bytecode vs serialize/pickle to Python
DataFrame is Container for Logical Plan
Lazy transformations represented as tree
Catalyst Optimizer creates physical plan
DataFrame.rdd returns the underlying RDD if needed
Custom UDF using registerFunction()
New, experimental UDAF support
47
Use DataFrames
instead of RDDs!!
48. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Catalyst Optimizer
Optimize DataFrame Transformation Tree
Subquery elimination: use aliases to collapse subqueries
Constant folding: replace expression with constant
Simplify filters: remove unnecessary filters
Predicate/filter pushdowns: avoid unnecessary data load
Projection collapsing: avoid unnecessary projections
Create Custom Rules
Rules are Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)
48
Implements!
oas.sql.catalyst.rules.Ruleå!
Apply to any stage!
JVM code
generation
49. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Columnar Storage Format
49
Skip whole chunks with min-max heuristics
stored in each chunk (sorted data only)
50. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet File Format
Based on Google Dremel
Implemented by Twitter and Cloudera
Columnar storage format
Optimized for fast columnar aggregations
Tight compression
Supports pushdowns
Nested, self-describing, evolving schema
50
51. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Compression
Run Length Encoding: Repeated data
Dictionary Encoding: Fixed set of values
Delta, Prefix Encoding: Sorted data
51
52. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Query Plan Debugging
52
gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)
DataFrame.queryExecution.logical
DataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.optimizedPlan
53. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Query Plan Visualization & Query Metrics
53
Effectiveness
of Filter
CPU Cache
Friendly
Binary Format
Cost-based
Join Optimization
Similar to
MapReduce
Map-side Join
Peak Memory for
Joins and Aggs
54. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Show Various File Formats, Partitioning Schemes,
DataSource Implementations, and Query Plans
54
RATINGS
========
UserID,ProfileID,Rating
(1-10)
GENDERS
========
UserID,Gender
(M,F,U)
Anonymous, Public
Dating Dataset
55. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DataSources API
Relations (o.a.s.sql.sources.interfaces.scala)
BaseRelation (abstract class): Provides schema of data
TableScan (impl): Read all data from source
PrunedFilteredScan (impl): Column pruning & predicate pushdowns
InsertableRelation (impl): Insert/overwrite data based on SaveMode
RelationProvider (trait/interface): Handle options, BaseRelation factory
Execution (o.a.s.sql.execution.commands.scala)
RunnableCommand (trait/interface): Common commands like EXPLAIN
ExplainCommand(impl: case class)
CacheTableCommand(impl: case class)
Filters (o.a.s.sql.sources.filters.scala)
Filter (abstract class): Handles all predicates/filters supported by this source
EqualTo (impl)
GreaterThan (impl)
StringStartsWith (impl)
55
56. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Native Spark SQL DataSources
56
57. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JSON Data Source
DataFrame
val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")
-- or –
val ratingsDF = sqlContext.read.json
("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code
CREATE TABLE genders USING json
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.json.bz2")
57
json() convenience method
58. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
JDBC Data Source
Add Driver to Spark JVM System Classpath
$ export SPARK_CLASSPATH=<jdbc-driver.jar>
DataFrame
val jdbcConfig = Map("driver" -> "org.postgresql.Driver",
"url" -> "jdbc:postgresql:hostname:port/database",
"dbtable" -> ”schema.tablename")
df.read.format("jdbc").options(jdbcConfig).load()
SQL
CREATE TABLE genders USING jdbc
OPTIONS (url, dbtable, driver, …)
58
59. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parquet Data Source
Configuration
spark.sql.parquet.filterPushdown=true
spark.sql.parquet.mergeSchema=true
spark.sql.parquet.cacheMetadata=true
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames
val gendersDF = sqlContext.read.format("parquet")
.load("file:/root/pipeline/datasets/dating/genders.parquet")
gendersDF.write.format("parquet").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL
CREATE TABLE genders USING parquet
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders.parquet")
59
60. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ORC Data Source
Configuration
spark.sql.orc.filterPushdown=true
DataFrames
val gendersDF = sqlContext.read.format("orc")
.load("file:/root/pipeline/datasets/dating/genders")
gendersDF.write.format("orc").partitionBy("gender")
.save("file:/root/pipeline/datasets/dating/genders")
SQL
CREATE TABLE genders USING orc
OPTIONS
(path "file:/root/pipeline/datasets/dating/genders")
60
61. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Third-Party Spark SQL DataSources
61
62. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CSV DataSource (Databricks)
Github
https://github.com/databricks/spark-csv
Maven
com.databricks:spark-csv_2.10:1.2.0
Code
val gendersCsvDF = sqlContext.read
.format("com.databricks.spark.csv”)
.load("file:/root/pipeline/datasets/dating/gender.csv.bz2")
.toDF("id", "gender")
62
toDF() is required if CSV does not contain header
63. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Avro DataSource (Databricks)
Github
https://github.com/databricks/spark-avro
Maven
com.databricks:spark-avro_2.10:2.0.1
Code
val df = sqlContext.read
.format("com.databricks.spark.avro")
.load("file:/root/pipeline/datasets/dating/gender.avro”)
63
64. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
ElasticSearch DataSource (Elastic.co)
Github
https://github.com/elastic/elasticsearch-hadoop
Maven
org.elasticsearch:elasticsearch-spark_2.10:2.1.0
Code
val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",
"es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document-type>")
64
65. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AWS Redshift Data Source (Databricks)
Github
https://github.com/databricks/spark-redshift
Maven
com.databricks:spark-redshift:0.5.0
Code
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://<hostname>:<port>/<database>…")
.option("query", "select x, count(*) my_table group by x")
.option("tempdir", "s3n://tmpdir")
.load(...)
65
UNLOAD and copy to tmp
bucket in S3 enables
parallel reads
66. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra DataSource (DataStax)
Github
https://github.com/datastax/spark-cassandra-connector
Maven
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code
ratingsDF.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"<keyspace>",
"table"->"<table>")).save(…)
66
67. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cassandra Pushdown Support
spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala
Pushdown Predicate Rules
1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate
2. Only push down primary key column predicates with = or IN predicate.
3. If there are regular columns in the pushdown predicates, they should have
at least one EQ expression on an indexed column and no IN predicates.
4. All partition column predicates must be included in the predicates to be pushed down,
only the last part of the partition key can be an IN predicate. For each partition column,
only one predicate is allowed.
5. For cluster column predicates, only last predicate can be non-EQ predicate
including IN predicate, and preceding column predicates must be EQ predicates.
If there is only one cluster column predicate, the predicates could be any non-IN predicate.
6. There is no pushdown predicates if there is any OR condition or NOT IN condition.
7. We're not allowed to push down multiple predicates for the same column if any of them
is equality or IN predicate.
67
68. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Rumor of New Cassandra DataSource
By-pass CQL front door used for transactional data
Bulk read/write directly from/to SSTables
Similar to existing Netflix Open Source project
https://github.com/Netflix/aegisthus
Promotes Cassandra to first-class Analytics Option
Potentially only part of DataStax Enterprise?!
Please mail a nasty letter to your local DataStax office
68
69. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom Data Source
Study Existing Native and Third-Party Data Source Impls
Native: JDBC (o.a.s.sql.execution.datasources.jdbc)
class JDBCRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
Third-Party: Cassandra (o.a.s.sql.cassandra)
class CassandraSourceRelation extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
<Insert Your Custom Data Source Here!>
69
70. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Cloudant DataSource (IBM)
Github
http://spark-packages.org/package/cloudant/spark-cloudant
Maven
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code
ratingsDF.write.format("com.cloudant.spark")
.mode(SaveMode.Append)
.options(Map("cloudant.host"->"<account>.cloudant.com",
"cloudant.username"->"<username>",
"cloudant.password"->"<password>"))
.save("<filename>")
70
71. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
DB2 and BigSQL DataSources (IBM)
Coming Soon!
71
72. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Rumor of REST DataSource (Databricks)
Coming Soon?
Ask Michael Armbrust
Spark SQL Lead @ Databricks
72
73. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom DataSource (Me and You All!)
Coming Right Now!
73
DEMO ALERT!!
74. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Create a Custom DataSource
74
75. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Contributing a Custom Data Source
spark-packages.org
Managed by
Contains links to externally-managed github projects
Ratings and comments
Requires supporte Spark version for each package
Examples
https://github.com/databricks/spark-csv
https://github.com/databricks/spark-avro
https://github.com/databricks/spark-redshift
75
76. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Spark Streaming: Scaling & Approximations
Understand Parallelism, Recovery, and Back Pressure
Describe Common Streaming Count Approximations
76
77. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Direct Kafka Streaming
KafkaRDD partitions store relevant offsets
Each partition acts as a Receiver
Tasks/workers pull from Kafka in parallel
Partitions rebuild from Kafka using offsets
No Write Ahead Log (WAL) needed
Optimizes happy path by avoiding the WAL
At least once delivery guarantee
77
78. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Parallelism of Direct Kafka Streaming
78
79. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Not-so-direct Kinesis Streaming
KinesisRDD partitions store relevant offsets
Single receiver required to see all data/offsets
Kinesis offsets not deterministic like Kafka
Partitions rebuild from Kinesis using offsets
No Write Ahead Log (WAL) needed
Optimizes happy path by avoiding the WAL
At least once delivery guarantee
79
80. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Streaming Back Pressure
More than Throttling
Push back on the source
Requires buffered source (Kafka, Kinesis)
Based on fundamentals of Control Theory
Contributed by TypeSafe
80
81. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
HyperLogLog
Approximate cardinality
(approx count distinct)
Fixed, low memory
Tunable error percentage
Only 1.5KB @ 2% error,10^9 elements
Twitter’s Algebird
Streaming example in Spark codebase
Spark’s countApproxDistinctByKey()
81
http://research.neustar.biz/
82. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Count Min Sketch
Approximate counters
Better than HashMap
Low, fixed memory
Known error bounds
Large num of counters
From Twitter Algebird
Streaming example in Spark codebase
82
83. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Monte Carlo Simulations
From Manhattan Project (A-bomb)
Simulate movement of neutrons
Law of Large Numbers (LLN)
Average of results of many trials
Converge on expected value
SparkPi example in
Spark codebase
Pi ~ (# red dots /
# total dots * 4)
83
84. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Spark ML: High Scale Machine Learning
Define Similarity and Dimension Reduction
Describe Sampling and Bucketing
Generate 10 Recommendations
84
85. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Live, Interactive Demo!
sparkafterdark.com
85
86. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Audience Participation Needed!!
86
->
You are
here
->
Audience Instructions
Navigate to sparkafterdark.com
Click 3 actresses and 3 actors
Wait for us to analyze together!
Note: This is totally anonymous!!
Project Links
https://github.com/fluxcapacitor/pipeline
https://hub.docker.com/r/fluxcapacitor
87. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Similarity
87
88. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Similarity
Euclidean: linear measure
Magnitude bias
Cosine: angle measure
Adjust for magnitude bias
Jaccard: (intersection / union)
Popularity bias
Log Likelihood
Adjust for popularity bias
88
Ali Matei Reynold Patrick Andy
Kimberly 1 1 1 1
Leslie 1 1!
Meredith 1 1 1
Lisa 1 1 1
Holden 1 1 1 1 1
z!
89. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All-Pairs Similarity Comparison
Compare everything to everything
aka. “pair-wise similarity” or “similarity join”
Naïve shuffle: O(m*n^2); m=rows, n=cols
Minimize shuffle through approximations!
Reduce m (rows)
Sampling and bucketing
Reduce n (cols)
Remove most frequent value (ie.0)
Principle Component Analysis
89
Dimension reduction!!
90. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Dimension Reduction
Sampling and Bucketing
90
91. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce m: DIMSUM Sampling
“Dimension Independent Matrix Square Using MR”
Remove rows with low similarity probability
MLlib: RowMatrix.columnSimilarities(…)
Twitter: 40% efficiency gain over Cosine Similarity
91
92. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce m: LSH Bucketing
“Locality Sensitive Hashing”
Split m into b buckets
Use similarity hash algorithm
Requires pre-processing of data
Compare bucket contents in parallel
Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets
ie. 500k x 500k matrix
O(1.25e17) -> O(1.25e13); b=50
github.com/mrsqueeze/spark-hash
92
93. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Reduce n: Remove Most Frequent Value
Eliminate most-frequent value
Represent other values with (index,value) pairs
Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Note: Choose most frequent value (may not be 0)
93
(index,value)
(index,value)
94. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Recommendations
Summary Statistics and Historical Analysis
Collaborative Filtering and Clustering
Text Featurization and NLP
94
95. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Types of Recommendations
Non-personalized
No preference or behavior data for user, yet
aka “Cold Start Problem”
Personalized
User-Item Similarity
Items that others with similar prefs have liked
Item-Item Similarity
Items similar to your previously-liked items
95
96. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Recommendation Terminology
User
User seeking recommendations
Item
Item that has been liked or rated
Feedback
Explicit: like, rating
Implicit: search, click, hover, view, scroll
Feature Engineering
Dimension reduction
96
97. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Non-Personalized Recommendations
Use Aggregate Data to Generate Recommendations
97
98. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Top Users by Like Count
“I might like users who have the most-likes overall
based on historical data.”
SparkSQL, DataFrames: Summary Stat, Aggs
98
99. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Top Influencers by Like Graph
“I might like the most-influential users in overall like graph.”
GraphX: PageRank
99
100. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Generate Recommnedations using Summary Stats & PageRank
100
101. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Personalized Recommendations
Use Similarity to Generate Personalized Recommendations
101
102. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Like Behavior of Similar Users
“I like the same people that you like.
What other people did you like that I haven’t seen?”
MLlib: Matrix Factorization, User-Item Similarity
102
103. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Demo!
Generate Recommendations using
Collaborative Filtering and Matrix Factorization
103
104. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Text-based Profiles as Me
“Our profiles have similar keywords and named entities.
We might like each other!”
MLlib: Word2Vec, TF/IDF, k-skip n-grams
104
105. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Profiles to Previous Likes
105
“Your profile text has similar keywords and named entities to
other profiles of people I like. I might like you, too!”
MLlib: Word2Vec, TF/IDF, Doc Similarity
106. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Relevant, High-Value Emails
“Your initial email references a lot of things in my profile.
I might like you for making the effort!”
MLlib: Word2Vec, TF/IDF, Entity Recognition
106
^
Her Email< My Profile
107. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
The Future of Recommendations
107
108. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Eigenfaces: Facial Recognition
“Your face looks similar to others that I’ve liked.
I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
108
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
109. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
NLP Conversation Starter Bot!
“If your responses to my generic opening
lines are positive, I may read your profile.”
MLlib: TF/IDF, DecisionTrees,
Sentiment Analysis
109
Positive Negative
110. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
110
Maintaining the Spark
111. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
⑨ Recommendations for Couples
“I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path
similar
similar
•
plots ->
<- actors
111
112. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Final Recommendation!
112
113. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Get Off the Computer & Meet People!
Thank you, Paris!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, CA, USA
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/fluxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/fluxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
113
114. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
More Relevant Links
http://meetup.com/Advanced-Apache-Spark-Meetup
http://advancedspark.com
http://github.com/fluxcapacitor/pipeline
http://hub.docker.com/r/fluxcapacitor/pipeline
http://sortbenchmark.org/ApacheSpark2014.pd
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches)
http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do)
https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html
http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Tutorial
http://techblog.netflix.com/2015/07/java-in-flames.html
http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance
http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf
http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
http://docs.scala-lang.org/overviews/quasiquotes/intro.html
http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches
http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do
114
115. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
What’s Next?
115
116. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
What’s Next?
Autoscaling Spark Workers
Completely Docker-based
Docker Compose and Docker Machine
Lots of Demos and Examples!
Zeppelin & IPython/Jupyter notebooks
Advanced streaming use cases
Advanced ML, Graph, and NLP use cases
Performance Tuning and Profiling
Work closely with Brendan Gregg & Netflix
Surface & share more low-level details of Spark internals
116
117. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Spark/Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit & Meetup (Oct 27th)
Delft Dutch Data Science Meetup (Oct 29th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Developers Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
117
San Francisco Datapalooza (Nov 10th)
San Francisco Advanced Apache Spark Meetup (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Apache Spark Meetup (Dec 8th)
Mountain View Advanced Apache Spark Meetup (Dec 10th)
Washington DC Advanced Apache Spark Meetup (Dec 17th)
Freg-a-palooza!
118. Click to edit Master text styles
Click to edit Master text styles
IBM Spark
spark.tc
Click to edit Master text styles
Power of data. Simplicity of design. Speed of innovation.
IBM Spark