4. • Started as a research project at UC Berkeley in 2009
• 600,000 lines of code (75% Scala)
• Last Release Spark 1.6 December 2015
• Next Release Spark 2.0
• Open Source License (Apache 2.0)
• Built by 1000+ developers from 200+ companies
4
5. Apache Spark: Flexible and Unified
5
{JSON
}
Data
Sources
Spark Core
DataFrames ML Pipelines
Spark
Streaming
Spark SQL MLlib GraphX
Live Data
6.
7. RDDs DataFrames / SQL
Collections of Native JVM Objects Structured Binary Data (Tungsten)
• Compile-time type-safety
• Easy to express certain types of
logic
• Lots of existing code + users
• Lower level control of Spark
• Imperative
• Lower memory pressure (gc & space)
• Memory accounting (avoids OOMs)
• Faster sorting / hashing / serialization
• More opportunities for automatic optimization
• Declarative
10. Why Spark ML
Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets,
Tungsten)
Advantages of MLlib’s Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility
11. High-level functionality in MLlib
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent
itemsets
11
Workflow utilities
• Model import/export
• Pipelines
• DataFrames
• Cross validation
Data utilities
• Feature
extraction &
selection
• Statistics
• Linear algebra
12. ML Workflows are complex
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 2
Extract features
Extract features
Feature transform
1
Feature transform
2
Feature transform
3
Train model 2
Ensemble
12
13. Iterate on Your Models
Analyze
Data
Feature
Engineerin
g
TrainTune
Test
13
• Databricks Notebooks or Juptyer,
Zeppelin are great for iterative
model development using the
REPL
• Spark provides fast, scalable
infrastructure so you don’t have
to wait for your results
• Subsample during the early
model development phase, but
when in doubt use more data
• Better feature engineering can
produce as good or better results
than tuning the algorithm
14. The Advanced Analytics Gap
14
ADVANCED ANALYTICS SOLUTIONS
ANOMALY
DETECTION
PREDICTIVE
ANALYTICS
NEXT GEN
PRODUCT R&D
SILOED, UNSTRUCTURED, FAST-GROWING DATA
HADOOP /
DATA LAKES
CLOUD
STORAGE
DATA
WAREHOUSES
15. DATA
WAREHOUSES
HADOOP /
DATA LAKES
YOUR
STORAGE
CLOUD
STORAGE
ORCHESTRATED
SPARK IN THE
CLOUD
15
Just-in-Time Data Platform
INTEGRATED
WORKSPACE
DASHBOARDS
Reports
NOTEBOOKS
github, viz,
collaboration
ENTERPRISESECURITY
Accesscontrol,auditing,encryption
BI TOOLS
OPEN SOURCE
YOUR CUSTOM SPARK APPS
MANAGEMENT: Scalability, resilience, multi-tenancy
INTERFACES: BI tools & RESTful APIs
DATA INTEGRATION: Universal access without centralization
MANAGED SERVICES
PRODUCTION JOBS
+
Powered by Apache Spark
17. Media & Entertainment Use Cases
17
Content
Personalization
Churn & Cohort
Analysis
Social Network
Graph
Sentiment Analysis
Secure Managed Spark Platform
ETL | Data Cleansing JIT Data Warehouse
Advanced Analytics
Machine Learning / Graph Analysis
Pixel Data Social Media
Nielsen Rating
Image Data
Video Stream
Viewing Data
Survey DataWearable Data
CRM Data
Transactional
18. Content Personalization: Recommendations
• Broad Application
– Movie Streaming, Matching Sites, Mobile App, Music
• Key Trends
• Continuous Application- Near Real-time interactivity
• Best products suited for a user’s preference to maximize the revenue
• Provide a tailored and personalized view of pertinent data for each individual you
serve
18
Rating, Play,
Browse,..Media Channels
(devices,..)
Event Distribution
Content Serving
Recommendation..
Online
Learning
Offline
Learning
Social
Behavioral Feed
Dashboard
Content
Repo
Event Analytics
19. Content Personalization: Recommendations
• Key Considerations
– Cold Start- Content Based, Similarity Index
– Rating Based- User-Item: ALS Item-Item: K-Mean
– Social Graph- PageRank of Top Influencers Spark Graph Frames
– Continuous Application- Spark Streaming, Real-time Model, Model Serving
19
Input Stream
Interactive
Dashboard
Structural
Streaming
Movie Rating
Social Feed
Real-time ML
Model
Updates
Off-line ML
Model
Serving
System
Continuous Application
Storage Realtime & Batch
21. Structured Streaming
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
Streaming Version
Batch Version
22. Social Media Analytics: GraphFrames
• Application
– Influencers in a Social Graph (PageRank), Distance Measure, Co-occurance, Clustering
• Key Trends
• Marketing Campaigns
• Finding Friends, Jobs, …
22
GraphFrames are built on top of Spark
DataFrames, vertices and edges are represented
as DataFrames, allowing us to store arbitrary data
with each vertex and edge.
Shortest Path: How fast communication propagates
Label Propagation Algorithm (LPA): Detect communities in a
graph.
Motif finding: Search for structural patterns in a graph.
23. Viewership Prediction: Topic Modeling
• Application
– Programming Decisions (e.g. House of Cards),
• Key Trends
• Consumer sentiment in realtime, augmenting Nielsen rating data with transcript analytics
• Netflex: Meta-tags with information about millions plays per day to determine what will be a hit,
what viewers like, and what keeps them watching
23
Text Tokenization
Remove Stopwords
Vector of Token Counts
Create LDA model with Online Variational Bayes
Review Topics
Model Tuning -
Create LDA model with Expectation Maximization
Visualize Results
24. Sentiment Analysis: Logistic Regresson
• Application
– Not just Twitter, any type of text e.g. viewers comments on a movie streams
24