The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
A look ahead at spark 2.0
1. A look ahead at Spark 2.0
Reynold Xin @rxin
2016-03-30,Strata Conference
2. About Databricks
Founded by creatorsof Spark in 2013
Cloud enterprisedata platform
- Managed Spark clusters
- Interactive data science
- Production pipelines
- Data governance,security, …
3. Today’s Talk
Looking back last12 months
Looking forward to Spark 2.0
• Project Tungsten,Phase 2
• Structured Streaming
• Unifying DataFrame & Dataset
Best resourceforlearning Spark
6. What is Spark?
Unified engineacross data workloads and platforms
…
SQLStreaming ML Graph Batch …
7. 2015: A Great Year for Spark
Most active open source projectin (big) data
• 1000+ code contributors
New language: R
Widespread industry support& adoption
8. “Spark is the Taylor Swift
of big data software.”
- Derrick Harris, Fortune
12. Diverse Runtime Environments
HOW RESPONDENTS ARE
RUNNING SPARK
51%
on a public cloud
MOST COMMON SPARK DEPLOYMENT
ENVIRONMENTS (CLUSTER MANAGERS)
48%
40%
11%
Standalone mode YARN Mesos
Cluster Managers
13. Spark 2.0
Next major release, coming in May
Builds on all we learned in past 2 years
14. Versioning in Spark
In reality, we hate breaking APIs!
Will notdo so exceptfor dependency conflicts(e.g.Guava) and experimental APIs
1.6.0
Patch version (only bug fixes)
Major version (may change APIs)
Minor version (addsAPIs/ features)
15. Major Features in 2.0
TungstenPhase 2
speedupsof 5-10x
StructuredStreaming
real-time engine
on SQL/DataFrames
Unifying Datasets
and DataFrames
17. Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs
• DataFrames are collections of rows with a schema
• Datasets add static types,e.g. Dataset[Person]
• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
18. Example
case class User(name: String, id: Int)
case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]
messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”))
.map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets
19. Benefits
Simpler to understand
• Onlykept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams
20. Long-Term
RDD will remain the low-levelAPIin Spark
Datasets & DataFrames give richer semanticsand optimizations
• New libraries will increasingly use these as interchange format
• Examples: Structured Streaming,MLlib, GraphFrames
23. Processing
Businesslogic change & new ops
(windows,sessions)
Complex Programming Models
Output
How do we define
outputover time & correctness?
Data
Late arrival, varying distribution overtime, …
24. The simplest way to perform streaming analytics
is not having to reason about streaming.
26. Structured Streaming
High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models
See Michael/TD’stalks tomorrow for a deep dive!
28. Demo
Run a join on a large table with 1 billion recordsand a small table with
1000 records
In Spark 1.6, took 60+ seconds.
In Spark 2.0, took ~3 seconds.
30. Volcano Iterator Model
Standard for 30 years: almost
all databases do it
Each operatoris an “iterator”
that consumes recordsfrom
its input operator
class Filter {
def next(): Boolean = {
var found = false
while (!found && child.next()) {
found = predicate(child.fetch())
}
return found
}
def fetch(): InternalRow = {
child.fetch()
}
…
}
31. What if we hire a collegefreshmanto
implement this queryin Java in 10 mins?
select count(*) from store_sales
where ss_item_sk = 1000
var count = 0
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1
}
}
32. Volcano model
30+ years of database research
college freshman
hand-written code in 10 mins
vs
34. How does a student beat 30 years of research?
Volcano
1. Many virtual function calls
2. Data in memory (orcache)
3. No loop unrolling,SIMD, pipelining
hand-written code
1. No virtual function calls
2. Data in CPU registers
3. Compilerloop unrolling,SIMD,
pipelining
Take advantage of all the information that is known after query compilation
35. Scan
Filter
Project
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
Tungsten Phase 2: Spark as a “Compiler”
Functionality of a generalpurpose
execution engine; performanceas if
hand built system just to run your query
38. Today’s talk
Spark has been growing explosively
Spark 2.0 doubles down on what made Spark attractive:
• elegantAPIs
• cutting-edge performance
Learn Spark on Databricks Community Edition
• join beta waitlist https://databricks.com/