The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
What's New in Apache Spark 2.3 & Why Should You Care
1. What's New in the Apache Spark™ 2.3
Jules S. Damji
BASM Meetup , Bloomberg
@2twitme
May 15, 2018
2. Spark Community& DeveloperAdvocate@
Databricks
Program Chair Spark + AI Summit
DeveloperAdvocate@ Hortonworks
Software engineering @: Sun Microsystems,
Netscape, @Home, VeriSign, Scalix, Centrify,
LoudCloud/Opsware, ProQuest
https://www.linkedin.com/in/dmatrix
@2twitme
3. Databricks’ Unified Analytics Platform
DATABRICKS RUNTIME
COLLABORATIVE WORKSPACE
Delta SQL Streaming
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies DataEngineers
and Data Scientists
Unifies Dataand AI
Technologies
Eliminates infrastructure
complexity
4. Major Features on Apache Spark 2.3
4
Continuous
Processing
Data Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Over 1400issues resolved!
8. Complexities in stream processing
COMPLEX DATA
Diversedata formats
(json, avro, binary, …)
Data can bedirty,
late, out-of-order
COMPLEX SYSTEMS
Diversestoragesystems (Kafka,
S3, Kinesis, RDBMS, …)
System failures
COMPLEX WORKLOADS
Combiningstreamingwith
interactive queries
Joining two Streams
Machine learning with Streams
9. Structured Streaming
stream processingon Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
13. Structured Streaming
Introducedin Spark 2.0 and Production Ready Spark 2.2
Among Databricks customers:
- 10X more usagethan DStream
- Processed100+ trillion recordsin production
18. Inner Join + Time constraints + Watermarks
time constraints
Time constraints
- Impressions can be 2 hours
late
- Clicks can be 3 hours late
- A click can occur within 1 hour
after the corresponding
impression
val impressionsWithWatermark = impressions
.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks
.withWatermark("clickTime", "3 hours")
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
"""
))
Join
Range Join
Stream-stream
Join
19. Stream-stream Joins
Inner
19
Supported, optionally specify watermark on both sides + time
constraints for state cleanup
Conditionally supported, must specify watermark on right + time
constraints for correct results, optionally specify watermark on left
for all state cleanup
Conditionally supported, must specify watermark on left + time
constraints for correct results, optionally specify watermark on right
for all state cleanup
Not supported.
Left
Right
Full
Stream-stream
Join
21. Streaming Machine Learning
Model transformation/prediction on batch and streaming data
with unified API.
After fitting a model or Pipeline, you can deploy it in a streaming
job.
val streamOutput = transformer.transform(streamDF)
21
23. Image Support in Spark
Spark Image data source SPARK-21866 :
• Defined a standard API in Spark for loading and reading images
• Deep learning frameworks can rely on this
val df = ImageSchema.readImages("/data/images")
23
26. PySpark
Introducedin Spark 0.7 (~2013); became first class citizen in the
DataFrame API in Spark 1.3 (~2015)
Much slower than Scala/Java with user-defined functions (UDF),
due to serialization & Python interpreter
Note: Most PyData tooling, e.g., Pandas, NumPy, are written in C++
27. PySpark Performance
Fast data serialization and execution using vectorized formats
[SPARK-22216] [SPARK-21187]
• Conversion from/to Pandas
df.toPandas()
createDataFrame(pandas_df)
• Pandas/Vectorized UDFs: UDF using Pandas to process data
• Scalar Pandas UDFs
• Grouped Map Pandas UDFs
27
28. Pandas UDF
• Used with functions
such as select and
withColumn.
• The Python function
should take
pandas.Seriesas
inputs and return
a pandas.Seriesof
the same length.
28
Scalar PandasUDFs:
29. Pandas UDF
29
Grouped Map PandasUDFs:
• Split-apply-combine
• A Python function that
defines the
computation for each
group.
• The output schema
30. PySpark Performance
30
Blog "Introducing Pandas UDFs for PySpark(Two Sigma)
http://dbricks.co/2rMwmW0
Apache Arrow Columnar Format forData Exchange
https://arrow.spark.org
31. Demo
Try to import this note book at home … :
https://dbricks.co/pandas_udf
34. Native Spark App in K8S
• New Spark scheduler backend
• Driver runs in a Kubernetes pod created
by the submission client and creates
pods that run the executors in response
to requests from the Spark scheduler.
[K8S-34377] [SPARK-18278]
• Make direct use of Kubernetes clusters for
multi-tenancy and sharing through
Namespaces and Quotas, as well as
administrative features such as Pluggable
Authorization, and Logging.
34
35. Spark on Kubernetes
Supported:
• Supports Kubernetes1.6 and up
• Supports cluster mode only
• Staticresource allocation only
• Supports Java and Scala
applications
• Can use container-local and
remote dependencies that are
downloadable
35
In roadmap (2.4):
• Client mode
• Dynamic resource allocation +
external shuffle service
• Python and R support
• Submission client local dependencies
+ Resource staging server (RSS)
• Non-secured and KerberizedHDFS
access (injection of Hadoop
configuration)