SlideShare a Scribd company logo
1 of 44
Download to read offline
1 © Hortonworks Inc. 2011–2018. All rights reserved
What’s New in Apache Spark 2.3 & 2.4
DWS Melbourne, Australia 2019
Robert Hryniewicz
2 © Hortonworks Inc. 2011–2018. All rights reserved
• Apache Spark and Kubernetes
• Native Vectorized ORC and SQL Cache Readers
• Pandas UDFs for PySpark
• Continuous Stream Processing
• Barrier Execution
• Avro/Image Data Source
• Higher-order Functions
Agenda Highlights
3 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark 2.3
4 © Hortonworks Inc. 2011–2018. All rights reserved
Native Vectorized ORC Reader
• Native ORC read and write: ‘spark.sql.orc.impl’ to ‘native’.
• Vectorized ORC reader: ‘spark.sql.orc.enableVectorizedReader’ to ‘true’
See also ORC Improvement in Apache Spark 2.3 by Dongjoon Hyun
5 © Hortonworks Inc. 2011–2018. All rights reserved
Normal UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
Apache Spark
Python Worker
Internal Spark data
Convert to standard Java type
Pickled
Unpickled
Evaluate row by row
Convert to Python data
6 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
Internal Spark data
Apache Arrow format
Convert to Pandas (cheap)
Vectorized operation by Pandas API
Apache Spark
Python Worker
7 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
Internal Spark data
Apache Arrow format
Convert to Pandas (cheap)
Vectorized operation by Pandas API
Apache Spark
Python Worker
8 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
See also Introducing Pandas UDFs for PySpark by Li Jin
9 © Hortonworks Inc. 2011–2018. All rights reserved
Conversion To/From Pandas With Apache Arrow
• Enable Apache Arrow optimization:
‘spark.sql.execution.arrow.enabled’ to
‘true’.
See also Speeding up PySpark with Apache Arrow by Bryan Cutler
10 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming
Continuous Stream Processing
11 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming: Microbatch
Continuous Stream Processing
See also Continuous Processing in Structured Streaming by Josh Torres
12 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming: Continuous Processing
Continuous Stream Processing
See also Continuous Processing in Structured Streaming by Josh Torres
13 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming: Continuous Processing
Continuous Stream Processing
See also Spark Summit Keynote Demo by Michael Armbrust
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
14 © Hortonworks Inc. 2011–2018. All rights reserved
Stream-to-Stream Joins
See also Introducing Stream-Stream Joins in Apache Spark 2.3 by Tathagata Das and Joseph Torres
15 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark and Kubernetes
See also Running Spark on Kubernetes
16 © Hortonworks Inc. 2011–2018. All rights reserved
Image Support in Spark
• Convert from compressed Images format
(e.g., PNG and JPG) to raw representation
of an image for OpenCV
• One record per one image file
See also SPARK-21866 by Ilya Matiach, and Deep Learning Pipelines for Apache Spark
17 © Hortonworks Inc. 2011–2018. All rights reserved
(Stateless) History Server
History Server Using K-V Store
• Requires storing app lists and UI in the memory
• Requires reading/parsing the whole log file
See also SPARK-18085 and the proposal by Marcelo Vanzin
18 © Hortonworks Inc. 2011–2018. All rights reserved
History Server Using K-V Store
History Server Using K-V Store
• Store app lists and UI in a persistent K-V store (LevelDB)
• Set ‘spark.history.store.path’ to use this feature
• The event log written by lower versions is still compatible
See also SPARK-18085 and the proposal by Marcelo Vanzin
19 © Hortonworks Inc. 2011–2018. All rights reserved
R Structured Streaming
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
See also SSR: Structured Streaming on R for Machine Learning by Felix Cheung
20 © Hortonworks Inc. 2011–2018. All rights reserved
R Native Function Execution Stability
See also SPARK-21093
21 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark 2.4
22 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark 2.4
Barrier Execution
Apache Spark 3.0
• [SPARK-24374] barrier execution mode
• [SPARK-24374] barrier execution mode
• [SPARK-24579] optimized data exchange
• [SPARK-24615] accelerator-aware scheduling
See also Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark by Reynold Xin
See also Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark by Xiangrui Meng
23 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDFs: Grouped Aggregate Pandas UDFs
https://github.com/apache/spark/commit/9786ce66c
https://github.com/apache/spark/commit/b2ce17b4c
24 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDFs: Grouped Aggregate Pandas UDFs
https://github.com/apache/spark/pull/22620/commits/06a7bd0c
Internal Spark data
Apache Arrow format
Convert to Pandas (cheap)
Vectorized operation by Pandas API
Apache Spark
Python Worker
25 © Hortonworks Inc. 2011–2018. All rights reserved
Eager Evaluation
• Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter
26 © Hortonworks Inc. 2011–2018. All rights reserved
Eager Evaluation
• Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter
See also (ongoing) SPARK-24572 for Eagar Evaluation at R side
27 © Hortonworks Inc. 2011–2018. All rights reserved
Flexible Streaming Sink
• Exposing output rows of each microbatch as a DataFrame
• foreachBatch(f: Dataset[T] => Unit) Scala/Java/Python APIs in DataStreamWriter.
28 © Hortonworks Inc. 2011–2018. All rights reserved
Avro Data Source
• Apache Avro (https://avro.apache.org)
• A data serialization format
• Widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines.
• Spark-Avro package (https://github.com/databricks/spark-avro)
• Spark SQL can read and write the Avro data.
• Inlining Spark-Avro package [SPARK-24768]
• Better experience for first-time users of Spark SQL and structured streaming
• Expect further improve the adoption of structured streaming
29 © Hortonworks Inc. 2011–2018. All rights reserved
Avro Data Source
• from_avro/to_avro functions to read and write Avro data within a DataFrame instead of
just files.
• Example:
• Decode the Avro data into a struct
• Filter by column `favorite_color`
• Encode the column `name` in Avro format
30 © Hortonworks Inc. 2011–2018. All rights reserved
Avro Data Source
• Refactor Avro Serializer and Deserializer
• External
• Arrow Data -> Row -> InternalRow
• Native
• Arrow Data -> InternalRow
31 © Hortonworks Inc. 2011–2018. All rights reserved
Avro Data Source
• Options:
• compression: compression codec in write
• ignoreExtension: if ignore .avro or not in read
• recordNamespace: record namespace in write
• recordName: top root record name in write
• avroSchema: avro schema to use
• Logical type support:
• Date [SPARK-24772]
• Decimal [SPARK-24774]
• Timestamp [SPARK-24773]
32 © Hortonworks Inc. 2011–2018. All rights reserved
Image Data Source
• Spark datasource for image format
• ImageSchema deprecated use instead:
• SQL syntax support
• Partition discovery
33 © Hortonworks Inc. 2011–2018. All rights reserved
Higher-order Functions
• Takes functions to transform complex datatype like map, array and struct
34 © Hortonworks Inc. 2011–2018. All rights reserved
Higher-order Functions
35 © Hortonworks Inc. 2011–2018. All rights reserved
Built-in Functions
• New or extended built-in functions for ArrayTypes and MapTypes
• 26 functions for ArrayTypes
• transform, filter, reduce, array_distinct, array_intersect, array_union, array_except, array_join,
array_max, array_min, ...
• 8 functions for MapTypes
• map_from_arrays, map_from_entries, map_entries, map_concat, map_filter, map_zip_with,
transform_keys, transform_values
36 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark and Kubernetes
• New Spark scheduler backend
• PySpark support [SPARK-23984]
• SparkR support [SPARK-24433]
• Client-mode support [SPARK-23146]
• Support for mounting K8S volumes [SPARK-23529]
Scala 2.12 (Beta) Support
Build Spark against Scala 2.12 [SPARK-14220]
37 © Hortonworks Inc. 2011–2018. All rights reserved
PySpark Custom Worker
• Configuration to select the modules for daemon and worker in PySpark
• Set ‘spark.python.daemon.module’and/or ‘spark.python.worker.module’ to the worker or
daemon modules
See also Remote Python Debugging 4 Spark
38 © Hortonworks Inc. 2011–2018. All rights reserved
Data Source Changes
• CSV
• Option samplingRatio
• for schema inference [SPARK-23846]
• Option enforceSchema
• for throwing an exception when user-
specified schema doesn‘t match the CSV
header [SPARK-23786]
• Option encoding
• for specifying the encoding of outputs.
[SPARK-19018]
• JSON
• Option dropFieldIfAllNull
• for ignoring column of all null values or
empty array/struct during JSON schema
inference [SPARK-23772]
• Option lineSep
• for defining the line separator that should
be used for parsing [SPARK-23765]
• Option encoding
• for specifying the encoding of inputs and
outputs. [SPARK-23723]
39 © Hortonworks Inc. 2011–2018. All rights reserved
Data Source Changes
• Parquet
• Push down
• STRING [SPARK-23972]
• Decimal [SPARK-24549]
• Timestamp [SPARK-24718]
• Date [SPARK-23727]
• Byte/Short [SPARK-24706]
• StringStartsWith [SPARK-24638]
• IN [SPARK-17091]
• ORC
• Native ORC reader is on by default
[SPARK-23456]
• Turn on ORC filter push-down by
default [SPARK-21783]
• Use native ORC reader to read Hive
serde tables by default [SPARK-
22279]
40 © Hortonworks Inc. 2011–2018. All rights reserved
Data Source Changes
• JDBC
• Option queryTimeout
• for the number of seconds the the driver will wait for a Statement object to execute.
[SPARK-23856]
• Option query
• for specifying the query to read from JDBC [SPARK-24423]
• Option pushDownFilters
• for specifying whether the filter pushdown is allowed [SPARK-24288]
• Option cascadeTruncate [SPARK-22880]
41 © Hortonworks Inc. 2011–2018. All rights reserved
What About Apache Spark 3.0?
Spark 2.2.0 RC1
2017/05
Spark 2.2.0 released
2018/07
Spark 2.2.0 RC2, RC3, RC4, RC5
2017/06
Spark 2.2.0 RC6
2017/07
Spark 2.3.0 RC1
2018/01
Spark 2.3.0 RC2, RC3, RC4, RC5
2018/02
Spark 2.3.0 released
2018/02
Spark 2.4.0 RC1
2018/09
Spark 3.0.0
2019/05 (?)
Spark 2.4.0 RC2
2018/10
Spark 2.4.0
2018/11
See also the thread in Spark dev mailing list for Spark 3.0 discussion
42 © Hortonworks Inc. 2011–2018. All rights reserved
Newer Integration for Apache Hive with Apache Spark
• Apache Hive 3 support: Apache Spark
provides a basic Hive compatibility
• Apache Hive ACID table support
• Structured Streaming Support
• Apache Ranger integration support
• Use LLAP and vectorized read/write – fast!
See also this article for Hive warehouse connector
43 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
44 © Hortonworks Inc. 2011–2018. All rights reserved
Thanks!
Robert Hryniewicz

More Related Content

What's hot

Spring Caches with Protocol Buffers
Spring Caches with Protocol BuffersSpring Caches with Protocol Buffers
Spring Caches with Protocol BuffersVMware Tanzu
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Neo4j 4.1 overview
Neo4j 4.1 overviewNeo4j 4.1 overview
Neo4j 4.1 overviewNeo4j
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...confluent
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Zero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with NettyZero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with NettyDaniel Bimschas
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 

What's hot (20)

Spring Caches with Protocol Buffers
Spring Caches with Protocol BuffersSpring Caches with Protocol Buffers
Spring Caches with Protocol Buffers
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Neo4j 4.1 overview
Neo4j 4.1 overviewNeo4j 4.1 overview
Neo4j 4.1 overview
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Zero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with NettyZero-Copy Event-Driven Servers with Netty
Zero-Copy Event-Driven Servers with Netty
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 

Similar to What s new in spark 2.3 and spark 2.4

What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4DataWorks Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3Dongjoon Hyun
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3DataWorks Summit
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4boxu42
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4Dongjoon Hyun
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyondXiao Li
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...Impetus Technologies
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3DataWorks Summit
 

Similar to What s new in spark 2.3 and spark 2.4 (20)

What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Sparc solaris servers
Sparc solaris serversSparc solaris servers
Sparc solaris servers
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

What s new in spark 2.3 and spark 2.4

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved What’s New in Apache Spark 2.3 & 2.4 DWS Melbourne, Australia 2019 Robert Hryniewicz
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved • Apache Spark and Kubernetes • Native Vectorized ORC and SQL Cache Readers • Pandas UDFs for PySpark • Continuous Stream Processing • Barrier Execution • Avro/Image Data Source • Higher-order Functions Agenda Highlights
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark 2.3
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Native Vectorized ORC Reader • Native ORC read and write: ‘spark.sql.orc.impl’ to ‘native’. • Vectorized ORC reader: ‘spark.sql.orc.enableVectorizedReader’ to ‘true’ See also ORC Improvement in Apache Spark 2.3 by Dongjoon Hyun
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Normal UDF Pandas UDFs (a.k.a. Vectorized UDFs) Apache Spark Python Worker Internal Spark data Convert to standard Java type Pickled Unpickled Evaluate row by row Convert to Python data
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDF Pandas UDFs (a.k.a. Vectorized UDFs) Internal Spark data Apache Arrow format Convert to Pandas (cheap) Vectorized operation by Pandas API Apache Spark Python Worker
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDF Pandas UDFs (a.k.a. Vectorized UDFs) Internal Spark data Apache Arrow format Convert to Pandas (cheap) Vectorized operation by Pandas API Apache Spark Python Worker
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDF Pandas UDFs (a.k.a. Vectorized UDFs) See also Introducing Pandas UDFs for PySpark by Li Jin
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Conversion To/From Pandas With Apache Arrow • Enable Apache Arrow optimization: ‘spark.sql.execution.arrow.enabled’ to ‘true’. See also Speeding up PySpark with Apache Arrow by Bryan Cutler
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming Continuous Stream Processing
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming: Microbatch Continuous Stream Processing See also Continuous Processing in Structured Streaming by Josh Torres
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming: Continuous Processing Continuous Stream Processing See also Continuous Processing in Structured Streaming by Josh Torres
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming: Continuous Processing Continuous Stream Processing See also Spark Summit Keynote Demo by Michael Armbrust https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Stream-to-Stream Joins See also Introducing Stream-Stream Joins in Apache Spark 2.3 by Tathagata Das and Joseph Torres
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark and Kubernetes See also Running Spark on Kubernetes
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Image Support in Spark • Convert from compressed Images format (e.g., PNG and JPG) to raw representation of an image for OpenCV • One record per one image file See also SPARK-21866 by Ilya Matiach, and Deep Learning Pipelines for Apache Spark
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved (Stateless) History Server History Server Using K-V Store • Requires storing app lists and UI in the memory • Requires reading/parsing the whole log file See also SPARK-18085 and the proposal by Marcelo Vanzin
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved History Server Using K-V Store History Server Using K-V Store • Store app lists and UI in a persistent K-V store (LevelDB) • Set ‘spark.history.store.path’ to use this feature • The event log written by lower versions is still compatible See also SPARK-18085 and the proposal by Marcelo Vanzin
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved R Structured Streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html See also SSR: Structured Streaming on R for Machine Learning by Felix Cheung
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved R Native Function Execution Stability See also SPARK-21093
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark 2.4
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark 2.4 Barrier Execution Apache Spark 3.0 • [SPARK-24374] barrier execution mode • [SPARK-24374] barrier execution mode • [SPARK-24579] optimized data exchange • [SPARK-24615] accelerator-aware scheduling See also Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark by Reynold Xin See also Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark by Xiangrui Meng
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDFs: Grouped Aggregate Pandas UDFs https://github.com/apache/spark/commit/9786ce66c https://github.com/apache/spark/commit/b2ce17b4c
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDFs: Grouped Aggregate Pandas UDFs https://github.com/apache/spark/pull/22620/commits/06a7bd0c Internal Spark data Apache Arrow format Convert to Pandas (cheap) Vectorized operation by Pandas API Apache Spark Python Worker
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Eager Evaluation • Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Eager Evaluation • Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter See also (ongoing) SPARK-24572 for Eagar Evaluation at R side
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Flexible Streaming Sink • Exposing output rows of each microbatch as a DataFrame • foreachBatch(f: Dataset[T] => Unit) Scala/Java/Python APIs in DataStreamWriter.
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Avro Data Source • Apache Avro (https://avro.apache.org) • A data serialization format • Widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines. • Spark-Avro package (https://github.com/databricks/spark-avro) • Spark SQL can read and write the Avro data. • Inlining Spark-Avro package [SPARK-24768] • Better experience for first-time users of Spark SQL and structured streaming • Expect further improve the adoption of structured streaming
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Avro Data Source • from_avro/to_avro functions to read and write Avro data within a DataFrame instead of just files. • Example: • Decode the Avro data into a struct • Filter by column `favorite_color` • Encode the column `name` in Avro format
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Avro Data Source • Refactor Avro Serializer and Deserializer • External • Arrow Data -> Row -> InternalRow • Native • Arrow Data -> InternalRow
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Avro Data Source • Options: • compression: compression codec in write • ignoreExtension: if ignore .avro or not in read • recordNamespace: record namespace in write • recordName: top root record name in write • avroSchema: avro schema to use • Logical type support: • Date [SPARK-24772] • Decimal [SPARK-24774] • Timestamp [SPARK-24773]
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Image Data Source • Spark datasource for image format • ImageSchema deprecated use instead: • SQL syntax support • Partition discovery
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Higher-order Functions • Takes functions to transform complex datatype like map, array and struct
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Higher-order Functions
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Built-in Functions • New or extended built-in functions for ArrayTypes and MapTypes • 26 functions for ArrayTypes • transform, filter, reduce, array_distinct, array_intersect, array_union, array_except, array_join, array_max, array_min, ... • 8 functions for MapTypes • map_from_arrays, map_from_entries, map_entries, map_concat, map_filter, map_zip_with, transform_keys, transform_values
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark and Kubernetes • New Spark scheduler backend • PySpark support [SPARK-23984] • SparkR support [SPARK-24433] • Client-mode support [SPARK-23146] • Support for mounting K8S volumes [SPARK-23529] Scala 2.12 (Beta) Support Build Spark against Scala 2.12 [SPARK-14220]
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved PySpark Custom Worker • Configuration to select the modules for daemon and worker in PySpark • Set ‘spark.python.daemon.module’and/or ‘spark.python.worker.module’ to the worker or daemon modules See also Remote Python Debugging 4 Spark
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Data Source Changes • CSV • Option samplingRatio • for schema inference [SPARK-23846] • Option enforceSchema • for throwing an exception when user- specified schema doesn‘t match the CSV header [SPARK-23786] • Option encoding • for specifying the encoding of outputs. [SPARK-19018] • JSON • Option dropFieldIfAllNull • for ignoring column of all null values or empty array/struct during JSON schema inference [SPARK-23772] • Option lineSep • for defining the line separator that should be used for parsing [SPARK-23765] • Option encoding • for specifying the encoding of inputs and outputs. [SPARK-23723]
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Data Source Changes • Parquet • Push down • STRING [SPARK-23972] • Decimal [SPARK-24549] • Timestamp [SPARK-24718] • Date [SPARK-23727] • Byte/Short [SPARK-24706] • StringStartsWith [SPARK-24638] • IN [SPARK-17091] • ORC • Native ORC reader is on by default [SPARK-23456] • Turn on ORC filter push-down by default [SPARK-21783] • Use native ORC reader to read Hive serde tables by default [SPARK- 22279]
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Data Source Changes • JDBC • Option queryTimeout • for the number of seconds the the driver will wait for a Statement object to execute. [SPARK-23856] • Option query • for specifying the query to read from JDBC [SPARK-24423] • Option pushDownFilters • for specifying whether the filter pushdown is allowed [SPARK-24288] • Option cascadeTruncate [SPARK-22880]
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved What About Apache Spark 3.0? Spark 2.2.0 RC1 2017/05 Spark 2.2.0 released 2018/07 Spark 2.2.0 RC2, RC3, RC4, RC5 2017/06 Spark 2.2.0 RC6 2017/07 Spark 2.3.0 RC1 2018/01 Spark 2.3.0 RC2, RC3, RC4, RC5 2018/02 Spark 2.3.0 released 2018/02 Spark 2.4.0 RC1 2018/09 Spark 3.0.0 2019/05 (?) Spark 2.4.0 RC2 2018/10 Spark 2.4.0 2018/11 See also the thread in Spark dev mailing list for Spark 3.0 discussion
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Newer Integration for Apache Hive with Apache Spark • Apache Hive 3 support: Apache Spark provides a basic Hive compatibility • Apache Hive ACID table support • Structured Streaming Support • Apache Ranger integration support • Use LLAP and vectorized read/write – fast! See also this article for Hive warehouse connector
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Thanks! Robert Hryniewicz