SlideShare une entreprise Scribd logo
1  sur  34
Jean-Georges Perrin • @jgperrin
Apache Spark v3.0.0
What’s new? A very personal view.
Jean-Georges Perrin
Software since 1983 $>0 1995
Big Data since 1984 $>0 2006
AI since 1994 $>0 2010
x12
@jgperrin • http://jgp.ai
Sources:

Wikipedia: https://en.wikipedia.org/wiki/Big_data

IBM: https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
• volume 

• variety 

• velocity 

• variability

• value
3
V4
5
Biiiiiiiig Data
Dataisconsidered
bigwhenitneeds
morethanone
computertobe
processed
Apps
Analytics
Distrib.
Hardware
OS
Apps
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
HardwareHardware
OS OS
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for predictive
models.
Explore data to find
hidden gems and patterns.
Tells stories to key
stakeholders.
Sources:

Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
Sources:

Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
IBM Watson Studio
Sources:

Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
Python rules in
Notebooks
A few more figures
Who does not like performance figures?
• Databricks:

• Processes >5T records/day with Structured Streaming (introduced in Spark
v2.0, stable in Spark v2.2)

• >90% of all Spark API are Spark SQL, regardless of language used

• Community:

• Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB
benchmark
5,000,000,000,001
Sources:

Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html

Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
What’s new in v3?
Sources:

Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html

Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
3400+ Jira tickets
Highlights in a nutshell
• Python

• Python v3 only (so long Python v2)

• Better error handling

• Koalas offer better Pandas support (close to 80%)

• SQL

• Better ANSI SQL compliance

• Core

• Adaptive query execution, including partition pruning

• Java v11 support, Scala v2.12 only

• Hydrogen - hardware & accelerator aware scheduler
GPU support for model training
Optimizing the optimizer
Got to love Catalyst
• v1.x: rule

• v2.x: rule & cost (thanks to IBM)

• v3.x: rule & cost & runtime (thanks to Databricks & Intel)
• Dynamically coalescing shuffle partitions

• Dynamically switching join strategies

• Dynamically optimizing skew joins
Sources:

Databricks blog, https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
Adaptive query execution (AQE)
Yet another 3-letter acronym
SparkSession spark = SparkSession.builder()
.appName("Join using AQE")
.master("local[*]")
.config("spark.sql.adaptive.enabled", useAqe)
.getOrCreate();
…
Dataset<Row> institPerCountyDf = higherEdDf.join(
countyZipDf,
higherEdDf.col("zip").equalTo(countyZipDf.col("zip")),
"inner");
institPerCountyDf = institPerCountyDf.join(
censusDf,
institPerCountyDf.col("county").equalTo(censusDf.col("countyId")),
"left");
For the entire session
/jgperrin/net.jgp.books.spark.ch12
Chapter 12
Lab #302
In a free companion book
• sinh, cosh, tanh, asinh, acosh, atanh (SPARK-28133)

• any, every, some (SPARK-19851)

• bit_and, bit_or (SPARK-27879)

• bit_count (SPARK-29491)

• bit_xor (SPARK-29545)

• bool_and, bool_or (SPARK-30184)

• count_if (SPARK-27425)

• date_part (SPARK-28690)

• extract (SPARK-23903)

• forall (SPARK-27905)

• from_csv (SPARK-25393)

• make_date (SPARK-28432)

• make_interval (SPARK-29393)

• make_timestamp (SPARK-28459)

• map_entries (SPARK-23935)

• map_filter (SPARK-23937)

• map_zip_with (SPARK-23938)

• max_by, min_by (SPARK-27653)

• schema_of_csv (SPARK-25672)

• to_csv (SPARK-25638)

• transform_keys (SPARK-23939)

• transform_values (SPARK-23940)

• typeof (SPARK-29961)

• version (SPARK-29554)

• xxhash64 (SPARK-27099)
Sources:

Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
New static functions
http://jgp.ai/sia
Always a soup
• Finally a reference guide

• http://jgp.ai/sparksql

• EXPLAIN can be FORMATTED

• Proleptic Gregorian calendar,
based on Java 8

• Overflow check

• ANSI compatibility through
configuration flag
SQL
Ingestion
Who needs a push down?
• Already available in databases

• Allow to filter what you ingest, before you ingest it

• Equivalent but easier than ingesting and filtering after
String sqlQuery =
"select actor.first_name, actor.last_name, film.title, "
+ "film.description "
+ "from actor, film_actor, film "
+ "where actor.actor_id = film_actor.actor_id "
+ "and film_actor.film_id = film.film_id";
Dataset<Row> df = spark.read().jdbc(
"jdbc:mysql://localhost:3306/sakila",
"(" + sqlQuery + ") actor_film_alias",
props);
Will only ingest the result of the MySQL query
/jgperrin/net.jgp.books.spark.ch08
Chapter 8
Lab #310
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT|
| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
An i...| 12/28/2016|http://amzn.to/2vBxOe1|
| 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav|
…
Dataset<Row> df = spark.read().format("csv")
…
.load("data/books.csv")
.filter("authorId = 1”);
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
Will only ingest books where authorId is 1
/jgperrin/net.jgp.books.spark.ch07
Chapter 7
Lab #201
Migration tips
Yes, there are needed
• Compilation will detect some (new Exception in structured streaming)

• Runtime will throw you off:

• Parsing dates

• Data sources (v2 on the way)

• Reference

• https://spark.apache.org/docs/latest/migration-guide.html
org.apache.spark.SparkUpgradeException: You may get a different result due to
the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can
set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before
Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.getOrCreate();
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY");
or:
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.config("spark.sql.legacy.timeParserPolicy", "LEGACY")
.getOrCreate();
Chapter 3
Lab #320
Lab #321
/jgperrin/net.jgp.books.spark.ch03
Mi-figue, mi-raisin
Mixed bags
The lakehouse is a full ecosystem
Or is it an operating system?
Streams
Systems
Files
Other
databases
Systems Streams
TBA?
FilesOther
databases
Business Data science Data engineering
Delta Lake &
Delta Engine
Outcome
Processing &
Storage
Data sources
Takeaways
• Apache Spark v3 is a major update, 3400+ patches

• Foundation for a rich data ecosystem

• Python increasingly popular, beats Scala

• Cornerstone for the lakehouse concept
Thank you! http://jgp.ai/sia
Join me for DataFriday:
http://jgp.ai/datafriday
Backup
Credits
• World of Watson by Jean-Georges Perrin CC BY-SA 4.0

• Digital Garage by Jean-Georges Perrin CC BY-SA 4.0

• Figs, grapes and rosehips by Marco Verch Professional Photographer and
Speaker, Flickr

• Soup by Valeria Boltneva from Pexels

Contenu connexe

Tendances

Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkSpark Summit
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchMapR Technologies
 
Data stax academy
Data stax academyData stax academy
Data stax academyDuyhai Doan
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDatabricks
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Lucidworks
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Lucidworks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 

Tendances (20)

Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
 
Data stax academy
Data stax academyData stax academy
Data stax academy
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
DataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New WorldDataSource V2 and Cassandra – A Whole New World
DataSource V2 and Cassandra – A Whole New World
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 

Similaire à Apache Spark v3.0.0

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics Sean Forgatch
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
 
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectImplementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectK.Mohamed Faizal
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Kristofferson A
 
MySQL Ecosystem in 2020
MySQL Ecosystem in 2020MySQL Ecosystem in 2020
MySQL Ecosystem in 2020Alkin Tezuysal
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillCharles Givre
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudMongoDB
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setKognitio
 
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...European Collaboration Summit
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLBjoern Rost
 

Similaire à Apache Spark v3.0.0 (20)

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Talavant Data Lake Analytics
Talavant Data Lake Analytics Talavant Data Lake Analytics
Talavant Data Lake Analytics
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World ProjectImplementing SharePoint on Azure, Lessons Learnt from a Real World Project
Implementing SharePoint on Azure, Lessons Learnt from a Real World Project
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
 
MySQL Ecosystem in 2020
MySQL Ecosystem in 2020MySQL Ecosystem in 2020
MySQL Ecosystem in 2020
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Drilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache DrillDrilling Cyber Security Data With Apache Drill
Drilling Cyber Security Data With Apache Drill
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
HotSpotコトハジメ
HotSpotコトハジメHotSpotコトハジメ
HotSpotコトハジメ
 
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuOSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuOSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQLStreaming ETL - from RDBMS to Dashboard with KSQL
Streaming ETL - from RDBMS to Dashboard with KSQL
 

Plus de Jean-Georges Perrin

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the worldJean-Georges Perrin
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsJean-Georges Perrin
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunityJean-Georges Perrin
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Jean-Georges Perrin
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMJean-Georges Perrin
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)Jean-Georges Perrin
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...Jean-Georges Perrin
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseJean-Georges Perrin
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applicationsJean-Georges Perrin
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and servicesJean-Georges Perrin
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & servicesJean-Georges Perrin
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)Jean-Georges Perrin
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryJean-Georges Perrin
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryJean-Georges Perrin
 

Plus de Jean-Georges Perrin (20)

It's painful how much data rules the world
It's painful how much data rules the worldIt's painful how much data rules the world
It's painful how much data rules the world
 
Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentions
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 

Dernier

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 

Dernier (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 

Apache Spark v3.0.0

  • 1. Jean-Georges Perrin • @jgperrin Apache Spark v3.0.0 What’s new? A very personal view.
  • 2. Jean-Georges Perrin Software since 1983 $>0 1995 Big Data since 1984 $>0 2006 AI since 1994 $>0 2010 x12 @jgperrin • http://jgp.ai
  • 3.
  • 4.
  • 7.
  • 8. Apps Analytics Distrib. Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS An analytics operating system?
  • 9. HardwareHardware OS OS Distributed OS Analytics OS Apps { An analytics operating system?
  • 10. HardwareHardware OS OS Distributed OS Analytics OS Apps { An analytics operating system?
  • 11. DATA Engineer DATA Scientist Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders. Sources: Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
  • 13.
  • 14. Sources: Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html Python rules in Notebooks
  • 15. A few more figures Who does not like performance figures? • Databricks: • Processes >5T records/day with Structured Streaming (introduced in Spark v2.0, stable in Spark v2.2) • >90% of all Spark API are Spark SQL, regardless of language used • Community: • Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB benchmark 5,000,000,000,001 Sources: Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
  • 17. Sources: Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
 Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html 3400+ Jira tickets
  • 18. Highlights in a nutshell • Python • Python v3 only (so long Python v2) • Better error handling • Koalas offer better Pandas support (close to 80%) • SQL • Better ANSI SQL compliance • Core • Adaptive query execution, including partition pruning • Java v11 support, Scala v2.12 only • Hydrogen - hardware & accelerator aware scheduler GPU support for model training
  • 19. Optimizing the optimizer Got to love Catalyst • v1.x: rule • v2.x: rule & cost (thanks to IBM) • v3.x: rule & cost & runtime (thanks to Databricks & Intel)
  • 20. • Dynamically coalescing shuffle partitions • Dynamically switching join strategies • Dynamically optimizing skew joins Sources: Databricks blog, https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html Adaptive query execution (AQE) Yet another 3-letter acronym
  • 21. SparkSession spark = SparkSession.builder() .appName("Join using AQE") .master("local[*]") .config("spark.sql.adaptive.enabled", useAqe) .getOrCreate(); … Dataset<Row> institPerCountyDf = higherEdDf.join( countyZipDf, higherEdDf.col("zip").equalTo(countyZipDf.col("zip")), "inner"); institPerCountyDf = institPerCountyDf.join( censusDf, institPerCountyDf.col("county").equalTo(censusDf.col("countyId")), "left"); For the entire session /jgperrin/net.jgp.books.spark.ch12 Chapter 12 Lab #302
  • 22. In a free companion book • sinh, cosh, tanh, asinh, acosh, atanh (SPARK-28133) • any, every, some (SPARK-19851) • bit_and, bit_or (SPARK-27879) • bit_count (SPARK-29491) • bit_xor (SPARK-29545) • bool_and, bool_or (SPARK-30184) • count_if (SPARK-27425) • date_part (SPARK-28690) • extract (SPARK-23903) • forall (SPARK-27905) • from_csv (SPARK-25393) • make_date (SPARK-28432) • make_interval (SPARK-29393) • make_timestamp (SPARK-28459) • map_entries (SPARK-23935) • map_filter (SPARK-23937) • map_zip_with (SPARK-23938) • max_by, min_by (SPARK-27653) • schema_of_csv (SPARK-25672) • to_csv (SPARK-25638) • transform_keys (SPARK-23939) • transform_values (SPARK-23940) • typeof (SPARK-29961) • version (SPARK-29554) • xxhash64 (SPARK-27099) Sources: Spark v3 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html New static functions http://jgp.ai/sia
  • 23. Always a soup • Finally a reference guide • http://jgp.ai/sparksql • EXPLAIN can be FORMATTED • Proleptic Gregorian calendar, based on Java 8 • Overflow check • ANSI compatibility through configuration flag SQL
  • 24. Ingestion Who needs a push down? • Already available in databases • Allow to filter what you ingest, before you ingest it • Equivalent but easier than ingesting and filtering after
  • 25. String sqlQuery = "select actor.first_name, actor.last_name, film.title, " + "film.description " + "from actor, film_actor, film " + "where actor.actor_id = film_actor.actor_id " + "and film_actor.film_id = film.film_id"; Dataset<Row> df = spark.read().jdbc( "jdbc:mysql://localhost:3306/sakila", "(" + sqlQuery + ") actor_film_alias", props); Will only ingest the result of the MySQL query /jgperrin/net.jgp.books.spark.ch08 Chapter 8 Lab #310
  • 26. +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | id|authorId| title|releaseDate| link| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P| | 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP| | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr| | 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n| | 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT| | 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? An i...| 12/28/2016|http://amzn.to/2vBxOe1| | 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav| … Dataset<Row> df = spark.read().format("csv") … .load("data/books.csv") .filter("authorId = 1”); +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | id|authorId| title|releaseDate| link| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ | 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P| | 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP| | 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr| | 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n| +---+--------+----------------------------------------------------------------------+-----------+----------------------+ Will only ingest books where authorId is 1 /jgperrin/net.jgp.books.spark.ch07 Chapter 7 Lab #201
  • 27. Migration tips Yes, there are needed • Compilation will detect some (new Exception in structured streaming) • Runtime will throw you off: • Parsing dates • Data sources (v2 on the way) • Reference • https://spark.apache.org/docs/latest/migration-guide.html
  • 28. org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. SparkSession spark = SparkSession.builder() .appName("CSV to dataframe to Dataset<Book> and back") .master("local") .getOrCreate(); spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"); or: SparkSession spark = SparkSession.builder() .appName("CSV to dataframe to Dataset<Book> and back") .master("local") .config("spark.sql.legacy.timeParserPolicy", "LEGACY") .getOrCreate(); Chapter 3 Lab #320 Lab #321 /jgperrin/net.jgp.books.spark.ch03
  • 30. The lakehouse is a full ecosystem Or is it an operating system? Streams Systems Files Other databases Systems Streams TBA? FilesOther databases Business Data science Data engineering Delta Lake & Delta Engine Outcome Processing & Storage Data sources
  • 31. Takeaways • Apache Spark v3 is a major update, 3400+ patches • Foundation for a rich data ecosystem • Python increasingly popular, beats Scala • Cornerstone for the lakehouse concept
  • 32. Thank you! http://jgp.ai/sia Join me for DataFriday: http://jgp.ai/datafriday
  • 34. Credits • World of Watson by Jean-Georges Perrin CC BY-SA 4.0 • Digital Garage by Jean-Georges Perrin CC BY-SA 4.0 • Figs, grapes and rosehips by Marco Verch Professional Photographer and Speaker, Flickr • Soup by Valeria Boltneva from Pexels