Apache Spark v3 is a new milestone for the Big Data framework. In this session, you will (re)discover what Spark is, learn about the new features in its third major version, and go through a complete end-to-end project.
I like to call Spark an Analytics Operating Systems. It is offering far more than just a framework or a library. I will explain why. Spark v3 is the latest major evolution. It was released mid-June 2020 and adds impressive new features. After looking at them from a high level, I will detail a few of my favorites.
Finally, as we all like code (well, at least I do), I will demonstrate a complete data & AI pipeline looking at Covid-19 data.
Key takeaways: Spark as an Analytics OS, Spark v3 highlights, building data/AI pipelines/models with Spark.
Audience: software engineers, data engineers, architects, data scientists.
11. DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for predictive
models.
Explore data to find
hidden gems and patterns.
Tells stories to key
stakeholders.
Sources:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
14. Sources:
Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc
Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
Python rules in
Notebooks
15. A few more figures
Who does not like performance figures?
• Databricks:
• Processes >5T records/day with Structured Streaming (introduced in Spark
v2.0, stable in Spark v2.2)
• >90% of all Spark API are Spark SQL, regardless of language used
• Community:
• Spark v3.0 is roughly two times faster than Spark v2.4 in the TPC-DS 30TB
benchmark
5,000,000,000,001
Sources:
Matei Zaharia, Spark + AI Summit 2020, https://youtu.be/p4PkA2huzVc
Databricks blog, https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html
Spark v3.0.0 release notes, https://spark.apache.org/releases/spark-release-3-0-0.html
23. Always a soup
• Finally a reference guide
• http://jgp.ai/sparksql
• EXPLAIN can be FORMATTED
• Proleptic Gregorian calendar,
based on Java 8
• Overflow check
• ANSI compatibility through
configuration flag
SQL
24. Ingestion
Who needs a push down?
• Already available in databases
• Allow to filter what you ingest, before you ingest it
• Equivalent but easier than ingesting and filtering after
25. String sqlQuery =
"select actor.first_name, actor.last_name, film.title, "
+ "film.description "
+ "from actor, film_actor, film "
+ "where actor.actor_id = film_actor.actor_id "
+ "and film_actor.film_id = film.film_id";
Dataset<Row> df = spark.read().jdbc(
"jdbc:mysql://localhost:3306/sakila",
"(" + sqlQuery + ") actor_film_alias",
props);
Will only ingest the result of the MySQL query
/jgperrin/net.jgp.books.spark.ch08
Chapter 8
Lab #310
26. +---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the ...| 04/23/2017|http://amzn.to/2i3mthT|
| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language?
An i...| 12/28/2016|http://amzn.to/2vBxOe1|
| 7| 3| Adventures of Huckleberry Finn| 05/26/1994|http://amzn.to/2wOeOav|
…
Dataset<Row> df = spark.read().format("csv")
…
.load("data/books.csv")
.filter("authorId = 1”);
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| id|authorId| title|releaseDate| link|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/2016|http://amzn.to/2kup94P|
| 2| 1|Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Har...| 10/06/2015|http://amzn.to/2l2lSwP|
| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/04/2008|http://amzn.to/2kYezqr|
| 4| 1|Harry Potter and the Chamber of Secrets: The Illustrated Edition (H...| 10/04/2016|http://amzn.to/2kYhL5n|
+---+--------+----------------------------------------------------------------------+-----------+----------------------+
Will only ingest books where authorId is 1
/jgperrin/net.jgp.books.spark.ch07
Chapter 7
Lab #201
27. Migration tips
Yes, there are needed
• Compilation will detect some (new Exception in structured streaming)
• Runtime will throw you off:
• Parsing dates
• Data sources (v2 on the way)
• Reference
• https://spark.apache.org/docs/latest/migration-guide.html
28. org.apache.spark.SparkUpgradeException: You may get a different result due to
the upgrading of Spark 3.0: Fail to parse '2015-10-6' in the new parser. You can
set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before
Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.getOrCreate();
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY");
or:
SparkSession spark = SparkSession.builder()
.appName("CSV to dataframe to Dataset<Book> and back")
.master("local")
.config("spark.sql.legacy.timeParserPolicy", "LEGACY")
.getOrCreate();
Chapter 3
Lab #320
Lab #321
/jgperrin/net.jgp.books.spark.ch03
30. The lakehouse is a full ecosystem
Or is it an operating system?
Streams
Systems
Files
Other
databases
Systems Streams
TBA?
FilesOther
databases
Business Data science Data engineering
Delta Lake &
Delta Engine
Outcome
Processing &
Storage
Data sources
31. Takeaways
• Apache Spark v3 is a major update, 3400+ patches
• Foundation for a rich data ecosystem
• Python increasingly popular, beats Scala
• Cornerstone for the lakehouse concept
34. Credits
• World of Watson by Jean-Georges Perrin CC BY-SA 4.0
• Digital Garage by Jean-Georges Perrin CC BY-SA 4.0
• Figs, grapes and rosehips by Marco Verch Professional Photographer and
Speaker, Flickr
• Soup by Valeria Boltneva from Pexels