Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Migrating to Spark 2.0 - Part 2
1. Migrating to Spark 2.0 -
Part 2
Moving to next generation spark
https://github.com/phatak-dev/spark-two-migration
2. ● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
3. Agenda
● What’s New in Spark 2.0
● Recap of Part 1
● Sub Queries
● Catalog API
● Hive Catalog
● Refresh Table
● Check Point for Iteration
● References
4. What’s new in 2.0?
● Dataset is the new user facing single abstraction
● RDD abstraction is used only for runtime
● Higher performance with whole stage code generation
● Significant changes to streaming abstraction with spark
structured streaming
● Incorporates learning from 4 years of production use
● Spark ML is replacing MLLib as de facto ml library
● Breaks API compatibility for better API’s and features
5. Need for Migration
● Lot of real world code is written in 1.x series of spark
● As fundamental abstractions are changed, all this code
need to migrate make use performance and API
benefits
● More and more ecosystem projects need the version of
spark 2.0
● 1.x series will be out of maintenance mode very soon.
So no more bug fixes
6. Recap of Part 1
● Choosing Scala Version
● New Connectors
● Spark Session Entry Point
● Built in Csv Connector
● Moving from DF RDD API to Dataset
● Cross Joins
● Custom ML Transforms
8. SubQueries
● A Query inside another query is known as subquery
● Standard feature of SQL
● Ex from MySQL
SELECT AVG(sum_column1)
FROM (SELECT SUM(column1) AS sum_column1
FROM t1 GROUP BY column1) AS t1;
● Highly useful as they allow us to combine multiple
different type of aggregation in one query
9. Types of SubQuery
● In select clause ( Scalar)
SELECT employee_id,
age,
(SELECT MAX(age) FROM employee) max_age
FROM employee
● In from clause (Derived Tables)
SELECT AVG(sum_column1)
FROM (SELECT SUM(column1) AS sum_column1
FROM t1 GROUP BY column1) AS t1;
● In where clause ( Predicate)
SELECT * FROM t1 WHERE column1 = (SELECT column1 FROM t2);
10. SubQuery support in Spark 1.x
● SubQuery support in spark 1.x mimic the support
available in hive 0.12
● Hive only supported the subqueries in from clause so
spark only supported the same.
● The subquery in from clause fairly limited on what they
are capable of doing.
● To support the advanced querying in spark sql, they
needed to add other from subqueries in 2.0
11. SubQuery support in Spark 2.x
● Spark has greatly improved its support on SQL dialect
on 2.0 version
● They have added most of the standard features of
SQL-92 standard
● Full fledged sql parser, no more depend on hive
● Runs all 99 queries TPC-DS natively
● Makes Spark full fledged OLAP query engine
12. Scalar SubQueries
● Scalar subqueries are the sub queries which returns
single ( scalar) result
● There are two kind of Scalar subqueries
○ UnCorrelated Subqueries
The one which doesn’t depend upon the external query
● Correlated Subqueries
The one depend upon the outer queries
13. Uncorrelated Scalar SubQueries
● Add maximum sales amount to each row of the sales
data
● This normally helps us to understand how far away a
given transaction compared to maximum sales we have
done
● In Spark 1.x
sparkone.SubQueries
● In Spark 2.x
sparktwo.SubQueries
14. Correlated SubQueries
● Add maximum sales amount to each row of the sales
data in each item category
● This normally helps us to understand how far away a
given transaction compared to maximum sales we have
done
● In Spark 1.x
sparkone.SubQueries
● In Spark 2.x
sparktwo.SubQueries
16. Catalog in SQL
● Catalog is a metadata store which contains the
metadata of the all the information of a SQL system
● Typical contents of catalog are
○ Databases
○ Tables
○ Table Metadata
○ Functions
○ Partitions
○ Buckets
17. Catalog in Spark 1.x
● By default, spark uses a in memory catalog which keeps
track of spark temp tables
● It is not persistent
● For any persistent operations, spark advocated use of
the hive metastore
● There was no standard API to query metadata
information from the in memory / hive metadata store
● AdHoc functions added to SQLContext over a time to fix
this
18. Need of Catalog API
● Many interactive applications, like notebook systems,
often need an API to query metastore to show relevant
information to user
● Whenever we integrate with hive, without a catalog API
we have to resort to running HQL queries and parsing
it’s information for getting data
● Cannot manipulate hive metastore directly from spark
API
● Need to evolve to support more meta stores in future
19. Catalog API in 2.x
● Spark 2.0 has added a full fledged catalog API to spark
session
● It lives a sparkSession.catalog namespace
● This catalog has API’s to create, read and delete
elements from in memory and also from hive metastore
● Having this standard API to interact with catalog makes
developer much easier than before
● If we were using non standard API’s before, it’s time to
migrate
20. Catalog API Migration
● Migrate from the sqlContext API’s to
sparkSession.catalog API
● Use sparkSession rather than using HiveContext to
have access to the special operations
● Spark 1.x
sparkone.CatalogExample
● Spark 2.x
sparktwo.CatalogExample
22. Hive Integration in Spark 1.x
● Spark SQL had native support for the hive from
beginning
● In beginning spark SQL used hive query parse for
parsing and meta store for persistent storage
● To integrate with hive, one has to create HiveContext
which is separate than SQLContext
● Some of API’s were only available in SQLContext and
some hive specific on HiveContext
● No support for manipulating hive metastore
23. Hive Integration in 2.x
● No more separate HiveContext
● SparkSession has enableHiveSupport API to enable
the hive support
● This makes both spark SQL and Hive API’s consistent
● Spark 2.0 catalog API also supports the hive metastore
● Example
sparktwo.CatalogHiveExample
25. Need of Refresh Table API
● In spark, we cache a table ( dataset) for performance
reasons
● Spark caches the metadata in it’s metadatastore and
actual data in block manager
● If underneath file/table changes, there was no direct API
in spark to force a table refresh
● If you just uncache/recache, it will only reflects the
change in data not metadata
● So we need a standard way to refresh the table
26. Refresh Table and By Path
● Spark 2.0 provides two API’s for refreshing datasets in
spark
● refreshTable API which was imported from
HiveContext are used for or registered temp tables or
hive tables
● refreshByPath API is used for refreshing datasets
without have to register them as tables beforehand
● Spark 1.x
sparkone.RefreshExample
● Spark 2.x - sparktwo.RefreshExample
28. Iterative Programming in Spark
● Spark is one of the first big data framework to have
great support iterative programming natively
● Iterative programs go over the again and again to
compute some results
● Spark ML is one of iterative frameworks in spark
● Even though caching and RDD mechanisms worked
great with iterative programming, moving to dataframe
has created new challenges
30. Iteration in Dataframe API
● As every step of iteration creates new DF, the logical
plan keeps on going big
● As spark needs to keep complete query plan for
recovery, the overhead to analyse the plan increases as
number of iterations increases
● This overhead is compute bound and done at master
● As this overhead increases, it makes iteration very very
slow
● Ex : sparkone.CheckPoint
31. Solution to Query Plan Issue
● To solve ever growing query plan (lineage) we need to
truncate it to make it faster
● Whenever we truncate the query plan, we will lose the
ability to recover
● To avoid that, we need to store the intermediate data
before we truncated the query plan
● Saving intermediate data with truncation of the query
plan will result in faster performance
32. Dataset Persistence API
● In spark 2.1, there is a new persist API on dataset
● It’s analogues to RDD persist API
● In RDD, persist used to persist the RDD and then
truncate it’s lineage
● Similarly in case of dataset, persist will persist the
dataset and then truncates it’s query plan
● Make sure that persist times are much lower than the
overhead you are facing
● Ex : sparktwo.CheckPoint
34. Best Practice Migration
● As fundamental abstractions has changed in spark 2.0,
we need to rethink our best practices
● Many best practices were centered around RDD
abstraction which is no more central abstraction
● Also there are many optimisation in the catalyst where
many optimisation is done for us by platform
● So let’s look at some best practices of 1.x and see how
we can change
35. Choice of Serializer
● Use kryo serializer over java and register classes
with kryo
● This best practice was devised for efficient caching and
transfer of RDD data
● But in spark 2.0, Dataset uses custom code generated
serialization framework for most of the code and data
● So unless there is heavy use of RDD in your project you
don’t need to worry about serializer is 2.0
36. Cache Format
● RDD uses MEMORY_ONLY as default and it’s most
efficient caching
● DataFrame/Dataset uses MEMORY_AND_DISK rather
than MEMORY_ONLY
● Computation of Dataset and converting to custom
serialization format is often costly
● So use MEMORY_AND_DISK as default format over
MEMORY_ONLY
37. Use of BroadCast variables
● Use broadcast variable for optimising the lookups and
joins
● BroadCast variables played an important role of making
joins efficient in the RDD world
● These variables don’t have much scope in Dataset API
land
● By configuring broadcast value, spark sql will do
broadcasting automatically
● Don’t use them unless there is a reason
38. Choice of Clusters
● Use YARN/Mesos for production. Standalone is
mostly for simple apps
● Users were encouraged to use a dedicated cluster
manager over standalone given by spark
● With databricks cloud putting weight behind standalone
cluster it has been ready for production grade
● Many companies run their spark applications in
standalone cluster today
● Choose standalone if you run only spark applications
39. Use of HiveContext
● Use HiveContext over SQLContext for using Spark
SQL
● In spark 1.x, Spark SQL was simplistic and heavily
depended on hive to give query parsing
● In spark 2.0, spark sql is enriched which is now more
powerful than hive itself
● Most of the udf of hive are now rewritten in spark sql
and code generated
● Unless you want to use hive metastore, use
sparksession without hive