SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Introduction to Data
Source V2 API
Next Generation Datasource API for Spark
2.0
https://github.com/phatak-dev/spark2.0-examples
● Madhukara Phatak
● Technical Lead at Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com
Agenda
● Structured Data Processing
● Shortcomings of Datasource API
● Goals of Datasource V2
● Anatomy of Datasource V2
● In Memory Datasource
● Mysql Data source
Structured Data Processing
Spark SQL Architecture
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL
Spark SQL Components
● Data source API
Universal API for Loading/ Saving structured data
● DataFrame API
Higher level representation for structured data
● SQL interpreter and optimizer
Express data transformation in SQL
● SQL service
Hive thrift server
Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Introduced in Spark 1.3 version along with DataFrame
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra (in works)
etc
Datasource V1 API
● Introduced in Spark 1.3 version
● Spark has evolved quite a bit after 1.3
○ Custom Memory Management
○ Dataset abstraction
○ Structured Streaming
● Datasource API has not evolved as with new versions of
spark
Difficulty in evolving DataSource API
● Data source is lowest level abstraction in structured
processing which talks to data sources directly
● Datasource are often written by the third party vendors
to connect to different sources
● They will be not often updated as spark codes which
makes changing API challenging
● So data source API remained same when rest of the
spark changed quite a bit
Shortcomings of V1 API
Dependency on High Level API
● Data sources are lowest level abstraction in the stack
● Data source V1 API depended on high level user facing
abstractions like SQLContext and DataFrame
def createRelation(sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation
● As the spark evolved, these abstractions got deprecated
and replaced by better ones
Dependency on High Level API
● But the as data source was not changed, it still stuck
with old abstractions
● So having low level abstractions depend upon high level
abstractions is not a good idea
Lack of Support for Columnar Read
def buildScan(): RDD[Row]
● From above API , it’s apparent that API reads the data
in row format
● But many analytics sources today are columnar in
nature
● So if the underneath columnar source is read into row
format it loses all the performance benefits.
Lack of Partition and Sorting Info
● Many data sources distribute the data using partitioning
over multiple machines
● In datasource V1 API there was no support to share the
partition locality information with spark engine
● This resulted in random reads from spark engine with lot
of network traffic
● Spark builtin sources solved this issue using internal
API’s
● But all third party connectors suffered from it
No Transaction Support in Write
● V1 API only supported generic write interface which was
primarily meant for write once sources like HDFS
● The interface did not have any transactional support
which is needed for sophisticated sources like
databases
● For example, when data is written to partially to
database and job aborts, it will not cleanup those rows.
● It’s not an issue in HDFS because it will track non
successful writes using _successful file
Limited Extendability
● Data source v1 API only supports filter push down and
column pruning
● But many smart sources, data sources with processing
power, do more capabilities than that.
● These sources can do sorting, aggregation in the
source level itself
● Currently data source API doesn’t good mechanism to
push more catalyst expressions to underneath source.
Introduction to Datasource V2 API
V2 API
● Datasource V2 is a new API introduced in Spark 2.3 to
address the shortcomings of V1 API
● V2 API mimics the simplicity of the Hadoop Input/Output
layers still keeping the all the powerful features of V1
● Currently it’s in beta. It will become GA in future
releases
● V1 API will be deprecated
● No user facing code change needed to use v2 data
sources.
Goals
● No Dependency on High level abstractions like
SparkContext,DataFrame etc
● Java Friendly
● Support for Filter Pushdown and Pruning
● Support for Partition locality
● Support for both columnar and row based reading
● Transaction support in write interface
● Get rid of internal API’s like HadoopFsRelation
Anatomy of V2 Read API
Java API
● Being Java friendly is one of the goal of the V2 API
● So all the base interfaces for the V2 API are defined as
Java interfaces
● This makes interoperability between Java much easier
compared to V1 API
● Little painful in scala to deal with Java models
● The next slides covers the all the different interfaces in
the read path
Interfaces
Master
Datasource Reader
Data Reader Factory
User code
Worker
Data Reader
Worker
Data Reader
DataSourceReader Interface
● Entry Point to the Data Source
● SubClass of ReaderSupport
● Has Two Methods
○ def readSchema():StructType
○ def createDataReaderFactories:List[DataReaderFactory]
● Responsible for Schema inference and creating data
factories
● What may be list indicate here?
DataReaderFactory Interface
● Follow factory design pattern of Java to create actual
data reader
● This code runs in master which may be responsible for
common code , like jdbc connections, across the data
readers.
● It has one method to create data reader
def createDataReader:DataReader
DataReader Interface
● Interface Responsible for actual reading of data
● Runs in worker nodes
● Should be serializable
● Method exposed are
def next : Boolean
def get : T
● Looks very similar to Iterator interface
Observations from API
● The API doesn’t use any high level API’s like
SparkContext,DataFrame etc
● The return type of DataReader is T which suggests that
it can support both row and columnar read
● Reader interface is quite simple which can be used for
wide variety of sources
● No more fiddling with RDD in Data source layer.
In-memory Data Source
Simple In-Memory Interface
● In-memory Interface which reads the data from
in-memory array
● Implements all the interfaces discussed earlier
● Has single partition
● Shows how all the different API’s come together to build
a full fledged source
● Ex : SimpleDataSource.scala
Multiple Partitions
● In this example, we extend our simple in-memory data
source to have multiple partitions
● In the code, we will have multiple data reader factories
compared one in earlier example
● In Data Reader code, we track the partition using it’s
start and end
● Mimics the HDFS InputFormat
● Ex : SimpleMultiDataSource.scala
Filter Push
Filter Push with Mysql Datasource
● Filter push allows data sources to push the spark SQL
filters to the data source
● In the smart sources, like relational databases, source
has capability to filter efficiently
● In this example, filter push is implemented for Mysql
Source
● This source will be using JDBC interface to
communicate with Mysql
● Ex : SimpleMysqlDataSource.scala
References
● http://blog.madhukaraphatak.com/categories
/datasource-v2-series
● https://databricks.com/blog/2018/02/28/intro
ducing-apache-spark-2-3.html
● https://issues.apache.org/jira/browse/SPARK
-15689

Contenu connexe

Tendances

The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACMarkus Michalewicz
 
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13c
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13cClone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13c
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13cAlfredo Krieg
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuningSimon Huang
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021Akshay Rai
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1Federico Campoli
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Exploring the replication and sharding in MongoDB
Exploring the replication and sharding in MongoDBExploring the replication and sharding in MongoDB
Exploring the replication and sharding in MongoDBIgor Donchovski
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...HostedbyConfluent
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?Mydbops
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsZohar Elkayam
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Spark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
High Availability for Oracle SE2
High Availability for Oracle SE2High Availability for Oracle SE2
High Availability for Oracle SE2Markus Flechtner
 

Tendances (20)

The Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RACThe Top 5 Reasons to Deploy Your Applications on Oracle RAC
The Top 5 Reasons to Deploy Your Applications on Oracle RAC
 
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13c
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13cClone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13c
Clone Oracle Databases In Minutes Without Risk Using Enterprise Manager 13c
 
Oracle db performance tuning
Oracle db performance tuningOracle db performance tuning
Oracle db performance tuning
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Trino at linkedIn - 2021
Trino at linkedIn - 2021Trino at linkedIn - 2021
Trino at linkedIn - 2021
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Exploring the replication and sharding in MongoDB
Exploring the replication and sharding in MongoDBExploring the replication and sharding in MongoDB
Exploring the replication and sharding in MongoDB
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsOracle Database Performance Tuning Advanced Features and Best Practices for DBAs
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAs
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
High Availability for Oracle SE2
High Availability for Oracle SE2High Availability for Oracle SE2
High Availability for Oracle SE2
 

Similaire à Introduction to Datasource V2 API

Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Building an open source high performance data analytics platform
Building an open source high performance data analytics platformBuilding an open source high performance data analytics platform
Building an open source high performance data analytics platformsupun06
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays
 
ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Anant Corporation
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 

Similaire à Introduction to Datasource V2 API (20)

Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Building an open source high performance data analytics platform
Building an open source high performance data analytics platformBuilding an open source high performance data analytics platform
Building an open source high performance data analytics platform
 
A Short Presentation on Kafka
A Short Presentation on KafkaA Short Presentation on Kafka
A Short Presentation on Kafka
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Cloud Native API Design and Management
Cloud Native API Design and ManagementCloud Native API Design and Management
Cloud Native API Design and Management
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...
 
ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
 
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
Data Engineer's Lunch #77: Apache Arrow Flight SQL: A Universal Standard for ...
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 

Plus de datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 

Plus de datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 

Dernier

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Dernier (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Introduction to Datasource V2 API

  • 1. Introduction to Data Source V2 API Next Generation Datasource API for Spark 2.0 https://github.com/phatak-dev/spark2.0-examples
  • 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Structured Data Processing ● Shortcomings of Datasource API ● Goals of Datasource V2 ● Anatomy of Datasource V2 ● In Memory Datasource ● Mysql Data source
  • 5. Spark SQL Architecture CSV JSON JDBC Data Source API Data Frame API Spark SQL and HQLDataframe DSL
  • 6. Spark SQL Components ● Data source API Universal API for Loading/ Saving structured data ● DataFrame API Higher level representation for structured data ● SQL interpreter and optimizer Express data transformation in SQL ● SQL service Hive thrift server
  • 7. Data source API ● Universal API for loading/saving structured data ● Built In support for Hive, Avro, Json,JDBC, Parquet ● Third party integration through spark-packages ● Support for smart sources ● Introduced in Spark 1.3 version along with DataFrame ● Third parties already supporting ○ Csv ○ MongoDB ○ Cassandra (in works) etc
  • 8. Datasource V1 API ● Introduced in Spark 1.3 version ● Spark has evolved quite a bit after 1.3 ○ Custom Memory Management ○ Dataset abstraction ○ Structured Streaming ● Datasource API has not evolved as with new versions of spark
  • 9. Difficulty in evolving DataSource API ● Data source is lowest level abstraction in structured processing which talks to data sources directly ● Datasource are often written by the third party vendors to connect to different sources ● They will be not often updated as spark codes which makes changing API challenging ● So data source API remained same when rest of the spark changed quite a bit
  • 11. Dependency on High Level API ● Data sources are lowest level abstraction in the stack ● Data source V1 API depended on high level user facing abstractions like SQLContext and DataFrame def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation ● As the spark evolved, these abstractions got deprecated and replaced by better ones
  • 12. Dependency on High Level API ● But the as data source was not changed, it still stuck with old abstractions ● So having low level abstractions depend upon high level abstractions is not a good idea
  • 13. Lack of Support for Columnar Read def buildScan(): RDD[Row] ● From above API , it’s apparent that API reads the data in row format ● But many analytics sources today are columnar in nature ● So if the underneath columnar source is read into row format it loses all the performance benefits.
  • 14. Lack of Partition and Sorting Info ● Many data sources distribute the data using partitioning over multiple machines ● In datasource V1 API there was no support to share the partition locality information with spark engine ● This resulted in random reads from spark engine with lot of network traffic ● Spark builtin sources solved this issue using internal API’s ● But all third party connectors suffered from it
  • 15. No Transaction Support in Write ● V1 API only supported generic write interface which was primarily meant for write once sources like HDFS ● The interface did not have any transactional support which is needed for sophisticated sources like databases ● For example, when data is written to partially to database and job aborts, it will not cleanup those rows. ● It’s not an issue in HDFS because it will track non successful writes using _successful file
  • 16. Limited Extendability ● Data source v1 API only supports filter push down and column pruning ● But many smart sources, data sources with processing power, do more capabilities than that. ● These sources can do sorting, aggregation in the source level itself ● Currently data source API doesn’t good mechanism to push more catalyst expressions to underneath source.
  • 18. V2 API ● Datasource V2 is a new API introduced in Spark 2.3 to address the shortcomings of V1 API ● V2 API mimics the simplicity of the Hadoop Input/Output layers still keeping the all the powerful features of V1 ● Currently it’s in beta. It will become GA in future releases ● V1 API will be deprecated ● No user facing code change needed to use v2 data sources.
  • 19. Goals ● No Dependency on High level abstractions like SparkContext,DataFrame etc ● Java Friendly ● Support for Filter Pushdown and Pruning ● Support for Partition locality ● Support for both columnar and row based reading ● Transaction support in write interface ● Get rid of internal API’s like HadoopFsRelation
  • 20. Anatomy of V2 Read API
  • 21. Java API ● Being Java friendly is one of the goal of the V2 API ● So all the base interfaces for the V2 API are defined as Java interfaces ● This makes interoperability between Java much easier compared to V1 API ● Little painful in scala to deal with Java models ● The next slides covers the all the different interfaces in the read path
  • 22. Interfaces Master Datasource Reader Data Reader Factory User code Worker Data Reader Worker Data Reader
  • 23. DataSourceReader Interface ● Entry Point to the Data Source ● SubClass of ReaderSupport ● Has Two Methods ○ def readSchema():StructType ○ def createDataReaderFactories:List[DataReaderFactory] ● Responsible for Schema inference and creating data factories ● What may be list indicate here?
  • 24. DataReaderFactory Interface ● Follow factory design pattern of Java to create actual data reader ● This code runs in master which may be responsible for common code , like jdbc connections, across the data readers. ● It has one method to create data reader def createDataReader:DataReader
  • 25. DataReader Interface ● Interface Responsible for actual reading of data ● Runs in worker nodes ● Should be serializable ● Method exposed are def next : Boolean def get : T ● Looks very similar to Iterator interface
  • 26. Observations from API ● The API doesn’t use any high level API’s like SparkContext,DataFrame etc ● The return type of DataReader is T which suggests that it can support both row and columnar read ● Reader interface is quite simple which can be used for wide variety of sources ● No more fiddling with RDD in Data source layer.
  • 28. Simple In-Memory Interface ● In-memory Interface which reads the data from in-memory array ● Implements all the interfaces discussed earlier ● Has single partition ● Shows how all the different API’s come together to build a full fledged source ● Ex : SimpleDataSource.scala
  • 29. Multiple Partitions ● In this example, we extend our simple in-memory data source to have multiple partitions ● In the code, we will have multiple data reader factories compared one in earlier example ● In Data Reader code, we track the partition using it’s start and end ● Mimics the HDFS InputFormat ● Ex : SimpleMultiDataSource.scala
  • 31. Filter Push with Mysql Datasource ● Filter push allows data sources to push the spark SQL filters to the data source ● In the smart sources, like relational databases, source has capability to filter efficiently ● In this example, filter push is implemented for Mysql Source ● This source will be using JDBC interface to communicate with Mysql ● Ex : SimpleMysqlDataSource.scala