SlideShare une entreprise Scribd logo
1  sur  43
Big data processing with Apache
Spark and Oracle database
Martin Toshev
Who am I
Software consultant (CoffeeCupConsulting)
BG JUG board member (http://jug.bg)
(BG JUG is a 2018 Oracle Duke’s
choice award winner)
Agenda
• Apache Spark from an eagle’s eye
• Apache Spark capabilities
• Using Oracle RDBMS as a Spark datasource
Apache Spark from an Eagle’s eye
Highlights
• A framework for large-scale distributed data processing
• Originally in Scala but extended with Java, Python and R
• One of the most contributed
open source/Apache/GitHub projects with over 1400
contributors
Spark vs MapReduce
• Spark has been developed in order to address the
shortcomings of the MapReduce programming model
• In particular MapReduce is unsuitable for:
– real-time processing (suitable for batch processing of present data)
– operations not limited to the key-value format of data
– large data on a network
– online transaction processing
– graph processing
– sequential program execution
Spark vs Hadoop
• Spark is faster as it depends more on RAM usage and
tries to minimize disk IO (on the storage system)
• Spark however can still use Hadoop:
– as a storage engine (HDFS)
– as a compute engine (MapReduce or Hadoop YARN)
• Spark has pluggable storage and compute engine
architecture
Spark components
Spark Framework
Spark Core
Spark
Streaming
MLib GraphXSpark SQL
Spark architecture
SparkContext
(driver)
Cluster
manager
Worker
node
Worker
node
Worker
node
Spark application
(JAR)
Input data
sources
Output data
sources
Apache Spark capabilities
Spark datasets
• The building block of Spark are RDDs (Resilient
Distributed Datasets)
• They are immutable collections of objects spread across
a Spark cluster and stored in RAM or on disk
• Created by means of distributed transformations
• Rebuilt on failure of a Spark node
Spark datasets
• The DataFrame API is a superset of RDDs introduced in
Spark 2.0
• The Dataset API provides a way to work with a
combination of RDDs and DataFrames
• The DataFrame API is preferred compared to RDDs due to
improved performance and more advanced operations
Spark datasets
List<Item> items = …;
SparkConf configuration = new
SparkConf().setAppName(“ItemsManager").setMaster("local");
JavaSparkContext context =
new JavaSparkContext(configuration);
JavaRDD<Item> itemsRDD = context.parallelize(items);
Spark transformations
map itemsRDD.map(i -> { i.setName(“phone”);
return i;});
filter itemsRDD.filter(i ->
i.getName().contains(“phone”))
flatMap itemsRDD.flatMap(i ->
Arrays.asList(i, i).iterator());
union itemsRDD.union(newItemsRDD);
intersection itemsRDD.intersection(newItemsRDD);
distinct itemsRDD.distinct()
cartesian itemsRDD.cartesian(otherDatasetRDD)
Spark transformations
groupBy pairItemsRDD = itemsRDD.mapToPair(i ->
new Tuple2(i.getType(), i));
modifiedPairItemsRDD =
pairItemsRDD.groupByKey();
reduceByKey pairItemsRDD = itemsRDD.mapToPair(o ->
new Tuple2(o.getType(), o));
modifiedPairItemsRDD =
pairItemsRDD.reduceByKey((o1, o2) ->
new Item(o1.getType(),
o1.getCount() + o2.getCount(),
o1.getUnitPrice())
);
• Other transformations include aggregateByKey,
sortByKey, join, cogroup …
Spark actions
• Spark actions are the terminal operations that produce
results from the transformations
• Actions are a way to communicate back from the
execution engine to the Spark driver instance
Spark actions
collect itemsRDD.collect()
reduce itemsRDD.map(i ->
i.getUnitPrice() * i.getCount()).
reduce((x, y) -> x + y);
count itemsRDD.count()
first itemsRDD.first()
take itemsRDD.take(4)
takeOrdered itemsRDD.takeOrdered(4, comparator)
foreach itemsRDD.foreach(System.out::println)
saveAsTextFile itemsRDD.saveAsTextFile(path)
saveAsObjectFile itemsRDD.saveAsObjectFile(path)
DataFrames/DataSets
• A dataframe can be created using an instance of the
org.apache.spark.sql.SparkSession class
• The DataFrame/DataSet APIs provide more advanced
operations and the capability to run SQL queries on the
data
itemsDS.createOrReplaceTempView(“items");
session.sql("SELECT * FROM items");
DataFrames/DataSets
• An existing RDD can be converted to a Spark dataframe:
• An RDD can be retrieved from a dataframe as well:
SparkSession session =
SparkSession.builder().appName("app").getOrCreate();
Dataset<Row> itemsDS =
session.createDataFrame(itemsRDD, Item.class);
itemsDS.rdd()
Spark data sources
• Spark can receive data from a variety of data sources in a
variety of ways (batching, real-time streaming)
• These datasources might be:
– files: Spark supports reading data from a variety of formats (JSON, CSV, Avro,
etc.)
– relational databases: using JDBC/ODBC driver Spark can extract data from an
RDBMS
– TCP sockets, messaging systems: using streaming capabilities of Spark data
can be read from messaging systems and raw TCP sockets
Spark data sources
• Spark provides support for operations on batch data or
real time data
• For real time data Spark provides two main APIs:
– Spark streaming is an older API working on RDDs
– Spark structured streaming is a newer API working on DataFrames/DataSets
Spark data sources
• Spark provides capabilities to plug-in additional data
sources not supported by Spark
• For streaming sources you can define your own custom
receivers
Spark streaming
• Data is divided into batches called Dstreams
(decentralized streams)
• Typical use case is the integration of Spark with
messaging systems such as Kafka, RabbitMQ and
ActiveMQ etc.
• Fault tolerance can be enabled in Spark Streaming
whereby data is stored in HDFS
Spark streaming
• To define a Spark stream you need to create a
JavaStreamingContext instance
SparkConf conf = new
SparkConf().setMaster("local[4]").setAppName("CustomerItems");
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(1));
Spark streaming
• Then a receiver can be created for the data:
– from sockets:
– from data directory:
– from RDD streams (for testing purposes):
jssc.socketTextStream("localhost", 7777);
jssc.textFileStream("... some data directory ...");
jssc.queueStream(... RDDs queue ... )
Spark streaming
• Then the data pipeline can be built using transformations
and actions on the streams
• Finally retrieval of data must be triggered from the
streaming context:
jssc.start();
jssc.awaitTermination();
Spark streaming
• Window streams can be created over stream data based
on two criteria:
– length of the window
– sliding interval for the windows
• Streaming datasets can also be joined with other
streaming or batch datasets
Spark structured streaming
• Newer streaming API working on DataSets/DataFrames:
• A schema can be specified on the streaming data using
the .schema(<schema>) method on the read stream
SparkSession context = SparkSession
.builder()
.appName("CustomerItems")
.getOrCreate();
Dataset<Row> lines = spark
.readStream()
.format("socket")
.option("host", "localhost")
.option("port", 7777)
.load();
Spark structured streaming
• Write sinks can also be used to write out streaming datasets:
• The following write sinks are provided by Spark:
- file
- Kafka
- foreach
- console (for testing purpose)
- memory (for testing purpose)
StreamingQuery query =
wordCounts.writeStream()
.outputMode("complete")
.format("console")
.start();
query.awaitTermination();
Clustering
• Spark supports the following cluster managers:
– Standalone scheduler (default)
– YARN
– Mesos
• Support for Kubernetes cluster manager is also
undergoing (experimental at present)
Using Oracle RDBMS
as a Spark datasource
Oracle RDBMS data source
• Spark supports retrieval of data through JDBC/ODBC
• Database driver must be supplied to the Spark classpath
(specified with the --driver-class-path) option
• For Oracle RDBMS that is the ojdbc driver
Oracle RDBMS data source
session.read()
.format("jdbc")
.option("url","jdbc:oracle:thin:@//127.0.0.1:1521/ORCL")
.option("dbtable", "items")
.option("user", "c##spark")
.option("password", "spark")
.load();
Oracle RDBMS data source
• You can use a variery of options when reading data from an
RDBMS using the jdbc format:
– query: a subquery that provides the possibility to limit retrieved data
– queryTimeout: specify the timeout for the JDBC query executed
against the RDBMS
• You can also save datasets to a table:
itemsDF.write().mode(org.apache.spark.sql.SaveMode.Append).
jdbc("jdbc:oracle:thin:@//127.0.0.1:1521/ORCL", “items",
prop);
Data processing options
• However the support provided by Spark is for batch
processing of data from the RDBMS …
• In many cases one might want to process data in a
streaming manner
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– process records as they are inserted in the RDBMS
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– process records on evenly-sized batches
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– process records on evenly-sized time intervals (record size may vary)
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– process batches of overlapping records using a sized window
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– processing of batches based on custom filter criteria
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• These can be achieved using the following mechanism:
– by duplicating writes over a streaming system such as Kafka
– via Spark streaming receiver that:
• buffer records (if a small delay is tolerable)
• creates an endpoint that an RDBMS trigger calls upon insertion
• listens for database changes using DCN (Database Change Notifications) via JDBC
(only pre-12c, DCN support dropped for PDBs as of 12c)
DEMO
Summary
• Apache Spark is one of the most feature-rich and
developed big data processing frameworks
• Provides a mechanism to distribute load over a large
number of nodes using different cluster managers
• A great option for fast and scalable processing of data
from an Oracle RDBMS

Contenu connexe

Tendances

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 

Tendances (20)

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Fast analytics kudu to druid
Fast analytics  kudu to druidFast analytics  kudu to druid
Fast analytics kudu to druid
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Delta Architecture
Delta ArchitectureDelta Architecture
Delta Architecture
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 

Similaire à Big data processing with Apache Spark and Oracle Database

Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; pythonMaloy Manna, PMP®
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetupjlacefie
 

Similaire à Big data processing with Apache Spark and Oracle Database (20)

Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Spark core
Spark coreSpark core
Spark core
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
An Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark MeetupAn Introduct to Spark - Atlanta Spark Meetup
An Introduct to Spark - Atlanta Spark Meetup
 

Plus de Martin Toshev

Semantic Technology In Oracle Database 12c
Semantic Technology In Oracle Database 12cSemantic Technology In Oracle Database 12c
Semantic Technology In Oracle Database 12cMartin Toshev
 
Practical security In a modular world
Practical security In a modular worldPractical security In a modular world
Practical security In a modular worldMartin Toshev
 
Java 9 Security Enhancements in Practice
Java 9 Security Enhancements in PracticeJava 9 Security Enhancements in Practice
Java 9 Security Enhancements in PracticeMartin Toshev
 
Writing Stored Procedures in Oracle RDBMS
Writing Stored Procedures in Oracle RDBMSWriting Stored Procedures in Oracle RDBMS
Writing Stored Procedures in Oracle RDBMSMartin Toshev
 
Security Architecture of the Java platform
Security Architecture of the Java platformSecurity Architecture of the Java platform
Security Architecture of the Java platformMartin Toshev
 
Oracle Database 12c Attack Vectors
Oracle Database 12c Attack VectorsOracle Database 12c Attack Vectors
Oracle Database 12c Attack VectorsMartin Toshev
 
RxJS vs RxJava: Intro
RxJS vs RxJava: IntroRxJS vs RxJava: Intro
RxJS vs RxJava: IntroMartin Toshev
 
Security Аrchitecture of Тhe Java Platform
Security Аrchitecture of Тhe Java PlatformSecurity Аrchitecture of Тhe Java Platform
Security Аrchitecture of Тhe Java PlatformMartin Toshev
 
Writing Stored Procedures with Oracle Database 12c
Writing Stored Procedures with Oracle Database 12cWriting Stored Procedures with Oracle Database 12c
Writing Stored Procedures with Oracle Database 12cMartin Toshev
 
Concurrency Utilities in Java 8
Concurrency Utilities in Java 8Concurrency Utilities in Java 8
Concurrency Utilities in Java 8Martin Toshev
 
The RabbitMQ Message Broker
The RabbitMQ Message BrokerThe RabbitMQ Message Broker
The RabbitMQ Message BrokerMartin Toshev
 
Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)
Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)
Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)Martin Toshev
 
Modularity of The Java Platform Javaday (http://javaday.org.ua/)
Modularity of The Java Platform Javaday (http://javaday.org.ua/)Modularity of The Java Platform Javaday (http://javaday.org.ua/)
Modularity of The Java Platform Javaday (http://javaday.org.ua/)Martin Toshev
 
Writing Java Stored Procedures in Oracle 12c
Writing Java Stored Procedures in Oracle 12cWriting Java Stored Procedures in Oracle 12c
Writing Java Stored Procedures in Oracle 12cMartin Toshev
 
KDB database (EPAM tech talks, Sofia, April, 2015)
KDB database (EPAM tech talks, Sofia, April, 2015)KDB database (EPAM tech talks, Sofia, April, 2015)
KDB database (EPAM tech talks, Sofia, April, 2015)Martin Toshev
 

Plus de Martin Toshev (20)

Jdk 10 sneak peek
Jdk 10 sneak peekJdk 10 sneak peek
Jdk 10 sneak peek
 
Semantic Technology In Oracle Database 12c
Semantic Technology In Oracle Database 12cSemantic Technology In Oracle Database 12c
Semantic Technology In Oracle Database 12c
 
Practical security In a modular world
Practical security In a modular worldPractical security In a modular world
Practical security In a modular world
 
Java 9 Security Enhancements in Practice
Java 9 Security Enhancements in PracticeJava 9 Security Enhancements in Practice
Java 9 Security Enhancements in Practice
 
Java 9 sneak peek
Java 9 sneak peekJava 9 sneak peek
Java 9 sneak peek
 
Writing Stored Procedures in Oracle RDBMS
Writing Stored Procedures in Oracle RDBMSWriting Stored Procedures in Oracle RDBMS
Writing Stored Procedures in Oracle RDBMS
 
Spring RabbitMQ
Spring RabbitMQSpring RabbitMQ
Spring RabbitMQ
 
Security Architecture of the Java platform
Security Architecture of the Java platformSecurity Architecture of the Java platform
Security Architecture of the Java platform
 
Oracle Database 12c Attack Vectors
Oracle Database 12c Attack VectorsOracle Database 12c Attack Vectors
Oracle Database 12c Attack Vectors
 
JVM++: The Graal VM
JVM++: The Graal VMJVM++: The Graal VM
JVM++: The Graal VM
 
RxJS vs RxJava: Intro
RxJS vs RxJava: IntroRxJS vs RxJava: Intro
RxJS vs RxJava: Intro
 
Security Аrchitecture of Тhe Java Platform
Security Аrchitecture of Тhe Java PlatformSecurity Аrchitecture of Тhe Java Platform
Security Аrchitecture of Тhe Java Platform
 
Spring RabbitMQ
Spring RabbitMQSpring RabbitMQ
Spring RabbitMQ
 
Writing Stored Procedures with Oracle Database 12c
Writing Stored Procedures with Oracle Database 12cWriting Stored Procedures with Oracle Database 12c
Writing Stored Procedures with Oracle Database 12c
 
Concurrency Utilities in Java 8
Concurrency Utilities in Java 8Concurrency Utilities in Java 8
Concurrency Utilities in Java 8
 
The RabbitMQ Message Broker
The RabbitMQ Message BrokerThe RabbitMQ Message Broker
The RabbitMQ Message Broker
 
Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)
Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)
Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)
 
Modularity of The Java Platform Javaday (http://javaday.org.ua/)
Modularity of The Java Platform Javaday (http://javaday.org.ua/)Modularity of The Java Platform Javaday (http://javaday.org.ua/)
Modularity of The Java Platform Javaday (http://javaday.org.ua/)
 
Writing Java Stored Procedures in Oracle 12c
Writing Java Stored Procedures in Oracle 12cWriting Java Stored Procedures in Oracle 12c
Writing Java Stored Procedures in Oracle 12c
 
KDB database (EPAM tech talks, Sofia, April, 2015)
KDB database (EPAM tech talks, Sofia, April, 2015)KDB database (EPAM tech talks, Sofia, April, 2015)
KDB database (EPAM tech talks, Sofia, April, 2015)
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Big data processing with Apache Spark and Oracle Database

  • 1. Big data processing with Apache Spark and Oracle database Martin Toshev
  • 2. Who am I Software consultant (CoffeeCupConsulting) BG JUG board member (http://jug.bg) (BG JUG is a 2018 Oracle Duke’s choice award winner)
  • 3. Agenda • Apache Spark from an eagle’s eye • Apache Spark capabilities • Using Oracle RDBMS as a Spark datasource
  • 4. Apache Spark from an Eagle’s eye
  • 5. Highlights • A framework for large-scale distributed data processing • Originally in Scala but extended with Java, Python and R • One of the most contributed open source/Apache/GitHub projects with over 1400 contributors
  • 6. Spark vs MapReduce • Spark has been developed in order to address the shortcomings of the MapReduce programming model • In particular MapReduce is unsuitable for: – real-time processing (suitable for batch processing of present data) – operations not limited to the key-value format of data – large data on a network – online transaction processing – graph processing – sequential program execution
  • 7. Spark vs Hadoop • Spark is faster as it depends more on RAM usage and tries to minimize disk IO (on the storage system) • Spark however can still use Hadoop: – as a storage engine (HDFS) – as a compute engine (MapReduce or Hadoop YARN) • Spark has pluggable storage and compute engine architecture
  • 8. Spark components Spark Framework Spark Core Spark Streaming MLib GraphXSpark SQL
  • 11. Spark datasets • The building block of Spark are RDDs (Resilient Distributed Datasets) • They are immutable collections of objects spread across a Spark cluster and stored in RAM or on disk • Created by means of distributed transformations • Rebuilt on failure of a Spark node
  • 12. Spark datasets • The DataFrame API is a superset of RDDs introduced in Spark 2.0 • The Dataset API provides a way to work with a combination of RDDs and DataFrames • The DataFrame API is preferred compared to RDDs due to improved performance and more advanced operations
  • 13. Spark datasets List<Item> items = …; SparkConf configuration = new SparkConf().setAppName(“ItemsManager").setMaster("local"); JavaSparkContext context = new JavaSparkContext(configuration); JavaRDD<Item> itemsRDD = context.parallelize(items);
  • 14. Spark transformations map itemsRDD.map(i -> { i.setName(“phone”); return i;}); filter itemsRDD.filter(i -> i.getName().contains(“phone”)) flatMap itemsRDD.flatMap(i -> Arrays.asList(i, i).iterator()); union itemsRDD.union(newItemsRDD); intersection itemsRDD.intersection(newItemsRDD); distinct itemsRDD.distinct() cartesian itemsRDD.cartesian(otherDatasetRDD)
  • 15. Spark transformations groupBy pairItemsRDD = itemsRDD.mapToPair(i -> new Tuple2(i.getType(), i)); modifiedPairItemsRDD = pairItemsRDD.groupByKey(); reduceByKey pairItemsRDD = itemsRDD.mapToPair(o -> new Tuple2(o.getType(), o)); modifiedPairItemsRDD = pairItemsRDD.reduceByKey((o1, o2) -> new Item(o1.getType(), o1.getCount() + o2.getCount(), o1.getUnitPrice()) ); • Other transformations include aggregateByKey, sortByKey, join, cogroup …
  • 16. Spark actions • Spark actions are the terminal operations that produce results from the transformations • Actions are a way to communicate back from the execution engine to the Spark driver instance
  • 17. Spark actions collect itemsRDD.collect() reduce itemsRDD.map(i -> i.getUnitPrice() * i.getCount()). reduce((x, y) -> x + y); count itemsRDD.count() first itemsRDD.first() take itemsRDD.take(4) takeOrdered itemsRDD.takeOrdered(4, comparator) foreach itemsRDD.foreach(System.out::println) saveAsTextFile itemsRDD.saveAsTextFile(path) saveAsObjectFile itemsRDD.saveAsObjectFile(path)
  • 18. DataFrames/DataSets • A dataframe can be created using an instance of the org.apache.spark.sql.SparkSession class • The DataFrame/DataSet APIs provide more advanced operations and the capability to run SQL queries on the data itemsDS.createOrReplaceTempView(“items"); session.sql("SELECT * FROM items");
  • 19. DataFrames/DataSets • An existing RDD can be converted to a Spark dataframe: • An RDD can be retrieved from a dataframe as well: SparkSession session = SparkSession.builder().appName("app").getOrCreate(); Dataset<Row> itemsDS = session.createDataFrame(itemsRDD, Item.class); itemsDS.rdd()
  • 20. Spark data sources • Spark can receive data from a variety of data sources in a variety of ways (batching, real-time streaming) • These datasources might be: – files: Spark supports reading data from a variety of formats (JSON, CSV, Avro, etc.) – relational databases: using JDBC/ODBC driver Spark can extract data from an RDBMS – TCP sockets, messaging systems: using streaming capabilities of Spark data can be read from messaging systems and raw TCP sockets
  • 21. Spark data sources • Spark provides support for operations on batch data or real time data • For real time data Spark provides two main APIs: – Spark streaming is an older API working on RDDs – Spark structured streaming is a newer API working on DataFrames/DataSets
  • 22. Spark data sources • Spark provides capabilities to plug-in additional data sources not supported by Spark • For streaming sources you can define your own custom receivers
  • 23. Spark streaming • Data is divided into batches called Dstreams (decentralized streams) • Typical use case is the integration of Spark with messaging systems such as Kafka, RabbitMQ and ActiveMQ etc. • Fault tolerance can be enabled in Spark Streaming whereby data is stored in HDFS
  • 24. Spark streaming • To define a Spark stream you need to create a JavaStreamingContext instance SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("CustomerItems"); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
  • 25. Spark streaming • Then a receiver can be created for the data: – from sockets: – from data directory: – from RDD streams (for testing purposes): jssc.socketTextStream("localhost", 7777); jssc.textFileStream("... some data directory ..."); jssc.queueStream(... RDDs queue ... )
  • 26. Spark streaming • Then the data pipeline can be built using transformations and actions on the streams • Finally retrieval of data must be triggered from the streaming context: jssc.start(); jssc.awaitTermination();
  • 27. Spark streaming • Window streams can be created over stream data based on two criteria: – length of the window – sliding interval for the windows • Streaming datasets can also be joined with other streaming or batch datasets
  • 28. Spark structured streaming • Newer streaming API working on DataSets/DataFrames: • A schema can be specified on the streaming data using the .schema(<schema>) method on the read stream SparkSession context = SparkSession .builder() .appName("CustomerItems") .getOrCreate(); Dataset<Row> lines = spark .readStream() .format("socket") .option("host", "localhost") .option("port", 7777) .load();
  • 29. Spark structured streaming • Write sinks can also be used to write out streaming datasets: • The following write sinks are provided by Spark: - file - Kafka - foreach - console (for testing purpose) - memory (for testing purpose) StreamingQuery query = wordCounts.writeStream() .outputMode("complete") .format("console") .start(); query.awaitTermination();
  • 30. Clustering • Spark supports the following cluster managers: – Standalone scheduler (default) – YARN – Mesos • Support for Kubernetes cluster manager is also undergoing (experimental at present)
  • 31. Using Oracle RDBMS as a Spark datasource
  • 32. Oracle RDBMS data source • Spark supports retrieval of data through JDBC/ODBC • Database driver must be supplied to the Spark classpath (specified with the --driver-class-path) option • For Oracle RDBMS that is the ojdbc driver
  • 33. Oracle RDBMS data source session.read() .format("jdbc") .option("url","jdbc:oracle:thin:@//127.0.0.1:1521/ORCL") .option("dbtable", "items") .option("user", "c##spark") .option("password", "spark") .load();
  • 34. Oracle RDBMS data source • You can use a variery of options when reading data from an RDBMS using the jdbc format: – query: a subquery that provides the possibility to limit retrieved data – queryTimeout: specify the timeout for the JDBC query executed against the RDBMS • You can also save datasets to a table: itemsDF.write().mode(org.apache.spark.sql.SaveMode.Append). jdbc("jdbc:oracle:thin:@//127.0.0.1:1521/ORCL", “items", prop);
  • 35. Data processing options • However the support provided by Spark is for batch processing of data from the RDBMS … • In many cases one might want to process data in a streaming manner
  • 36. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – process records as they are inserted in the RDBMS Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 37. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – process records on evenly-sized batches Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 38. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – process records on evenly-sized time intervals (record size may vary) Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 39. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – process batches of overlapping records using a sized window Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 40. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – processing of batches based on custom filter criteria Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 41. Data processing options • These can be achieved using the following mechanism: – by duplicating writes over a streaming system such as Kafka – via Spark streaming receiver that: • buffer records (if a small delay is tolerable) • creates an endpoint that an RDBMS trigger calls upon insertion • listens for database changes using DCN (Database Change Notifications) via JDBC (only pre-12c, DCN support dropped for PDBs as of 12c)
  • 42. DEMO
  • 43. Summary • Apache Spark is one of the most feature-rich and developed big data processing frameworks • Provides a mechanism to distribute load over a large number of nodes using different cluster managers • A great option for fast and scalable processing of data from an Oracle RDBMS