SlideShare a Scribd company logo
1 of 30
Jim Hatcher
Using Spark to Load Oracle Data into Cassandra
1 Introduction
2 Problem Description
3 Methods of loading external data into Cassandra
4 What is Spark?
5 Lessons Learned
6 Resources
2© DataStax, All Rights Reserved.
Introduction
© DataStax, All Rights Reserved. 4
At IHS Markit, we take raw data and turn it into
information and insights for our customers.
Automotive Systems (CarFax)
Defense Systems (Jane’s)
Oil & Gas Systems (Petra, Kingdom)
Maritime Systems
Technology Systems (Electronic Parts Database, Root Metrics)
Chemicals
Financial Systems (Wall Street on Demand)
Lots of others
Problem Description
Cluster
Factory
Oracle
Back-end Applications Customer-facing Systems
Load
Files
Customer-
facing
Applications
Oracle
Cassandra
+
Solr
Factory
Applications
Data
Updates
Cassandra
+
Spark
Methods of loading external data into Cassandra
Methods of Loading External Data into C*
1. CQL Copy command
2. Sqoop
3. Write a custom program that uses the CQL driver
4. Write a Spark program
© DataStax, All Rights Reserved. 8
What is Spark?
© DataStax, All Rights Reserved. 10
What is Spark?
Spark is a processing framework designed
to work with distributed data.
“up to 100X faster than MapReduce”
according to spark.apache.org
Used in any ecosystem where you want to
work with distributed data (Hadoop,
Cassandra, etc.)
Includes other specialized libraries:
• SparkSQL
• Spark Streaming
• MLLib
• GraphX
Spark Facts
Conceptually Similar To MapReduce
Written In Scala
Supported By DataBricks
Supported Languages Scala, Java, Python, R
© DataStax, All Rights Reserved. 11
Spark Client
Driver
Spark
Context
Spark Master
Spark Worker
Spark Worker
Spark Worker
Executor
Executor
Executor
1. Request Resources
2. Allocate Resources
3.StartExecutors
4.Perform
Computation
Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture
Spark Architecture
© DataStax, All Rights Reserved. 12
Spark with Cassandra
Credit:
https://academy.datastax.com/courses/ds320-
analytics-apache-spark/introduction-spark-
architecture
Cassandra Cluster
A
CB
Spark Worker
Spark WorkerSpark Worker
Spark Master
Spark Client
Spark Cassandra Connector – open source, supported by DataStax
https://github.com/datastax/spark-cassandra-connector
© DataStax, All Rights Reserved. 13
ETL (Extract, Transform, Load)
Text File
JDBC Data
Source
Cassandra
Hadoop
Extract Data
Spark: Create
RDD or Data
Frame
Data Source(s)
Spark Code
Transform Data
Spark: Map
function
Spark Code
Cassandra
Data Source(s)
Load Data
Spark: Save
Spark Code
© DataStax, All Rights Reserved.
Typical Code - Example
// Extract
val extracted = sqlContext
.read
.format("jdbc")
.options(
Map[String, String](
"url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",
"dbtable" -> "table_name"
)
)
.load()
// Transform
val transformed = extracted.map { dbRow =>
(dbRow.getAs[String](“field_one"), dbRow.getAs[Integer](“field_two"))
}
// Load
transformed.saveToCassandra(“keyspace_name", “table_name", SomeColumns(“field_one“, “field_two"))
Lessons Learned
Lesson #1 - Spark SQL handles Oracle
NUMBER fields with no precision incorrectly
https://issues.apache.org/jira/browse/SPARK-10909
All of our Oracle tables have ID fields defined as NUMBER(15,0).
When you use Spark SQL to access an Oracle table, there is a piece of code in the JDBC driver that
reads the metadata and creates a dataframe with the proper schema. If your schema has a
NUMBER(*, 0) field defined in it, you get a “Overflowed precision” error.
This is fixed in Spark 1.5, but we don’t have the option of adopting a new version of Spark since we’re
using Spark bundled with DSE 4.8.6 (which uses spark 1.4.2). We were able to fix this by stealing the
fix from the Spark 1.5 code and applying it to our code (yay, open source!).
At some point, we’ll update to DSE 5.* which uses Spark 1.6, and we can remove this code.
© DataStax, All Rights Reserved. 16
© DataStax, All Rights Reserved. 17
import java.sql.Types
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcType}
import org.apache.spark.sql.types._
private case object OracleDialect extends JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:oracle")
override def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
// Handle NUMBER fields that have no precision/scale in special way
// because JDBC ResultSetMetaData converts this to 0 precision and -127 scale
// For more details, please see
// https://github.com/apache/spark/pull/8780#issuecomment-145598968
// and
// https://github.com/apache/spark/pull/8780#issuecomment-144541760
if (sqlType == Types.NUMERIC && size == 0) {
// This is sub-optimal as we have to pick a precision/scale in advance whereas the data
// in Oracle is allowed to have different precision/scale for each value.
Option(DecimalType(38, 10))
} else {
None
}
}
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case StringType => Some(JdbcType("VARCHAR2(255)", java.sql.Types.VARCHAR))
case _ => None
}
}
org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(OracleDialect)
Lesson #1 - Spark SQL handles Oracle
NUMBER fields with no precision incorrectly
Lesson #2 - Spark SQL doesn’t handle timeuuid
fields correctly
https://issues.apache.org/jira/browse/SPARK-10501
Spark SQL doesn’t know what to do with a timeuuid field when reading a table from Cassandra. This is an
issue since we commonly use timeuuid columns in our Cassandra key structures.
We got this error: scala.MatchError: UUIDType (of class
org.apache.spark.sql.cassandra.types.UUIDType$)
We are able to work around this issue by casting the timeuuid values to strings, like this:
© DataStax, All Rights Reserved. 18
val dataFrameRaw = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "table_name", "keyspace" -> "keyspace_name"))
.load()
val dataFrameFixed = dataFrameRaw
.withColumn(“timeuuid_column", dataFrameRaw("timeuuid_column").cast(StringType))
Lesson #3 – Careful when generating ID fields
We created an RDD:
val baseRdd = rddInsertsAndUpdates.map { dbRow =>
val keyColumn = {
if (!dbRow.isNullAt(dbRow.fieldIndex(“timeuuid_key_column"))) {
dbRow.getAs[String]("timeuuid_key_column")
} else {
UUIDs.timeBased().toString
}
}
//do some further processing
(keyColumn, …other values)
}
Then, we took that RDD and transformed it into another RDD:
val invertedIndexTable = baseRdd.map { entry =>
(entry.getString(“timeuuid_key_column"), entry.getString(“fld_1"))
}
Then we wrote them both to C*, like this:
baseRdd.saveToCassandra(“keyspace_name", “table_name", SomeColumns(“key_column“, “fld_1“, “fld_2"))
invertedIndexTable.saveToCassandra(“keyspace_name", “inverted_index_table_name"
SomeColumns(“key_column“, “fld_1“)
© DataStax, All Rights Reserved. 19
Lesson #3 – Careful when generating ID fields
We kept finding that the ID values in the inverted index table had slightly different ID values than the
values in the base table.
We fixed this by adding a cache() to our first RDD.
© DataStax, All Rights Reserved. 20
val baseRdd = rddInsertsAndUpdates.map { dbRow =>
val keyColumn = {
if (!dbRow.isNullAt(dbRow.fieldIndex(“timeuuid_key_column"))) {
dbRow.getAs[String]("timeuuid_key_column")
} else {
UUIDs.timeBased().toString
}
}
//do some further processing
(keyColumn, …other values)
}.cache()
Lesson #4 – You can only return an RDD of a
tuple if you have 22 items or less.
© DataStax, All Rights Reserved. 21
It’s pretty common in Spark to return an RDD of tuples
val myNewRdd = myOldRdd.map { dbRow =>
val firstName = dbRow.getAs[String](“FirstName")
val lastName = dbRow.getAs[String](“LastName")
val calcField1 = dbRow.getAs[Intger](“SomeColumn") * 3.14
(firstName, lastName, calcField1)
}
This works great until you get to 22 fields in your tuple, and then Scala throws an error. (Later
versions of Scala lift this restriction, but it’s a problem for our version of Scala.)
Lesson #4 – You can only return an RDD of a
tuple if you have 22 items or less.
© DataStax, All Rights Reserved. 22
You can fix this by returning an RDD of CassandraRows instead. (especially if your goal is to save them
to C*)
val myNewRdd = myOldRdd.map { dbRow =>
val firstName = dbRow.getAs[String](“FirstName")
val lastName = dbRow.getAs[String](“LastName")
val calcField1 = dbRow.getAs[Integer](“SomeColumn") * 3.14
val allValues = IndexedSeq[AnyRef](firstName, lastName, calcField1)
val allColumnNames = Array[String](
“first_name",
“last_name",
“calc_field_1“)
new CassandraRow(allColumnNames, allValues)
}
Lesson #5 – Getting a JDBC dataframe based on a
SQL statement is not very intuitive.
To get a dataframe from a JDBC source, you do this:
val exampleDataFrame = sqlContext
.read
.format("jdbc")
.options(
Map[String, String](
"url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",
"dbtable" -> "table_name"
)
)
.load()
You would think there would be a version of this call that lets you pass in a SQL statement but there is
not.
However, when JDBC creates your query from the above syntax, all it does is prepend your dbtable
value with “SELECT * FROM”.
© DataStax, All Rights Reserved. 23
Lesson #5 – Getting a JDBC dataframe based on a
SQL statement is not very intuitive.
So, the workaround is to do this:
val sql =
"( " +
" SELECT S.* " +
" FROM Sample S " +
" WHERE ID = 11111 " +
" ORDER BY S.SomeField " +
")"
val exampleDataFrame = sqlContext
.read
.format("jdbc")
.options(
Map[String, String](
"url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",
"dbtable" -> sql
)
)
.load()
You’re effectively doing this in Oracle:
SELECT * FROM (
SELECT S.*
FROM Sample S
WHERE ID = 11111
ORDER BY S.SomeField
)
© DataStax, All Rights Reserved. 24
Lesson #6 – Creating a partitioned JDBC
dataframe is not very intuitive.
The code to get a JDBC dataframe looks like this:
val basePartitionedOracleData = sqlContext
.read
.format("jdbc")
.options(
Map[String, String](
"url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",
"dbtable" -> "ExampleTable",
"lowerBound" -> "1",
"upperBound" -> "10000",
"numPartitions" -> "10",
"partitionColumn" -> “KeyColumn"
)
)
.load()
The last four arguments in that map are there for the purpose of getting a partitioned dataset. If you pass any of them,
you have to pass all of them.
© DataStax, All Rights Reserved. 25
Lesson #6 – Creating a partitioned JDBC
dataframe is not very intuitive.
When you pass these additional arguments in, here’s what it does:
It builds a SQL statement template in the format “SELECT * FROM {tableName} WHERE {partitionColumn} >= ? AND
{partitionColumn} < ?”
It sends {numPartitions} statements to the DB engine. If you suppled these values: {dbTable=ExampleTable,
lowerBound=1, upperBound=10,000, numPartitions=10, partitionColumn=KeyColumn}, it would create these ten
statements:
SELECT * FROM ExampleTable WHERE KeyColumn >= 1 AND KeyColumn < 1001
SELECT * FROM ExampleTable WHERE KeyColumn >= 1001 AND KeyColumn < 2000
SELECT * FROM ExampleTable WHERE KeyColumn >= 2001 AND KeyColumn < 3000
SELECT * FROM ExampleTable WHERE KeyColumn >= 3001 AND KeyColumn < 4000
SELECT * FROM ExampleTable WHERE KeyColumn >= 4001 AND KeyColumn < 5000
SELECT * FROM ExampleTable WHERE KeyColumn >= 5001 AND KeyColumn < 6000
SELECT * FROM ExampleTable WHERE KeyColumn >= 6001 AND KeyColumn < 7000
SELECT * FROM ExampleTable WHERE KeyColumn >= 7001 AND KeyColumn < 8000
SELECT * FROM ExampleTable WHERE KeyColumn >= 8001 AND KeyColumn < 9000
SELECT * FROM ExampleTable WHERE KeyColumn >= 9001 AND KeyColumn < 10000
And then it would put the results of each of those queries in its own partition in Spark.
© DataStax, All Rights Reserved. 26
Lesson #7 – JDBC *really* wants you to get your
partitioned dataframe using a sequential ID column.
In our Oracle database, we don’t have sequential integer ID columns.
We tried to get around that by doing a query like this and passing “ROW_NUMBER” as the partitioning
column:
SELECT ST.*, ROW_NUMBER() OVER (ORDER BY ID_FIELD ASC) AS ROW_NUMBER
FROM SourceTable ST
WHERE …my criteria
ORDER BY ID_FIELD
But, this didn’t perform well.
We ended up creating a processing table:
CREATE TABLE SPARK_ETL_BATCH_SEQUENCE (
SEQ_ID NUMBER(15,0) NOT NULL, //this has a sequence that gets auto-incremented
BATCH_ID NUMBER(15,0) NOT NULL,
ID_FIELD NUMBER(15,0) NOT NULL
)
© DataStax, All Rights Reserved. 27
Lesson #7 – JDBC *really* wants you to get your
partitioned dataframe using a sequential ID column.
We insert into this table first:
INSERT INTO SPARK_ETL_BATCH_SEQUENCE ( BATCH_ID, ID_FIELD ) //SEQ_ID gets auto-populated
SELECT {NextBatchID}, ID_FIELD
FROM SourceTable ST
WHERE …my criteria
ORDER BY ID_FIELD
Then, we join to it in the query where we get our data which provides us with a sequential ID:
SELECT ST.*, SEQ.SEQ_ID
FROM SourceTable ST
INNER JOIN SPARK_ETL_BATCH_SEQUENCE SEQ ON ST.ID_FIELD = SEQ.ID_FIELD
WHERE …my criteria
ORDER BY ID_FIELD
And, we use SEQ_ID as our Partitioning Column.
Despite its need to talk to Oracle twice, this approach has proven to perform much faster than having
uneven partitions.
© DataStax, All Rights Reserved. 28
Resources
Resources
© DataStax, All Rights Reserved. 30
Spark
• Books
• Learning Spark
http://shop.oreilly.com/product/0636920028512.do
Scala (Knowing Scala with really help you progress in Spark)
• Functional Programming Principles in Scala (videos)
https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd
• Books
http://www.scala-lang.org/documentation/books.html
Spark and Cassandra
• DataStax Academy
http://academy.datastax.com/
• Self-paced course: DS320: DataStax Enterprise Analytics with Apache Spark – Really Good!
• Tutorials
• Spark Cassandra Connector website – lots of good examples
https://github.com/datastax/spark-cassandra-connector

More Related Content

What's hot

Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
DataStax
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax
 

What's hot (20)

Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Cassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE SearchCassandra Summit 2015: Intro to DSE Search
Cassandra Summit 2015: Intro to DSE Search
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 
Processing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and SparkProcessing 50,000 events per second with Cassandra and Spark
Processing 50,000 events per second with Cassandra and Spark
 
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformHow We Used Cassandra/Solr to Build Real-Time Analytics Platform
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
DataStax | Advanced DSE Analytics Client Configuration (Jacek Lewandowski) | ...
 
Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
 
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 

Viewers also liked

Building Killr Applications with DSE
Building Killr Applications with DSEBuilding Killr Applications with DSE
Building Killr Applications with DSE
DataStax
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
DataStax
 
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
DataStax
 
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databasesGive sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
DataStax
 

Viewers also liked (20)

Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
 
Building Killr Applications with DSE
Building Killr Applications with DSEBuilding Killr Applications with DSE
Building Killr Applications with DSE
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
 
Webinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar: Transforming Customer Experience Through an Always-On Data PlatformWebinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar: Transforming Customer Experience Through an Always-On Data Platform
 
Can My Inventory Survive Eventual Consistency?
Can My Inventory Survive Eventual Consistency?Can My Inventory Survive Eventual Consistency?
Can My Inventory Survive Eventual Consistency?
 
Webinar - Bringing Game Changing Insights with Graph Databases
Webinar - Bringing Game Changing Insights with Graph DatabasesWebinar - Bringing Game Changing Insights with Graph Databases
Webinar - Bringing Game Changing Insights with Graph Databases
 
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
 
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
 
Webinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph DatabasesWebinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph Databases
 
There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit...
There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit...There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit...
There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit...
 
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
 
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databasesGive sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 

Similar to Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Similar to Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016 (20)

Lecture17
Lecture17Lecture17
Lecture17
 
JDBC for CSQL Database
JDBC for CSQL DatabaseJDBC for CSQL Database
JDBC for CSQL Database
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Jdbc
JdbcJdbc
Jdbc
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Learning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQLLearning spark ch09 - Spark SQL
Learning spark ch09 - Spark SQL
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spring framework part 2
Spring framework part 2Spring framework part 2
Spring framework part 2
 
Jdbc
JdbcJdbc
Jdbc
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Sqlapi0.1
Sqlapi0.1Sqlapi0.1
Sqlapi0.1
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 

More from DataStax

More from DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Recently uploaded

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 

Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

  • 1. Jim Hatcher Using Spark to Load Oracle Data into Cassandra
  • 2. 1 Introduction 2 Problem Description 3 Methods of loading external data into Cassandra 4 What is Spark? 5 Lessons Learned 6 Resources 2© DataStax, All Rights Reserved.
  • 4. © DataStax, All Rights Reserved. 4 At IHS Markit, we take raw data and turn it into information and insights for our customers. Automotive Systems (CarFax) Defense Systems (Jane’s) Oil & Gas Systems (Petra, Kingdom) Maritime Systems Technology Systems (Electronic Parts Database, Root Metrics) Chemicals Financial Systems (Wall Street on Demand) Lots of others
  • 6. Cluster Factory Oracle Back-end Applications Customer-facing Systems Load Files Customer- facing Applications Oracle Cassandra + Solr Factory Applications Data Updates Cassandra + Spark
  • 7. Methods of loading external data into Cassandra
  • 8. Methods of Loading External Data into C* 1. CQL Copy command 2. Sqoop 3. Write a custom program that uses the CQL driver 4. Write a Spark program © DataStax, All Rights Reserved. 8
  • 10. © DataStax, All Rights Reserved. 10 What is Spark? Spark is a processing framework designed to work with distributed data. “up to 100X faster than MapReduce” according to spark.apache.org Used in any ecosystem where you want to work with distributed data (Hadoop, Cassandra, etc.) Includes other specialized libraries: • SparkSQL • Spark Streaming • MLLib • GraphX Spark Facts Conceptually Similar To MapReduce Written In Scala Supported By DataBricks Supported Languages Scala, Java, Python, R
  • 11. © DataStax, All Rights Reserved. 11 Spark Client Driver Spark Context Spark Master Spark Worker Spark Worker Spark Worker Executor Executor Executor 1. Request Resources 2. Allocate Resources 3.StartExecutors 4.Perform Computation Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture Spark Architecture
  • 12. © DataStax, All Rights Reserved. 12 Spark with Cassandra Credit: https://academy.datastax.com/courses/ds320- analytics-apache-spark/introduction-spark- architecture Cassandra Cluster A CB Spark Worker Spark WorkerSpark Worker Spark Master Spark Client Spark Cassandra Connector – open source, supported by DataStax https://github.com/datastax/spark-cassandra-connector
  • 13. © DataStax, All Rights Reserved. 13 ETL (Extract, Transform, Load) Text File JDBC Data Source Cassandra Hadoop Extract Data Spark: Create RDD or Data Frame Data Source(s) Spark Code Transform Data Spark: Map function Spark Code Cassandra Data Source(s) Load Data Spark: Save Spark Code
  • 14. © DataStax, All Rights Reserved. Typical Code - Example // Extract val extracted = sqlContext .read .format("jdbc") .options( Map[String, String]( "url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc", "dbtable" -> "table_name" ) ) .load() // Transform val transformed = extracted.map { dbRow => (dbRow.getAs[String](“field_one"), dbRow.getAs[Integer](“field_two")) } // Load transformed.saveToCassandra(“keyspace_name", “table_name", SomeColumns(“field_one“, “field_two"))
  • 16. Lesson #1 - Spark SQL handles Oracle NUMBER fields with no precision incorrectly https://issues.apache.org/jira/browse/SPARK-10909 All of our Oracle tables have ID fields defined as NUMBER(15,0). When you use Spark SQL to access an Oracle table, there is a piece of code in the JDBC driver that reads the metadata and creates a dataframe with the proper schema. If your schema has a NUMBER(*, 0) field defined in it, you get a “Overflowed precision” error. This is fixed in Spark 1.5, but we don’t have the option of adopting a new version of Spark since we’re using Spark bundled with DSE 4.8.6 (which uses spark 1.4.2). We were able to fix this by stealing the fix from the Spark 1.5 code and applying it to our code (yay, open source!). At some point, we’ll update to DSE 5.* which uses Spark 1.6, and we can remove this code. © DataStax, All Rights Reserved. 16
  • 17. © DataStax, All Rights Reserved. 17 import java.sql.Types import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcType} import org.apache.spark.sql.types._ private case object OracleDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:oracle") override def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = { // Handle NUMBER fields that have no precision/scale in special way // because JDBC ResultSetMetaData converts this to 0 precision and -127 scale // For more details, please see // https://github.com/apache/spark/pull/8780#issuecomment-145598968 // and // https://github.com/apache/spark/pull/8780#issuecomment-144541760 if (sqlType == Types.NUMERIC && size == 0) { // This is sub-optimal as we have to pick a precision/scale in advance whereas the data // in Oracle is allowed to have different precision/scale for each value. Option(DecimalType(38, 10)) } else { None } } override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { case StringType => Some(JdbcType("VARCHAR2(255)", java.sql.Types.VARCHAR)) case _ => None } } org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(OracleDialect) Lesson #1 - Spark SQL handles Oracle NUMBER fields with no precision incorrectly
  • 18. Lesson #2 - Spark SQL doesn’t handle timeuuid fields correctly https://issues.apache.org/jira/browse/SPARK-10501 Spark SQL doesn’t know what to do with a timeuuid field when reading a table from Cassandra. This is an issue since we commonly use timeuuid columns in our Cassandra key structures. We got this error: scala.MatchError: UUIDType (of class org.apache.spark.sql.cassandra.types.UUIDType$) We are able to work around this issue by casting the timeuuid values to strings, like this: © DataStax, All Rights Reserved. 18 val dataFrameRaw = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "table_name", "keyspace" -> "keyspace_name")) .load() val dataFrameFixed = dataFrameRaw .withColumn(“timeuuid_column", dataFrameRaw("timeuuid_column").cast(StringType))
  • 19. Lesson #3 – Careful when generating ID fields We created an RDD: val baseRdd = rddInsertsAndUpdates.map { dbRow => val keyColumn = { if (!dbRow.isNullAt(dbRow.fieldIndex(“timeuuid_key_column"))) { dbRow.getAs[String]("timeuuid_key_column") } else { UUIDs.timeBased().toString } } //do some further processing (keyColumn, …other values) } Then, we took that RDD and transformed it into another RDD: val invertedIndexTable = baseRdd.map { entry => (entry.getString(“timeuuid_key_column"), entry.getString(“fld_1")) } Then we wrote them both to C*, like this: baseRdd.saveToCassandra(“keyspace_name", “table_name", SomeColumns(“key_column“, “fld_1“, “fld_2")) invertedIndexTable.saveToCassandra(“keyspace_name", “inverted_index_table_name" SomeColumns(“key_column“, “fld_1“) © DataStax, All Rights Reserved. 19
  • 20. Lesson #3 – Careful when generating ID fields We kept finding that the ID values in the inverted index table had slightly different ID values than the values in the base table. We fixed this by adding a cache() to our first RDD. © DataStax, All Rights Reserved. 20 val baseRdd = rddInsertsAndUpdates.map { dbRow => val keyColumn = { if (!dbRow.isNullAt(dbRow.fieldIndex(“timeuuid_key_column"))) { dbRow.getAs[String]("timeuuid_key_column") } else { UUIDs.timeBased().toString } } //do some further processing (keyColumn, …other values) }.cache()
  • 21. Lesson #4 – You can only return an RDD of a tuple if you have 22 items or less. © DataStax, All Rights Reserved. 21 It’s pretty common in Spark to return an RDD of tuples val myNewRdd = myOldRdd.map { dbRow => val firstName = dbRow.getAs[String](“FirstName") val lastName = dbRow.getAs[String](“LastName") val calcField1 = dbRow.getAs[Intger](“SomeColumn") * 3.14 (firstName, lastName, calcField1) } This works great until you get to 22 fields in your tuple, and then Scala throws an error. (Later versions of Scala lift this restriction, but it’s a problem for our version of Scala.)
  • 22. Lesson #4 – You can only return an RDD of a tuple if you have 22 items or less. © DataStax, All Rights Reserved. 22 You can fix this by returning an RDD of CassandraRows instead. (especially if your goal is to save them to C*) val myNewRdd = myOldRdd.map { dbRow => val firstName = dbRow.getAs[String](“FirstName") val lastName = dbRow.getAs[String](“LastName") val calcField1 = dbRow.getAs[Integer](“SomeColumn") * 3.14 val allValues = IndexedSeq[AnyRef](firstName, lastName, calcField1) val allColumnNames = Array[String]( “first_name", “last_name", “calc_field_1“) new CassandraRow(allColumnNames, allValues) }
  • 23. Lesson #5 – Getting a JDBC dataframe based on a SQL statement is not very intuitive. To get a dataframe from a JDBC source, you do this: val exampleDataFrame = sqlContext .read .format("jdbc") .options( Map[String, String]( "url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc", "dbtable" -> "table_name" ) ) .load() You would think there would be a version of this call that lets you pass in a SQL statement but there is not. However, when JDBC creates your query from the above syntax, all it does is prepend your dbtable value with “SELECT * FROM”. © DataStax, All Rights Reserved. 23
  • 24. Lesson #5 – Getting a JDBC dataframe based on a SQL statement is not very intuitive. So, the workaround is to do this: val sql = "( " + " SELECT S.* " + " FROM Sample S " + " WHERE ID = 11111 " + " ORDER BY S.SomeField " + ")" val exampleDataFrame = sqlContext .read .format("jdbc") .options( Map[String, String]( "url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc", "dbtable" -> sql ) ) .load() You’re effectively doing this in Oracle: SELECT * FROM ( SELECT S.* FROM Sample S WHERE ID = 11111 ORDER BY S.SomeField ) © DataStax, All Rights Reserved. 24
  • 25. Lesson #6 – Creating a partitioned JDBC dataframe is not very intuitive. The code to get a JDBC dataframe looks like this: val basePartitionedOracleData = sqlContext .read .format("jdbc") .options( Map[String, String]( "url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc", "dbtable" -> "ExampleTable", "lowerBound" -> "1", "upperBound" -> "10000", "numPartitions" -> "10", "partitionColumn" -> “KeyColumn" ) ) .load() The last four arguments in that map are there for the purpose of getting a partitioned dataset. If you pass any of them, you have to pass all of them. © DataStax, All Rights Reserved. 25
  • 26. Lesson #6 – Creating a partitioned JDBC dataframe is not very intuitive. When you pass these additional arguments in, here’s what it does: It builds a SQL statement template in the format “SELECT * FROM {tableName} WHERE {partitionColumn} >= ? AND {partitionColumn} < ?” It sends {numPartitions} statements to the DB engine. If you suppled these values: {dbTable=ExampleTable, lowerBound=1, upperBound=10,000, numPartitions=10, partitionColumn=KeyColumn}, it would create these ten statements: SELECT * FROM ExampleTable WHERE KeyColumn >= 1 AND KeyColumn < 1001 SELECT * FROM ExampleTable WHERE KeyColumn >= 1001 AND KeyColumn < 2000 SELECT * FROM ExampleTable WHERE KeyColumn >= 2001 AND KeyColumn < 3000 SELECT * FROM ExampleTable WHERE KeyColumn >= 3001 AND KeyColumn < 4000 SELECT * FROM ExampleTable WHERE KeyColumn >= 4001 AND KeyColumn < 5000 SELECT * FROM ExampleTable WHERE KeyColumn >= 5001 AND KeyColumn < 6000 SELECT * FROM ExampleTable WHERE KeyColumn >= 6001 AND KeyColumn < 7000 SELECT * FROM ExampleTable WHERE KeyColumn >= 7001 AND KeyColumn < 8000 SELECT * FROM ExampleTable WHERE KeyColumn >= 8001 AND KeyColumn < 9000 SELECT * FROM ExampleTable WHERE KeyColumn >= 9001 AND KeyColumn < 10000 And then it would put the results of each of those queries in its own partition in Spark. © DataStax, All Rights Reserved. 26
  • 27. Lesson #7 – JDBC *really* wants you to get your partitioned dataframe using a sequential ID column. In our Oracle database, we don’t have sequential integer ID columns. We tried to get around that by doing a query like this and passing “ROW_NUMBER” as the partitioning column: SELECT ST.*, ROW_NUMBER() OVER (ORDER BY ID_FIELD ASC) AS ROW_NUMBER FROM SourceTable ST WHERE …my criteria ORDER BY ID_FIELD But, this didn’t perform well. We ended up creating a processing table: CREATE TABLE SPARK_ETL_BATCH_SEQUENCE ( SEQ_ID NUMBER(15,0) NOT NULL, //this has a sequence that gets auto-incremented BATCH_ID NUMBER(15,0) NOT NULL, ID_FIELD NUMBER(15,0) NOT NULL ) © DataStax, All Rights Reserved. 27
  • 28. Lesson #7 – JDBC *really* wants you to get your partitioned dataframe using a sequential ID column. We insert into this table first: INSERT INTO SPARK_ETL_BATCH_SEQUENCE ( BATCH_ID, ID_FIELD ) //SEQ_ID gets auto-populated SELECT {NextBatchID}, ID_FIELD FROM SourceTable ST WHERE …my criteria ORDER BY ID_FIELD Then, we join to it in the query where we get our data which provides us with a sequential ID: SELECT ST.*, SEQ.SEQ_ID FROM SourceTable ST INNER JOIN SPARK_ETL_BATCH_SEQUENCE SEQ ON ST.ID_FIELD = SEQ.ID_FIELD WHERE …my criteria ORDER BY ID_FIELD And, we use SEQ_ID as our Partitioning Column. Despite its need to talk to Oracle twice, this approach has proven to perform much faster than having uneven partitions. © DataStax, All Rights Reserved. 28
  • 30. Resources © DataStax, All Rights Reserved. 30 Spark • Books • Learning Spark http://shop.oreilly.com/product/0636920028512.do Scala (Knowing Scala with really help you progress in Spark) • Functional Programming Principles in Scala (videos) https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd • Books http://www.scala-lang.org/documentation/books.html Spark and Cassandra • DataStax Academy http://academy.datastax.com/ • Self-paced course: DS320: DataStax Enterprise Analytics with Apache Spark – Really Good! • Tutorials • Spark Cassandra Connector website – lots of good examples https://github.com/datastax/spark-cassandra-connector

Editor's Notes

  1. Co-Organizer of Dallas Cassandra Meetup Group Certified Apache Cassandra Developer
  2. CQL Copy command - This is a pretty quick and dirty way of getting data from a text file into a C* table.  The primary limiting factor is that the data in the text file has to match the schema of the table. Sqoop - this is a tool from the Hadoop ecosystem, but it works for C*, too.  It's meant for pulling to/from a RDBMS.  It's pretty limited on any kind of transformation you want do. Write a Java program.  It's pretty simple to write a java program that reads from a text file and uses the CQL Driver to write to C*.  If you set the write consistency level to Any and use the ExecuteAsync() methods, you can get it to run pretty darn fast. Write a Spark program.  This is a great option if you want to transform the schema of the source before writing to the C* destination.  You can get the data from any number of sources (text files, RDBMS, etc.), use a map statement to transform the data into the right format, and then use the Spark Cassandra Connector to write the data to C*.