Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Jim Hatcher
Using Spark to Load Oracle Data into Cassandra

1 Introduction
2 Problem Description
3 Methods of loading external data into Cassandra
4 What is Spark?
5 Lessons Learned
6 Resources
2© DataStax, All Rights Reserved.

© DataStax, All Rights Reserved. 4
At IHS Markit, we take raw data and turn it into
information and insights for our customers.
Automotive Systems (CarFax)
Defense Systems (Jane’s)
Oil & Gas Systems (Petra, Kingdom)
Maritime Systems
Technology Systems (Electronic Parts Database, Root Metrics)
Chemicals
Financial Systems (Wall Street on Demand)
Lots of others

Cluster
Factory
Oracle
Back-end Applications Customer-facing Systems
Load
Files
Customer-
facing
Applications
Oracle
Cassandra
+
Solr
Factory
Applications
Data
Updates
Cassandra
+
Spark

Methods of loading external data into Cassandra

Methods of Loading External Data into C*
1. CQL Copy command
2. Sqoop
3. Write a custom program that uses the CQL driver
4. Write a Spark program

What is Spark?
Spark is a processing framework designed
to work with distributed data.
“up to 100X faster than MapReduce”
according to spark.apache.org
Used in any ecosystem where you want to
work with distributed data (Hadoop,
Cassandra, etc.)
Includes other specialized libraries:
• SparkSQL
• Spark Streaming
• MLLib
• GraphX
Spark Facts
Conceptually Similar To MapReduce
Written In Scala
Supported By DataBricks
Supported Languages Scala, Java, Python, R

Spark Client
Driver
Spark
Context
Spark Master
Spark Worker
Spark Worker
Spark Worker
Executor
Executor
Executor
1. Request Resources
2. Allocate Resources
3.StartExecutors
4.Perform
Computation
Credit: https://academy.datastax.com/courses/ds320-analytics-apache-spark/introduction-spark-architecture
Spark Architecture

Spark with Cassandra
Credit:
https://academy.datastax.com/courses/ds320-
analytics-apache-spark/introduction-spark-
architecture
Cassandra Cluster
A
CB
Spark Worker
Spark WorkerSpark Worker
Spark Master
Spark Client
Spark Cassandra Connector – open source, supported by DataStax
https://github.com/datastax/spark-cassandra-connector

ETL (Extract, Transform, Load)
Text File
JDBC Data
Source
Cassandra
Hadoop
Extract Data
Spark: Create
RDD or Data
Frame
Data Source(s)
Spark Code
Transform Data
Spark: Map
function
Spark Code
Cassandra
Data Source(s)
Load Data
Spark: Save
Spark Code

© DataStax, All Rights Reserved.
Typical Code - Example
// Extract
val extracted = sqlContext
.read
.format("jdbc")
.options(
Map[String, String](
"url" -> "jdbc:oracle:thin:username/password@//hostname:port/oracle_svc",
"dbtable" -> "table_name"
)
)
.load()
// Transform
val transformed = extracted.map { dbRow =>
(dbRow.getAs[String](“field_one"), dbRow.getAs[Integer](“field_two"))
}
// Load
transformed.saveToCassandra(“keyspace_name", “table_name", SomeColumns(“field_one“, “field_two"))

Lesson #1 - Spark SQL handles Oracle
NUMBER fields with no precision incorrectly
https://issues.apache.org/jira/browse/SPARK-10909
All of our Oracle tables have ID fields defined as NUMBER(15,0).
When you use Spark SQL to access an Oracle table, there is a piece of code in the JDBC driver that
reads the metadata and creates a dataframe with the proper schema. If your schema has a
NUMBER(*, 0) field defined in it, you get a “Overflowed precision” error.
This is fixed in Spark 1.5, but we don’t have the option of adopting a new version of Spark since we’re
using Spark bundled with DSE 4.8.6 (which uses spark 1.4.2). We were able to fix this by stealing the
fix from the Spark 1.5 code and applying it to our code (yay, open source!).
At some point, we’ll update to DSE 5.* which uses Spark 1.6, and we can remove this code.

import java.sql.Types
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcType}
import org.apache.spark.sql.types._
private case object OracleDialect extends JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:oracle")
override def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
// Handle NUMBER fields that have no precision/scale in special way
// because JDBC ResultSetMetaData converts this to 0 precision and -127 scale
// For more details, please see
// https://github.com/apache/spark/pull/8780#issuecomment-145598968
// and
// https://github.com/apache/spark/pull/8780#issuecomment-144541760
if (sqlType == Types.NUMERIC && size == 0) {
// This is sub-optimal as we have to pick a precision/scale in advance whereas the data
// in Oracle is allowed to have different precision/scale for each value.
Option(DecimalType(38, 10))
} else {
None
}
}
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case StringType => Some(JdbcType("VARCHAR2(255)", java.sql.Types.VARCHAR))
case _ => None
}
}
org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(OracleDialect)
Lesson #1 - Spark SQL handles Oracle
NUMBER fields with no precision incorrectly

Lesson #2 - Spark SQL doesn’t handle timeuuid
fields correctly
https://issues.apache.org/jira/browse/SPARK-10501
Spark SQL doesn’t know what to do with a timeuuid field when reading a table from Cassandra. This is an
issue since we commonly use timeuuid columns in our Cassandra key structures.
We got this error: scala.MatchError: UUIDType (of class
org.apache.spark.sql.cassandra.types.UUIDType$)
We are able to work around this issue by casting the timeuuid values to strings, like this:
val dataFrameRaw = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "table_name", "keyspace" -> "keyspace_name"))
.load()
val dataFrameFixed = dataFrameRaw
.withColumn(“timeuuid_column", dataFrameRaw("timeuuid_column").cast(StringType))

Lesson #3 – Careful when generating ID fields
We created an RDD:
val baseRdd = rddInsertsAndUpdates.map { dbRow =>
val keyColumn = {
if (!dbRow.isNullAt(dbRow.fieldIndex(“timeuuid_key_column"))) {
dbRow.getAs[String]("timeuuid_key_column")
} else {
UUIDs.timeBased().toString
}
}
//do some further processing
(keyColumn, …other values)
}
Then, we took that RDD and transformed it into another RDD:
val invertedIndexTable = baseRdd.map { entry =>
(entry.getString(“timeuuid_key_column"), entry.getString(“fld_1"))
}
Then we wrote them both to C*, like this:
baseRdd.saveToCassandra(“keyspace_name", “table_name", SomeColumns(“key_column“, “fld_1“, “fld_2"))
invertedIndexTable.saveToCassandra(“keyspace_name", “inverted_index_table_name"
SomeColumns(“key_column“, “fld_1“)

Lesson #3 – Careful when generating ID fields
We kept finding that the ID values in the inverted index table had slightly different ID values than the
values in the base table.
We fixed this by adding a cache() to our first RDD.
val baseRdd = rddInsertsAndUpdates.map { dbRow =>
val keyColumn = {
if (!dbRow.isNullAt(dbRow.fieldIndex(“timeuuid_key_column"))) {
dbRow.getAs[String]("timeuuid_key_column")
} else {
UUIDs.timeBased().toString
}
}
//do some further processing
(keyColumn, …other values)
}.cache()

Lesson #4 – You can only return an RDD of a
tuple if you have 22 items or less.
It’s pretty common in Spark to return an RDD of tuples
val myNewRdd = myOldRdd.map { dbRow =>
val firstName = dbRow.getAs[String](“FirstName")
val lastName = dbRow.getAs[String](“LastName")
val calcField1 = dbRow.getAs[Intger](“SomeColumn") * 3.14
(firstName, lastName, calcField1)
}
This works great until you get to 22 fields in your tuple, and then Scala throws an error. (Later
versions of Scala lift this restriction, but it’s a problem for our version of Scala.)

Lesson #4 – You can only return an RDD of a
tuple if you have 22 items or less.
You can fix this by returning an RDD of CassandraRows instead. (especially if your goal is to save them
to C*)
val myNewRdd = myOldRdd.map { dbRow =>
val firstName = dbRow.getAs[String](“FirstName")
val lastName = dbRow.getAs[String](“LastName")
val calcField1 = dbRow.getAs[Integer](“SomeColumn") * 3.14
val allValues = IndexedSeq[AnyRef](firstName, lastName, calcField1)
val allColumnNames = Array[String](
“first_name",
“last_name",
“calc_field_1“)
new CassandraRow(allColumnNames, allValues)
}

Lesson #5 – Getting a JDBC dataframe based on a
SQL statement is not very intuitive.
To get a dataframe from a JDBC source, you do this:
val exampleDataFrame = sqlContext
.read
.format("jdbc")
.options(
"dbtable" -> "table_name"
)
)
.load()
You would think there would be a version of this call that lets you pass in a SQL statement but there is
not.
However, when JDBC creates your query from the above syntax, all it does is prepend your dbtable
value with “SELECT * FROM”.

Lesson #5 – Getting a JDBC dataframe based on a
SQL statement is not very intuitive.
So, the workaround is to do this:
val sql =
"( " +
" SELECT S.* " +
" FROM Sample S " +
" WHERE ID = 11111 " +
" ORDER BY S.SomeField " +
")"
val exampleDataFrame = sqlContext
.read
.format("jdbc")
.options(
"dbtable" -> sql
)
)
.load()
You’re effectively doing this in Oracle:
SELECT * FROM (
SELECT S.*
FROM Sample S
WHERE ID = 11111
ORDER BY S.SomeField
)

Lesson #6 – Creating a partitioned JDBC
dataframe is not very intuitive.
The code to get a JDBC dataframe looks like this:
val basePartitionedOracleData = sqlContext
.read
.format("jdbc")
.options(
"dbtable" -> "ExampleTable",
"lowerBound" -> "1",
"upperBound" -> "10000",
"numPartitions" -> "10",
"partitionColumn" -> “KeyColumn"
)
)
.load()
The last four arguments in that map are there for the purpose of getting a partitioned dataset. If you pass any of them,
you have to pass all of them.

Lesson #6 – Creating a partitioned JDBC
dataframe is not very intuitive.
When you pass these additional arguments in, here’s what it does:
It builds a SQL statement template in the format “SELECT * FROM {tableName} WHERE {partitionColumn} >= ? AND
{partitionColumn} < ?”
It sends {numPartitions} statements to the DB engine. If you suppled these values: {dbTable=ExampleTable,
lowerBound=1, upperBound=10,000, numPartitions=10, partitionColumn=KeyColumn}, it would create these ten
statements:
SELECT * FROM ExampleTable WHERE KeyColumn >= 1 AND KeyColumn < 1001
And then it would put the results of each of those queries in its own partition in Spark.

Lesson #7 – JDBC *really* wants you to get your
partitioned dataframe using a sequential ID column.
In our Oracle database, we don’t have sequential integer ID columns.
We tried to get around that by doing a query like this and passing “ROW_NUMBER” as the partitioning
column:
SELECT ST.*, ROW_NUMBER() OVER (ORDER BY ID_FIELD ASC) AS ROW_NUMBER
FROM SourceTable ST
WHERE …my criteria
ORDER BY ID_FIELD
But, this didn’t perform well.
We ended up creating a processing table:
CREATE TABLE SPARK_ETL_BATCH_SEQUENCE (
SEQ_ID NUMBER(15,0) NOT NULL, //this has a sequence that gets auto-incremented
BATCH_ID NUMBER(15,0) NOT NULL,
ID_FIELD NUMBER(15,0) NOT NULL
)

Lesson #7 – JDBC *really* wants you to get your
partitioned dataframe using a sequential ID column.
We insert into this table first:
INSERT INTO SPARK_ETL_BATCH_SEQUENCE ( BATCH_ID, ID_FIELD ) //SEQ_ID gets auto-populated
SELECT {NextBatchID}, ID_FIELD
FROM SourceTable ST
ORDER BY ID_FIELD
Then, we join to it in the query where we get our data which provides us with a sequential ID:
SELECT ST.*, SEQ.SEQ_ID
FROM SourceTable ST
INNER JOIN SPARK_ETL_BATCH_SEQUENCE SEQ ON ST.ID_FIELD = SEQ.ID_FIELD
ORDER BY ID_FIELD
And, we use SEQ_ID as our Partitioning Column.
Despite its need to talk to Oracle twice, this approach has proven to perform much faster than having
uneven partitions.

Resources
Spark
• Books
• Learning Spark
http://shop.oreilly.com/product/0636920028512.do
Scala (Knowing Scala with really help you progress in Spark)
• Functional Programming Principles in Scala (videos)
https://www.youtube.com/user/afigfigueira/playlists?shelf_id=9&view=50&sort=dd
• Books
http://www.scala-lang.org/documentation/books.html
Spark and Cassandra
• DataStax Academy
http://academy.datastax.com/
• Self-paced course: DS320: DataStax Enterprise Analytics with Apache Spark – Really Good!
• Tutorials
• Spark Cassandra Connector website – lots of good examples
https://github.com/datastax/spark-cassandra-connector

Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Similar to Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016 (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C* Summit 2016

Editor's Notes