This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
4. •Distributed database
•Highly Available
•Linear Scalable
•Multi Datacenter Support
•No Single Point Of Failure
•CQL Query Language
• Similiar to SQL
• No Joins and aggregates
•Eventual Consistency „Tunable Consistency“
Cassandra_
4
6. CQL - Querying Language With Limitations_
6
SELECT * FROM performer WHERE name = 'ACDC'
—> ok
SELECT * FROM performer WHERE name = 'ACDC' and country =
'Australia'
—> not ok
SELECT country, COUNT(*) as quantity FROM artists GROUP BY
country ORDER BY quantity DESC
—> not supported
performer
name (PK)
genre
country
8. •Open Source & Apache project since 2010
•Data processing Framework
• Batch processing
• Stream processing
What Is Apache Spark_
8
9. •Fast
• up to 100 times faster than Hadoop
• a lot of in-memory processing
• linear scalable using more nodes
•Easy
• Scala, Java and Python API
• Clean Code (e.g. with lambdas in Java 8)
• expanded API: map, reduce, filter, groupBy, sort, union, join,
reduceByKey, groupByKey, sample, take, first, count
•Fault-Tolerant
• easily reproducible
Why Use Spark_
9
10. •RDD‘s – Resilient Distributed Dataset
• Read–Only description of a collection of objects
• Distributed among the cluster (on memory or disk)
• Determined through transformations
• Allows automatically rebuild on failure
•Operations
• Transformations (map,filter,reduce...) —> new RDD
• Actions (count, collect, save)
•Only Actions start processing!
Easily Reproducable?_
10
11. •Partitions
• Describes the Partitions (i.e. one per Cassandra Partition)
•Dependencies
• dependencies on parent RDD’s
•Compute
• The function to compute the RDD’s partitions
•(Optional) Partitioner
• How is the data partitioned? (Hash, Range..)
•(Optional) Preferred Location
• Where to get the data (i.e. List of Cassandra Node IP’s)
Properties Of A RDD_
11
16. •Memory
• A lot of data in memory
• More memory —> Less disk IO —> Faster processing
• Minimum 8 GB / Node
•Network
• Communication between Driver, Cluster Manager & Worker
• Important for reduce operations
• 10 Gigabit LAN or better
•CPU
• Less communication between threads
• Good to parallelize
• Minimum 8 – 16 Cores / Node
What About Hardware?_
16
21. •Parallelization!
• keys are use for partitioning
• pairs with different keys are distributed across the cluster
•Efficient processing of
• aggregate by key
• group by key
• sort by key
• joins, union based on keys
Why Use Pair RDD’s_
21
32. Read As Case Class
case class Movie(title: String, year: Int)
sc.cassandraTable[Movie]("movie","movies").select("title","year")
sc.cassandraTable("movie","movies").select("title","year").as(Movie)
Read A Cassandra Table_
32
33. •Every RDD can be saved
• Using Tuples
val tuples = sc.parallelize(Seq(("Hobbit",2012),("96 Hours",2008)))
tuples.saveToCassandra("movie","movies", SomeColumns("title","year")
• Using Case Classes
case class Movie (title:String, year: int)
val objects =
sc.parallelize(Seq(Movie("Hobbit",2012),Movie("96 Hours",2008)))
objects.saveToCassandra("movie","movies")
Write Table_
33
35. •Joins can be expensive as they may require shuffling
val directors = sc.cassandraTable(..)
.map(r => (r.getString("name"),r))
val movies = sc.cassandraTable()
.map(r => (r.getString("director"),r))
movies.join(directors)
// RDD[(String, (CassandraRow, CassandraRow))]
Pair RDDs With Cassandra - Join
35
director
name text K
country text
movie
title text K
director text
36. •Automatically on read
•Not automatically on write
• No Shuffling Spark Operations -> Writes are local
• Shuffeling Spark Operartions
• Fan Out writes to Cassandra
• repartitionByCassandraReplica(“keyspace“, “table“) before write
•Joins with data locality
Using Data Locality With Cassandra_
36
sc.cassandraTable[CassandraRow](KEYSPACE, A)
.repartitionByCassandraReplica(KEYSPACE, B)
.joinWithCassandraTable[CassandraRow](KEYSPACE, B)
.on(SomeColumns("id"))
37. •cassandraCount()
• Utilizes Cassandra query
• vs load the table into memory and do a count
•spanBy(), spanByKey()
• group data by Cassandra partition key
• does not need shuffling
• should be preferred over groupBy/groupByKey
CREATE TABLE events (year int, month int, ts timestamp, data
varchar, PRIMARY KEY (year,month,ts));
sc.cassandraTable("test", "events")
.spanBy(row => (row.getInt("year"), row.getInt("month")))
sc.cassandraTable("test", "events")
.keyBy(row => (row.getInt("year"), row.getInt("month")))
.spanByKey
Further Transformations & Actions_
37
45. •SQL Queries with Spark (SQL & HiveQL)
• On structured data
• On DataFrame
• Every result of Spark SQL is a DataFrame
• All operations of the GenericRDD‘s available
•Supports (even on non primary key columns)
• Joins
• Union
• Group By
• Having
• Order By
Spark SQL_
45
50. •Real Time Processing using micro batches
•Supported sources: TCP, S3, Kafka, Twitter,..
•Data as Discretized Stream (DStream)
•Same programming model as for batches
•All Operations of the GenericRDD & SQL & MLLib
•Stateful Operations & Sliding Windows
Stream Processing With Spark Streaming_
50
52. •Maintain State for each key in a DStream: updateStateByKey
Spark Streaming - Stateful Operations_
52
def updateAlbumCount(newValues: Seq[Int],runningCount:
Option[Int]) : Option[Int] =
{
val newCount = runningCount.getOrElse(0) + newValues.size
Some(newCount)
}
val countStream = stream.updateStateByKey[Int]
(updateAlbumCount _)
Stream is a DStream of Pair RDD's
53. •One Receiver -> One Node
• Start more receivers and union them
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i =>
KafkaUtils.createStream(...) }
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()
•Received data will be split up into blocks
• 1 block => 1 task
• blocks = batchSize / blockInterval
•Repartition data to distribute over cluster
Spark Streaming - Parallelism_
53
67. •In particular for huge amounts of external data
•Support for CSV, TSV, XML, JSON und other
Use Cases for Spark and Cassandra_
67
Data Loading
case class User (id: java.util.UUID, name: String)
val users = sc.textFile("users.csv")
.repartition(2*sc.defaultParallelism)
.map(line => line.split(",") match { case Array(id,name) =>
User(java.util.UUID.fromString(id), name)})
users.saveToCassandra("keyspace","users")
68. Validate consistency in a Cassandra database
•syntactic
• Uniqueness (only relevant for columns not in the PK)
• Referential integrity
• Integrity of the duplicates
•semantic
• Business- or Application constraints
• e.g.: At least one genre per movies, a maximum of 10 tags per blog
post
Use Cases for Spark and Cassandra_
68
Validation & Normalization
69. •Modelling, Mining, Transforming, ....
•Use Cases
• Recommendation
• Fraud Detection
• Link Analysis (Social Networks, Web)
• Advertising
• Data Stream Analytics ( Spark Streaming)
• Machine Learning ( Spark ML)
Use Cases for Spark and Cassandra_
69
Analyses (Joins, Transformations,..)
70. •Changes on existing tables
• New table required when changing primary key
• Otherwise changes could be performed in-place
•Creating new tables
• data derived from existing tables
• Support new queries
•Use the CassandraConnectors in Spark
Use Cases for Spark and Cassandra_
70
Schema Migration