We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.
Patrick walks you through organizing a stream of data into an efficient queue using Apache Kafka, processing the data in flight using Apache Spark Streaming, storing the data in a highly scaling and fault-tolerant database using Apache Cassandra, and transforming and finding insights in volumes of stored data using Apache Spark.
Topics include:
- Understanding the right use case
- Considerations when deploying Apache Kafka
- Processing streams with Apache Spark Streaming
- A deep dive into how Apache Cassandra stores data
- Integration between Cassandra and Spark
- Data models for time series
- Postprocessing without ETL using Apache Spark on Cassandra
2. Agenda
• Lecture
• Kafka
• Spark
• Cassandra
• Hands on
• Verify Cassandra up and running
• Load data into Cassandra
• Break 3:00 - 3:30
• Lecture
• Cassandra (continued)
• Spark and Cassandra
• PySpark
• Hands On
• Spark Shell
• Spark SQL
Section 1 Section 2
3. About me
• Chief Evangelist for Apache Cassandra
• Senior Solution Architect at DataStax
• Chief Architect, Hobsons
• Web applications and performance since 1996
4. What is time series data?
A sequence of data points, typically consisting of successive measurements
made over a time interval.
Source: https://en.wikipedia.org/wiki/Time_series
7. What is time series analysis?
Methods for analyzing time series data in order to extract meaningful
statistics and other characteristics of the data.
Source: https://en.wikipedia.org/wiki/Time_series
40. Kafka
Producer Consumer
Collection
API
Temperature
Processor
Precipitation
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1
Temperature
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1
Topic Temperature
Replication Factor = 2
Topic Precipitation
Replication Factor = 2
41. Kafka
Producer
Consumer
Collection
API
Temperature
Processor
Precipitation
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1 Temperature
Processor
Topic = Temperature
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem
1
Temp
2
Tem
3
Temp
4
Temp
5
Partition 1
Temperature
Processor
Temperature
Processor
Precipitation
Processor
Topic Temperature
Replication Factor = 2
Topic Precipitation
Replication Factor = 2
42. Guarantees
Order
•Messages are ordered as they are sent by the
producer
•Consumers see messages in the order they were
inserted by the producer
Durability
•Messages are delivered at least once
•With a Replication Factor N up to N-1 server failures
can be tolerated without losing committed messages
73. val kafkaStream = KafkaUtils.createStream(streamingContext,
[ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])
Zookeeper
Server IP
Consumer
Group Created
In Kafka
List of Kafka topics
and number of threads per topic
Receiver Based Approach
78. Direct Based Approach
val directKafkaStream = KafkaUtils.createDirectStream[
[key class], [value class], [key decoder class], [value decoder class] ](
streamingContext, [map of Kafka parameters], [set of topics to consume])
List of Kafka brokers
(and any other params)
Kafka topics
106. Window
•Amount of time in seconds to sample data
•Larger size creates memory pressure
Slide
•Amount of time in seconds to advance window
DStream
•Window of data as a set
•Same operations as an RDD
116. Dynamo Paper(2007)
• How do we build a data store that is:
• Reliable
• Performant
• “Always On”
• Nothing new and shiny
Evolutionary. Real. Computer Science
Also the basis for Riak and Voldemort
119. Cassandra - More than one server
• All nodes participate in a cluster
• Shared nothing
• Add or remove as needed
• More capacity? Add a server
119
132. Token
Server
•Each partition is a 128 bit value
•Consistent hash between 2-63
and 264
•Each node owns a range of those
values
•The token is the beginning of that
range to the next node’s token value
•Virtual Nodes break these down
further
Data
Token Range
0 …
141. Repair
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1
00-25
10.0.0.4
76-100
10.0.0.2
26-50
10.0.0.3
51-75
76-100
51-75
00-25
76-100
26-50
00-25
51-75
26-50
Client
Repair = Am
I consistent?
You are missing
some data. Here.
Have some of mine.
142. Consistency level
Consistency Level Number of Nodes Acknowledged
One One - Read repair triggered
Local One One - Read repair in local DC
Quorum 51%
Local Quorum 51% in local DC
150. Example: Weather Station
• Weather station collects data
• Cassandra stores in sequence
• Application reads in sequence
• Aggregations in fast lookup table
Windsor California
July 1, 2014
High: 73.4
Low : 51.4
Precipitation: 0.0
2014 Total: 8.3”
Weather for Windsor, California as of 9PM PST July 7th 2015
Current Temp: 71 F
Daily Precipitation: 0.0”
Up-to-date Weather
High: 85 F
Low 58 F
2015 Total Precipitation: 12.0 “
153. Relational Data Models
• 5 normal forms
• Foreign Keys
• Joins
deptId First Last
1 Edgar Codd
2 Raymond Boyce
id Dept
1 Engineering
2 Math
Employees
Department
156. CQL vs SQL
• No joins
• Limited aggregations
deptId First Last
1 Edgar Codd
2 Raymond Boyce
id Dept
1 Engineering
2 Math
Employees
Department
SELECT e.First, e.Last, d.Dept
FROM Department d, Employees e
WHERE ‘Codd’ = e.Last
AND e.deptId = d.id
157. Denormalization
• Combine table columns into a single view
• No joins
SELECT First, Last, Dept
FROM employees
WHERE id = ‘1’
id First Last Dept
1 Edgar Codd Engineering
2 Raymond Boyce Math
Employees
158. Queries supported
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
dewpoint double,
pressure double,
wind_direction int,
wind_speed double,
sky_condition int,
sky_condition_text text,
one_hour_precip double,
six_hour_precip double,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Get weather data given
•Weather Station ID
•Weather Station ID and Time
•Weather Station ID and Range of Time
159. Aggregation Queries
CREATE TABLE daily_aggregate_temperature (
wsid text,
year int,
month int,
day int,
high double,
low double,
mean double,
variance double,
stdev double,
PRIMARY KEY ((wsid), year, month, day)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Get temperature stats given
•Weather Station ID
•Weather Station ID and Time
•Weather Station ID and Range of Time
Windsor California
July 1, 2014
High: 73.4
Low : 51.4
160. daily_aggregate_precip
CREATE TABLE daily_aggregate_precip (
wsid text,
year int,
month int,
day int,
precipitation counter,
PRIMARY KEY ((wsid), year, month, day)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Get precipitation stats given
•Weather Station ID
•Weather Station ID and Time
•Weather Station ID and Range of Time
Windsor California
July 1, 2014
High: 73.4
Low : 51.4
Precipitation: 0.0
161. year_cumulative_precip
CREATE TABLE year_cumulative_precip (
wsid text,
year int,
precipitation counter,
PRIMARY KEY ((wsid), year)
) WITH CLUSTERING ORDER BY (year DESC);
Get latest yearly precipitation accumulation
•Weather Station ID
•Weather Station ID and Time
•Provide fast lookup
Windsor California
July 1, 2014
High: 73.4
Low : 51.4
Precipitation: 0.0
2014 Total: 8.3”
166. Lightweight Transactions
INSERT INTO weather_station (id, call_sign, country_code, elevation, lat, long, name, state_code)
VALUES ('727930:24233', 'KSEA', 'US', 121.9, 47.467, -122.32, 'SEATTLE SEATTLE-TACOMA INTL A', ‘WA’)
IF NOT EXISTS;
Don’t overwrite!
167. Lightweight Transactions
CREATE TABLE IF NOT EXISTS weather_station (
id text,
name text,
country_code text,
state_code text,
call_sign text,
lat double,
long double,
elevation double,
PRIMARY KEY(id)
);
No-op. Don’t throw error
168. Select
id | call_sign | country_code | elevation | lat | long | name | state_code
--------------+-----------+--------------+-----------+--------+---------+-------------------------------+------------
727930:24233 | KSEA | US | 121.9 | 47.467 | -122.32 | SEATTLE SEATTLE-TACOMA INTL A | WA
SELECT id, call_sign, country_code, elevation, lat, long, name, state_code
FROM weather_station
WHERE id = '727930:24233';
Fields
Table Name
Primary Key: Partition Key Required
169. Update
UPDATE weather_station
SET name = 'SeaTac International Airport'
WHERE id = '727930:24233';
id | call_sign | country_code | elevation | lat | long | name | state_code
--------------+-----------+--------------+-----------+--------+---------+------------------------------+------------
727930:24233 | KSEA | US | 121.9 | 47.467 | -122.32 | SeaTac International Airport | WA
Table Name
Fields to Update: Not in Primary Key
Primary Key
172. Collections
Set
CREATE TABLE weather_station (
id text,
name text,
country_code text,
state_code text,
call_sign text,
lat double,
long double,
elevation double,
equipment set<text>
PRIMARY KEY(id)
);
equipment set<text>
CQL Type: For Ordering
Column Name
173. Collections
Set
List
CREATE TABLE weather_station (
id text,
name text,
country_code text,
state_code text,
call_sign text,
lat double,
long double,
elevation double,
equipment set<text>,
service_dates list<timestamp>,
PRIMARY KEY(id)
);
equipment set<text>
service_dates list<timestamp>Column Name
CQL Type: For Ordering
Column Name
CQL Type
174. Collections
Set
List
Map
CREATE TABLE weather_station (
id text,
name text,
country_code text,
state_code text,
call_sign text,
lat double,
long double,
elevation double,
equipment set<text>,
service_dates list<timestamp>,
service_notes map<timestamp,text>,
PRIMARY KEY(id)
);
equipment set<text>
service_dates list<timestamp>
service_notes map<timestamp,text>
Column Name
Column Name
CQL Key Type CQL Value Type
CQL Type: For Ordering
Column Name
CQL Type
175. User Defined Functions*
*As of Cassandra 2.2
•Built-in: avg, min, max, count(<column name>)
•Runs on server
•Always use with partition key
176. User Defined Functions
CREATE FUNCTION maxI(current int, candidate int)
CALLED ON NULL INPUT
RETURNS int LANGUAGE java AS
'if (current == null) return candidate; else return Math.max(current, candidate);' ;
CREATE AGGREGATE maxAgg(int)
SFUNC maxI
STYPE int
INITCOND null;
CQL Type
Pure Function
SELECT maxAgg(temperature)
FROM raw_weather_data
WHERE wsid='10010:99999'
AND year = 2005 AND month = 12 AND day = 1
Aggregate using
function over
partition
185. Partition keys
10010:99999 Murmur3 Hash Token = 7224631062609997448
722266:13850 Murmur3 Hash Token = -6804302034103043898
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘722266:13850’,2005,12,1,7,-5.6);
Consistent hash. 128 bit number
between 2-63
and 264
186. Partition keys
10010:99999 Murmur3 Hash Token = 15
722266:13850 Murmur3 Hash Token = 77
For this example, let’s make it a
reasonable number
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘722266:13850’,2005,12,1,7,-5.6);
190. Writes
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
dewpoint double,
pressure double,
wind_direction int,
wind_speed double,
sky_condition int,
sky_condition_text text,
one_hour_precip double,
six_hour_precip double,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
191. Writes
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,10,-5.6);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,9,-5.1);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,8,-4.9);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.3);
192. Write Path
Client
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.3);
year 1wsid 1 month 1 day 1 hour 1
year 2wsid 2 month 2 day 2 hour 2
Memtable
SSTable
SSTable
SSTable
SSTable
Node
Commit Log Data
* Compaction *
Temp
Temp
193. Storage Model - Logical View
2005:12:1:10
-5.6
2005:12:1:9
-5.1
2005:12:1:8
-4.9
10010:99999
10010:99999
10010:99999
wsid hour temperature
2005:12:1:7
-5.3
10010:99999
SELECT wsid, hour, temperature
FROM raw_weather_data
WHERE wsid=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
194. 2005:12:1:10
-5.6 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:8
10010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperature
FROM raw_weather_data
WHERE wsid=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
195. 2005:12:1:10
-5.6
2005:12:1:11
-4.9 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:8
10010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperature
FROM raw_weather_data
WHERE wsid=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
196. 2005:12:1:10
-5.6
2005:12:1:11
-4.9 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:8
10010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperature
FROM raw_weather_data
WHERE wsid=‘10010:99999’
AND year = 2005 AND month = 12 AND day = 1;
2005:12:1:12
-5.4
198. Query patterns
• Range queries
• “Slice” operation on disk
Single seek on disk
10010:99999
Partition key for locality
SELECT wsid,hour,temperature
FROM raw_weather_data
WHERE wsid='10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10;
2005:12:1:10
-5.6 -5.3-4.9-5.1
2005:12:1:9 2005:12:1:8 2005:12:1:7
199. Query patterns
• Range queries
• “Slice” operation on disk
Programmers like this
Sorted by event_time
2005:12:1:10
-5.6
2005:12:1:9
-5.1
2005:12:1:8
-4.9
10010:99999
10010:99999
10010:99999
weather_station hour temperature
2005:12:1:7
-5.3
10010:99999
SELECT weatherstation,hour,temperature
FROM temperature
WHERE weatherstation_id=‘10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10;
210. Type mapping
CQL Type Scala Type
ascii String
bigint Long
boolean Boolean
counter Long
decimal BigDecimal, java.math.BigDecimal
double Double
float Float
inet java.net.InetAddress
int Int
list Vector, List, Iterable, Seq, IndexedSeq, java.util.List
map Map, TreeMap, java.util.HashMap
set Set, TreeSet, java.util.HashSet
text, varchar String
timestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTime
timeuuid java.util.UUID
uuid java.util.UUID
varint BigInt, java.math.BigInteger
*nullable values Option
211. Execution of jobs
Local Cluster
•Connect to localhost
master
•Single system dev
•Runs stand alone
•Connect to spark master
IP
•Production configuration
•Submit using spark-
submit
212. Summary
•Cassandra acts as the storage layer for Spark
•Deploy in a mixed cluster configuration
•Spark executors access Cassandra using the
DataStax connector
•Deploy your jobs in either local or cluster modes
214. Attaching to Spark and Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objects
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
/** The setMaster("local") lets us run & test the job right in our IDE */
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster(“local[*]")
.setAppName(getClass.getName)
// Optionally
.set("cassandra.username", "cassandra")
.set("cassandra.password", “cassandra")
val sc = new SparkContext(conf)
215. Weather station example
CREATE TABLE raw_weather_data (
wsid text,
year int,
month int,
day int,
hour int,
temperature double,
dewpoint double,
pressure double,
wind_direction int,
wind_speed double,
sky_condition int,
sky_condition_text text,
one_hour_precip double,
six_hour_precip double,
PRIMARY KEY ((wsid), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
216. Simple example
/** keyspace & table */
val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data")
/** get a simple count of all the rows in the raw_weather_data table */
val rowCount = tableRDD.count()
println(s"Total Rows in Raw Weather Table: $rowCount")
sc.stop()
217. Simple example
/** keyspace & table */
val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data")
/** get a simple count of all the rows in the raw_weather_data table */
val rowCount = tableRDD.count()
println(s"Total Rows in Raw Weather Table: $rowCount")
sc.stop()
Executer
SELECT *
FROM isd_weather_data.raw_weather_data
Spark RDD
Spark Partition
Spark Connector
218. Using CQL
SELECT temperature
FROM raw_weather_data
WHERE wsid = '724940:23234'
AND year = 2008
AND month = 12
AND day = 1;
val cqlRRD = sc.cassandraTable("isd_weather_data", "raw_weather_data")
.select("temperature")
.where("wsid = ? AND year = ? AND month = ? AND DAY = ?",
"724940:23234", "2008", "12", “1")
219. Using SQL!
spark-sql> SELECT wsid, year, month, day, max(temperature) high, min(temperature) low
FROM raw_weather_data
WHERE month = 6
AND temperature !=0.0
GROUP BY wsid, year, month, day;
724940:23234 2008 6 1 15.6 10.0
724940:23234 2008 6 2 15.6 10.0
724940:23234 2008 6 3 17.2 11.7
724940:23234 2008 6 4 17.2 10.0
724940:23234 2008 6 5 17.8 10.0
724940:23234 2008 6 6 17.2 10.0
724940:23234 2008 6 7 20.6 8.9
220. SQL with a Join
spark-sql> SELECT ws.name, raw.hour, raw.temperature
FROM raw_weather_data raw
JOIN weather_station ws
ON raw.wsid = ws.id
WHERE raw.wsid = '724940:23234'
AND raw.year = 2008 AND raw.month = 6 AND raw.day = 1;
SAN FRANCISCO INTL AP 23 15.0
SAN FRANCISCO INTL AP 22 15.0
SAN FRANCISCO INTL AP 21 15.6
SAN FRANCISCO INTL AP 20 15.0
SAN FRANCISCO INTL AP 19 15.0
SAN FRANCISCO INTL AP 18 14.4
221. Analyzing large data sets
val spanRDD = sc.cassandraTable[Double]("isd_weather_data", "raw_weather_data")
.select("temperature")
.where("wsid = ? AND year = ? AND month = ? AND DAY = ?",
"724940:23234", "2008", "12", "1").spanBy(row => (row.getString("wsid")))
•Specify partition grouping
•Use with large partitions
•Perfect for time series
222. Saving back the weather data
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("isd_weather_data")
cc.sql("""
SELECT wsid, year, month, day, max(temperature) high, min(temperature) low
FROM raw_weather_data
WHERE month = 6
AND temperature !=0.0
GROUP BY wsid, year, month, day;
""")
.map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))}
.saveToCassandra("isd_weather_data", "daily_aggregate_temperature")
226. DataFrames
• Abstraction over RDDs
• Modeled after Pandas & R
• Structured data
• Python passes commands only
• Commands are pushed down
• Data Never Leaves the JVM
• You can still use the RDD if you
want
• Dataframe.rdd
RDD
DataFrame
228. Sample Dataset - Movielens
• Subset of movies (1-100)
• ~800k ratings
CREATE TABLE movielens.movie (
movie_id int PRIMARY KEY,
genres set<text>,
title text
)
CREATE TABLE movielens.rating (
movie_id int,
user_id int,
rating decimal,
ts int,
PRIMARY KEY (movie_id, user_id)
)
229. Reading Cassandra Tables
• DataFrames has a standard
interface for reading
• Cache if you want to keep dataset
in memory
cl = "org.apache.spark.sql.cassandra"
movies = sql.read.format(cl).
load(keyspace="movielens",
table="movie").cache()
ratings = sql.read.format(cl).
load(keyspace="movielens",
table="rating").cache()
230. Filtering
• Select specific rows matching
various patterns
• Fields do not require indexes
• Filtering occurs in memory
• You can use DSE Solr Search
Queries
• Filtering returns a DataFrame
movies.filter(movies.movie_id == 1)
movies[movies.movie_id == 1]
movies.filter("movie_id=1")
movie_id title genres
44 Mortal Kombat (1995)
['Action',
'Adventure',
'Fantasy']
movies.filter("title like '%Kombat%'")
231. Filtering
• Helper function:
explode()
• select() to keep
specific columns
• alias() to rename
title
Broken Arrow (1996)
GoldenEye (1995)
Mortal Kombat (1995)
White Squall (1996)
Nick of Time (1995)
from pyspark.sql import functions as F
movies.select("title", F.explode("genres").
alias("genre")).
filter("genre = 'Action'").select("title")
title genre
Broken Arrow (1996) Action
Broken Arrow (1996) Adventure
Broken Arrow (1996) Thriller
232. Aggregation
• Count, sum, avg
• in SQL: GROUP BY
• Useful with spark streaming
• Aggregate raw data
• Send to dashboards
ratings.groupBy("movie_id").
agg(F.avg("rating").alias('avg'))
ratings.groupBy("movie_id").avg("rating")
movie_id avg
31 3.24
32 3.8823
33 3.021
233. Joins
• Inner join by default
• Can do various outer joins
as well
• Returns a new DF with all
the columns
ratings.join(movies, "movie_id")
DataFrame[movie_id: int,
user_id: int,
rating: decimal(10,0),
ts: int,
genres: array<string>,
title: string]
234. Chaining Operations
• Similar to SQL, we can build up in
complexity
• Combine joins with aggregations,
limits & sorting
ratings.groupBy("movie_id").
agg(F.avg("rating").
alias('avg')).
sort("avg", ascending=False).
limit(3).
join(movies, "movie_id").
select("title", "avg")
title avg
Usual Suspects, The (1995) 4.32
Seven (a.k.a. Se7en) (1995) 4.054
Persuasion (1995) 4.053
235. SparkSQL
• Register DataFrame as Table
• Query using HiveSQL syntax
movies.registerTempTable("movie")
ratings.registerTempTable("rating")
sql.sql("""select title, avg(rating) as avg_rating
from movie join rating
on movie.movie_id = rating.movie_id
group by title
order by avg_rating DESC limit 3""")
236. Database Migrations
• DataFrame reader supports JDBC
• JOIN operations can be cross DB
• Read dataframe from JDBC, write
to Cassandra