SlideShare a Scribd company logo
1 of 44
Download to read offline
Real-time Analytics with
Cassandra, Spark and Shark

Saturday, December 7, 13
Who is this guy
•

Staff Engineer, Compute and Data Services, Ooyala

•

Building multiple web-scale real-time systems on top of C*, Kafka,
Storm, etc.

•

Scala/Akka guy

•

github.com/velvia

•

@evanfchan

Saturday, December 7, 13
Agenda
• Ooyala and Cassandra
• What problem are we trying to solve?
• Spark and Shark
• Our Spark/Cassandra Architecture
• Demo

Saturday, December 7, 13
Cassandra at Ooyala
Who is Ooyala, and how we use Cassandra

Saturday, December 7, 13
OOYALA
Powering personalized video
experiences across all screens.

CONFIDENTIAL—DO NOT DISTRIBUTE

Saturday, December 7, 13

5
COMPANY OVERVIEW
Founded in 2007
Commercially launch in 2009
300 employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala

CONFIDENTIAL—DO NOT DISTRIBUTE

Saturday, December 7, 13

6
TRUSTED VIDEO PARTNER
CUSTOMERS

STRATEGIC PARTNERS

CONFIDENTIAL—DO NOT DISTRIBUTE

Saturday, December 7, 13

7
We are a large Cassandra user
•

12 clusters ranging in size from 3 to 107 nodes

•

Total of 28TB of data managed over ~220 nodes

•

Over 2 billion C* column writes per day

•

Powers all of our analytics infrastructure

•

DSE/C* 1.0.x, 1.1.x, 1.2.6

•

Large prod cluster is one of the biggest Cassandra
installations

Saturday, December 7, 13
What problem are we trying to
solve?
Lots of data, complex queries, answered really quickly... but how??

Saturday, December 7, 13
From mountains of raw data...

Saturday, December 7, 13
To nuggets of truth...
• Quickly
• Painlessly
• At

Saturday, December 7, 13

scale?
Today: Precomputed aggregates
•

Video metrics computed along several high cardinality dimensions

•

Very fast lookups, but inflexible, and hard to change

•

Most computed aggregates are never read

•

What if we need more dynamic queries?
–

Top content for mobile users in France

–

Engagement curves for users who watched recommendations

–

Data mining, trends, machine learning

Saturday, December 7, 13
The static - dynamic continuum
100% Precomputation
•

Super fast lookups

•

Inflexible, wasteful

•

Best for 80% most
common queries

100% Dynamic

Saturday, December 7, 13

•

Always compute results
from raw data

•

Flexible but slow
Where we want to be
Partly dynamic
•
•

Flexible, fast dynamic
queries

•

Saturday, December 7, 13

Pre-aggregate most
common queries

Easily generate many
materialized views
Industry Trends
•

Fast execution frameworks
–

•

Impala, Drill, Presto

In-memory databases
–

VoltDB, Druid

•

Streaming and real-time

•

Higher-level, productive data frameworks
–

Cascading, Hive, Pig

Saturday, December 7, 13
Why Spark and Shark?
“Lightning-fast in-memory cluster computing”

Saturday, December 7, 13
Introduction to Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
– Iterative algorithms (machine learning)
– Interactive data mining
• More general purpose than Hadoop MR
• Active contributions from ~ 15 companies

Saturday, December 7, 13
Data Source
Source 2

Map
map()
Reduce
join()

HDFS
Map
Reduce

Saturday, December 7, 13

cache()
transform
Throughput: Memory is king
0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Saturday, December 7, 13

37500

75000

112500

150000
Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
• EASY testing!!
• Low latency - quick development cycles

Saturday, December 7, 13
Spark word count example

file = spark.textFile("hdfs://...")
 
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)

Saturday, December 7, 13

1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18
private final static IntWritable one = new IntWritable(1);
19
private Text word = new Text();
20
21
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22
String line = value.toString();
23
StringTokenizer tokenizer = new StringTokenizer(line);
24
while (tokenizer.hasMoreTokens()) {
25
word.set(tokenizer.nextToken());
26
context.write(word, one);
27
}
28
}
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33
public void reduce(Text key, Iterable<IntWritable> values, Context context)
34
throws IOException, InterruptedException {
35
int sum = 0;
36
for (IntWritable val : values) {
37
sum += val.get();
38
}
39
context.write(key, new IntWritable(sum));
40
}
41 }
42
43 public static void main(String[] args) throws Exception {
44
Configuration conf = new Configuration();
45
46
Job job = new Job(conf, "wordcount");
47
48
job.setOutputKeyClass(Text.class);
49
job.setOutputValueClass(IntWritable.class);
50
51
job.setMapperClass(Map.class);
52
job.setReducerClass(Reduce.class);
53
54
job.setInputFormatClass(TextInputFormat.class);
55
job.setOutputFormatClass(TextOutputFormat.class);
56
57
FileInputFormat.addInputPath(job, new Path(args[0]));
58
FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60
job.waitForCompletion(true);
61 }
62
63 }
The Spark Ecosystem
Bagel Pregel on
Spark

HIVE on Spark

Spark Streaming discretized stream
processing

Spark
Tachyon - in-memory caching DFS

Saturday, December 7, 13
Shark - HIVE on Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
• Can use DSE / CassandraFS for Metastore
• Easy Scala/Java integration via Spark - easier than
writing UDFs

Saturday, December 7, 13
Our new analytics architecture
How we integrate Cassandra and Spark/Shark

Saturday, December 7, 13
From raw events to fast queries
Raw
Events
Raw
Events

Spark

Ingestion

Raw
Events

Saturday, December 7, 13

C*
event
store

View 1

Spark

Predefined
queries

Spark

View 2

Shark

Ad-hoc
HiveQL

Spark

View 3
Our Spark/Shark/Cassandra Stack
Spark Master

Shark

Spark
Worker

Shark

Spark
Worker

SerDe

Job Server

Shark

Spark
Worker

SerDe

SerDe

InputFormat

InputFormat

InputFormat

Cassandra

Cassandra

Cassandra

Node1

Node2

Node3

Saturday, December 7, 13
Event Store Cassandra schema
Event CF
t0
2013-04-05
T00:00Z#id1

t1

t2

t3

t4

{event0:
a0}

{event1:
a1}

{event2:
a2}

{event3:
a3}

{event4:
a4}

EventAttr CF
ipaddr:10.20.30.40:t1
2013-04-05
T00:00Z#id1

Saturday, December 7, 13

videoId:45678:t1

providerId:500:t0
Unpacking raw events
t0

t1

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,
type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,
type:5}
type:9}

Saturday, December 7, 13

UserID

Video

Type

id1

10

5
Unpacking raw events
t0

t1

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,
type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,
type:5}
type:9}

Saturday, December 7, 13

UserID

Video

Type

id1

10

5

id1

11

1
Unpacking raw events
t0

t1

{video: 10, {video: 11,
type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,
type:5}
type:9}

Saturday, December 7, 13

Video

Type

id1

10

5

id1

11

1

id2

2013-04-05
T00:00Z#id1

UserID

20

5
Unpacking raw events
t0

t1

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,
type:5}
type:9}

Saturday, December 7, 13

Type

id1

10

5

id1

11

1

20

5

id2

{video: 10, {video: 11,
type:5}
type:1}

Video

id2

2013-04-05
T00:00Z#id1

UserID

25

9
Options for Spark/Cassandra Integration
•

Hadoop InputFormat
–

ColumnFamilyInputFormat - reads all rows from 1 CF

– CqlPagingInputFormat, etc. - CQL3, 2-dary indexes
– Roll your own (join multiple CFs, etc)
•

Spark native RDD
–

sc.parallelize(rowkeys).flatMap(readColumns(_))

– JdbcRdd + Cassandra JDBC driver
•

http://tuplejump.github.io/calliope/

Saturday, December 7, 13
Tips for InputFormat Development
•

Know which target platforms you are developing for
–

•

Which API to write against? New? Old? Both?

Be prepared to spend time tuning your split computation
–

Low latency jobs require fast splits

•

Consider sorting row keys by token for data locality

•

Implement predicate pushdown for HIVE SerDe’s
–

Use your indexes to reduce size of dataset

Saturday, December 7, 13
Example: OLAP processing
Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Saturday, December 7, 13

Spark

Query 1: Plays
by Provider
Union

OLAP
Aggregates
Cached Materialized Views

Query 2: Top
content for
mobile
Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)

6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Saturday, December 7, 13

130 seconds

20-30 seconds

60 ms
OLAP WorkFlow

Result

Aggregate

Query

Result
Query

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Saturday, December 7, 13

Dataset

Query Job

Query Job
Fault Tolerance
•

Cached dataset lives in Java Heap only - what if process dies?

•

Spark lineage - automatic recomputation from source, but this is
expensive!

•

Can also replicate cached dataset to survive single node failures

•

Persist materialized views back to C*, then load into cache -- now
recovery path is much faster

•

Persistence also enables multiple processes to hold cached dataset

Saturday, December 7, 13
Demo time

Saturday, December 7, 13
Shark Demo
•

Local shark node, 1 core, MBP

•

How to create a table from C* using our inputformat

•

Creating a cached Shark table

•

Running fast queries

Saturday, December 7, 13
Creating a Shark Table from InputFormat

Saturday, December 7, 13
Creating a cached table

Saturday, December 7, 13
Querying cached table

Saturday, December 7, 13
THANK YOU
•

@evanfchan

•

ev@ooyala.com

•

WE ARE HIRING!!

Saturday, December 7, 13
Spark: Under the hood
Map

Dataset

Map

Map

Reduce

Dataset

Map

Map

Driver

Reduce

Reduce

Dataset

Map

One executor process per node

Saturday, December 7, 13

Driver

More Related Content

What's hot

Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataPatrick McFadin
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...
Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...
Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...DataStax
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingScyllaDB
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentationDuyhai Doan
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 

What's hot (20)

Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...
Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...
Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data Modeling
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentation
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 

Similar to Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Creating Scalable JVM/Java Apps on Heroku
Creating Scalable JVM/Java Apps on HerokuCreating Scalable JVM/Java Apps on Heroku
Creating Scalable JVM/Java Apps on HerokuJoe Kutner
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Tugdual Grall
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"Daniel Bryant
 
Intro to PHP Testing
Intro to PHP TestingIntro to PHP Testing
Intro to PHP TestingRan Mizrahi
 
Rapid Application Development with WSO2 Platform
Rapid Application Development with WSO2 PlatformRapid Application Development with WSO2 Platform
Rapid Application Development with WSO2 PlatformWSO2
 
Red Dirt Ruby Conference
Red Dirt Ruby ConferenceRed Dirt Ruby Conference
Red Dirt Ruby ConferenceJohn Woodell
 
Building Rackspace Cloud Monitoring
Building Rackspace Cloud MonitoringBuilding Rackspace Cloud Monitoring
Building Rackspace Cloud Monitoringgdusbabek
 
Unleashing the Rails Asset Pipeline
Unleashing the Rails Asset PipelineUnleashing the Rails Asset Pipeline
Unleashing the Rails Asset PipelineKenneth Kalmer
 
Jasper Reports
Jasper ReportsJasper Reports
Jasper ReportsEnkitec
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenHuy Nguyen
 
Intro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandIntro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandMatt Tesauro
 
APIs, now and in the future
APIs, now and in the futureAPIs, now and in the future
APIs, now and in the futureChris Mills
 
Using Puppet - Real World Configuration Management
Using Puppet - Real World Configuration ManagementUsing Puppet - Real World Configuration Management
Using Puppet - Real World Configuration ManagementJames Turnbull
 

Similar to Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala (20)

Creating Scalable JVM/Java Apps on Heroku
Creating Scalable JVM/Java Apps on HerokuCreating Scalable JVM/Java Apps on Heroku
Creating Scalable JVM/Java Apps on Heroku
 
Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?Why and How to integrate Hadoop and NoSQL?
Why and How to integrate Hadoop and NoSQL?
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
 
Smartgears
SmartgearsSmartgears
Smartgears
 
Logging & Docker - Season 2
Logging & Docker - Season 2Logging & Docker - Season 2
Logging & Docker - Season 2
 
Intro to PHP Testing
Intro to PHP TestingIntro to PHP Testing
Intro to PHP Testing
 
Rapid Application Development with WSO2 Platform
Rapid Application Development with WSO2 PlatformRapid Application Development with WSO2 Platform
Rapid Application Development with WSO2 Platform
 
Red Dirt Ruby Conference
Red Dirt Ruby ConferenceRed Dirt Ruby Conference
Red Dirt Ruby Conference
 
Building Rackspace Cloud Monitoring
Building Rackspace Cloud MonitoringBuilding Rackspace Cloud Monitoring
Building Rackspace Cloud Monitoring
 
Unleashing the Rails Asset Pipeline
Unleashing the Rails Asset PipelineUnleashing the Rails Asset Pipeline
Unleashing the Rails Asset Pipeline
 
Jasper Reports
Jasper ReportsJasper Reports
Jasper Reports
 
NATO IST Symposium 2013
NATO IST Symposium 2013NATO IST Symposium 2013
NATO IST Symposium 2013
 
Lightweight javaEE with Guice
Lightweight javaEE with GuiceLightweight javaEE with Guice
Lightweight javaEE with Guice
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
 
Droidcon Paris 2015
Droidcon Paris 2015Droidcon Paris 2015
Droidcon Paris 2015
 
Intro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandIntro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP Switzerland
 
Session 203 iouc summit database
Session 203 iouc summit databaseSession 203 iouc summit database
Session 203 iouc summit database
 
APIs, now and in the future
APIs, now and in the futureAPIs, now and in the future
APIs, now and in the future
 
Using Puppet - Real World Configuration Management
Using Puppet - Real World Configuration ManagementUsing Puppet - Real World Configuration Management
Using Puppet - Real World Configuration Management
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 

Recently uploaded

Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 

Recently uploaded (20)

Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

  • 1. Real-time Analytics with Cassandra, Spark and Shark Saturday, December 7, 13
  • 2. Who is this guy • Staff Engineer, Compute and Data Services, Ooyala • Building multiple web-scale real-time systems on top of C*, Kafka, Storm, etc. • Scala/Akka guy • github.com/velvia • @evanfchan Saturday, December 7, 13
  • 3. Agenda • Ooyala and Cassandra • What problem are we trying to solve? • Spark and Shark • Our Spark/Cassandra Architecture • Demo Saturday, December 7, 13
  • 4. Cassandra at Ooyala Who is Ooyala, and how we use Cassandra Saturday, December 7, 13
  • 5. OOYALA Powering personalized video experiences across all screens. CONFIDENTIAL—DO NOT DISTRIBUTE Saturday, December 7, 13 5
  • 6. COMPANY OVERVIEW Founded in 2007 Commercially launch in 2009 300 employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara Global footprint, 200M unique users, 110+ countries, and more than 6,000 websites Over 1 billion videos played per month and 2 billion analytic events per day 25% of U.S. online viewers watch video powered by Ooyala CONFIDENTIAL—DO NOT DISTRIBUTE Saturday, December 7, 13 6
  • 7. TRUSTED VIDEO PARTNER CUSTOMERS STRATEGIC PARTNERS CONFIDENTIAL—DO NOT DISTRIBUTE Saturday, December 7, 13 7
  • 8. We are a large Cassandra user • 12 clusters ranging in size from 3 to 107 nodes • Total of 28TB of data managed over ~220 nodes • Over 2 billion C* column writes per day • Powers all of our analytics infrastructure • DSE/C* 1.0.x, 1.1.x, 1.2.6 • Large prod cluster is one of the biggest Cassandra installations Saturday, December 7, 13
  • 9. What problem are we trying to solve? Lots of data, complex queries, answered really quickly... but how?? Saturday, December 7, 13
  • 10. From mountains of raw data... Saturday, December 7, 13
  • 11. To nuggets of truth... • Quickly • Painlessly • At Saturday, December 7, 13 scale?
  • 12. Today: Precomputed aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read • What if we need more dynamic queries? – Top content for mobile users in France – Engagement curves for users who watched recommendations – Data mining, trends, machine learning Saturday, December 7, 13
  • 13. The static - dynamic continuum 100% Precomputation • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries 100% Dynamic Saturday, December 7, 13 • Always compute results from raw data • Flexible but slow
  • 14. Where we want to be Partly dynamic • • Flexible, fast dynamic queries • Saturday, December 7, 13 Pre-aggregate most common queries Easily generate many materialized views
  • 15. Industry Trends • Fast execution frameworks – • Impala, Drill, Presto In-memory databases – VoltDB, Druid • Streaming and real-time • Higher-level, productive data frameworks – Cascading, Hive, Pig Saturday, December 7, 13
  • 16. Why Spark and Shark? “Lightning-fast in-memory cluster computing” Saturday, December 7, 13
  • 17. Introduction to Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: – Iterative algorithms (machine learning) – Interactive data mining • More general purpose than Hadoop MR • Active contributions from ~ 15 companies Saturday, December 7, 13
  • 19. Throughput: Memory is king 0 C*, cold cache C*, warm cache Spark RDD 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Saturday, December 7, 13 37500 75000 112500 150000
  • 20. Developers love it • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell • EASY testing!! • Low latency - quick development cycles Saturday, December 7, 13
  • 21. Spark word count example file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) Saturday, December 7, 13 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
  • 22. The Spark Ecosystem Bagel Pregel on Spark HIVE on Spark Spark Streaming discretized stream processing Spark Tachyon - in-memory caching DFS Saturday, December 7, 13
  • 23. Shark - HIVE on Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore • Easy Scala/Java integration via Spark - easier than writing UDFs Saturday, December 7, 13
  • 24. Our new analytics architecture How we integrate Cassandra and Spark/Shark Saturday, December 7, 13
  • 25. From raw events to fast queries Raw Events Raw Events Spark Ingestion Raw Events Saturday, December 7, 13 C* event store View 1 Spark Predefined queries Spark View 2 Shark Ad-hoc HiveQL Spark View 3
  • 26. Our Spark/Shark/Cassandra Stack Spark Master Shark Spark Worker Shark Spark Worker SerDe Job Server Shark Spark Worker SerDe SerDe InputFormat InputFormat InputFormat Cassandra Cassandra Cassandra Node1 Node2 Node3 Saturday, December 7, 13
  • 27. Event Store Cassandra schema Event CF t0 2013-04-05 T00:00Z#id1 t1 t2 t3 t4 {event0: a0} {event1: a1} {event2: a2} {event3: a3} {event4: a4} EventAttr CF ipaddr:10.20.30.40:t1 2013-04-05 T00:00Z#id1 Saturday, December 7, 13 videoId:45678:t1 providerId:500:t0
  • 28. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, {video: 11, type:5} type:1} 2013-04-05 T00:00Z#id2 {video: 20, {video: 25, type:5} type:9} Saturday, December 7, 13 UserID Video Type id1 10 5
  • 29. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id1 {video: 10, {video: 11, type:5} type:1} 2013-04-05 T00:00Z#id2 {video: 20, {video: 25, type:5} type:9} Saturday, December 7, 13 UserID Video Type id1 10 5 id1 11 1
  • 30. Unpacking raw events t0 t1 {video: 10, {video: 11, type:5} type:1} 2013-04-05 T00:00Z#id2 {video: 20, {video: 25, type:5} type:9} Saturday, December 7, 13 Video Type id1 10 5 id1 11 1 id2 2013-04-05 T00:00Z#id1 UserID 20 5
  • 31. Unpacking raw events t0 t1 2013-04-05 T00:00Z#id2 {video: 20, {video: 25, type:5} type:9} Saturday, December 7, 13 Type id1 10 5 id1 11 1 20 5 id2 {video: 10, {video: 11, type:5} type:1} Video id2 2013-04-05 T00:00Z#id1 UserID 25 9
  • 32. Options for Spark/Cassandra Integration • Hadoop InputFormat – ColumnFamilyInputFormat - reads all rows from 1 CF – CqlPagingInputFormat, etc. - CQL3, 2-dary indexes – Roll your own (join multiple CFs, etc) • Spark native RDD – sc.parallelize(rowkeys).flatMap(readColumns(_)) – JdbcRdd + Cassandra JDBC driver • http://tuplejump.github.io/calliope/ Saturday, December 7, 13
  • 33. Tips for InputFormat Development • Know which target platforms you are developing for – • Which API to write against? New? Old? Both? Be prepared to spend time tuning your split computation – Low latency jobs require fast splits • Consider sorting row keys by token for data locality • Implement predicate pushdown for HIVE SerDe’s – Use your indexes to reduce size of dataset Saturday, December 7, 13
  • 34. Example: OLAP processing Spark OLAP Aggregates Spark OLAP Aggregates t0 2013-04 -05T00: 00Z#id1 {video: 10, type:5} 2013-04 -05T00: 00Z#id2 {video: 20, type:5} C* events Saturday, December 7, 13 Spark Query 1: Plays by Provider Union OLAP Aggregates Cached Materialized Views Query 2: Top content for mobile
  • 35. Performance numbers Spark: C* -> OLAP aggregates cold cache, 1.4 million events C* -> OLAP aggregates warmed cache OLAP aggregate query via Spark (56k records) 6-node C*/DSE 1.1.9 cluster, Spark 0.7.0 Saturday, December 7, 13 130 seconds 20-30 seconds 60 ms
  • 36. OLAP WorkFlow Result Aggregate Query Result Query REST Job Server Spark Executors Aggregation Job Cassandra Saturday, December 7, 13 Dataset Query Job Query Job
  • 37. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster • Persistence also enables multiple processes to hold cached dataset Saturday, December 7, 13
  • 39. Shark Demo • Local shark node, 1 core, MBP • How to create a table from C* using our inputformat • Creating a cached Shark table • Running fast queries Saturday, December 7, 13
  • 40. Creating a Shark Table from InputFormat Saturday, December 7, 13
  • 41. Creating a cached table Saturday, December 7, 13
  • 43. THANK YOU • @evanfchan • ev@ooyala.com • WE ARE HIRING!! Saturday, December 7, 13
  • 44. Spark: Under the hood Map Dataset Map Map Reduce Dataset Map Map Driver Reduce Reduce Dataset Map One executor process per node Saturday, December 7, 13 Driver