SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Interactive Analytics With 

Spark And Cassandra
!
Evan Chan

Ooyala, Inc.
April 7Th, 2014
• Staff Engineer, Compute and Data Services, Ooyala

• Building multiple web-scale real-time systems on top of
C*, Kafka, Storm, etc.

• Scala/Akka guy

• Very excited by open source, big data projects

• @evanfchan
Who is this guy?
!2
• Cassandra at Ooyala

• What problem are we trying to solve?

• Spark and Shark

• Integrating Cassandra and Spark

• Our Spark/Cassandra Architecture
Agenda
!3
CASSANDRA AT
OOYALA
!4
OOYALA
Powering personalized video
experiences across all
screens.
CONFIDENTIAL—DO NOT DISTRIBUTE !6CONFIDENTIAL—DO NOT DISTRIBUTE
Founded in 2007

Commercially launch in 2009

230+ employees in Silicon Valley, LA, NYC, 

London, Paris, Tokyo, Sydney & Guadalajara 

Global footprint, 200M unique users,
110+
countries, and more than 6,000 websites

Over 1 billion videos played per month 
and 2
billion analytic events per day

25% of U.S. online viewers watch video 

powered by Ooyala
COMPANY OVERVIEW
CONFIDENTIAL—DO NOT DISTRIBUTE !7
TRUSTED VIDEO PARTNER
STRATEGIC PARTNERS
CUSTOMERS
CONFIDENTIAL—DO NOT DISTRIBUTE
TITLE TEXT GOES HERE
• 12 clusters ranging in size from 3 to 107
nodes
• Total of 28TB of data managed over ~220
nodes
• Powers all of our analytics infrastructure
• Traditional analytics aggregations
• Recommendations and trends
• DSE/C* 1.0.x, 1.1.x, 1.2.6
We are a large Cassandra user
!8
TITLE TEXT GOES HERE
• Started investing in Spark beginning of 2013

• 2 teams of developers doing stuff with Spark

• Actively contributing to Spark developer community

• Deploying Spark to a large (>100 node) production
cluster

• Spark community very active, huge amount of interest
Becoming a big Spark user...
!9
WHAT PROBLEM ARE
WE TRYING TO SOLVE?
!10
From mountains of raw data...
• Quickly
• Painlessly
• At scale?

To nuggets of truth...
Today: Precomputed Aggregates
• Video metrics computed along several high cardinality dimensions
• Very fast lookups, but inflexible, and hard to change
• Most computed aggregates are never read
• What if we need more dynamic queries?
• Top content for mobile users in France
• Engagement curves for users who watched recommendations
• Data mining, trends, machine learning
The Static - Dynamic Continuum
• Super fast lookups
• Inflexible, wasteful
• Best for 80% most
common queries
• Always compute
results from raw data
• Flexible but slow
100% Precomputation 100% Dynamic
Where We Want To Be
Partly dynamic
• Pre-aggregate most
common queries
• Flexible, fast
dynamic queries
• Easily generate
many materialized
views
WHY SPARK?
!16
Introduction To Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
–Iterative algorithms (machine learning)
–Interactive data mining
• More general purpose than Hadoop MR
• Top level Apache project
• Active contributions from Intel, Yahoo, lots of
Spark Vs Hadoop
HDFS
Map
Reducee
Map
Reduce
Data	
  Source
map()
join()
Source	
  2
cache()
transform
Throughput: Memory Is King
6-node C*/DSE 1.1.9 cluster,

Spark 0.7.0
Spark cached RDD 10-50x faster than raw Cassandra
Developers Love It
• “I wrote my first aggregation job in 30
minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
Spark Vs Hadoop Word Count
file = spark.textFile("hdfs://...")	
 	
file.flatMap(line => line.split(" "))	
    .map(word => (word, 1))	
    .reduceByKey(_ + _)	
1 package org.myorg;!
2 !
3 import java.io.IOException;!
4 import java.util.*;!
5 !
6 import org.apache.hadoop.fs.Path;!
7 import org.apache.hadoop.conf.*;!
8 import org.apache.hadoop.io.*;!
9 import org.apache.hadoop.mapreduce.*;!
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;!
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;!
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;!
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;!
14 !
15 public class WordCount {!
16 !
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{!
18 private final static IntWritable one = new IntWritable(1);!
19 private Text word = new Text();!
20 !
21 public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {!
22 String line = value.toString();!
23 StringTokenizer tokenizer = new StringTokenizer(line);!
24 while (tokenizer.hasMoreTokens()) {!
25 word.set(tokenizer.nextToken());!
26 context.write(word, one);!
27 }!
28 }!
29 } !
30 !
31 public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {!
32 !
33 public void reduce(Text key, Iterable<IntWritable> values, Context context) !
34 throws IOException, InterruptedException {!
35 int sum = 0;!
36 for (IntWritable val : values) {!
37 sum += val.get();!
38 }!
39 context.write(key, new IntWritable(sum));!
40 }!
41 }!
42 !
43 public static void main(String[] args) throws Exception {!
44 Configuration conf = new Configuration();!
45 !
46 Job job = new Job(conf, "wordcount");!
47 !
48 job.setOutputKeyClass(Text.class);!
49 job.setOutputValueClass(IntWritable.class);!
50 !
51 job.setMapperClass(Map.class);!
52 job.setReducerClass(Reduce.class);!
53 !
54 job.setInputFormatClass(TextInputFormat.class);!
55 job.setOutputFormatClass(TextOutputFormat.class);!
56 !
57 FileInputFormat.addInputPath(job, new Path(args[0]));!
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));!
59 !
60 job.waitForCompletion(true);!
61 }!
62 !
63 }!
One Platform To Rule Them All
HIVE on Spark
Spark Streaming -
discretized stream
processing
• SQL, Graph, ML, Streaming all in one
framework
• Much higher code sharing/reuse
• Easy integration between components
• Fewer platforms == lower TCO
• Integration with Mesos, YARN helps
share resources
Shark - Hive On Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
• Can use DSE / CassandraFS for Metastore
INTEGRATING
CASSANDRA & SPARK
!24
Our Spark/Shark/Cassandra Stack
Node1
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node2
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Node3
Cassandra
InputFormat
SerDe
Spark
Worker
Shark
Spark Master Job Server
OPTIONS FOR READING FROM C*
• Hadoop InputFormat
– ColumnFamilyInputFormat - reads all rows from 1 CF
– CqlPagingInputFormat, etc. - CQL3, 2-dary indexes
– Roll your own (join multiple CFs, etc)
• Spark native RDD
– sc.parallelize(rowkeys).flatMap(readColum
ns(_))
– JdbcRdd + Cassandra JDBC driver
Columnfamilyinputformat
video type
Record1 10 1
Record2 11 5
id Video Type
Record1 10 1
Record2 11 5
• Must read from all rows
• One CF only, not very flexible
Node 2Node 1
Spark RDD
• RDD = Resilient Distributed Dataset
• Multiple partitions living on different
nodes
• Each partition has records
Partition 1 Partition 2 Partition 3 Partition 4
Inputformat Vs Rdd
InputFormat RDD
Supports Hadoop, HIVE, Spark,
Shark
Spark / Shark only
Have to implement multiple classes
- InputFormat, RecordReader,
Writeable, etc. Clunky API.
One class - simple API.
Two APIs, and often need to
implement both (HIVE needs older...)
Just one API.
• You can easily use InputFormats in
Spark using newAPIHadoopRDD().
• Writing a custom RDD could have
saved us lots of time.
Node 2Node 1
Skipping The Inputformat
Row 1 data Row 2 data Row 3 data Row 4 data
Rowkey1 Rowkey2 Rowkey3 Rowkey4
sc.parallelize(rowkeys).flatMap(readColumns(_))
OUR CASSANDRA /
SPARK ARCHITECTURE
!31
From Raw Logs To Fast Queries
Process
C*

columnar
storage
Raw Logs
Raw Logs
Raw Logs
Spark
Spark
Spark
OLAP Table
1
OLAP Table
2
OLAP Table
3
Spark
Shark
Predefined
queries
Ad-hoc
HiveQL
Why Cassandra Alone Isn’t Enough
• Over 30 million multi-dimensional
fact table rows per day
• Materializing every possible answer
isn’t close to possible
• Multi dimensional filtering and
grouping alone leads to many
billions of possible answers
• Querying fact tables in Cassandra is
too slow
• Reading millions or billion
random rows
• CQL doesn’t support grouping
Our Approach
• Use Cassandra to store the raw
fact tables
• Optimize the schema for OLAP
workloads
• Fast full table reads
• Easily read fewer columns
• Use Spark for fast random row
access and fast distributed
computation
uuid-
part-0
uuid-
part-1
2013-04-05T
00:00Z#id1
Section 0 Section 1 Section 2
country rows 0-9999 rows 10000-19999 rows ....
city rows 0-9999 rows 10000-19999 rows ....
Index CF
Columns CF
An OLAP Schema for Cassandra
Metadata
2013-04-05T
00:00-part-0
{columns:
[“country”,
“city”,
“plays” ]}
Metadata CF
•Optimized for: selective column loading, maximum throughput and compression
Olap Workflow
Dataset
Aggregation
Job
Query
Job
Spark

Executors
Cassandra
REST Job Server
Query
Job
Aggregate Query
Result
Query
Result
Querying Data In Spark
Node 2Node 1
Partition 1
!
record1
record2
record3

Partition 2
!
record4
record5
record6

Partition 3
!
record7
record8
record9
Partition 4
!
record10
record11
record12
Convert to Spark SQL /
Shark Table
Shark / Spark SQL
rdd.group / .map / .sort /
.join etc
rdd.reduce / .collect / .to
p
Fault Tolerance
• Cached dataset lives in Java Heap only - what if process
dies?
• Spark lineage - automatic recomputation from source, but
this is expensive!
• Can also replicate cached dataset to survive single node
failures
• Persist materialized views back to C*, then load into cache
-- now recovery path is much faster
DEMO
!39
Creating a Shark Table
Creating a Cached Table
Querying a Cached Table
THANK YOU
And YES, We’re HIRING!!
ooyala.com/careers
@evanfchan
Industry Trends
• Fast execution frameworks
• Impala
• In-memory databases
• VoltDB, Druid
• Streaming and real-time
• Higher-level, productive data
PERFORMANCE #’S
Spark: C* -> OLAP aggregates

cold cache, 1.4 million events
130 seconds
C* -> OLAP aggregates

warmed cache
20-30 seconds
OLAP aggregate query via
Spark

(56k records)
60 ms
6-node C*/DSE 1.1.9 cluster,

Spark 0.7.0
EXAMPLE: OLAP PROCESSING
t0
2013-0
4-05T0
0:00Z#i
{vide
o:
10,
2013-0
4-05T0
0:00Z#i
{vide
o:
20,
C* events
OLAP
Aggregates
OLAP
Aggregates
OLAP
Aggregates
Cached Materialized Views
Spark
Spark
Spark
Union
Query 1: Plays by
Provider
Query 2: Top content
for mobile

Contenu connexe

Tendances

Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Amazon Relational Database Service Deep Dive
Amazon Relational Database Service Deep DiveAmazon Relational Database Service Deep Dive
Amazon Relational Database Service Deep DiveAmazon Web Services
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...1Strategy
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduceAmazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...Amazon Web Services
 
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheGetting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheAmazon Web Services
 
Getting started with Amazon DynamoDB
Getting started with Amazon DynamoDBGetting started with Amazon DynamoDB
Getting started with Amazon DynamoDBAmazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 

Tendances (20)

Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Amazon Relational Database Service Deep Dive
Amazon Relational Database Service Deep DiveAmazon Relational Database Service Deep Dive
Amazon Relational Database Service Deep Dive
 
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
2016 Utah Cloud Summit: Big Data Architectural Patterns and Best Practices on...
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
 
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheGetting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCache
 
Getting started with Amazon DynamoDB
Getting started with Amazon DynamoDBGetting started with Amazon DynamoDB
Getting started with Amazon DynamoDB
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 

Similaire à Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersJulien Anguenot
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Anthony Baker
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 

Similaire à Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra (20)

C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developers
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Data Science
Data ScienceData Science
Data Science
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 

Plus de DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

Plus de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Dernier

Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 

Dernier (20)

Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

  • 1. Interactive Analytics With 
 Spark And Cassandra ! Evan Chan
 Ooyala, Inc. April 7Th, 2014
  • 2. • Staff Engineer, Compute and Data Services, Ooyala • Building multiple web-scale real-time systems on top of C*, Kafka, Storm, etc. • Scala/Akka guy • Very excited by open source, big data projects • @evanfchan Who is this guy? !2
  • 3. • Cassandra at Ooyala • What problem are we trying to solve? • Spark and Shark • Integrating Cassandra and Spark • Our Spark/Cassandra Architecture Agenda !3
  • 6. CONFIDENTIAL—DO NOT DISTRIBUTE !6CONFIDENTIAL—DO NOT DISTRIBUTE Founded in 2007 Commercially launch in 2009 230+ employees in Silicon Valley, LA, NYC, 
 London, Paris, Tokyo, Sydney & Guadalajara Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites Over 1 billion videos played per month 
and 2 billion analytic events per day 25% of U.S. online viewers watch video 
 powered by Ooyala COMPANY OVERVIEW
  • 7. CONFIDENTIAL—DO NOT DISTRIBUTE !7 TRUSTED VIDEO PARTNER STRATEGIC PARTNERS CUSTOMERS CONFIDENTIAL—DO NOT DISTRIBUTE
  • 8. TITLE TEXT GOES HERE • 12 clusters ranging in size from 3 to 107 nodes • Total of 28TB of data managed over ~220 nodes • Powers all of our analytics infrastructure • Traditional analytics aggregations • Recommendations and trends • DSE/C* 1.0.x, 1.1.x, 1.2.6 We are a large Cassandra user !8
  • 9. TITLE TEXT GOES HERE • Started investing in Spark beginning of 2013 • 2 teams of developers doing stuff with Spark • Actively contributing to Spark developer community • Deploying Spark to a large (>100 node) production cluster • Spark community very active, huge amount of interest Becoming a big Spark user... !9
  • 10. WHAT PROBLEM ARE WE TRYING TO SOLVE? !10
  • 11. From mountains of raw data...
  • 12. • Quickly • Painlessly • At scale?
 To nuggets of truth...
  • 13. Today: Precomputed Aggregates • Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change • Most computed aggregates are never read • What if we need more dynamic queries? • Top content for mobile users in France • Engagement curves for users who watched recommendations • Data mining, trends, machine learning
  • 14. The Static - Dynamic Continuum • Super fast lookups • Inflexible, wasteful • Best for 80% most common queries • Always compute results from raw data • Flexible but slow 100% Precomputation 100% Dynamic
  • 15. Where We Want To Be Partly dynamic • Pre-aggregate most common queries • Flexible, fast dynamic queries • Easily generate many materialized views
  • 17. Introduction To Spark • In-memory distributed computing framework • Created by UC Berkeley AMP Lab in 2010 • Targeted problems that MR is bad at: –Iterative algorithms (machine learning) –Interactive data mining • More general purpose than Hadoop MR • Top level Apache project • Active contributions from Intel, Yahoo, lots of
  • 18. Spark Vs Hadoop HDFS Map Reducee Map Reduce Data  Source map() join() Source  2 cache() transform
  • 19. Throughput: Memory Is King 6-node C*/DSE 1.1.9 cluster,
 Spark 0.7.0 Spark cached RDD 10-50x faster than raw Cassandra
  • 20. Developers Love It • “I wrote my first aggregation job in 30 minutes” • High level “distributed collections” API • No Hadoop cruft • Full power of Scala, Java, Python • Interactive REPL shell
  • 21. Spark Vs Hadoop Word Count file = spark.textFile("hdfs://...")   file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _) 1 package org.myorg;! 2 ! 3 import java.io.IOException;! 4 import java.util.*;! 5 ! 6 import org.apache.hadoop.fs.Path;! 7 import org.apache.hadoop.conf.*;! 8 import org.apache.hadoop.io.*;! 9 import org.apache.hadoop.mapreduce.*;! 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;! 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;! 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;! 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;! 14 ! 15 public class WordCount {! 16 ! 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {! 18 private final static IntWritable one = new IntWritable(1);! 19 private Text word = new Text();! 20 ! 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {! 22 String line = value.toString();! 23 StringTokenizer tokenizer = new StringTokenizer(line);! 24 while (tokenizer.hasMoreTokens()) {! 25 word.set(tokenizer.nextToken());! 26 context.write(word, one);! 27 }! 28 }! 29 } ! 30 ! 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {! 32 ! 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) ! 34 throws IOException, InterruptedException {! 35 int sum = 0;! 36 for (IntWritable val : values) {! 37 sum += val.get();! 38 }! 39 context.write(key, new IntWritable(sum));! 40 }! 41 }! 42 ! 43 public static void main(String[] args) throws Exception {! 44 Configuration conf = new Configuration();! 45 ! 46 Job job = new Job(conf, "wordcount");! 47 ! 48 job.setOutputKeyClass(Text.class);! 49 job.setOutputValueClass(IntWritable.class);! 50 ! 51 job.setMapperClass(Map.class);! 52 job.setReducerClass(Reduce.class);! 53 ! 54 job.setInputFormatClass(TextInputFormat.class);! 55 job.setOutputFormatClass(TextOutputFormat.class);! 56 ! 57 FileInputFormat.addInputPath(job, new Path(args[0]));! 58 FileOutputFormat.setOutputPath(job, new Path(args[1]));! 59 ! 60 job.waitForCompletion(true);! 61 }! 62 ! 63 }!
  • 22. One Platform To Rule Them All HIVE on Spark Spark Streaming - discretized stream processing • SQL, Graph, ML, Streaming all in one framework • Much higher code sharing/reuse • Easy integration between components • Fewer platforms == lower TCO • Integration with Mesos, YARN helps share resources
  • 23. Shark - Hive On Spark • 100% HiveQL compatible • 10-100x faster than HIVE, answers in seconds • Reuse UDFs, SerDe’s, StorageHandlers • Can use DSE / CassandraFS for Metastore
  • 26. OPTIONS FOR READING FROM C* • Hadoop InputFormat – ColumnFamilyInputFormat - reads all rows from 1 CF – CqlPagingInputFormat, etc. - CQL3, 2-dary indexes – Roll your own (join multiple CFs, etc) • Spark native RDD – sc.parallelize(rowkeys).flatMap(readColum ns(_)) – JdbcRdd + Cassandra JDBC driver
  • 27. Columnfamilyinputformat video type Record1 10 1 Record2 11 5 id Video Type Record1 10 1 Record2 11 5 • Must read from all rows • One CF only, not very flexible
  • 28. Node 2Node 1 Spark RDD • RDD = Resilient Distributed Dataset • Multiple partitions living on different nodes • Each partition has records Partition 1 Partition 2 Partition 3 Partition 4
  • 29. Inputformat Vs Rdd InputFormat RDD Supports Hadoop, HIVE, Spark, Shark Spark / Shark only Have to implement multiple classes - InputFormat, RecordReader, Writeable, etc. Clunky API. One class - simple API. Two APIs, and often need to implement both (HIVE needs older...) Just one API. • You can easily use InputFormats in Spark using newAPIHadoopRDD(). • Writing a custom RDD could have saved us lots of time.
  • 30. Node 2Node 1 Skipping The Inputformat Row 1 data Row 2 data Row 3 data Row 4 data Rowkey1 Rowkey2 Rowkey3 Rowkey4 sc.parallelize(rowkeys).flatMap(readColumns(_))
  • 31. OUR CASSANDRA / SPARK ARCHITECTURE !31
  • 32. From Raw Logs To Fast Queries Process C*
 columnar storage Raw Logs Raw Logs Raw Logs Spark Spark Spark OLAP Table 1 OLAP Table 2 OLAP Table 3 Spark Shark Predefined queries Ad-hoc HiveQL
  • 33. Why Cassandra Alone Isn’t Enough • Over 30 million multi-dimensional fact table rows per day • Materializing every possible answer isn’t close to possible • Multi dimensional filtering and grouping alone leads to many billions of possible answers • Querying fact tables in Cassandra is too slow • Reading millions or billion random rows • CQL doesn’t support grouping
  • 34. Our Approach • Use Cassandra to store the raw fact tables • Optimize the schema for OLAP workloads • Fast full table reads • Easily read fewer columns • Use Spark for fast random row access and fast distributed computation
  • 35. uuid- part-0 uuid- part-1 2013-04-05T 00:00Z#id1 Section 0 Section 1 Section 2 country rows 0-9999 rows 10000-19999 rows .... city rows 0-9999 rows 10000-19999 rows .... Index CF Columns CF An OLAP Schema for Cassandra Metadata 2013-04-05T 00:00-part-0 {columns: [“country”, “city”, “plays” ]} Metadata CF •Optimized for: selective column loading, maximum throughput and compression
  • 37. Querying Data In Spark Node 2Node 1 Partition 1 ! record1 record2 record3
 Partition 2 ! record4 record5 record6
 Partition 3 ! record7 record8 record9 Partition 4 ! record10 record11 record12 Convert to Spark SQL / Shark Table Shark / Spark SQL rdd.group / .map / .sort / .join etc rdd.reduce / .collect / .to p
  • 38. Fault Tolerance • Cached dataset lives in Java Heap only - what if process dies? • Spark lineage - automatic recomputation from source, but this is expensive! • Can also replicate cached dataset to survive single node failures • Persist materialized views back to C*, then load into cache -- now recovery path is much faster
  • 43. THANK YOU And YES, We’re HIRING!! ooyala.com/careers @evanfchan
  • 44. Industry Trends • Fast execution frameworks • Impala • In-memory databases • VoltDB, Druid • Streaming and real-time • Higher-level, productive data
  • 45. PERFORMANCE #’S Spark: C* -> OLAP aggregates
 cold cache, 1.4 million events 130 seconds C* -> OLAP aggregates
 warmed cache 20-30 seconds OLAP aggregate query via Spark
 (56k records) 60 ms 6-node C*/DSE 1.1.9 cluster,
 Spark 0.7.0
  • 46. EXAMPLE: OLAP PROCESSING t0 2013-0 4-05T0 0:00Z#i {vide o: 10, 2013-0 4-05T0 0:00Z#i {vide o: 20, C* events OLAP Aggregates OLAP Aggregates OLAP Aggregates Cached Materialized Views Spark Spark Spark Union Query 1: Plays by Provider Query 2: Top content for mobile