What You Will Learn At This Meetup:
• Review of Cassandra analytics landscape: Hadoop & HIVE
• Custom input formats to extract data from Cassandra
• How Spark & Shark increase query speed & productivity over standard solutions
Abstract
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
About Evan Chan
Evan Chan is a Software Engineer at Ooyala. In his own words: I love to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. I am a big believer in GitHub, open source, and meetups, and have given talks at conferences such as the Cassandra Summit 2013.
South Bay Cassandra Meetup URL: http://www.meetup.com/DataStax-Cassandra-South-Bay-Users/events/147443722/
2. Who is this guy
•
Staff Engineer, Compute and Data Services, Ooyala
•
Building multiple web-scale real-time systems on top of C*, Kafka,
Storm, etc.
•
Scala/Akka guy
•
github.com/velvia
•
@evanfchan
Saturday, December 7, 13
3. Agenda
• Ooyala and Cassandra
• What problem are we trying to solve?
• Spark and Shark
• Our Spark/Cassandra Architecture
• Demo
Saturday, December 7, 13
6. COMPANY OVERVIEW
Founded in 2007
Commercially launch in 2009
300 employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala
CONFIDENTIAL—DO NOT DISTRIBUTE
Saturday, December 7, 13
6
8. We are a large Cassandra user
•
12 clusters ranging in size from 3 to 107 nodes
•
Total of 28TB of data managed over ~220 nodes
•
Over 2 billion C* column writes per day
•
Powers all of our analytics infrastructure
•
DSE/C* 1.0.x, 1.1.x, 1.2.6
•
Large prod cluster is one of the biggest Cassandra
installations
Saturday, December 7, 13
9. What problem are we trying to
solve?
Lots of data, complex queries, answered really quickly... but how??
Saturday, December 7, 13
11. To nuggets of truth...
• Quickly
• Painlessly
• At
Saturday, December 7, 13
scale?
12. Today: Precomputed aggregates
•
Video metrics computed along several high cardinality dimensions
•
Very fast lookups, but inflexible, and hard to change
•
Most computed aggregates are never read
•
What if we need more dynamic queries?
–
Top content for mobile users in France
–
Engagement curves for users who watched recommendations
–
Data mining, trends, machine learning
Saturday, December 7, 13
13. The static - dynamic continuum
100% Precomputation
•
Super fast lookups
•
Inflexible, wasteful
•
Best for 80% most
common queries
100% Dynamic
Saturday, December 7, 13
•
Always compute results
from raw data
•
Flexible but slow
14. Where we want to be
Partly dynamic
•
•
Flexible, fast dynamic
queries
•
Saturday, December 7, 13
Pre-aggregate most
common queries
Easily generate many
materialized views
15. Industry Trends
•
Fast execution frameworks
–
•
Impala, Drill, Presto
In-memory databases
–
VoltDB, Druid
•
Streaming and real-time
•
Higher-level, productive data frameworks
–
Cascading, Hive, Pig
Saturday, December 7, 13
16. Why Spark and Shark?
“Lightning-fast in-memory cluster computing”
Saturday, December 7, 13
17. Introduction to Spark
• In-memory distributed computing framework
• Created by UC Berkeley AMP Lab in 2010
• Targeted problems that MR is bad at:
– Iterative algorithms (machine learning)
– Interactive data mining
• More general purpose than Hadoop MR
• Active contributions from ~ 15 companies
Saturday, December 7, 13
19. Throughput: Memory is king
0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0
Saturday, December 7, 13
37500
75000
112500
150000
20. Developers love it
• “I wrote my first aggregation job in 30 minutes”
• High level “distributed collections” API
• No Hadoop cruft
• Full power of Scala, Java, Python
• Interactive REPL shell
• EASY testing!!
• Low latency - quick development cycles
Saturday, December 7, 13
21. Spark word count example
file = spark.textFile("hdfs://...")
file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
Saturday, December 7, 13
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18
private final static IntWritable one = new IntWritable(1);
19
private Text word = new Text();
20
21
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22
String line = value.toString();
23
StringTokenizer tokenizer = new StringTokenizer(line);
24
while (tokenizer.hasMoreTokens()) {
25
word.set(tokenizer.nextToken());
26
context.write(word, one);
27
}
28
}
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33
public void reduce(Text key, Iterable<IntWritable> values, Context context)
34
throws IOException, InterruptedException {
35
int sum = 0;
36
for (IntWritable val : values) {
37
sum += val.get();
38
}
39
context.write(key, new IntWritable(sum));
40
}
41 }
42
43 public static void main(String[] args) throws Exception {
44
Configuration conf = new Configuration();
45
46
Job job = new Job(conf, "wordcount");
47
48
job.setOutputKeyClass(Text.class);
49
job.setOutputValueClass(IntWritable.class);
50
51
job.setMapperClass(Map.class);
52
job.setReducerClass(Reduce.class);
53
54
job.setInputFormatClass(TextInputFormat.class);
55
job.setOutputFormatClass(TextOutputFormat.class);
56
57
FileInputFormat.addInputPath(job, new Path(args[0]));
58
FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60
job.waitForCompletion(true);
61 }
62
63 }
22. The Spark Ecosystem
Bagel Pregel on
Spark
HIVE on Spark
Spark Streaming discretized stream
processing
Spark
Tachyon - in-memory caching DFS
Saturday, December 7, 13
23. Shark - HIVE on Spark
• 100% HiveQL compatible
• 10-100x faster than HIVE, answers in seconds
• Reuse UDFs, SerDe’s, StorageHandlers
• Can use DSE / CassandraFS for Metastore
• Easy Scala/Java integration via Spark - easier than
writing UDFs
Saturday, December 7, 13
24. Our new analytics architecture
How we integrate Cassandra and Spark/Shark
Saturday, December 7, 13
25. From raw events to fast queries
Raw
Events
Raw
Events
Spark
Ingestion
Raw
Events
Saturday, December 7, 13
C*
event
store
View 1
Spark
Predefined
queries
Spark
View 2
Shark
Ad-hoc
HiveQL
Spark
View 3
32. Options for Spark/Cassandra Integration
•
Hadoop InputFormat
–
ColumnFamilyInputFormat - reads all rows from 1 CF
– CqlPagingInputFormat, etc. - CQL3, 2-dary indexes
– Roll your own (join multiple CFs, etc)
•
Spark native RDD
–
sc.parallelize(rowkeys).flatMap(readColumns(_))
– JdbcRdd + Cassandra JDBC driver
•
http://tuplejump.github.io/calliope/
Saturday, December 7, 13
33. Tips for InputFormat Development
•
Know which target platforms you are developing for
–
•
Which API to write against? New? Old? Both?
Be prepared to spend time tuning your split computation
–
Low latency jobs require fast splits
•
Consider sorting row keys by token for data locality
•
Implement predicate pushdown for HIVE SerDe’s
–
Use your indexes to reduce size of dataset
Saturday, December 7, 13
37. Fault Tolerance
•
Cached dataset lives in Java Heap only - what if process dies?
•
Spark lineage - automatic recomputation from source, but this is
expensive!
•
Can also replicate cached dataset to survive single node failures
•
Persist materialized views back to C*, then load into cache -- now
recovery path is much faster
•
Persistence also enables multiple processes to hold cached dataset
Saturday, December 7, 13
39. Shark Demo
•
Local shark node, 1 core, MBP
•
How to create a table from C* using our inputformat
•
Creating a cached Shark table
•
Running fast queries
Saturday, December 7, 13
40. Creating a Shark Table from InputFormat
Saturday, December 7, 13
44. Spark: Under the hood
Map
Dataset
Map
Map
Reduce
Dataset
Map
Map
Driver
Reduce
Reduce
Dataset
Map
One executor process per node
Saturday, December 7, 13
Driver