SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Escape From Hadoop: 
Ultra Fast Data Analysis 
with Apache Cassandra & Spark 
Kurt Russell Spitzer 
Piotr Kołaczkowski 
Piotr Kołaczkowski 
DataStax 
slides by 
presented by
Why escape from Hadoop? 
Hadoop 
Many Moving Pieces 
Map Reduce 
Single Points of Failure 
Lots of Overhead 
And there is a way out!
Spark Provides a Simple and Efficient 
framework for Distributed Computations 
Node Roles 2 
In Memory Caching Yes! 
Fault Tolerance Yes! 
Great Abstraction For 
Datasets? RDD! 
Spark 
Worker 
Spark 
Worker 
Spark 
Master 
Spark 
Worker 
Resilient 
Distributed 
Dataset 
SSppaarrkk EExxeeccuuttoorr
Spark is Compatible with 
HDFS, JDBC, Parquet, CSVs, …. 
AND 
APACHE CASSANDRA 
Apache 
Cassandra
Apache Cassandra is a Linearly Scaling and 
Fault Tolerant noSQL Database 
Linearly Scaling: 
The power of the database 
increases linearly with the number 
of machines 
2x machines = 2x throughput 
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 
Fault Tolerant: 
Nodes down != Database Down 
Datacenter down != Database 
Down
Apache Cassandra Architecture is Very Simple 
Node Roles 1 
Replication Tunable 
Consistency Replication 
Tunable 
CC** 
CC** CC** 
CC** 
CClliieenntt
DataStax OSS Connector 
Spark to Cassandra 
https://github.com/datastax/spark-cassandra-connector 
CCaassssaannddrraa SSppaarrkk 
KKeeyyssppaaccee TTaabbllee 
RRDDDD[[CCaassssaannddrraaRRooww]] 
RRDDDD[[TTuupplleess]] 
Bundled and Supported with 
DSE 4.5!
DataStax Connector 
Spark to Cassandra 
By the numbers: 
● 370 commits 
● 17 branches 
● 10 releases 
● 11 contributors 
● 168 issues (65 open) 
● 98 pull requests (6 open)
Spark Cassandra Connector uses the DataStax 
Java Driver to Read from and Write to C* 
CC** 
Full Token 
Range 
Each Executor Maintains a 
connection to the C* Cluster 
Spark 
Executor 
DataStax 
Java Driver 
Tokens 1001 -2000 
Tokens 1-1000 
Tokens … 
RDD’s read into different 
splits based on token ranges
Co-locate Spark and C* for Best Performance 
CC** Running Spark Workers on 
the same nodes as your C* 
cluster will save network hops 
when reading and writing CC** CC** 
Spark 
Worker 
CC** 
Spark 
Worker 
Spark 
Master 
Spark 
Worker
Setting up C* and Spark 
DSE > 4.5.0 
Just start your nodes with 
dse cassandra -k 
Apache Cassandra 
Follow the excellent guide by Al Tobey 
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
We need a Distributed System For 
Analytics and Batch Jobs 
But it doesn’t have to be complicated!
Even count needs to be distributed 
Ask me to write a Map 
Reduce for word count, I 
dare you. 
You could make this easier by adding yet another technology to your 
Hadoop Stack (hive, pig, impala) or 
we could just do one liners on the spark shell.
Basics: Getting a Table and Counting 
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; 
USE newyork; 
CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); 
CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; 
USE newyork; 
CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); 
INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); 
scala> sc.cassandraTable(“newyork","presidentlocations").count 
res3: Long = 10 
scala> sc.cassandraTable(“newyork","presidentlocations").count 
res3: Long = 10 
cassandraTable 
count 10
Basics: take() and toArray 
scala> sc.cassandraTable("newyork","presidentlocations").take(1) 
res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) 
scala> sc.cassandraTable("newyork","presidentlocations").take(1) 
res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) 
cassandraTable 
take(1) 
Array of CassandraRows 
99 NNYYCC 
scala> sc.cassandraTable(“newyork","presidentlocations").toArray 
res3: Array[com.datastax.spark.connector.CassandraRow] = Array( 
scala> sc.cassandraTable(“newyork","presidentlocations").toArray 
res3: Array[com.datastax.spark.connector.CassandraRow] = Array( 
CassandraRow{time: 9, location: NYC}, 
CassandraRow{time: 3, location: White House}, 
…, 
CassandraRow{time: 6, location: Air Force 1}) 
cassandraTable 
toArray 
Array of CassandraRows 
99 NNYYCC 
CassandraRow{time: 9, location: NYC}, 
CassandraRow{time: 3, location: White House}, 
…, 
CassandraRow{time: 6, location: Air Force 1}) 
9999 NNNNYYYYCCCC 9999 NNNNYYYYCCCC
Basics: Getting Row Values out of a 
CassandraRow 
scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time") 
res5: Int = 9 
scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time") 
res5: Int = 9 
cassandraTable 
first 
A CassandraRow object 
99 NNYYCC 
99 get[Int] 
get[Int] 
get[String] 
get[List[...]] 
…get[Any] 
Got null ? 
get[Option[Int]] 
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
get[Int] get[String] 
CC** 
Copy A Table 
Say we want to restructure our table or add a new column? 
CREATE TABLE characterlocations ( 
CREATE TABLE characterlocations ( 
time int, 
character text, 
location text, 
PRIMARY KEY (time,character) 
); 
time int, 
character text, 
location text, 
PRIMARY KEY (time,character) 
); 
scala> sc.cassandraTable(“newyork","presidentlocations") 
.map( row => ( 
scala> sc.cassandraTable(“newyork","presidentlocations") 
.map( row => ( 
row.get[Int](“time"), 
"president", 
row.get[String](“location"))) 
row.get[Int](“time"), 
"president", 
row.get[String](“location"))) 
.saveToCassandra("newyork","characterlocations") 
.saveToCassandra("newyork","characterlocations") 
cqlsh:newyork> SELECT * FROM characterlocations ; 
time | character | location 
------+-----------+------------- 
cqlsh:newyork> SELECT * FROM characterlocations ; 
time | character | location 
------+-----------+------------- 
5 | president | Air Force 1 
10 | president | NYC 
…… 
5 | president | Air Force 1 
10 | president | NYC 
…… 
cassandraTable 
11 wwhhiittee hhoouussee 
11,,pprreessiiddeenntt,,wwhhiittee hhoouussee 
saveToCassandra
Filter a Table 
What if we want to filter based on a non-clustering key column? 
scala> sc.cassandraTable(“newyork","presidentlocations") 
scala> sc.cassandraTable(“newyork","presidentlocations") 
.filter( _.getInt("time") > 7 ) 
.toArray 
res9: Array[com.datastax.spark.connector.CassandraRow] = Array( 
CassandraRow{time: 9, location: NYC}, 
CassandraRow{time: 10, location: NYC}, 
CassandraRow{time: 8, location: NYC} 
) 
.filter( _.getInt("time") > 7 ) 
.toArray 
res9: Array[com.datastax.spark.connector.CassandraRow] = Array( 
CassandraRow{time: 9, location: NYC}, 
CassandraRow{time: 10, location: NYC}, 
CassandraRow{time: 8, location: NYC} 
) 
cassandraTable 
11 wwhhiittee hhoouussee 
getInt 
11 
>7 
filter
Backfill a Table with a Different Key! 
CREATE TABLE timelines ( 
time int, 
character text, 
location text, 
PRIMARY KEY ((character), time) 
) 
CREATE TABLE timelines ( 
time int, 
character text, 
location text, 
PRIMARY KEY ((character), time) 
) 
If we actually want to have quick access to 
timelines we need a C* table with a 
different structure. 
sc.cassandraTable("newyork","characterlocations") 
.saveToCassandra("newyork","timelines") 
sc.cassandraTable("newyork","characterlocations") 
.saveToCassandra("newyork","timelines") 
cqlsh:newyork> select * from timelines; 
character | time | location 
-----------+------+------------- 
president | 1 | White House 
president | 2 | White House 
president | 3 | White House 
president | 4 | White House 
president | 5 | Air Force 1 
president | 6 | Air Force 1 
president | 7 | Air Force 1 
president | 8 | NYC 
president | 9 | NYC 
president | 10 | NYC 
cqlsh:newyork> select * from timelines; 
character | time | location 
-----------+------+------------- 
president | 1 | White House 
president | 2 | White House 
president | 3 | White House 
president | 4 | White House 
president | 5 | Air Force 1 
president | 6 | Air Force 1 
president | 7 | Air Force 1 
president | 8 | NYC 
president | 9 | NYC 
president | 10 | NYC 
11 wwhhiittee hhoouussee 
cassandraTable 
saveToCassandra 
pprreessiiddeenntt CC**
Import a CSV 
I have some data in another source which I could really use in 
my Cassandra table 
sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv") 
sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv") 
.map(_.split(",")) 
.map(line => (line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location")) 
.map(_.split(",")) 
.map(line => (line(0),line(1),line(2))) 
.saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location")) 
textFile 
map 
plissken,1,white house 
cqlsh:newyork> select * from timelines where character = 'plissken'; 
character | time | location 
-----------+------+----------------- 
plissken | 1 | Federal Reserve 
plissken | 2 | Federal Reserve 
plissken | 3 | Federal Reserve 
plissken | 4 | Court 
plissken | 5 | Court 
plissken | 6 | Court 
plissken | 7 | Court 
plissken | 8 | Stealth Glider 
plissken | 9 | NYC 
plissken | 10 | NYC 
cqlsh:newyork> select * from timelines where character = 'plissken'; 
character | split 
time | location 
-----------+------+----------------- 
plissken | 1 | Federal Reserve 
plissken | 2 | Federal Reserve 
plissken | 3 | Federal Reserve 
plissken | 4 | Court 
plissken | 5 | Court 
plissken | 6 | Court 
plissken | 7 | Court 
plissken | 8 | Stealth Glider 
plissken | 9 | NYC 
plissken | 10 | NYC 
pplliisssskkeenn 11 wwhhiittee hhoouussee 
pplliisssskkeenn,,11,,wwhhiittee hhoouussee 
saveToCassandra 
CC**
Perform a Join with MySQL 
Maybe a little more than one line … 
import java.sql._ 
import org.apache.spark.rdd.JdbcRDD 
Class.forName("com.mysql.jdbc.Driver").newInstance(); 
val quotes = new JdbcRDD( 
sc, 
getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"), 
sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?", 
lowerBound = 0, 
upperBound = 100, 
numPartitions = 5, 
mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3)) 
) 
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 
import java.sql._ 
import org.apache.spark.rdd.JdbcRDD 
Class.forName("com.mysql.jdbc.Driver").newInstance(); 
val quotes = new JdbcRDD( 
sc, 
getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"), 
sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?", 
lowerBound = 0, 
upperBound = 100, 
numPartitions = 5, 
mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3)) 
) 
quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23
Perform a Join with MySQL 
Maybe a little more than one line … 
val locations = 
sc.cassandraTable("newyork","timelines") 
val locations = 
sc.cassandraTable("newyork","timelines") 
.filter(_.getString("character") == "plissken") 
.map(row => (row.getInt("time"), row.getString("location"))) 
.filter(_.getString("character") == "plissken") 
.map(row => (row.getInt("time"), row.getString("location"))) 
quotes.join(locations) 
.take(1) 
.foreach(println) 
quotes.join(locations) 
.take(1) 
.foreach(println) 
(5, ( 
Bob Hauk: There was an accident. 
About an hour ago, a small jet went down inside New York City. 
The President was on board. 
Snake Plissken: The president of what?, 
Court 
)) 
cassandraTable 
JdbcRDD 
pplliisssskkeenn,, 55,, ccoouurrtt 
55,,ccoouurrtt 55,,((‘‘BBoobb HHaauukk:: ……’’,,ccoouurrtt)) 
55,, ‘‘BBoobb HHaauukk:: ……'' 
(5, ( 
Bob Hauk: There was an accident. 
About an hour ago, a small jet went down inside New York City. 
The President was on board. 
Snake Plissken: The president of what?, 
Court 
)) 
join
Easy Objects with Case Classes 
We have the technology to make this even easier! 
case class TimelineRow(character: String, time: Int, location: String) 
sc.cassandraTable[TimelineRow]("newyork","timelines") 
case class TimelineRow(character: String, time: Int, location: String) 
sc.cassandraTable[TimelineRow]("newyork","timelines") 
.filter(_.character == "plissken") 
.filter(_.time == 8) 
.toArray 
.filter(_.character == "plissken") 
.filter(_.time == 8) 
.toArray 
res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider)) 
res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider)) 
cassandraTable[TimelineRow] 
TimelineRow 
cchhaarraacctteerr,,ttiimmee,,llooccaattiioonn 
filter 
cchhaarraacctteerr ==== pplliisssskkeenn 
ttiimmee ==== 88 
cchhaarraacctteerr::pplliisssskkeenn,, ttiimmee::88,, llooccaattiioonn:: SStteeaalltthh GGlliiddeerr
A Map Reduce for Word Count … 
scala> sc.cassandraTable("newyork","presidentlocations") 
scala> sc.cassandraTable("newyork","presidentlocations") 
.map(_.getString("location")) 
.flatMap(_.split(" ")) 
.map((_,1)) 
.reduceByKey(_ + _) 
.toArray 
.map(_.getString("location")) 
.flatMap(_.split(" ")) 
.map((_,1)) 
.reduceByKey(_ + _) 
.toArray 
res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 
res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 
11 wwhhiittee hhoouussee 
wwhhiittee hhoouussee 
wwhhiittee,, 11 hhoouussee,, 11 
hhoouussee,, 11 hhoouussee,, 11 
hhoouussee,, 22 
cassandraTable 
getString 
_.split(" ") 
(_,1) 
reduceByKey(_ + _) 
wwhhiittee hhoouussee
Selected RDD transformations 
● min(), max(), count() 
● reduce[T](f: (T, T) ⇒ T): T 
● fold[T](zeroValue: T)(op: (T, T) ⇒ T): T 
● aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U 
● flatMap[U](func: (T) ⇒ TraversableOnce[U]): RDD[U] 
● mapPartitions[U]( 
f: (Iterator[T]) ⇒ Iterator[U], 
preservesPartitioning: Boolean): RDD[U] 
● sortBy[K](f: (T) ⇒ K, ascending: Boolean = true) 
● groupBy[K](f: (T) ⇒ K): RDD[(K, Iterable[T])] 
● intersection(other: RDD[T]): RDD[T] 
● union(other: RDD[T]): RDD[T] 
● subtract(other: RDD[T]): RDD[T] 
● zip[U](other: RDD[U]): RDD[(T, U)] 
● keyBy[K](f: (T) ⇒ K): RDD[(K, T)] 
● sample(withReplacement: Boolean, fraction: Double)
RDD can do even more...
How Fast is it? 
● Reading big data from Cassandra: 
– Spark ~2x faster than Hadoop 
● Minimum latency (1 node, vnodes disabled, tiny data): 
– Spark: 0.7s 
– Hadoop: ~20s 
● Minimum latency (1 node, vnodes enabled): 
– Spark: 1s 
– Hadoop: ~8 minutes 
● In memory processing: 
– up to 100x faster than Hadoop
source: https://amplab.cs.berkeley.edu/benchmark/
In-memory Processing 
Call cache or persist(storageLevel) to store RDD data in memory. 
val rdd = sc.cassandraTable("newyork","presidentlocations") 
.filter(...) 
.map(...) 
.reduce(...) 
.cache 
rdd.first // slow, loads data from Cassandra and keeps in memory 
rdd.first // fast, doesn't read from Cassandra, reads from memory 
val rdd = sc.cassandraTable("newyork","presidentlocations") 
.filter(...) 
.map(...) 
.reduce(...) 
.cache 
rdd.first // slow, loads data from Cassandra and keeps in memory 
rdd.first // fast, doesn't read from Cassandra, reads from memory 
Multiple StorageLevels available: 
● MEMORY_ONLY 
● MEMORY_ONLY_SER 
● MEMORY_AND_DISK 
● MEMORY_AND_DISK_SER 
● DISK_ONLY 
Also replicated variants available: just append _2 to the constant name.
Fault Tolerance 
cassandraTable 
77 88 99 
11 22 33 44 55 66 77 88 99 
11 
22 
44 55 
filter 
map 
33 
66 
Node 1 
FilteredRDD 
MappedRDD 
Cassandra RDD 
44 55 
77 88 
66 
99 
Node 2 
77 88 
11 22 
99 
33 
Node 3 
Replication Factor = 2
Standalone App Example 
https://github.com/RussellSpitzer/spark-cassandra-csv 
CCaarr,, MMooddeell,, CCoolloorr 
Dodge, Caravan, Red 
Ford, F150, Black 
Toyota, Prius, Green 
RDD 
[CassandraRow] 
FavoriteCars 
Table 
CCaassssaannddrraa 
Column Mapping 
CSV
Useful modules / projects 
● Java API 
– for diehard Java developers 
● Python API 
– for those allergic to static types 
● Shark 
– Hive QL on Spark (discontinued) 
● Spark SQL 
– new SQL engine based on Catalyst query planner 
● Spark Streaming 
– microbatch streaming framework 
● MLLib 
– machine learning library 
● GraphX 
– efficient representation and processing of graph data
We're hiring! 
http://www.datastax.com/company/careers
Thanks for listening! 
There is plenty more we can do with Spark but … 
Questions?

Contenu connexe

Tendances

Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and SparkPatrick McFadin
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team ApachePatrick McFadin
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingVassilis Bekiaris
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark Summit
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesPatrick McFadin
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
 

Tendances (20)

Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
 
An Introduction to time series with Team Apache
An Introduction to time series with Team ApacheAn Introduction to time series with Team Apache
An Introduction to time series with Team Apache
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fireApache cassandra and spark. you got the the lighter, let's start the fire
Apache cassandra and spark. you got the the lighter, let's start the fire
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 

En vedette

Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1Vimal Suthar
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemrhatr
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Ashutosh Sonaliya
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOPKirthan S Holla
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsJohn Nestor
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock AnalysisVaibhav Jain
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式Xuan-Chao Huang
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPKirthan S Holla
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013SATOSHI TAGOMORI
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsSatya Narayan
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 

En vedette (20)

Resume of Vimal 4.1
Resume of Vimal 4.1Resume of Vimal 4.1
Resume of Vimal 4.1
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Traffic data analysis using HADOOP
Traffic data analysis using HADOOPTraffic data analysis using HADOOP
Traffic data analysis using HADOOP
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
TRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOPTRAFFIC DATA ANALYSIS USING HADOOP
TRAFFIC DATA ANALYSIS USING HADOOP
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 

Similaire à Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...DataStax Academy
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...Databricks
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraDeependra Ariyadewa
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousRussell Spitzer
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparisonshsedghi
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developersChristopher Batey
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015Alexander DEJANOVSKI
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 

Similaire à Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra (20)

Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt... Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
 
Store and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and CassandraStore and Process Big Data with Hadoop and Cassandra
Store and Process Big Data with Hadoop and Cassandra
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 

Dernier

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Dernier (20)

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

  • 1. Escape From Hadoop: Ultra Fast Data Analysis with Apache Cassandra & Spark Kurt Russell Spitzer Piotr Kołaczkowski Piotr Kołaczkowski DataStax slides by presented by
  • 2. Why escape from Hadoop? Hadoop Many Moving Pieces Map Reduce Single Points of Failure Lots of Overhead And there is a way out!
  • 3. Spark Provides a Simple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Fault Tolerance Yes! Great Abstraction For Datasets? RDD! Spark Worker Spark Worker Spark Master Spark Worker Resilient Distributed Dataset SSppaarrkk EExxeeccuuttoorr
  • 4. Spark is Compatible with HDFS, JDBC, Parquet, CSVs, …. AND APACHE CASSANDRA Apache Cassandra
  • 5. Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down
  • 6. Apache Cassandra Architecture is Very Simple Node Roles 1 Replication Tunable Consistency Replication Tunable CC** CC** CC** CC** CClliieenntt
  • 7. DataStax OSS Connector Spark to Cassandra https://github.com/datastax/spark-cassandra-connector CCaassssaannddrraa SSppaarrkk KKeeyyssppaaccee TTaabbllee RRDDDD[[CCaassssaannddrraaRRooww]] RRDDDD[[TTuupplleess]] Bundled and Supported with DSE 4.5!
  • 8. DataStax Connector Spark to Cassandra By the numbers: ● 370 commits ● 17 branches ● 10 releases ● 11 contributors ● 168 issues (65 open) ● 98 pull requests (6 open)
  • 9.
  • 10. Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C* CC** Full Token Range Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1001 -2000 Tokens 1-1000 Tokens … RDD’s read into different splits based on token ranges
  • 11. Co-locate Spark and C* for Best Performance CC** Running Spark Workers on the same nodes as your C* cluster will save network hops when reading and writing CC** CC** Spark Worker CC** Spark Worker Spark Master Spark Worker
  • 12. Setting up C* and Spark DSE > 4.5.0 Just start your nodes with dse cassandra -k Apache Cassandra Follow the excellent guide by Al Tobey http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
  • 13. We need a Distributed System For Analytics and Batch Jobs But it doesn’t have to be complicated!
  • 14. Even count needs to be distributed Ask me to write a Map Reduce for word count, I dare you. You could make this easier by adding yet another technology to your Hadoop Stack (hive, pig, impala) or we could just do one liners on the spark shell.
  • 15. Basics: Getting a Table and Counting CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); CREATE KEYSPACE newyork WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; USE newyork; CREATE TABLE presidentlocations ( time int, location text , PRIMARY KEY time ); INSERT INTO presidentlocations (time, location ) VALUES ( 1 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 2 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 3 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 4 , 'White House' ); INSERT INTO presidentlocations (time, location ) VALUES ( 5 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 6 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 7 , 'Air Force 1' ); INSERT INTO presidentlocations (time, location ) VALUES ( 8 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 9 , 'NYC' ); INSERT INTO presidentlocations (time, location ) VALUES ( 10 , 'NYC' ); scala> sc.cassandraTable(“newyork","presidentlocations").count res3: Long = 10 scala> sc.cassandraTable(“newyork","presidentlocations").count res3: Long = 10 cassandraTable count 10
  • 16. Basics: take() and toArray scala> sc.cassandraTable("newyork","presidentlocations").take(1) res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) scala> sc.cassandraTable("newyork","presidentlocations").take(1) res2: Array[com.datastax.spark.connector.CassandraRow] = Array(CassandraRow{time: 9, location: NYC}) cassandraTable take(1) Array of CassandraRows 99 NNYYCC scala> sc.cassandraTable(“newyork","presidentlocations").toArray res3: Array[com.datastax.spark.connector.CassandraRow] = Array( scala> sc.cassandraTable(“newyork","presidentlocations").toArray res3: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 3, location: White House}, …, CassandraRow{time: 6, location: Air Force 1}) cassandraTable toArray Array of CassandraRows 99 NNYYCC CassandraRow{time: 9, location: NYC}, CassandraRow{time: 3, location: White House}, …, CassandraRow{time: 6, location: Air Force 1}) 9999 NNNNYYYYCCCC 9999 NNNNYYYYCCCC
  • 17. Basics: Getting Row Values out of a CassandraRow scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time") res5: Int = 9 scala> sc.cassandraTable("newyork","presidentlocations").first.get[Int]("time") res5: Int = 9 cassandraTable first A CassandraRow object 99 NNYYCC 99 get[Int] get[Int] get[String] get[List[...]] …get[Any] Got null ? get[Option[Int]] http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
  • 18. get[Int] get[String] CC** Copy A Table Say we want to restructure our table or add a new column? CREATE TABLE characterlocations ( CREATE TABLE characterlocations ( time int, character text, location text, PRIMARY KEY (time,character) ); time int, character text, location text, PRIMARY KEY (time,character) ); scala> sc.cassandraTable(“newyork","presidentlocations") .map( row => ( scala> sc.cassandraTable(“newyork","presidentlocations") .map( row => ( row.get[Int](“time"), "president", row.get[String](“location"))) row.get[Int](“time"), "president", row.get[String](“location"))) .saveToCassandra("newyork","characterlocations") .saveToCassandra("newyork","characterlocations") cqlsh:newyork> SELECT * FROM characterlocations ; time | character | location ------+-----------+------------- cqlsh:newyork> SELECT * FROM characterlocations ; time | character | location ------+-----------+------------- 5 | president | Air Force 1 10 | president | NYC …… 5 | president | Air Force 1 10 | president | NYC …… cassandraTable 11 wwhhiittee hhoouussee 11,,pprreessiiddeenntt,,wwhhiittee hhoouussee saveToCassandra
  • 19. Filter a Table What if we want to filter based on a non-clustering key column? scala> sc.cassandraTable(“newyork","presidentlocations") scala> sc.cassandraTable(“newyork","presidentlocations") .filter( _.getInt("time") > 7 ) .toArray res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) .filter( _.getInt("time") > 7 ) .toArray res9: Array[com.datastax.spark.connector.CassandraRow] = Array( CassandraRow{time: 9, location: NYC}, CassandraRow{time: 10, location: NYC}, CassandraRow{time: 8, location: NYC} ) cassandraTable 11 wwhhiittee hhoouussee getInt 11 >7 filter
  • 20. Backfill a Table with a Different Key! CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) ) CREATE TABLE timelines ( time int, character text, location text, PRIMARY KEY ((character), time) ) If we actually want to have quick access to timelines we need a C* table with a different structure. sc.cassandraTable("newyork","characterlocations") .saveToCassandra("newyork","timelines") sc.cassandraTable("newyork","characterlocations") .saveToCassandra("newyork","timelines") cqlsh:newyork> select * from timelines; character | time | location -----------+------+------------- president | 1 | White House president | 2 | White House president | 3 | White House president | 4 | White House president | 5 | Air Force 1 president | 6 | Air Force 1 president | 7 | Air Force 1 president | 8 | NYC president | 9 | NYC president | 10 | NYC cqlsh:newyork> select * from timelines; character | time | location -----------+------+------------- president | 1 | White House president | 2 | White House president | 3 | White House president | 4 | White House president | 5 | Air Force 1 president | 6 | Air Force 1 president | 7 | Air Force 1 president | 8 | NYC president | 9 | NYC president | 10 | NYC 11 wwhhiittee hhoouussee cassandraTable saveToCassandra pprreessiiddeenntt CC**
  • 21. Import a CSV I have some data in another source which I could really use in my Cassandra table sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv") sc.textFile("file:///home/pkolaczk/ReallyImportantDocuments/PlisskenLocations.csv") .map(_.split(",")) .map(line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location")) .map(_.split(",")) .map(line => (line(0),line(1),line(2))) .saveToCassandra("newyork","timelines", SomeColumns("character", "time", "location")) textFile map plissken,1,white house cqlsh:newyork> select * from timelines where character = 'plissken'; character | time | location -----------+------+----------------- plissken | 1 | Federal Reserve plissken | 2 | Federal Reserve plissken | 3 | Federal Reserve plissken | 4 | Court plissken | 5 | Court plissken | 6 | Court plissken | 7 | Court plissken | 8 | Stealth Glider plissken | 9 | NYC plissken | 10 | NYC cqlsh:newyork> select * from timelines where character = 'plissken'; character | split time | location -----------+------+----------------- plissken | 1 | Federal Reserve plissken | 2 | Federal Reserve plissken | 3 | Federal Reserve plissken | 4 | Court plissken | 5 | Court plissken | 6 | Court plissken | 7 | Court plissken | 8 | Stealth Glider plissken | 9 | NYC plissken | 10 | NYC pplliisssskkeenn 11 wwhhiittee hhoouussee pplliisssskkeenn,,11,,wwhhiittee hhoouussee saveToCassandra CC**
  • 22. Perform a Join with MySQL Maybe a little more than one line … import java.sql._ import org.apache.spark.rdd.JdbcRDD Class.forName("com.mysql.jdbc.Driver").newInstance(); val quotes = new JdbcRDD( sc, getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"), sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?", lowerBound = 0, upperBound = 100, numPartitions = 5, mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3)) ) quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23 import java.sql._ import org.apache.spark.rdd.JdbcRDD Class.forName("com.mysql.jdbc.Driver").newInstance(); val quotes = new JdbcRDD( sc, getConnection = () => DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root"), sql = "SELECT * FROM quotes WHERE ? <= ID and ID <= ?", lowerBound = 0, upperBound = 100, numPartitions = 5, mapRow = (r: ResultSet) => (r.getInt(2),r.getString(3)) ) quotes: org.apache.spark.rdd.JdbcRDD[(Int, String)] = JdbcRDD[9] at JdbcRDD at <console>:23
  • 23. Perform a Join with MySQL Maybe a little more than one line … val locations = sc.cassandraTable("newyork","timelines") val locations = sc.cassandraTable("newyork","timelines") .filter(_.getString("character") == "plissken") .map(row => (row.getInt("time"), row.getString("location"))) .filter(_.getString("character") == "plissken") .map(row => (row.getInt("time"), row.getString("location"))) quotes.join(locations) .take(1) .foreach(println) quotes.join(locations) .take(1) .foreach(println) (5, ( Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court )) cassandraTable JdbcRDD pplliisssskkeenn,, 55,, ccoouurrtt 55,,ccoouurrtt 55,,((‘‘BBoobb HHaauukk:: ……’’,,ccoouurrtt)) 55,, ‘‘BBoobb HHaauukk:: ……'' (5, ( Bob Hauk: There was an accident. About an hour ago, a small jet went down inside New York City. The President was on board. Snake Plissken: The president of what?, Court )) join
  • 24. Easy Objects with Case Classes We have the technology to make this even easier! case class TimelineRow(character: String, time: Int, location: String) sc.cassandraTable[TimelineRow]("newyork","timelines") case class TimelineRow(character: String, time: Int, location: String) sc.cassandraTable[TimelineRow]("newyork","timelines") .filter(_.character == "plissken") .filter(_.time == 8) .toArray .filter(_.character == "plissken") .filter(_.time == 8) .toArray res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider)) res13: Array[TimelineRow] = Array(TimelineRow(plissken,8,Stealth Glider)) cassandraTable[TimelineRow] TimelineRow cchhaarraacctteerr,,ttiimmee,,llooccaattiioonn filter cchhaarraacctteerr ==== pplliisssskkeenn ttiimmee ==== 88 cchhaarraacctteerr::pplliisssskkeenn,, ttiimmee::88,, llooccaattiioonn:: SStteeaalltthh GGlliiddeerr
  • 25. A Map Reduce for Word Count … scala> sc.cassandraTable("newyork","presidentlocations") scala> sc.cassandraTable("newyork","presidentlocations") .map(_.getString("location")) .flatMap(_.split(" ")) .map((_,1)) .reduceByKey(_ + _) .toArray .map(_.getString("location")) .flatMap(_.split(" ")) .map((_,1)) .reduceByKey(_ + _) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3)) 11 wwhhiittee hhoouussee wwhhiittee hhoouussee wwhhiittee,, 11 hhoouussee,, 11 hhoouussee,, 11 hhoouussee,, 11 hhoouussee,, 22 cassandraTable getString _.split(" ") (_,1) reduceByKey(_ + _) wwhhiittee hhoouussee
  • 26. Selected RDD transformations ● min(), max(), count() ● reduce[T](f: (T, T) ⇒ T): T ● fold[T](zeroValue: T)(op: (T, T) ⇒ T): T ● aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U ● flatMap[U](func: (T) ⇒ TraversableOnce[U]): RDD[U] ● mapPartitions[U]( f: (Iterator[T]) ⇒ Iterator[U], preservesPartitioning: Boolean): RDD[U] ● sortBy[K](f: (T) ⇒ K, ascending: Boolean = true) ● groupBy[K](f: (T) ⇒ K): RDD[(K, Iterable[T])] ● intersection(other: RDD[T]): RDD[T] ● union(other: RDD[T]): RDD[T] ● subtract(other: RDD[T]): RDD[T] ● zip[U](other: RDD[U]): RDD[(T, U)] ● keyBy[K](f: (T) ⇒ K): RDD[(K, T)] ● sample(withReplacement: Boolean, fraction: Double)
  • 27. RDD can do even more...
  • 28. How Fast is it? ● Reading big data from Cassandra: – Spark ~2x faster than Hadoop ● Minimum latency (1 node, vnodes disabled, tiny data): – Spark: 0.7s – Hadoop: ~20s ● Minimum latency (1 node, vnodes enabled): – Spark: 1s – Hadoop: ~8 minutes ● In memory processing: – up to 100x faster than Hadoop
  • 30. In-memory Processing Call cache or persist(storageLevel) to store RDD data in memory. val rdd = sc.cassandraTable("newyork","presidentlocations") .filter(...) .map(...) .reduce(...) .cache rdd.first // slow, loads data from Cassandra and keeps in memory rdd.first // fast, doesn't read from Cassandra, reads from memory val rdd = sc.cassandraTable("newyork","presidentlocations") .filter(...) .map(...) .reduce(...) .cache rdd.first // slow, loads data from Cassandra and keeps in memory rdd.first // fast, doesn't read from Cassandra, reads from memory Multiple StorageLevels available: ● MEMORY_ONLY ● MEMORY_ONLY_SER ● MEMORY_AND_DISK ● MEMORY_AND_DISK_SER ● DISK_ONLY Also replicated variants available: just append _2 to the constant name.
  • 31. Fault Tolerance cassandraTable 77 88 99 11 22 33 44 55 66 77 88 99 11 22 44 55 filter map 33 66 Node 1 FilteredRDD MappedRDD Cassandra RDD 44 55 77 88 66 99 Node 2 77 88 11 22 99 33 Node 3 Replication Factor = 2
  • 32. Standalone App Example https://github.com/RussellSpitzer/spark-cassandra-csv CCaarr,, MMooddeell,, CCoolloorr Dodge, Caravan, Red Ford, F150, Black Toyota, Prius, Green RDD [CassandraRow] FavoriteCars Table CCaassssaannddrraa Column Mapping CSV
  • 33. Useful modules / projects ● Java API – for diehard Java developers ● Python API – for those allergic to static types ● Shark – Hive QL on Spark (discontinued) ● Spark SQL – new SQL engine based on Catalyst query planner ● Spark Streaming – microbatch streaming framework ● MLLib – machine learning library ● GraphX – efficient representation and processing of graph data
  • 35. Thanks for listening! There is plenty more we can do with Spark but … Questions?