The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
9. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
10. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
11. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
● Writing mapreduce code sucks
12. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
● Writing mapreduce code sucks
● Lots of code for even simple tasks
13. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
● Writing mapreduce code sucks
● Lots of code for even simple tasks
● Boilerplate
14. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
● Writing mapreduce code sucks
● Lots of code for even simple tasks
● Boilerplate
● Configuration isn’t for the faint of heart
15. Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, etc.)
● Spark / Spark SQL
● Shark
16. Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, etc.)
● Spark / Spark SQL
● Shark
MPP queries for Hadoop data
17. Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, etc.)
● Spark / Spark SQL
● Shark
Varies by vendor
22. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
23. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
24. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch & streaming analysis
25. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch & streaming analysis
● Functional programming model
26. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch & streaming analysis
● Functional programming model
● Direct Cassandra integration
37. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
38. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
39. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala, Python, Java
40. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala, Python, Java
● Stream processing
41. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala, Python, Java
● Stream processing
● Supports any existing Hadoop input / output
format
43. What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
44. What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
45. What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
● Works out of the box on EMR
46. What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
● Works out of the box on EMR
● Easily join datasets from disparate sources
48. Deployment Options
3 cluster manager choices:
● Standalone - included with Spark & easy to
set up
● Mesos - generic cluster manager that can
also handle MapReduce
● YARN - Hadoop 2 resource manager
49. Spark Word Count
val file = sc.textFile(“hdfs://…”)
val counts = file.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs://…”)
57. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
58. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
59. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Supports server-side filters (where clauses)
60. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Supports server-side filters (where clauses)
● Data locality aware
61. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Supports server-side filters (where clauses)
● Data locality aware
● Uses HDFS, CassandraFS, or other
distributed FS for checkpointing
73. Resilient Distributed Dataset (RDD)
● A distributed collection of items
● Transformations
○ Similar to those found in scala collections
○ Lazily processed
74. Resilient Distributed Dataset (RDD)
● A distributed collection of items
● Transformations
○ Similar to those found in Scala collections
○ Lazily processed
● Can recalculate from any point of failure
75. RDD Transformations vs Actions
Transformations:
Produce new RDDs
Actions:
Require the materialization of the records to produce a
value
79. Spark SQL
● Provides SQL access to structured data
○ Existing RDDs
○ Hive warehouses (uses existing metastore, SerDes
and UDFs)
○ JDBC/ODBC - use existing BI tools to query large
datasets
81. Getting set up
● Download Spark 1.2.x
● Download Cassandra 2.1.x
● Add the spark-cassandra-connector to
your project
"com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.2.0-alpha3"
89. Spark Streaming
● Creates RDDs from stream source on a
defined interval
● Same ops as “normal” RDDs
● Supports a variety of sources
90. Spark Streaming
● Creates RDDs from stream source on a
defined interval
● Same ops as “normal” RDDs
● Supports a variety of sources
● Exactly once message guarantee
91. Spark Streaming - Use Cases
Applications Sensors Web Mobile
Intrusion detection
Malfunction
detection
Site analytics Network analytics
Fraud detection
Dynamic process
optimization
Recommendations
Location based
advertising
Log processing
Supply chain
planning
Sentiment
analysis
...
97. Real World Task Distribution
Transformations and Actions:
Similar to Scala but …
Your choice of transformations need be made with task
distribution and memory in mind
98. Real World Task Distribution
How do we do that?
● Partition the data appropriately for number of cores (at
least 2x number of cores)
99. Real World Task Distribution
How do we do that?
● Partition the data appropriately for number of cores (at
least 2x number of cores)
● Filter early and often (Queries, Filters, Distinct...)
100. Real World Task Distribution
How do we do that?
● Partition the data appropriately for number of cores (at
least 2x number of cores)
● Filter early and often (Queries, Filters, Distinct...)
● Use pipelines
101. Real World Task Distribution
How do we do that?
● Partial Aggregation
102. Real World Task Distribution
How do we do that?
● Partial Aggregation
● Create an algorithm to be as simple/efficient as it can
be to appropriately answer the question
104. Real World Task Distribution
Some Common Costly Transformations:
● sorting
● groupByKey
● reduceByKey
● ...
105. Partitioning Example
User event log:
● Time-series events
● Tracks user interactions with system over
time
● Location check-ins, page/module views,
profile changes, ad impressions/clicks, etc.