The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options.
About Robbie Strickland, Software Development Manager at The Weather Channel
Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.
5. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
6. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
7. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
● Writing mapreduce code sucks
8. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
● Writing mapreduce code sucks
● Lots of code for even simple tasks
9. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
● Writing mapreduce code sucks
● Lots of code for even simple tasks
● Boilerplate
10. Hadoop works fine, right?
● It’s 11 years old (!!)
● It’s relatively slow and inefficient
● … because everything gets written to disk
● Writing mapreduce code sucks
● Lots of code for even simple tasks
● Boilerplate
● Configuration isn’t for the faint of heart
11. Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, etc.)
● Spark / Spark SQL
● Shark
12. Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, etc.)
● Spark / Spark SQL
● Shark
MPP queries for Hadoop data
13. Landscape has changed ...
Lots of new tools available:
● Cloudera Impala
● Apache Drill
● Proprietary (Splunk, keen.io, etc.)
● Spark / Spark SQL
● Shark
Varies by vendor
18. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
19. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
20. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch & streaming analysis
21. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch & streaming analysis
● Functional programming model
22. Spark wins?
● Can be a Hadoop replacement
● Works with any existing InputFormat
● Doesn’t require Hadoop
● Supports batch & streaming analysis
● Functional programming model
● Direct Cassandra integration
33. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
34. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
35. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala, Python, Java
36. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala, Python, Java
● Stream processing
37. What is Spark?
● In-memory cluster computing
● 10-100x faster than MapReduce
● Collection API over large datasets
● Scala, Python, Java
● Stream processing
● Supports any existing Hadoop input / output
format
39. What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
40. What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
41. What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
● Works out of the box on EMR
42. What is Spark?
● Native graph processing via GraphX
● Native machine learning via MLlib
● SQL queries via SparkSQL
● Works out of the box on EMR
● Easily join datasets from disparate sources
44. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
45. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
46. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Supports server-side filters (where clauses)
47. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Supports server-side filters (where clauses)
● Data locality aware
48. Spark on Cassandra
● Direct integration via DataStax driver -
cassandra-driver-spark on github
● No job config cruft
● Supports server-side filters (where clauses)
● Data locality aware
● Uses HDFS, CassandraFS, or other
distributed FS for checkpointing
58. Resilient Distributed Dataset (RDD)
● A distributed collection of items
● Transformations
○ Similar to those found in scala collections
○ Lazily processed
59. Resilient Distributed Dataset (RDD)
● A distributed collection of items
● Transformations
○ Similar to those found in scala collections
○ Lazily processed
● Can recalculate from any point of failure
60. RDD Transformations vs Actions
Transformations:
Produce new RDDs
Actions:
Require the materialization of the records to produce a
value
69. Spark Streaming
● Creates RDDs from stream source on a
defined interval
● Same ops as “normal” RDDs
● Supports a variety of sources
○ Queues
○ Twitter
○ Flume
○ etc.
73. Real World Task Distribution
Transformations and Actions:
Similar to Scala but …
Your choice of transformations need be made with task
distribution and memory in mind
74. Real World Task Distribution
How do we do that?
● Partition the data appropriately for number of cores (at
least 2 x number of cores)
● Filter early and often (Queries, Filters, Distinct...)
● Use pipelines
75. Real World Task Distribution
How do we do that?
● Partial Aggregation
● Create an algorithm to be as simple/efficient as it can be
to appropriately answer the question