At StampedeCon 2012 in St. Louis, Scott Fines of NISC presents: Recent years have seen a sudden and rapid introduction of new technologies for distributing applications to essentially arbitrary levels. The growth in variety and depth of these different systems has grown to match, and it can be a challenge just to keep up. In this talk, I’ll discuss some of the more common systems such as Hadoop, HBase, and Cassandra, and some of the different scenarios and pitfalls of using them. I’ll cover when MapReduce is powerful and helpful, and when it’s better to use a different approach. Putting it all together, I’ll mention ZooKeeper, Flume, and some of the surrounding small projects that can help make a useable system.
8. ! MapReduce works if:
! Your algorithm needs to touch every piece of data in the set
! You can write your algorithm in a MapReduce structure
! Your data set is gigantic
9. ! MapReduce is not so good if:
! Your data set is very small
! Your algorithm doesn t need to touch everything
! You only want to query specific pieces of data
10. ! No Indexing
! Job startup cost
! No indices
! Always touches all the data
11. ! MapReduce code is usually a pain to
write
! requires a Java developer
! lots of boilerplate for common tasks
19. Column-Oriented Storage
! SQL =
! Fixed Columns, infinite rows
! Column-Oriented:
! Rows are groups of Key-Value pairs
20. HBase/Cassandra
! Both Column-oriented stores
! Both highly available
! Both rely on memory for performance
21. Apache Cassandra
! Highly Available and Partition Tolerant
! Attempts to hold as much data as
possible in memory
! Manages files on local disk
22. Eventual Consistency
! Cassandra has Eventual Consistency
! It is possible to read out-of-date
data!
! Also possible to guarantee
consistency, at a cost
23. Why Eventual Consistency?
! Data is only written once
! Either it s there or not
! You don t care if you get out-of-date
data
! Shopping Carts
24. Cassandra Strengths
! Fast
! Writes faster than Reads!
! Easy to maintain
! Self-contained
25. Cassandra Weaknesses
! Consistency Model is complex
! Scanning over rows is excruciating
26. Apache HBase
! Uses HDFS as storage mechanism
! Holds large proportion of data in RAM
! need RAM >= 1% of your data size!
27. HBase Strengths
! Strong consistency guarantee
! Good at scanning over rows
! Strong community
! part of the Hadoop ecosystem
28. HBase weaknesses
! Slower than Cassandra
! HDFS is higher latency than direct
disk
! Complex to maintain
! requires running
! HDFS
! ZooKeeper
29. HBase vs. Cassandra
! Pick Cassandra if:
! Doings lots of writes
! need easy maintenance
! don t care about consistency so much
! Pick HBase if
! Scanning over rows a lot
! comfortable with maintaining Hadoop/ZooKeeper
! Need simple consistency guarantees
30. Your cluster:
HBase/
Hadoop
Cassandra
SQL
Application
31. This is complicated!
! How do we configure it?
! What if we have to run an algorithm on
only a single node at a time?
! What if we need to coordinate actions?
32. Apache ZooKeeper
! Distributed Coordination System
! Designed for creating distributed concurrency controls
! also good for storing configuration
! NOT good for storing anything else!
33. ! Now you have:
! Bulk Processing with Hadoop
! Large data queries with HBase/
Cassandra
! Coordination with ZooKeeper
! Your old SQL database!
34. ! Chances are, still need SQL for some
stuff
! If the data sizes are manageable, SQL is
tried-and-true
35. The People Problem
! Big Data systems are complicated
! Lots of moving parts
! Lots of places where things can go
wrong
! Need good people!
36. ! Try and Hire an expert directly...
! Not that many out there
37. ! Train 2 or 3 experts instead
! Worth every penny
38. Who should I hire?
! Probably won t find direct experts
! Look instead for people who:
! are good with algorithms
! are fast learners
! not risk-averse