Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

Welcome to The Jungle
Building Distributed Systems for Large data sets

!   SQL solves all our problems!
!   Or does it?

The Problem with SQL

!   At some point, data is too large to ﬁt on
a single machine.
!   Then what do you do?

Your cluster:

SQL

Application

The ﬁrst sign of trouble

!   Can do small queries pretty good
!   Large analytical queries?
!   forget it!

!   Takes too long
!   Uses too many resources

!   Hadoop = HDFS + MapReduce
!   HDFS = Distributed, Fault Tolerant

File System
!   MapReduce = Highly distributed

processing engine

!   MapReduce works if:
!   Your algorithm needs to touch every piece of data in the set
!   You can write your algorithm in a MapReduce structure
!   Your data set is gigantic

!   MapReduce is not so good if:
!   Your data set is very small
!   Your algorithm doesn t need to touch everything
!   You only want to query speciﬁc pieces of data

!   No Indexing
!   Job startup cost
!   No indices
!   Always touches all the data

!   MapReduce code is usually a pain to
write
!   requires a Java developer

!   lots of boilerplate for common tasks

Apache Pig

!   Data Flow Language
!   feels like using sed/awk

!   good at transformations of data

Apache Hive

!   SQL-like interface
!   good for large queries

!   maintains table information from

ﬁles

Pig vs. Hive

!   Both can do the same thing
!   Hive is easier to learn

!   Pig is easier to maintain

!   Pretty much a matter of taste

The second sign

!   Your Bulk processing and ad-hoc
analysis is working great in Hadoop
!   But now your small queries are sucking

Scale SQL?

!   A Few options:
!   Buy Oracle Rac...$$$$

!   Static Sharding...hard to maintain

!   Don t do it?

Column-Oriented Storage

!   SQL =
!   Fixed Columns, inﬁnite rows

!   Column-Oriented:
!   Rows are groups of Key-Value pairs

HBase/Cassandra

!   Both Column-oriented stores
!   Both highly available
!   Both rely on memory for performance

Apache Cassandra

!   Highly Available and Partition Tolerant
!   Attempts to hold as much data as
possible in memory
!   Manages ﬁles on local disk

Eventual Consistency

!   Cassandra has Eventual Consistency
!   It is possible to read out-of-date

data!
!   Also possible to guarantee

consistency, at a cost

Why Eventual Consistency?

!   Data is only written once
!   Either it s there or not

!   You don t care if you get out-of-date
data
!   Shopping Carts

Cassandra Strengths

!   Fast
!   Writes faster than Reads!

!   Easy to maintain
!   Self-contained

Cassandra Weaknesses

!   Consistency Model is complex
!   Scanning over rows is excruciating

Apache HBase

!   Uses HDFS as storage mechanism
!   Holds large proportion of data in RAM
!   need RAM >= 1% of your data size!

HBase Strengths

!   Strong consistency guarantee
!   Good at scanning over rows
!   Strong community
!   part of the Hadoop ecosystem

HBase weaknesses
!   Slower than Cassandra
!   HDFS is higher latency than direct

disk
!   Complex to maintain
!   requires running

!   HDFS

!   ZooKeeper

HBase vs. Cassandra

!   Pick Cassandra if:
!   Doings lots of writes
!   need easy maintenance
!   don t care about consistency so much

!   Pick HBase if
!   Scanning over rows a lot
!   comfortable with maintaining Hadoop/ZooKeeper
!   Need simple consistency guarantees

Your cluster:
HBase/
Hadoop
Cassandra
SQL

Application

This is complicated!

!   How do we conﬁgure it?
!   What if we have to run an algorithm on
only a single node at a time?
!   What if we need to coordinate actions?

Apache ZooKeeper

!   Distributed Coordination System
!   Designed for creating distributed concurrency controls
!   also good for storing conﬁguration
!   NOT good for storing anything else!

!   Now you have:
!   Bulk Processing with Hadoop

!   Large data queries with HBase/

Cassandra
!   Coordination with ZooKeeper

!   Your old SQL database!

!   Chances are, still need SQL for some
stuff
!   If the data sizes are manageable, SQL is
tried-and-true

The People Problem

!   Big Data systems are complicated
!   Lots of moving parts

!   Lots of places where things can go

wrong
!   Need good people!

!   Try and Hire an expert directly...
!   Not that many out there

!   Train 2 or 3 experts instead
!   Worth every penny

Who should I hire?

!   Probably won t ﬁnd direct experts
!   Look instead for people who:
!   are good with algorithms

!   are fast learners

!   not risk-averse

Thank You

!   email:
! scottfines@gmail.com
!   github:
!   scottfines

!   linkedin:
!   scottfines

Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012

Similaire à Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012 (20)

Plus de StampedeCon

Plus de StampedeCon (20)

Dernier

Dernier (20)

Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon 2012