Cassandra, Modeling and Availability at AMUG

Conceptual Modeling Differences From A RDBMS
Matthew F. Dennis, DataStax // @mdennis

Austin MySQL User Group
January 11, 2012

Cassandra Is Not Relational
get out of the relational mindset when working
with Cassandra (or really any NoSQL DB)

Work Backwards From Queries
Think in terms of queries, not in terms of
normalizing the data; in fact, you often want to
denormalize (already common in the data
warehousing world, even in RDBMS)

OK great, but how do I do that?
Well, you need to know how Cassandra Models
Data (e.g. Google Big Table)

research.google.com/archive/bigtable-osdi06.pdf

Go Read It!

In Cassandra:

data is organized into Keyspaces (usually one per app)
➔

each Keyspace can have multiple Column Families
➔

each Column Family can have many Rows
➔

each Row has a Row Key and a variable number of Columns
➔

each Column consists of a Name, Value and Timestamp
➔

In Cassandra, Keyspaces:
are similar in concept to a “database” in some RDBMs
➔

are stored in separate directories on disk
➔

are usually one-one with applications
➔

are usually the administrative unit for things related to ops
➔

contain multiple column families
➔

In Cassandra, In Keyspaces, Column Famlies:
➔ are similar in concept to a “table” in most RDBMs

➔ are stored in separate files on disk (many per CF)

➔ are usually approximately one-one with query type

➔ are usually the administrative unit for things related to your data

➔ can contain many (~billion* per node) rows

* for a good sized node
(you can always add nodes)

In Cassandra, In Keyspaces, In Column Families ...

Rows

thepaul office: Austin OS: OSX twitter: thepaul0

mdennis office: UA OS: Linux twitter: mdennis

thobbs office: Austin twitter: tylhobbs

Row Keys




Columns

Column Names




Column Values







Rows Are Randomly Ordered
(if using the RandomPartitioner)




Columns Are Ordered by Name
(by a configurable comparator)

Columns are ordered because
doing so allows very efficient
implementations of useful and
common operations

(e.g. merge joins)

In particular, within a row I can
find given columns by name very
quickly (ordered names => log(n)
binary search).

More importantly, I can query for a
slice between a start and end

Row Key

RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...

start end

Why does that matter?
Because columns within a row aren't static!

The Column Name Can Be Part of Your Data

INTC ts0: $25.20 ts1: $25.25 ...

AMR ts0: $6.20 ts9: $0.26 ...

CRDS ts0: $1.05 ts5: $6.82 ...

Columns Are Ordered by Name
(in this case by a TimeUUID Comparator)

Turns Out That Pattern Comes Up A Lot
➔ stock ticks
➔ event logs

➔ ad clicks/views

➔ sensor records

➔ access/error logs

➔ plane/truck/person/”entity” locations

➔…

OK, but I can do that in SQL
Not efficiently at scale, at least not easily ...

How it Looks In a RDBMS
ticker timestamp bid ask ...
AMR ts0 ... ... ...
... ... ... ... ...
CRDS ts0 ... ... ...
... ... ... ... ...
Data I Care About ... ts0 ... ... ...
AMR ts1 ... ... ...
... ... ... ... ...
... ... ... ... ...
… ts1 ... ... ...
AMR ts2 ... ... ...
... ts2 ... ... ...

How it Looks In a RDBMS
ticker timestamp bid ask ...
AMR ts0 ... ... ...

Larger Than Your Page Size
Disk Seeks
AMR ts1 ... ... ...

Larger Than Your Page Size

AMR ts2 ... ... ...
... ts2 ... ... ...

OK, but what about ...
PostgreSQL Cluster Command?
➔

MySQL Cluster Indexes?
➔

Oracle Index Organized Tables?
➔

SQLServer Clustered Index?
➔

OK, but what about ...
PostgreSQL Cluster Using?
➔

Meh ...
MySQL [InnoDB] Cluster Indexes?
➔

Oracle Index Organized Table?
➔

SQLServer Clustered Index?
➔

(seriously, who uses SQLServer?!)

The on-disk management of that
clustering results in tons of IO …

In the case of PostgreSQL:

clustering is a one time operation
➔

(implies you must periodically rewrite the entire table)

new data is *not* written in clustered order
➔

(which is often the data you care most about)

OK, so just partition the tables ...

Not a bad idea, except in MySQL there is a limit of
1024 partitions and generally less if using NDB

(you should probably still do it if using MySQL though)

http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html

OK fine, I agree storing data that is queried
together on disk together is a good thing but
what's that have to do with modeling?

Seek To Here

RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...

Read Precisely My Data *

* more on some caveats later

Well, that's what is meant by “work backwards
from your queries” or “think in terms of queries”

(NB: this concept, in general, applies to RDBMS
at scale as well; it is not specific to Cassandra)

An Example From Fraud Detection
To calculate risk it is common to need to know all the
emails, destinations, origins, devices, locations, phone
numbers, et cetera ever used for the account in question

In a normalized model that usually translates to a
table for each type of entity being tracked

id name ... id device ...
1 guy ... 1000 0xdead ...
2 gal ... 2000 0xb33f ...
... ... ... ... ... ...

id dest ... id email ... id origin ...
15 USA ... 100 guy@ ... 150 USA ...
25 Finland ... 200 gal@ ... 250 Nigeria ...
... ... ... ... ... ... ... ... ...

The problem is that at scale that also means
a disk seek for each one …
(even for perfect IOT et al if across multiple tables)

➔Previous emails? That's a seek …
➔Previous devices? That's a seek …

➔Previous destinations? That's a seek ...

But In Cassandra I Store The Data I Query
Together On Disk Together
(remember, column names need not be static)

Data I Care About

acctY ... ... ... ... ... ... ...
acctX dest21 dev2 dev7 email3 email9 orig4 ...
acctZ ... ... ... ... ... ... ...

email:cassandra@mailinator.com = dateEmailWasLastUsed

Column Name Column Value

Don't treat Cassandra (or any DB) as a black box
➔Understand how your DBs (and data structures) work

➔Understand the building blocks they provide

➔Understand the work complexity (“big O”) of queries

➔For data sets > memory, goal is to minimize seeks *

* on a related note, SSDs are awesome

Availability Has Many Levels
➔ Component Failure (disk)

➔ Machine Failure (NIC, cpu, power supply)

➔ Site Failure (UPS, power grid, tornado)

➔ Political Failure (war, coup)

The Common Theme In The Solutions?

Replication

Replication In Cassandra Follows The
Dynamo Model *
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Read It!

Every Node Has A Token
0 - 2^127
t0

t3 t0 < t1 < t2 < t3 < 2^127 t1

t2

Row Key Determines Node(s)
MD5(RK) => T
t0

t3 < T < 2^127

t3 t1

t2

Row Key Determines Node
MD5(RK) => T First Replica
t0

t3 < T < 2^127

t3 t1

t2

Walk The Ring To Find Subsequent Replicas *
MD5(RK) => T First Replica
t0

t3 < T < 2^127

t3 t1

Second Replica

t2
* by default

Writes Happen In Parallel To All Replicas
First Replica
client t0

RK= ...
RK= ...

t3 t1
RK= ...

Second Replica

Coordinator t2
(not a master)

Some Or All Replicas Respond
First Replica
client t0

RK= ...
“ok”

X
t3 t1
“ok”

Second Replica

Coordinator Waits For Ack(s) t2
From Destination Node(s)

The Coordinator Responds To Client
First Replica
client t0

“ok”
“ok”

X
t3 t1
“ok”

Second Replica

Coordinator Waits For Ack(s) t2
From Destination Node(s)

What Nodes Can Be A Coordinator?

The coordinator for any given read or
write is really just whatever node the
client connected to for that request

any node for any request at any time

How Many Replicas Does The
Coordinator Wait For?

configurable, per query
➔

ONE / QUORUM are the most common
➔

(more on this in a moment)

Writing At CL.One

First Replica
client t0

t3 t1

X
Second Replica
t2 Third
Replica

Wait For At Least One Node
(eventually all nodes get updates)

Writing At CL.One

First Replica
client t0

“ok”

“ok”
t3 t1

X
Second Replica
t2 Third
Replica


Reading At CL.One

First Replica
client t0

t3 t1

X
Second Replica
t2 Third
Replica

(so you might read stale data)

Reading At CL.One

First Replica
client t0

“old”
“old”

t3 t1

X
Second Replica
t2 Third
Replica

(so you might read stale data)

Writing At CL.Quorum

First Replica
client t0

t3 t1

X
Second Replica
t2 Third
Replica

Wait For Majority Of Nodes

Writing At CL.Quorum

First Replica
client t0

“ok” “ok”

“ok”
t3 t1

X
Second Replica
t2 Third
Replica


Reading At CL.Quorum

First Replica
client t0

X

t3 t1

Second Replica
t2 Third
Replica

(majority => overlap => consistent)


First Replica
client t0

“ok”
X
“ok”
t3 t1

“old”
coordinator chooses client
response based on client
Second Replica
supplied per column TS t2 Third
Replica

(majority => overlap => consistent)


First Replica
client t0

X
Already Has
Response t3 t1

“current”

Second Replica
t2 Third
Replica

Read Repair Updates Stale Nodes

On A Side Note, A Lost Response

t0

“ok”

X
t3

Is The Same As A Lost Request

t0

X
RK = ...

t3

* In Regards To Meeting Consistency

Which Is The Same As A Failed/Slow Node

X
t0

RK = ...

t3

* In Regards To Meeting Consistency

In fact, it is actually impossible for the originator
to reliably distinguish between the 3

One More Important Piece:

writes are idempotent *

* except with the counter API, but if you want that it can be done

Why is that important?
It means we can replay/retry writes, even late
and/or out of order, and get the same results

After/during node failures
➔

After/during network partitions
➔

After/during upgrades
➔

In other words you can concurrently issue
conflicting updates to two different nodes while
those nodes have no communication between them

Which is important because ...

If you care about global availability you must
serve reads and writes from multiple data centers

There is no way around this

Q?
Conceptual Modeling Differences From A RDBMS
Matthew F. Dennis, DataStax // @mdennis

A Brief Rant On Query Planners, Garbage
Collectors, Virtual Memory, Automatic
Transmissions and Data Structures

Cassandra, Modeling and Availability at AMUG

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Cassandra, Modeling and Availability at AMUG

Similaire à Cassandra, Modeling and Availability at AMUG (20)

Dernier

Dernier (20)

Cassandra, Modeling and Availability at AMUG