Cassandra at Digby

Cassandra at Digby

Cody Koeninger
ckoeninger@digby.com

Localpoint Architecture
Localpoint In-App SDK
Location Algorithm – Opt-in – Push – Rich Message – Message Management

Localpoint Cloud
Messaging

Identity
•
Attributes
•
Location
History
•
Campaign
History

Campaign Management (Push, Triggered) – Mobile
Offer Management – Campaign Reporting

Create/Manage

Location

Location API

Accuracy, Power, Privacy Optimization – Geofence
Management - Cross-OS, Cross-Device

Create/Manage

Analytics / Events Engine

Profiles

Campaign API

Real-Time API

Visits – Dwell Time – Frequency - Occupancy

Publish/Subscribe

•

CRM
API

Analytics Engine API
Transaction Record Export

© 2013 Digby. CONFIDENTIAL

Web
Console

Why Cassandra?
●

Somewhat of a green field project: add market
segmentation (aka “Profiles”) to our existing
geolocation / messaging infrastructure

●

Horizontal scalability

●

Homogenous deployment, less ops pain

●

No pre-existing investment in Hadoop

●

Data model matches our problem

Devices
●

Android and iOS mobile devices

●

Unique ID

●

●

Other parts of the codebase handle
geolocation. Here we're concerned primarily
with device as an ID
~Millions of devices

Attributes
●

Arbitrary key-value pairs associated to devices

●

Defined by marketers and app developers

●

String, boolean, integer, date

●

Encrypted due to PII concerns

●

e.g. birthdate: 1989-01-01, ownsPs3: true

●

~100 attributes

Profiles
●
●

●

●

Market segmentation on attributes of devices
Boolean expressions comparing to a fixed
value
Combined via Boolean 'and', aka set
intersection. No 'or'
e.g. wantsPs4: birthdate >= 1978-01-01 &&
ownsPs3 == true && ownsPs4 == false

●

May be defined long after attributes are defined

●

~100 profiles

Data Modeling
●

●

●
●

For nonrelational data stores, you need to know
what your queries are before you store data
Probably true of relational databases as well,
but they let you get away with it
Answering queries via primary key is ideal
Cassandra has 2 parts to a primary key lookup:
partitioning (by hash), then clustering (by order)

Use Case 1: Triggered Messaging
●

●

When a device breaches a geofence, check to
see if it is in a profile, then send a promotion
e.g. device is near a store, and is in the
wantsPs4 profile, tell it there are Ps4s in stock

●

Latency is important

●

Query: Given a device, which profiles is it in?

Use Case 2: Scheduled Messaging
●

●

At some date and time, find all the devices in a
given profile, and send them a promotion
e.g. send all devices in the wantsPs4 profile a
message telling them Ps4 is out of stock for
months, but Xbox One is on sale cheap

●

Throughput is more important than latency

●

Query: Given a profile, which devices are in it?

Use Case 3: Historical Analytics
●

●

●

Marketers may want to analyze past data
based on attributes that were known at that
time, but not included in profiles at that time
In other words, we need to know raw facts
(attributes), not just derived conclusions (profile
membership)
Query: Given a device and time, what were the
attributes for that device at that time

Brainstorming
●

Need to answer 3 questions:

●

given Device, get Profiles

●

given Profile, get Devices

●

given (Device, Time), get Attributes

given (Device, Time), get Attributes
create table attributes (
brandCode ascii,
deviceId ascii,
unixtime bigint,
attrs blob,
primary key ((brandCode, deviceId), unixtime)
) with compact storage
and clustering order by (unixtime desc)
select attrs from attributes where brandCode = ? and
deviceId = ? and unixtime <= ? limit 1

given Device, get Profiles
select attrs from attributes where brandCode = ?
and deviceId = ? limit 1
Then, in code, filter the (relatively small) set of
profiles based on whether attrs match it

given Profile, get Devices
create table profile_devices (
brandCode ascii,
profileId bigint,
deviceId ascii,
primary key((brandCode, profileId), deviceId)
) with compact storage
select deviceId from profile_devices where
brandCode = ? and profileId = ?

Why Spark?
●
●

Scala
Distributed computing that will interop with
Hadoop IO (and thus Cassandra), but doesn't
depend on HDFS

●

Approachable codebase (20kloc, vs 200kloc+)

●

Interactive shell

●

Fast to write, fast to run

Why Spark?
file = spark.textFile("hdfs://...")
file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)

Deployment
●

http://spark.incubator.apache.org/docs/0.8.1/spark-standalone.html

●

Spark worker processes on Cassandra storage nodes

●

Gives data locality

●

Spark master process on Cassandra monitoring machine

●

Cluster start/stop done via ssh key from master

●

Submit jobs to master url

●

Consider pre-installing dependency jars on workers

●

Must use exact same binary version of Scala throughout

Spark / Cassandra Interop
// from CassandraTest.scala in the Spark distro
val casRdd = sc.newAPIHadoopRDD(job.getConfiguration(),
classOf[ColumnFamilyInputFormat], classOf[ByteBuffer],
classOf[SortedMap[ByteBuffer, IColumn]])
// Let us first get all the paragraphs from the retrieved rows
val paraRdd = casRdd.map {
case (key, value) => {
ByteBufferUtil.string(value.get(ByteBufferUtil.bytes("para")).value())
}
}
// Lets get the word count in paras
val counts = paraRdd.flatMap(p => p.split(" ")).
map(word => (word, 1)).
reduceByKey(_ + _)
counts.collect().foreach {
case (word, count) => println(word + ":" + count)
}

Spark Resources
●

●

●

Project homepage
http://spark.incubator.apache.org/
AMP Camp tutorials
http://ampcamp.berkeley.edu/
Introduction to Spark internals
http://www.youtube.com/watch?v=49Hr5xZyTEA

Cassandra at Digby

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Similaire à Cassandra at Digby

Similaire à Cassandra at Digby (20)

Dernier

Dernier (20)

Cassandra at Digby

Notes de l'éditeur