3. Why Cassandra?
●
Somewhat of a green field project: add market
segmentation (aka “Profiles”) to our existing
geolocation / messaging infrastructure
●
Horizontal scalability
●
Homogenous deployment, less ops pain
●
No pre-existing investment in Hadoop
●
Data model matches our problem
4. Devices
●
Android and iOS mobile devices
●
Unique ID
●
●
Other parts of the codebase handle
geolocation. Here we're concerned primarily
with device as an ID
~Millions of devices
5. Attributes
●
Arbitrary key-value pairs associated to devices
●
Defined by marketers and app developers
●
String, boolean, integer, date
●
Encrypted due to PII concerns
●
e.g. birthdate: 1989-01-01, ownsPs3: true
●
~100 attributes
6. Profiles
●
●
●
●
Market segmentation on attributes of devices
Boolean expressions comparing to a fixed
value
Combined via Boolean 'and', aka set
intersection. No 'or'
e.g. wantsPs4: birthdate >= 1978-01-01 &&
ownsPs3 == true && ownsPs4 == false
●
May be defined long after attributes are defined
●
~100 profiles
7. Data Modeling
●
●
●
●
For nonrelational data stores, you need to know
what your queries are before you store data
Probably true of relational databases as well,
but they let you get away with it
Answering queries via primary key is ideal
Cassandra has 2 parts to a primary key lookup:
partitioning (by hash), then clustering (by order)
8. Use Case 1: Triggered Messaging
●
●
When a device breaches a geofence, check to
see if it is in a profile, then send a promotion
e.g. device is near a store, and is in the
wantsPs4 profile, tell it there are Ps4s in stock
●
Latency is important
●
Query: Given a device, which profiles is it in?
9. Use Case 2: Scheduled Messaging
●
●
At some date and time, find all the devices in a
given profile, and send them a promotion
e.g. send all devices in the wantsPs4 profile a
message telling them Ps4 is out of stock for
months, but Xbox One is on sale cheap
●
Throughput is more important than latency
●
Query: Given a profile, which devices are in it?
10. Use Case 3: Historical Analytics
●
●
●
Marketers may want to analyze past data
based on attributes that were known at that
time, but not included in profiles at that time
In other words, we need to know raw facts
(attributes), not just derived conclusions (profile
membership)
Query: Given a device and time, what were the
attributes for that device at that time
11. Brainstorming
●
Need to answer 3 questions:
●
given Device, get Profiles
●
given Profile, get Devices
●
given (Device, Time), get Attributes
12. given (Device, Time), get Attributes
create table attributes (
brandCode ascii,
deviceId ascii,
unixtime bigint,
attrs blob,
primary key ((brandCode, deviceId), unixtime)
) with compact storage
and clustering order by (unixtime desc)
select attrs from attributes where brandCode = ? and
deviceId = ? and unixtime <= ? limit 1
13. given Device, get Profiles
select attrs from attributes where brandCode = ?
and deviceId = ? limit 1
Then, in code, filter the (relatively small) set of
profiles based on whether attrs match it
14. given Profile, get Devices
create table profile_devices (
brandCode ascii,
profileId bigint,
deviceId ascii,
primary key((brandCode, profileId), deviceId)
) with compact storage
select deviceId from profile_devices where
brandCode = ? and profileId = ?
15. Why Spark?
●
●
Scala
Distributed computing that will interop with
Hadoop IO (and thus Cassandra), but doesn't
depend on HDFS
●
Approachable codebase (20kloc, vs 200kloc+)
●
Interactive shell
●
Fast to write, fast to run
18. Spark / Cassandra Interop
// from CassandraTest.scala in the Spark distro
val casRdd = sc.newAPIHadoopRDD(job.getConfiguration(),
classOf[ColumnFamilyInputFormat], classOf[ByteBuffer],
classOf[SortedMap[ByteBuffer, IColumn]])
// Let us first get all the paragraphs from the retrieved rows
val paraRdd = casRdd.map {
case (key, value) => {
ByteBufferUtil.string(value.get(ByteBufferUtil.bytes("para")).value())
}
}
// Lets get the word count in paras
val counts = paraRdd.flatMap(p => p.split(" ")).
map(word => (word, 1)).
reduceByKey(_ + _)
counts.collect().foreach {
case (word, count) => println(word + ":" + count)
}
Short Script:
“All of this is made possible by the advanced technology we’ve made available in the Digby Mobile Suite, an enterprise-grade and PCI certified SaaS platform that is our focus as a company. Our customer, using Digby Services or in self-implementation mode, can use each of these products, in blue, to support the building of applications. Each of them is modular and works with the others, all of them connected to the base platform and a collection of shared services and integration points. The Digby Mobile Console, as mentioned before, is the place where customers can manage the products they have deployed and access relevant analytics both within each product and across the entire solution.
This Digby Mobile Suite allows for the deployment of powerful mobile websites and rich applications quickly, efficiently, and with less risk than any custom-built work. It handles cross-platform differences elegantly. And in a space that is constantly changing and innovating, each of these products has its own roadmap where we continue to handle any platform changes and bring innovations to market that make the products more powerful over time. Additionally, future products mean that customers can extend the capabilities of their mobile footprint even more widely, ensuring they are keeping pace with consumer expectations.”
<number>