2. Outline
• What is real-time?
• How do databases implement real-time
queries?
• Why is Cassandra ideal for real-time
applications?
• Writing real-time applications with
Cassandra
4. “Of or relating to a system in which input data is processed within
milliseconds” dictionary.com
“Occurring immediately” webopedia
“...the most important requirement of a real-time system is
predictability and not performance” wikipedia
“...a time frame that is very brief, appearing to be immediate.”
wisegeek.com
“Often real-time response times are understood to be in the order of
milliseconds and sometimes microseconds” wikipedia
5.
6. Real-time queries
• ‘Give me X’
• ‘How many Y?’
• ‘What is the top K?’
• ‘How many distinct Z from P?’
7. Real-time definition
• Definition a query is processed in real-
time if the time to get the answer is at most
a constant times the transfer time plus the
round-trip time
tresponse ≤ C(ttransfer + tping )
8. Real-time definition
• The more you ask for, the longer it takes
• For small queries, request dominated by
round trip time
• No query can take less time than the time to
receive it
9. Real-time definition
• Users on faster networks expect a faster
response
• What we mean by real-time is getting faster
10. Implications
• What does this mean for the database?
• Use Google Analytics example
• Simple query:
‘How many page views have there
been from France in the last 24
hours?’
11. Requirement
• Response is one number
• With overhead, say ~1KB
• Ping time 1ms
• 10Mbit connection => 1KB in ~1ms
• 2ms total
12. Solution 1
• grep *.fr /var/log/apache2/*.log
• Suppose have 1M hits an hour => 7GB of
logs a day
• Single disk would take 70s
• Need a beefy server to do this
• Needs to grow as your audience grows
13. Solution 2
• Maintain a counter for each country
• Increment the counter on each hit
• On query just read the counter
• Maybe it is on disk - 5ms seek
• No need to scale speed with traffic
14. Implications
• Real-time queries can only read about as
much data as they send to the requester
• Need to precompute answers
• Store data in a query-centric rather than
data-centric view
15. Age of data
• A real-time query will often need to query
new data
• But not necessarily
• Could run batch process pre-compute
answers
34. What else do we need?
Real-time analytics
High value getting quick response
High cost if service is down
Need high availability
35. What else do we need?
Real-time analytics
High value getting quick response
Need low latency
Need data geographically close
36. Cassandra and HA
• No SPOF
• Choose point on consistency and availability
curve
• Tuneable consistency
• Replication
• Multi data-centre support
37. Cassandra and low
latency
• Can configure caches
• Can parallelise reads
• Multi-DC support enables world-wide
replication
• Can choose lower consistency to avoid
round-trips to other DCs
39. Real-time apps
• Need to write code using a client library
• Design data-model
• If queries change, code changes
40. Acunu Analytics
• Provides simple RESTful interface to
Cassandra counters
• Push processing into ingest phase
AA
event Cassandra
counter
updates
41. Acunu Analytics
• Event template, e.g.,
select : ["COUNT", "AVG(loadTime)"],
type : {
time : [TIME(HOUR; MIN; SEC), ?, 0],
page : PATH(/),
loadTime : [LONG, 0, 0]
}
• Specifies “blow-up” strategy according to
supported queries
• Need to know basics of query in advance, but
not whole thing
42. Features
• Simple, real-time, incremental analytics
• work done on ingest
• sum, count, distinct, avg, stddev, min-max etc
• time + hierarchy bucketing
• efficient ‘group’ semantics
• works with Apache Cassandra
43. Summary
• Formalise what real-time means
• Deduced how data must be stored
• Explored how Cassandra has these
properties
• Discussed how Acunu Analytics helps when
writing real-time apps