Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.
2. Real Time Analytics with Cassandra, Hive, and Solr
Aaron Stannard, Founder & CEO of MarkedUp
3. Powerful analytics tools for native apps
Understand your
audience.
Gain valuable data on
your users.
Monitor your
app’s health.
Log errors and crashes
remotely.
Drive
more sales.
Better data = more
revenue.
13. Analytics Schema Strategy
• All
row
keys
should
be
predictable
(not
always
possible)
• U8lize
physical
sortability
of
columns
• Use
predictably
sortable
data
types
for
column
names
(integers,
dates)
• Learn
to
love
composite
keys
• Batch
muta8ons
are
your
friend
• Use
distributed
counters
for
real-‐
8me
metrics
• Use
TTL
for
automa8on
data
expira8on
(if
necessary)
19. When is Hadoop necessary?
• Large volumes of data (100GB+)
• Queries require retrospective / historical analysis
• Need consistent results
• Need to perform multi-stage analysis
• Speed isn’t a concern (Hadoop is sloooooooooow)
20. Hadoop on easy mode: Hive
• SQL abstraction on top of Hadoop (more familiar)
• Easier to deploy and test
• Simplifies data warehousing
• Easy to automatically import data from Cassandra
• DSE eliminates need for HDFS
22. Hive Syntax
Query: count the number items where “key” is greater than
100
RDBMS> select key, count(1) from kv1
where key > 100 group by key;
Hive> select key, count(1) from kv1
where key > 100 group by key;
23. Hive Tips and Tricks
• Don’t write data from Hive back to a hot Cassandra column family
• If writing data from Hive to Cassandra, use dedicated column
families
• You can write to multiple places on a single Hive read (table, CSV
file, etc…)
• Use sampling to test Hive queries on scaled-down data sets
24. How do you count millions of
distinct items in real-time?
25. • Solr:
Lucene-‐based
indexing
engine
• Part
of
Apache
Founda8on
• Full-‐text
search
• Faceted
search
• Distributed
• Integrates
well
with
Cassandra