C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Optimizing the Public Cloud for Cost
and Scalability with Cassandra
Charles Lamanna
Senior Development Lead
@clamanna
Ricardo Villalobos
Senior Cloud Architect
@ricvilla

MetricsHub
keep services up and running for the lowest possible cost

Live Status
Cost Awareness
Alerts and Notifications
Actions and Scaling
$
#CASSANDRA13

growth
2000+ customers in 6 months

0
500
1000
1500
2000
2500
10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013
Number of MetricsHub Customers

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013
Number of VMs Monitored by
MetricsHub

0
1
2
3
4
5
6
7
8
10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/2013
Number of Metricshub Employees

storing data
200M data points per hour

Planning for huge data ingestion rates
• MetricsHub requires high scale, real-time data:
• 1,000 data points per minute per VM
• 12 data points per endpoint per minute
• 500+ data points per storage account per hour
• Need to aggregate, analyze and take actions based on
this data stream (in near real-time)
• Must be cheap, scalable and reliable

Looked at Redis…
• Perform aggregation in memory (using INCR and other native
operations)
• Flush aggregate data from Redis to persistent storage at a
regular interval
• Is fast, powerful and a good OSS community

… but it was fragile, and expensive for this use
case
• RAM/Memory in the public cloud is *expensive* (but storage is
*cheap*)
• Flushing the data requires complex coordination
• If we did not flush quickly enough – out of memory!

Looked at SQL…
• Create tables for different time windows and granularities
• Roll over from table-to-table (and drop entire tables when
the data expires)
• Update in place (for counters, min, max, etc.) in a reliable
way

… but SQL did not fit
• Higher write than read volume pushed boundaries of the
servers
• Requires complex sharding after just a few dozen new
customers
• Is possible, but not worth the operational cost

Then we tried Cassandra (and
never went back)
• Scales fluidly
• Grows horizontally – double the nodes, double capacity
• Add / remove capacity / nodes with no downtime
• Highly available
• No single point of failure
• Replication factor (i.e. hot copies) is just a config switch

… and by the way
• Little-to-none operations cost
• New nodes take minutes to setup
• Nodes just keep running for months on end
• “Aggregate on write” – no jobs required!
• Atomic distributed counters make it easy to do aggregates on
write
• …and a nice kicker: has *great* perf / COGS in Azure

architecture
68 virtual machines (PAAS and IAAS)

Table Storage
Jobs Worker Role
(24 instances)
SQL Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
End User Web
Browsers
Monitored Customer Resources
(e.g. websites; SQL databases)
Monitored Virtual Machines
Endpoints Replicated data
in multiple
datacenters
Clients
PaaS
IaaS
Services

Avoiding state
• Application logic / code all
lives on stateless
machines
• Keeps it simple: decreases
human operations cost
• Use Azure PAAS offerings
(Web and Worker roles)
Table Storage
Jobs Worker Role
(24 instances)
SQL
Database
Blob storage
Portal Web Role
(3 instances)
(32 XL instances)
Web API Web Role
(8 instances)
in multiple
datacenters
PaaS

Windows Azure Cloud Services
(PAAS)
• Scale horizontally (grew from
1 to 30+ instances)
• Managed by the platform
(patched; coordinated
recycling; failover; etc.)
• 1 click deployment from
Visual Studio (with automatic
load balancer swaps)

Table Storage
SQL
Database
Blob storage
Portal Web Role
(3 instances)
(32 XL instances)
Web API Web Role
(8 instances)
in multiple
datacenters
Jobs Worker Role
Runs recurring tasks
to pull, generate and
analyze data
Jobs are
synchronized and
scheduled using
Windows Azure
Tables and Queues
Jobs Worker Role
(24 instances)

Table Storage
Jobs Worker Role
(24 instances)
SQL
Database
Blob storage
Portal Web Role
(3 instances)
(32 XL instances)
in multiple
datacenters
Web API Role
RESTful endpoint for
saving and reading
custom metrics.
Highly
concurrent, secure &
scalable.
Web API Web Role
(8 instances)

Table Storage
Jobs Worker Role
(24 instances)
SQL
Database
Blob storage
(32 XL instances)
Web API Web Role
(8 instances)
in multiple
datacenters
Portal Web Role
Interface for our
customers – shows
trends, charts and
issues.
Portal Web Role
(3 instances)

Table Storage
Jobs Worker Role
(24 instances)
SQL
Database
Blob storage
Web API Web Role
(8 instances)
in multiple
datacenters
Maintains all
state for metrics /
time series data. Portal Web Role
(3 instances)
(32 XL instances)
Cassandra Cluster

Windows Azure Virtual Machines
(IaaS)
Starting Select Image and VM Size New Disk Persisted in Storage

32 nodes, 8 “pods” of 4 nodes

Exposing the pods
• Each pod of 4 nodes
has a single load
balanced endpoint
• Clients (on our
stateless roles) treats
the endpoint as a pool
• Blacklists and skips an
endpoint if it starts
producing a lot of
errors

Where does the data go?
• Data files are on 8 mounted network
backed disks (*not* ephemeral disks)
• Data disks are geo-replicated (3
copies local; 1 remote) for “free” DR
• Azure data disks offer great
throughput (VMs end up CPU bound)

Our Column Families (CQL
3)
CREATE TABLE oneminute (
rk text,
ck text,
cnt counter,
sum counter,
PRIMARY KEY (rk, ck)
);

Updating values…
Realtime “average” values at any granularity, for any time window
update
oneminute/tenminute/oneday
set
sum = sum + {sample_value},
cnt = cnt + 1
where
rk = '{customer_name}' and
ck = '{metric_path}'

Reading values…
*ONE* round trip to fetch a metric over time (e.g. CPU over past
week)
select * from oneminute
where
rk = ‘{customer_name}' and
ck < '{metric_path_start}'
and
ck >= '{metric_path_end}‘
order by ck desc;

What’s next?
• Windows Azure Virtual Networks to connect /
secure all of our resources
(PAAS + IAAS + Services)
• Expand Cassandra cluster across datacenter
boundaries for improved availability
• Integrate with more off-the-shelf Azure
components to reduce operational overhead

Global Physical Infrastructure
servers/network/datacenters
REST API + OTHER SERVICES
compute data management networking

C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Similaire à C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos (20)

Plus de DataStax Academy

Plus de DataStax Academy (20)

Dernier

Dernier (20)

C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Notes de l'éditeur