The document discusses how MetricsHub optimized its public cloud architecture using Cassandra to store massive amounts of telemetry data in a highly scalable and cost-effective way. It describes how MetricsHub grew to monitor over 8000 VMs and collect 200 million data points per hour. It explains how Cassandra provided a more scalable and reliable solution than Redis or SQL for MetricsHub's use case of aggregating and analyzing streaming telemetry data in real-time. The architecture diagram shows how MetricsHub uses a combination of Azure PaaS and IaaS resources, including a 32 node Cassandra cluster, to monitor customer resources.
Similaire à C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos
Similaire à C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos (20)
C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos
1. Optimizing the Public Cloud for Cost
and Scalability with Cassandra
Charles Lamanna
Senior Development Lead
@clamanna
Ricardo Villalobos
Senior Cloud Architect
@ricvilla
11. Planning for huge data ingestion rates
• MetricsHub requires high scale, real-time data:
• 1,000 data points per minute per VM
• 12 data points per endpoint per minute
• 500+ data points per storage account per hour
• Need to aggregate, analyze and take actions based on
this data stream (in near real-time)
• Must be cheap, scalable and reliable
12. Looked at Redis…
• Perform aggregation in memory (using INCR and other native
operations)
• Flush aggregate data from Redis to persistent storage at a
regular interval
• Is fast, powerful and a good OSS community
13. … but it was fragile, and expensive for this use
case
• RAM/Memory in the public cloud is *expensive* (but storage is
*cheap*)
• Flushing the data requires complex coordination
• If we did not flush quickly enough – out of memory!
14. Looked at SQL…
• Create tables for different time windows and granularities
• Roll over from table-to-table (and drop entire tables when
the data expires)
• Update in place (for counters, min, max, etc.) in a reliable
way
15. … but SQL did not fit
• Higher write than read volume pushed boundaries of the
servers
• Requires complex sharding after just a few dozen new
customers
• Is possible, but not worth the operational cost
16. Then we tried Cassandra (and
never went back)
• Scales fluidly
• Grows horizontally – double the nodes, double capacity
• Add / remove capacity / nodes with no downtime
• Highly available
• No single point of failure
• Replication factor (i.e. hot copies) is just a config switch
17. … and by the way
• Little-to-none operations cost
• New nodes take minutes to setup
• Nodes just keep running for months on end
• “Aggregate on write” – no jobs required!
• Atomic distributed counters make it easy to do aggregates on
write
• …and a nice kicker: has *great* perf / COGS in Azure
19. Table Storage
Jobs Worker Role
(24 instances)
SQL Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
End User Web
Browsers
Monitored Customer Resources
(e.g. websites; SQL databases)
Monitored Virtual Machines
Endpoints Replicated data
in multiple
datacenters
Clients
PaaS
IaaS
Services
20. Avoiding state
• Application logic / code all
lives on stateless
machines
• Keeps it simple: decreases
human operations cost
• Use Azure PAAS offerings
(Web and Worker roles)
Table Storage
Jobs Worker Role
(24 instances)
SQL
Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
Endpoints Replicated data
in multiple
datacenters
PaaS
21. Windows Azure Cloud Services
(PAAS)
• Scale horizontally (grew from
1 to 30+ instances)
• Managed by the platform
(patched; coordinated
recycling; failover; etc.)
• 1 click deployment from
Visual Studio (with automatic
load balancer swaps)
22. Table Storage
SQL
Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
Endpoints Replicated data
in multiple
datacenters
Jobs Worker Role
Runs recurring tasks
to pull, generate and
analyze data
Jobs are
synchronized and
scheduled using
Windows Azure
Tables and Queues
Jobs Worker Role
(24 instances)
23. Table Storage
Jobs Worker Role
(24 instances)
SQL
Database
Blob storage
Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Endpoints Replicated data
in multiple
datacenters
Web API Role
RESTful endpoint for
saving and reading
custom metrics.
Highly
concurrent, secure &
scalable.
Web API Web Role
(8 instances)
24. Table Storage
Jobs Worker Role
(24 instances)
SQL
Database
Blob storage
Cassandra VM Cluster
(32 XL instances)
Web API Web Role
(8 instances)
Endpoints Replicated data
in multiple
datacenters
Portal Web Role
Interface for our
customers – shows
trends, charts and
issues.
Portal Web Role
(3 instances)
25. Table Storage
Jobs Worker Role
(24 instances)
SQL
Database
Blob storage
Web API Web Role
(8 instances)
Endpoints Replicated data
in multiple
datacenters
Maintains all
state for metrics /
time series data. Portal Web Role
(3 instances)
Cassandra VM Cluster
(32 XL instances)
Cassandra Cluster
26. Windows Azure Virtual Machines
(IaaS)
Starting Select Image and VM Size New Disk Persisted in Storage
28. Exposing the pods
• Each pod of 4 nodes
has a single load
balanced endpoint
• Clients (on our
stateless roles) treats
the endpoint as a pool
• Blacklists and skips an
endpoint if it starts
producing a lot of
errors
29. Where does the data go?
• Data files are on 8 mounted network
backed disks (*not* ephemeral disks)
• Data disks are geo-replicated (3
copies local; 1 remote) for “free” DR
• Azure data disks offer great
throughput (VMs end up CPU bound)
31. Updating values…
Realtime “average” values at any granularity, for any time window
update
oneminute/tenminute/oneday
set
sum = sum + {sample_value},
cnt = cnt + 1
where
rk = '{customer_name}' and
ck = '{metric_path}'
32. Reading values…
*ONE* round trip to fetch a metric over time (e.g. CPU over past
week)
select * from oneminute
where
rk = ‘{customer_name}' and
ck < '{metric_path_start}'
and
ck >= '{metric_path_end}‘
order by ck desc;
33. What’s next?
• Windows Azure Virtual Networks to connect /
secure all of our resources
(PAAS + IAAS + Services)
• Expand Cassandra cluster across datacenter
boundaries for improved availability
• Integrate with more off-the-shelf Azure
components to reduce operational overhead
Examples: Ping customer endpoint; pull load balancer stats; identify if a VM set is overloadedHuge scale and highly reliable framework (10s of thousands of jobs; no downtime)All jobs are isolated by task (e.g. ping URL) and customerCommunicates with Cassandra using FluentCassandra (.NET)Requests round robin balanced over 8 endpointsData stream is massive (100k writes / sec) and needs to be resilient
Integrates with other partner services (e.g. Windows Azure store)Used by MetricsHub client agents (on customer machines)Based on .NET (C#) WebAPIsPersists all customer data (writes) to Cassandra only
.NET based using MVC + IISHeavy use of jQuery / javascript on the client side 15+ OSS components are used in the portalBundled & shipped 1-click deployment Updated our production portal several times a day
FluentcassandraAll reads / writes for metric data go to this cluster; no need for a cache40+ VMs connect to this cluster