From 100s to 100s of Millions

From 100s to 100s of
Millions
July 2011
Erik Onnen

About Me
• Director of Platform Engineering at Urban Airship (.75
years)
• Previously Principal Engineer at Jive Software (3 years)
• 12 years large scale, distributed systems experience going
back to CORBA
• Cassandra, HBase, Kafka and ZooKeeper contributor -
most recently CASSANDRA-2463

In this Talk

• About Urban Airship
• Systems Overview
• A Tale of Storage Engines
• Our Cassandra Deployment
• Battle Scars
• Development Lessons Learned
• Operations Lessons Learned
• Looking Forward

What is an Urban Airship?
• Hosting for mobile services that developers should not
build themselves
• Unified API for services across platforms
• SLAs for throughput, latency

By The Numbers

• Over 160 million active application installs use our system
across over 80 million unique devices

By The Numbers

• Freemium API peaks at 700 requests/second, dedicated
customer API 10K requests/second

By The Numbers

• Over half of those are device check-ins

By The Numbers

• Transactions - send push, check status, get content

By The Numbers

• At any given point in time, we have ~ 1.1 million secure
socket connections into our transactional core

By The Numbers

• At any given point in time, we have ~ 1.1 million secure
socket connections into our transactional core
• 6 months for the company to deliver 1M messages, just
broke 4.2B

Transactional System

• Edge Systems:
• API - Apache/Python/django+piston+pycassa
• Device negotiation - Java NIO + Hector
• Message Delivery - Python, Java NIO + Hector
• Device data - Java HTTPS endpoint

Transactional System

• Edge Systems:
• API - Apache/Python/django+piston+pycassa
• Device negotiation - Java NIO + Hector
• Message Delivery - Python, Java NIO + Hector
• Device data - Java HTTPS endpoint
• Persistence
• Sharded PostgreSQL
• Cassandra 0.7
• MongoDB 1.7

A Tale of Storage Engines

• “Is there a NoSQL system you guys don’t use?”


• Riak :)


• Riak :)
• We do use:


• Riak :)
• We do use:
• Cassandra


• Riak :)
• We do use:
• Cassandra
• HBase


• Riak :)
• We do use:
• Cassandra
• HBase
• Redis


• Riak :)
• We do use:
• Cassandra
• HBase
• Redis
• MongoDB


• Riak :)
• We do use:
• Cassandra
• HBase
• Redis
• MongoDB
• We’re converging on Cassandra + PostgreSQL for
transactional and HBase for long haul


• PostgreSQL


• PostgreSQL
• Bootstrapped the company on PostgreSQL in EC2


• PostgreSQL
• Highly relational, large index model


• PostgreSQL
• Layered in memcached


• PostgreSQL
• Writes weren’t scaling after ~ 6 months


• PostgreSQL
• Writes weren’t scaling after ~ 6 months
• Continued to use for several silos of data but needed a
way to grow more easily

• MongoDB

• MongoDB
• Initially, we loved Mongo

• MongoDB
• Document databases are cool

• MongoDB
• BSON is nice

• MongoDB
• BSON is nice
• As data set grew, we learned a lot about MongoDB

• MongoDB
• BSON is nice
• As data set grew, we learned a lot about MongoDB
• “MongoDB does not wait for a response by default when
writing to the database.”

• MongoDB - Read/Write Problems

• Early days (1.2) one global lock (reads block
writes and vice versa)

• Later, one read lock, one write lock per server

• Long running queries were often devastating

• Replication would fall too far behind and stop

• No writes or updates

• Effectively a failure for most clients

• Effectively a failure for most clients
• With replication, queries for anything other than
the shard key talk to every node in the cluster

• MongoDB - Update Problems

• Simple updates (i.e. counters) were fine

• Bigger updates commonly resulted in large scans of the
collection depending on position == heavy disk I/O

• Frequently spill to end of the collection datafile leaving
“holes” but not sparse files

• Those “holes” get MMap’d even though they’re not used

• Those “holes” get MMap’d even though they’re not used
• Updates moving data acquire multiple locks commonly
blocking other read/write operations

• MongoDB - Optimization Problems

• Compacting a collection locks the entire collection

• Read slave was too busy to be a backup, needed moar
RAMs but were already on High-Memory EC2, nowhere
else to go

else to go
• Mongo MMaps everything - when your data set is
bigger than RAM, you better have fast disks

else to go
• Mongo MMaps everything - when your data set is
bigger than RAM, you better have fast disks
• Until 1.8, no support for sparse indexes

• MongoDB - Ops Issues

• Lots of good information in mongostat

• Recovering a crashed system was effectively impossible
without disabling indexes first (not the default)

• Replica sets never worked for us in testing, lots of
inconsistencies in failure scenarios

• Replica sets never worked for us in testing, lots of
inconsistencies in failure scenarios
• Scattered records lead to lots of I/O that hurt on bad
disks (EC2)

Cassandra at Urban Airship

• Summer of 2010 - no faith left in MongoDB started a
migration to Cassandra


• Lots of L&P testing, client analysis, etc.


• December 2010 - Cassandra backed 85% of our Android
stack’s persistence


• Six EC2 XLS with each serving:


• 30GB data


• 30GB data
• ~1000 reads/second/node


• 30GB data
• ~1000 reads/second/node
• ~750 writes/second/node


• Why Cassandra?


• Why Cassandra?
• Well suited for most of our data model (simple DAGs)


• Why Cassandra?
• Lots of UUIDs and hashes partition well


• Why Cassandra?
• Retrievals don’t need ordering beyond keys or TSD


• Why Cassandra?
• Rolling upgrades FTW


• Why Cassandra?
• Dynamic rebalancing and node addition


• Why Cassandra?
• Column TTLs huge for us


• Why Cassandra?
• Column TTLs huge for us
• Awesome community :)


• Why Cassandra cont’d?


• Particularly well suited to working around EC2 availability


• Needed a cross AZ strategy - we had seen EBS issues
in the past, didn’t trust fault containment w/n a zone


• Didn’t want locality of replication so needed to stripe
across AZs


across AZs
• Read repair and handoff generally did the right thing
when a node would flap (Ubuntu #708920)


across AZs
• No SPoF


across AZs
• No SPoF
• Ability to alter CLs on a per operation basis

Battle Scars - Development
• Know your data model

• Creating indexes after the fact is a PITA

• Design around wide rows

• I/O problems

• I/O problems
• Thrift problems

• I/O problems
• Thrift problems
• Count problems

• I/O problems
• Thrift problems
• Count problems
• Favor JSON over packed binaries if possible

• I/O problems
• Thrift problems
• Count problems
• Careful with Thrift in the stack

• I/O problems
• Thrift problems
• Count problems
• Careful with Thrift in the stack
• Don’t fear the StorageProxy

• Assume failure in the client

• Read timeout vs. connection refused

• When maintaining your own indexes, try and cleanup
after failure

after failure
• Be ready to cleanup inconsistencies anyway

after failure
• Verify client library assumptions and exception handling

after failure
• Retry now vs. retry later?

after failure
• Compensating action during failures?

after failure
• Don’t avoid the Cassandra code

after failure
• Don’t avoid the Cassandra code
• Embed for testing

Battle Scars - Ops
• Cassandra in EC2:

Battle Scars - Ops
• Ensure Dynamic Snitch is enabled

Battle Scars - Ops
• Disk I/O

Battle Scars - Ops
• Disk I/O
• Avoid EBS except for snapshot backups or use S3

Battle Scars - Ops
• Disk I/O
• Stripe ephemerals, not EBS volumes

Battle Scars - Ops
• Disk I/O
• Avoid smaller instances all together

Battle Scars - Ops
• Disk I/O
• Don’t always assume traversing close proximity AZs is
more expensive

Battle Scars - Ops
• Disk I/O
• Don’t always assume traversing close proximity AZs is
more expensive
• Balance RAM cost vs. the cost of additional hosts and
spending time w/ GC logs

Battle Scars - Ops
• Java Best Practices:

Battle Scars - Ops
• All Java services are managed via the same set of
scripts

Battle Scars - Ops
scripts
• In most cases, operators don’t treat Cassandra
different from HBase

Battle Scars - Ops
scripts
• Simple mechanism to take thread or heap dump

Battle Scars - Ops
scripts
• All logging is consistent - GC, application, stdx

Battle Scars - Ops
scripts
• Init scripts use the same scripts operators do

Battle Scars - Ops
scripts
• Bare metal will rock your world

Battle Scars - Ops
scripts
• Bare metal will rock your world
• +UseLargePages will rock your world too

Battle Scars - Ops
ParNew GC Effectiveness Mean Time ParNew GC
300 0.04

225 0.03

150 0.02

75 0.01

0 0
MB Collected Collection Time (ms)

Bare Metal EC2 XL Bare Metal EC2 XL

ParNew Collection Count
60000

45000

30000

15000

0
Number of Collections

Bare Metal EC2 XL

Battle Scars - Ops
• Java Best Practices cont’d:

Battle Scars - Ops
• Get familiar with GC logs (-XX:+PrintGCDetails)

Battle Scars - Ops
• Understand what degenerate CMS collection looks like

Battle Scars - Ops
• We settled at -XX:CMSInitiatingOccupancyFraction=60

Battle Scars - Ops
• Possibly experiment with tenuring threshold

Battle Scars - Ops
• When in doubt take a thread dump

Battle Scars - Ops
• TDA (http://java.net/projects/tda/)

Battle Scars - Ops
• TDA (http://java.net/projects/tda/)
• Eclipse MAT (http://www.eclipse.org/mat/)

Battle Scars - Ops
• Understand when to compact

Battle Scars - Ops
• Understand upgrade implications for datafiles

Battle Scars - Ops
• Watch hinted handoff closely

Battle Scars - Ops
• Watch hinted handoff closely
• Monitor JMX religiously

Looking Forward
• Cassandra is a great hammer but not everything is a nail

Looking Forward
• Coprocessors would be awesome (hint hint)

Looking Forward
• Still spend too much time worrying about GC

Looking Forward
• Glad to see the ecosystem around the product evolving

Looking Forward
• CQL

Looking Forward
• CQL
• Pig

Looking Forward
• CQL
• Pig
• Brisk

Looking Forward
• CQL
• Pig
• Brisk
• Guardedly optimistic about off heap data management

Thanks to

• jbellis, driftx
• Datastax
• Whoever wrote TDA
• SAP

Thanks!

• Urban Airship: http://urbanairship.com/
• We’re hiring! http://urbanairship.com/company/jobs/
• Me @eonnen or erik at

From 100s to 100s of Millions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to From 100s to 100s of Millions

Similar to From 100s to 100s of Millions (20)

Recently uploaded

Recently uploaded (20)

From 100s to 100s of Millions

Editor's Notes