The Use Case for Cassandra at Ping Identity
How and why Ping Identity uses Cassandra database inside PingOne.
By
Michael Ward, Site Reliability Engineer, On-Demand
Ping Identity
mward@pingidentity.com
@devoperandi
History:Cassandra like most things at Ping started out as a trial run. We implemented reporting for PingOne on Cassandra and let it bake.And we wanted to see what direction it was going, get our feet wet and see how it fit in with existing and future projects. Experimenting on MongoDB. Great debate between Cassandra and MongoDbCassandra won due to write anywhere technologyMore servers, smaller capacityGeographic distribution for data redundancy, availability and performanceHorizontal scalabilityNo single point of failure
Remember to mention our migration from Mongo by year endHaven’t performed this migration yet
Why? Built to provide insight into PingOneWhy?SaaS applications are known for not providing logging and reporting information into their customers. We wanted to change that. And we continue building out this functionality out.Reports range from Number of success and failed SSOs. Unique user access per application over any period of time back to a year.Same schema – Use case still fitsClient = Hector of thrift api
Requirements:Geographically distributedRespectable performanceNo updates or deletes (repairs suck)Benefits:Easy management due to requirements for the clusterLimitations:One big ringWrites could start in DC1 and actually write to DC2Lopsided dataNo compressionReads were slowNodes recover over the WANLack of Security
So this upgrade happened in two parts:First to v0.8Second to 1.1.2After upgrading the cluster in place we found out this wasn’t a good ideaWe missed out on compressionOur data was still not really evenly distributedReplication was set to one per DC
Started with 9 nodes in the cluster with intent to horizontally scale25-35% performance improvement on reads5-10% performance improvement on writesCompression enabled 50% reduction in data sizeToken offsetsbetter data distributionnode recovery happened locallyMultiple replicas means always read locallyFirst write always happens locally thus faster response back to applicationLimitations
Traffic first directed at old clusterTake snapshot of clusterPush to new clusterCopy Schema from old cluster to new clusterAdd Snappy CompressionBulk load into new clusterSwitch Traffic to new ClusterReplay logs from central log server from bulk load timeCompression We chose to stream the data into a new cluster to allow for compression. Steps: tar up snaptshot push to new cluster stream in using bulk loader Because we did this during the day, we new consistency between the clusters would fall behind. We allowed this because we are capable replaying this into the cluster after the switch.
Here is what our Reporting Cluster looks like on the front end
New ClusterMuch easier to implementNo Manual token generationMore efficient memory utilizationImplemented Secondary Indexes Better data distribution via VnodesDevs wanted to take advantage of CQL3, implement Astyanax client, Atomic BatchesOps wanted Internal authPerformance Boosts in v1.2:Reduced memory footprint by partition summary (last on-heap memory structure)15% read performance increase by including ‘USETLAB JVM flag’, (localizes object allocation in memory) https://blogs.oracle.com/jonthecollector/entry/the_real_thingAuto Token Generation Just set the number of token ranges you want per serverData Distribution More token ranges = less likely to have unbalanced clusterMemory UtilizationMoved compression metadata and bloomfilters off-heapAtomic batches If one is successful they all areRequest Tracing Allows for performance testing of individual queries against the databaseAuthentication/Authorization Hey security around the cluster. Go figure.Less manual cluster rebalance is using something other than random partitioner
We aren’t currently performing any Row CacheThe number of replicas per datacenter can actually reduce the effectiveness of Row Caching