6. Lessons learned
• partial in-memory file storage bug
• journal file on hdfs -> backup of local master disk
• hdfs api
• RawTable in Shark
• persist(OFF_HEAP) temporary storage
• RDD.persist() OFF_HEAP > MEMORY_SER_AND_DISK
• native API: getInStream(CACHE|NO_CACHE) -> local workers
• do not evict blocks when streaming to Tachyon/hdfs
• Tachyon > Spark JVM Cache for long running jobs
• kryo/defaultCodec/sequenceFile format to minimize memory footprint
• 25million emails/month 2TB, 3-45 nodes, 120-170GB of RAM for Tachyon
6
The physical architecture diagram for our largest customer deployment, demonstrating the enterprise-grade attributes of the platform: scalability, high availability, performance, resilience, manageability while providing means for geo-failover (warehouse), geo-replication (real-time DB), data and system monitoring, instrumentation, backup & restore.
Cassandra rings are DC-replicated across EC2 east and west coast regions, data between geo-replicas synchronized in real time through an ipsec tunnel (VPC-to-VPC).
Geo-replicated apis behind an AWS Route 53 DNS service (latency based resource records sets) and ELBs ensures users requests are served from the closest geographical location. Failure to an entire region (happened to us during a big conference!) does not affect our availability and SLAs.
User facing dashboards are served from Cassandra (real-time store), with data being exported from a data warehouse (Shark/Hive) build on top a Mesos-managed Spark/Hadoop cluster.
Export jobs are instrumented and provide a throttling mechanism to control throughput.
Export jobs run on the east-coast only, data is synchronized in real time with the west coast ring. Generated apis are automatically instrumented (Graphite) and monitored (Nagios).
Referral Provider Network: one of the 6 applications that we built for our healthcare customer using the xPatterns APIs and tools on the new beyond Hadoop infrastructure: ELT Pipeline, Export to NoSQL API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool.
The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty and with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths.
The dataset behind the app consists of 8 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra)