SlideShare a Scribd company logo
1 of 26
xPatterns on Spark, Shark,
Tachyon and Mesos
Seattle Spark Meetup
May 2014
2 Atigeo Confidential
• xPatterns Architecture
• xPatterns Infrastructure Evolution
• Ingestion API & GUI (Demo)
• Transformation API & GUI (Demo)
• Jaws Http SharkServer API & GUI (Demo)
• Export to NoSql API & GUI (Demo)
• xPatterns dashboard application (Demo)
• xPatterns monitoring and instrumentation (Demo)
• ELT pipeline rebuilt on BDAS: from 0.8.0 to 0.9.1
• Lessons Learned, Tips & Tricks
Agenda
3 Atigeo Confidential
4 Atigeo Confidential
5 Atigeo Confidential
• Hadoop -> Spark: faster distributed computing engine leveraging in-memory computation at a much lower
operational cost, machine learning primitives, simpler programming model (Scala, Python, Java), faster job
submission, shell for quick prototyping and testing, ideal for our iterative algorithms
• Hive -> Shark: interactive queries on large datasets have become reasonable requests (in-memory caching
yields 4-20x performance improvement, ELT script base migration required minimal effort (same familiar
HiveQL, with a few exceptions)
• NO resource manager - > Mesos: multiple workloads from multiple frameworks can co-exist and fairly
consume the cluster resources (policy based). More mature than YARN, allows us to separate production
from experimentation workloads, co-locates legacy Hadoop MR jobs, multiple Shark servers (Jaws), multiple
Spark Job servers, mixed Hive and Shark queries (ELT), and establish priority queues: no more
unmanageable contention and delayed execution while maximizing cluster utilization (dynamic scheduling)
• No Cache -> Tachyon: in-memory distributed file system, with HDFS backup, resilience through lineage
rather than replication, our out-of-process cache that survives Spark JVM restarts, allows for fine tuning
performance and experimenting against cached warehouse tables without reload. Faster than in process
cache due to delayed GC. Provides data sharing between multiple Spark/Shark jobs, efficient in-memory
columnar storage with compression support for minimal footprint
• Cloudera Manager Dashboards-> Ganglia: distributed monitoring system for dashboards with historical
metrics data (CPU, RAM, disk I/O, network I/O) and Spark/Hadoop metrics. This is a nice addition to our
Nagios (monitoring and alerts) and Graphite (instrumentation dashboards)
xPatterns Infrastructure Evolution
6 Atigeo Confidential
• Highly available, scalable and resilient distributed download tool exposed through Restful API
& GUI
• Supports encryption/decryption, compression/decompression, automatic backup & restore
(aws S3) and geo-failover (hdfs and S3 in both us-east and us-west ec2 regions)
• Support multiple input sources: sftp, S3 and 450+ sources through Talend Integration
• Configurable throughput (number of parallel Spark processors, in both fine-grained and
coarse-grained Mesos modes)
• File Transfer log and file transition state history for auditing purposes (pluggable persistence
model, Cassandra/hdfs), configurable alerts, reports
• Ingest + Backup: download + decompression + hdfs persistence + encryption + S3 upload
• Restore: S3 download + decryption + decompress + hdfs persistence
• Geo-failover: backup on S3 us-east + restore from S3 us-east into west-coast hdfs + backup on
S3 us-west
• Ingestion jobs can be resumed from any stage after failure (# of Spark task retries exhausted)
• Http streaming API exposed for high-throughput push model ingestion (ingestion into Kafka
pub-sub, batch Spark job for transfer into hdfs)
Distributed Data Ingestion API & GUI
7 Atigeo Confidential
8 Atigeo Confidential
T-Component API & GUI
• Data Transformation component for building a data pipeline with monitoring and quality gates
• Exposes all of Oozie’s action types and adds Spark (Java & Scala) and Shark (QL) stages
• Uses patched Ooyala SparkJobServer (multiple contexts in same JVM bug fixed by us!)
• Spark stage required to run code that accepts an xPatterns-managed Spark context (coarse-grained or
fine-grained) as parameter
• DAG and job execution info persistence in Hive Metastore
• Exposes full API for job, stages, resources management and scheduled pipeline execution
• Demo: submit a Spark driver program as Spark stage for transforming ingested hdfs files, submit a Shar
stage for creating a Shark table and further transforming the datasets through HiveQL statements
9 Atigeo Confidential
• T-component DAG executed by Oozie
• Spark and Shark stages executed through ssh actions
• Spark stage sent to SparkJobServer
• SharSk stage executed through shark CLI for now (SharkServer2 in the future)
• Support for pySpark stage coming soon
10 Atigeo Confidential
• Jaws: a highly scalable and resilient restful (http) interface on top of a managed Shark session
that can concurrently and asynchronously submit Shark queries, return persisted results
(automatically limited in size or paged), execution logs and job information (Cassandra or hdfs
persisted).
• Jaws can be load balanced for higher availability and scalability and it fuels a web-based GUI
that is integrated in the xPatterns Management Console (Warehouse Explorer)
• Jaws exposes configuration options for fine-tuning Spark & Shark performance and running
against a stand-alone Spark deployment, with or without Tachyon as in-memory distributed
file system on top of HDFS, and with or without Mesos as resource manager
• Provides different deployment recipes for all combinations of Spark, Mesos and Tachyon
• Shark editor provides analysts, data scientists with a view into the warehouse through a
metadata explorer, provides a query editor with intelligent features like auto-complete, a
results viewer, logs viewer and historical queries for asynchronously retrieving persisted
results, logs and query information for both running and historical queries
• Adding web-style pagination and query cancellation, spray io http layer (REST on Akka)
Jaws REST SharkServer & GUI
11 Atigeo Confidential
Jaws REST SharkServer & GUI
12 Atigeo Confidential
13 Atigeo Confidential
Export to NoSql API
• Datasets in the warehouse need to be exposed to high-throughput low-latency real-time
APIs. Each application requires extra processing performed on top of the core datasets,
hence additional transformations are executed for building data marts inside the
warehouse
• Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive
table to a Cassandra Column Family, through a custom Spark job with configurable
throughput (configurable Spark processors against a Cassandra ring) (instrumentation
dashboard embedded, logs, progress and instrumentation events pushed though SSE)
• Data Modeling is driven by the read access patterns provided by an application engineer
building dashboards and visualizations: lookup key, columns (record fields to read), paging,
sorting, filtering
• The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo-
replicated) that uses the underlying generated Cassandra data model and fuels the data in
the dashboards
• Configuration API provided for creating export jobs and executing them (ad-hoc or
scheduled).
14 Atigeo Confidential
15 Atigeo Confidential
Mesos/Spark cluster
16 Atigeo Confidential
Cassandra multi DC ring – write latency
17 Atigeo Confidential
Nagios monitoring
18 Atigeo Confidential
19 Atigeo Confidential
Referral Provider Network
• One of the many applications that we built for our largest healthcare customers using
the xPatterns APIs and tools on the new upgraded infrastructure: ELT Pipeline, Jaws,
Export to NoSql API. The dashboard for the RPN application was built using D3.js and
angular against the generic api published by the export tool.
• The application allows for building a graph of downstream and upstream referred and
referring providers, grouped by specialty, with computed aggregates like patient counts,
claim counts and total charged amounts. RPN is used for both fraud detection and for
aiding a clinic buying decision, by following the busiest graph paths.
• The dataset behind the app consists of 8 billion medical records, from which we
extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in
the graph (persisted in Cassandra)
• While we demo the graph building we will also look at the Graphite instrumentation
dashboard for analyzing the runtime performance of the geo-replicated Cassandra read
operations (latency in the 20-50ms range)
20 Atigeo Confidential
21 Atigeo Confidential
Graphite – Cassandra multi DC ring
22 Atigeo Confidential
• 20 billion healthcare records, 200 TB of compressed hdfs data
• Processing pipeline, a mixture of custom MR and mostly Hive scripts, converted to Spark and
Shark, with performance gains of 3-4x (for disk intensive operations) to 20-40x for queries on
cached tables (Spark cache or Tachyon which is slightly faster with added resilience benefits)
• Daily processing reduced from 14 hours to 1.5hours!
• Shark 0.8.1 does not support: map join auto-conversion, automatic calculation of number of
reducers, reducer or map out phase disk spills, skew joins etc … we have to either manually
fine tune the cluster and the query based on the specific dataset, or we are better off with
Hive under these circumstances … so we use Mesos to manage Hadoop and Spark under the
same cluster, mixing Hive and Shark workloads (demo)
• 0.9.0 fixes many of the problems, but still requires patches! (spill & mesos fine-grained)
• Tested against multiple cluster configurations of the same cost, using 3 types of instances:
m1.xlarge (4c x 15GB), m2.4xlarge (8c x 68.4GB) and cc2.8xlarge (32c x 60.8GB).
• Jaws config settings explained: set mapreduce.job.reduces=…, set
shark.column.compress=true, spark.default.parallelism=384,
spark.storage.memoryFraction=0.3, spark.shuffle.memoryFraction=0.6,
spark.shuffle.consolidateFiles=true, spark.shuffle.spill=false|true, spark.mesos.coarse=false,
spark.scheduler.mode=FAIR
• Multiple sparkContexts in the same JVM, Mesos framework starvation bug
ELT processing and data quality pipeline
23 Atigeo Confidential
• Export to Semantic Search API (solrCloud/lucene)
• pySpark Job Server
• pySpark  Shark/Tachyon interop (either)
• pySpark  Spark SQL (1.0) interop (or)
• Parquet columnar storage for warehouse data
Coming soon …
24 Atigeo Confidential
Q & A
Oh btw … we’re hiring!
© 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this
presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided
after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
26 Atigeo Confidential

More Related Content

What's hot

Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Robert "Chip" Senkbeil
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsClaudiu Barbura
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with SparkKnoldus Inc.
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackAnirvan Chakraborty
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayJacob Park
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit
 

What's hot (20)

Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internals
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Lambda Architecture with Spark
Lambda Architecture with SparkLambda Architecture with Spark
Lambda Architecture with Spark
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Developing a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and SprayDeveloping a Real-time Engine with Akka, Cassandra, and Spray
Developing a Real-time Engine with Akka, Cassandra, and Spray
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 

Viewers also liked

Reactive Streams and RabbitMQ
Reactive Streams and RabbitMQReactive Streams and RabbitMQ
Reactive Streams and RabbitMQmkiedys
 
Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?Izzet Mustafaiev
 
Akka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsAkka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsNLJUG
 
Akka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeRoland Kuhn
 
Reactive Jersey Client
Reactive Jersey ClientReactive Jersey Client
Reactive Jersey ClientMichal Gajdos
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingAhmed Soliman
 
Resilient Applications with Akka Persistence - Scaladays 2014
Resilient Applications with Akka Persistence - Scaladays 2014Resilient Applications with Akka Persistence - Scaladays 2014
Resilient Applications with Akka Persistence - Scaladays 2014Björn Antonsson
 
12 Factor App: Best Practices for JVM Deployment
12 Factor App: Best Practices for JVM Deployment12 Factor App: Best Practices for JVM Deployment
12 Factor App: Best Practices for JVM DeploymentJoe Kutner
 
Micro services, reactive manifesto and 12-factors
Micro services, reactive manifesto and 12-factorsMicro services, reactive manifesto and 12-factors
Micro services, reactive manifesto and 12-factorsDejan Glozic
 

Viewers also liked (9)

Reactive Streams and RabbitMQ
Reactive Streams and RabbitMQReactive Streams and RabbitMQ
Reactive Streams and RabbitMQ
 
Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?Docker. Does it matter for Java developer ?
Docker. Does it matter for Java developer ?
 
Akka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based ApplicationsAkka in Practice: Designing Actor-based Applications
Akka in Practice: Designing Actor-based Applications
 
Akka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in Practice
 
Reactive Jersey Client
Reactive Jersey ClientReactive Jersey Client
Reactive Jersey Client
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
 
Resilient Applications with Akka Persistence - Scaladays 2014
Resilient Applications with Akka Persistence - Scaladays 2014Resilient Applications with Akka Persistence - Scaladays 2014
Resilient Applications with Akka Persistence - Scaladays 2014
 
12 Factor App: Best Practices for JVM Deployment
12 Factor App: Best Practices for JVM Deployment12 Factor App: Best Practices for JVM Deployment
12 Factor App: Best Practices for JVM Deployment
 
Micro services, reactive manifesto and 12-factors
Micro services, reactive manifesto and 12-factorsMicro services, reactive manifesto and 12-factors
Micro services, reactive manifesto and 12-factors
 

Similar to Spark, Shark, Tachyon and Mesos Infrastructure Evolution for Big Data Analytics

xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014Claudiu Barbura
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache SparkSimon Lia-Jonassen
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comCedric Vidal
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018harvraja
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizonArtem Ervits
 

Similar to Spark, Shark, Tachyon and Mesos Infrastructure Evolution for Big Data Analytics (20)

xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
 
Red Hat Storage Roadmap
Red Hat Storage RoadmapRed Hat Storage Roadmap
Red Hat Storage Roadmap
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.com
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Coherence RoadMap 2018
Coherence RoadMap 2018Coherence RoadMap 2018
Coherence RoadMap 2018
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Spark, Shark, Tachyon and Mesos Infrastructure Evolution for Big Data Analytics

  • 1. xPatterns on Spark, Shark, Tachyon and Mesos Seattle Spark Meetup May 2014
  • 2. 2 Atigeo Confidential • xPatterns Architecture • xPatterns Infrastructure Evolution • Ingestion API & GUI (Demo) • Transformation API & GUI (Demo) • Jaws Http SharkServer API & GUI (Demo) • Export to NoSql API & GUI (Demo) • xPatterns dashboard application (Demo) • xPatterns monitoring and instrumentation (Demo) • ELT pipeline rebuilt on BDAS: from 0.8.0 to 0.9.1 • Lessons Learned, Tips & Tricks Agenda
  • 5. 5 Atigeo Confidential • Hadoop -> Spark: faster distributed computing engine leveraging in-memory computation at a much lower operational cost, machine learning primitives, simpler programming model (Scala, Python, Java), faster job submission, shell for quick prototyping and testing, ideal for our iterative algorithms • Hive -> Shark: interactive queries on large datasets have become reasonable requests (in-memory caching yields 4-20x performance improvement, ELT script base migration required minimal effort (same familiar HiveQL, with a few exceptions) • NO resource manager - > Mesos: multiple workloads from multiple frameworks can co-exist and fairly consume the cluster resources (policy based). More mature than YARN, allows us to separate production from experimentation workloads, co-locates legacy Hadoop MR jobs, multiple Shark servers (Jaws), multiple Spark Job servers, mixed Hive and Shark queries (ELT), and establish priority queues: no more unmanageable contention and delayed execution while maximizing cluster utilization (dynamic scheduling) • No Cache -> Tachyon: in-memory distributed file system, with HDFS backup, resilience through lineage rather than replication, our out-of-process cache that survives Spark JVM restarts, allows for fine tuning performance and experimenting against cached warehouse tables without reload. Faster than in process cache due to delayed GC. Provides data sharing between multiple Spark/Shark jobs, efficient in-memory columnar storage with compression support for minimal footprint • Cloudera Manager Dashboards-> Ganglia: distributed monitoring system for dashboards with historical metrics data (CPU, RAM, disk I/O, network I/O) and Spark/Hadoop metrics. This is a nice addition to our Nagios (monitoring and alerts) and Graphite (instrumentation dashboards) xPatterns Infrastructure Evolution
  • 6. 6 Atigeo Confidential • Highly available, scalable and resilient distributed download tool exposed through Restful API & GUI • Supports encryption/decryption, compression/decompression, automatic backup & restore (aws S3) and geo-failover (hdfs and S3 in both us-east and us-west ec2 regions) • Support multiple input sources: sftp, S3 and 450+ sources through Talend Integration • Configurable throughput (number of parallel Spark processors, in both fine-grained and coarse-grained Mesos modes) • File Transfer log and file transition state history for auditing purposes (pluggable persistence model, Cassandra/hdfs), configurable alerts, reports • Ingest + Backup: download + decompression + hdfs persistence + encryption + S3 upload • Restore: S3 download + decryption + decompress + hdfs persistence • Geo-failover: backup on S3 us-east + restore from S3 us-east into west-coast hdfs + backup on S3 us-west • Ingestion jobs can be resumed from any stage after failure (# of Spark task retries exhausted) • Http streaming API exposed for high-throughput push model ingestion (ingestion into Kafka pub-sub, batch Spark job for transfer into hdfs) Distributed Data Ingestion API & GUI
  • 8. 8 Atigeo Confidential T-Component API & GUI • Data Transformation component for building a data pipeline with monitoring and quality gates • Exposes all of Oozie’s action types and adds Spark (Java & Scala) and Shark (QL) stages • Uses patched Ooyala SparkJobServer (multiple contexts in same JVM bug fixed by us!) • Spark stage required to run code that accepts an xPatterns-managed Spark context (coarse-grained or fine-grained) as parameter • DAG and job execution info persistence in Hive Metastore • Exposes full API for job, stages, resources management and scheduled pipeline execution • Demo: submit a Spark driver program as Spark stage for transforming ingested hdfs files, submit a Shar stage for creating a Shark table and further transforming the datasets through HiveQL statements
  • 9. 9 Atigeo Confidential • T-component DAG executed by Oozie • Spark and Shark stages executed through ssh actions • Spark stage sent to SparkJobServer • SharSk stage executed through shark CLI for now (SharkServer2 in the future) • Support for pySpark stage coming soon
  • 10. 10 Atigeo Confidential • Jaws: a highly scalable and resilient restful (http) interface on top of a managed Shark session that can concurrently and asynchronously submit Shark queries, return persisted results (automatically limited in size or paged), execution logs and job information (Cassandra or hdfs persisted). • Jaws can be load balanced for higher availability and scalability and it fuels a web-based GUI that is integrated in the xPatterns Management Console (Warehouse Explorer) • Jaws exposes configuration options for fine-tuning Spark & Shark performance and running against a stand-alone Spark deployment, with or without Tachyon as in-memory distributed file system on top of HDFS, and with or without Mesos as resource manager • Provides different deployment recipes for all combinations of Spark, Mesos and Tachyon • Shark editor provides analysts, data scientists with a view into the warehouse through a metadata explorer, provides a query editor with intelligent features like auto-complete, a results viewer, logs viewer and historical queries for asynchronously retrieving persisted results, logs and query information for both running and historical queries • Adding web-style pagination and query cancellation, spray io http layer (REST on Akka) Jaws REST SharkServer & GUI
  • 11. 11 Atigeo Confidential Jaws REST SharkServer & GUI
  • 13. 13 Atigeo Confidential Export to NoSql API • Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse • Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring) (instrumentation dashboard embedded, logs, progress and instrumentation events pushed though SSE) • Data Modeling is driven by the read access patterns provided by an application engineer building dashboards and visualizations: lookup key, columns (record fields to read), paging, sorting, filtering • The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo- replicated) that uses the underlying generated Cassandra data model and fuels the data in the dashboards • Configuration API provided for creating export jobs and executing them (ad-hoc or scheduled).
  • 16. 16 Atigeo Confidential Cassandra multi DC ring – write latency
  • 19. 19 Atigeo Confidential Referral Provider Network • One of the many applications that we built for our largest healthcare customers using the xPatterns APIs and tools on the new upgraded infrastructure: ELT Pipeline, Jaws, Export to NoSql API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool. • The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty, with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths. • The dataset behind the app consists of 8 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra) • While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations (latency in the 20-50ms range)
  • 21. 21 Atigeo Confidential Graphite – Cassandra multi DC ring
  • 22. 22 Atigeo Confidential • 20 billion healthcare records, 200 TB of compressed hdfs data • Processing pipeline, a mixture of custom MR and mostly Hive scripts, converted to Spark and Shark, with performance gains of 3-4x (for disk intensive operations) to 20-40x for queries on cached tables (Spark cache or Tachyon which is slightly faster with added resilience benefits) • Daily processing reduced from 14 hours to 1.5hours! • Shark 0.8.1 does not support: map join auto-conversion, automatic calculation of number of reducers, reducer or map out phase disk spills, skew joins etc … we have to either manually fine tune the cluster and the query based on the specific dataset, or we are better off with Hive under these circumstances … so we use Mesos to manage Hadoop and Spark under the same cluster, mixing Hive and Shark workloads (demo) • 0.9.0 fixes many of the problems, but still requires patches! (spill & mesos fine-grained) • Tested against multiple cluster configurations of the same cost, using 3 types of instances: m1.xlarge (4c x 15GB), m2.4xlarge (8c x 68.4GB) and cc2.8xlarge (32c x 60.8GB). • Jaws config settings explained: set mapreduce.job.reduces=…, set shark.column.compress=true, spark.default.parallelism=384, spark.storage.memoryFraction=0.3, spark.shuffle.memoryFraction=0.6, spark.shuffle.consolidateFiles=true, spark.shuffle.spill=false|true, spark.mesos.coarse=false, spark.scheduler.mode=FAIR • Multiple sparkContexts in the same JVM, Mesos framework starvation bug ELT processing and data quality pipeline
  • 23. 23 Atigeo Confidential • Export to Semantic Search API (solrCloud/lucene) • pySpark Job Server • pySpark  Shark/Tachyon interop (either) • pySpark  Spark SQL (1.0) interop (or) • Parquet columnar storage for warehouse data Coming soon …
  • 24. 24 Atigeo Confidential Q & A Oh btw … we’re hiring!
  • 25. © 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Editor's Notes

  1. the logical architecture diagram with the 3 logical layers of xPatterns: Infrastructure, Analytics and Visualization and the roles: ELT Engineer, Data Scientist, Application Engineer. xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application.”
  2. The physical architecture diagram for our largest customer deployment, demonstrating the enterprise-grade attributes of the platform: scalability, high availability, performance, resilience, manageability while providing means for geo-failover (warehouse), geo-replication (real-time DB), data and system monitoring, instrumentation, backup & restore. Cassandra rings are DC-replicated across EC2 east and west coast regions, data between geo-replicas synchronized in real time through an ipsec tunnel (VPC-to-VPC). Geo-replicated apis behind an AWS Route 53 DNS service (latency based resource records sets) and ELBs ensures users requests are served from the closest geographical location. Failure to an entire region (happened to us during a big conference!) does not affect our availability and SLAs. User facing dashboards are served from Cassandra (real-time store), with data being exported from a data warehouse (Shark/Hive) build on top a Mesos-managed Spark/Hadoop cluster. Export jobs are instrumented and provide a throttling mechanism to control throughput. Export jobs run on the east-coast only, data is synchronized in real time with the west coast ring. Generated apis are automatically instrumented (Graphite) and monitored (Nagios).
  3. Jaws: a highly scalable and resilient restful (http) interface on top of a managed Shark session that can concurrently and asynchronously submit Shark queries, return persisted results (automatically limited in size), execution logs and job information (Cassandra or hdfs persisted). Jaws can be load balanced for higher availability and scalability and it fuels a web-based GUI called Shark Editor that is integrated in the xPatterns Management Console Jaws exposes configuration options for fine-tuning Spark & Shark performance and running against a stand-alone Spark deployment, with or without Tachyon as in-memory distributed file system on top of HDFS, and with or without Mesos as resource manager Provides different deployment recipes for all combinations of Spark, Mesos and Tachyon Shark editor provides analysts, data scientists with a view into the warehouse through a metadata explorer, provides a query editor with intelligent features like auto-complete, a results viewer, logs viewer and historical queries for asynchronously retrieving persisted results, logs and query information for both running and historical queries (DEMO)
  4. Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse Pre-optimization Shark/Hive queries required for building an efficient data model for Cassandra persistence: minimal number of column families, wide rows (50-100 MB compressed). Resulting data model is efficient for both read (dashboard/API) and write (export/updates) requests Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring) Data Modeling is driven by the read access patterns: lookup key, columns (record fields to read), paging, sorting, filtering. The data access patterns is used for automatically publishing a REST api that uses the underlying generated Cassandra data model and it fuels the data in the dashboards Execution logs behind workflows, progress report and instrumentation events for the dashboard are pushed to the browser through SSE (Zookeeper watchers used for synchronization)
  5. Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse Pre-optimization Shark/Hive queries required for building an efficient data model for Cassandra persistence: minimal number of column families, wide rows (50-100 MB compressed). Resulting data model is efficient for both read (dashboard/API) and write (export/updates) requests Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring) Data Modeling is driven by the read access patterns: lookup key, columns (record fields to read), paging, sorting, filtering. The data access patterns is used for automatically publishing a REST api that uses the underlying generated Cassandra data model and it fuels the data in the dashboards Execution logs behind workflows, progress report and instrumentation events for the dashboard are pushed to the browser through SSE (Zookeeper watchers used for synchronization)
  6. Mesos/Spark context (CoarseGrainedMode) with a fixed 120 cores spread out across 4 nodes for the export job
  7. Instrumentation dashboard showcasing the write latency measured during the export to noSql job (7ms max). Writes are performed against the east-coast DC … they are propagated to the west coast, however the JMX metric exposed (Write.Latency.OneMinuteRate) does not reflect it … need to build a new dashboard with different metrics!
  8. Nagios monitoring for the geo-replicated, instrumented generated apis. The APIs (readers) and the Spark executors (writers) have a retry mechanism (AOP aspects) that implement throttling when Cassandra is under siege …
  9. Ganglia monitoring Dashboard
  10. Referral Provider Network: one of the 6 applications that we built for our healthcare customer using the xPatterns APIs and tools on the new beyond Hadoop infrastructure: ELT Pipeline, Export to NoSQL API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool. The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty and with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths. The dataset behind the app consists of 8 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra) While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations during the demo  
  11. Referral Provider Network: one of the 6 applications that we built for our healthcare customer using the xPatterns APIs and tools on the new beyond Hadoop infrastructure: ELT Pipeline, Export to NoSQL API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool. The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty and with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths. The dataset behind the app consists of 8 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra) While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations during the demo  
  12. Instrumentation dashboard showcasing the read latency measured during peak (40ms average, 60peak)
  13. We converted the data ingestion tool to Spark so we can completely replace Hadoop and Hive in the future and for the faster job submission. We are converting all of our MR jobs to Spark jobs, for both ingestion and exports to Cassandra Different dataset require different cluster configurations, biggest problem of Spark 0.8.1 is that intermediate output and reducer input is not spilled to disk, causing OOM frequently (0.9.0 solves the reducer part).
  14. Cassandra is xPatterns: real-time database for user facing apis and dashboards applications, system of records for real-time analytics use-cases (Kafka/Storm/Cassandra), distribute in-memory cache store for configuration data, persistence store for user feedback in semantic search and dynamic ontology use cases (soldCloud/Cassandra/Zookeeper).