This document summarizes a presentation about using Apache Spark for various data analytics use cases. It discusses how Spark can be used for interactive SQL queries on large datasets, log file enrichment by connecting to data stores like HBase, mixing SQL and machine learning by accessing training and query engines in the same platform, and building recommendation engines by performing ETL, training models with MLlib, and serving recommendations with NoSQL. The presentation argues that Spark helps flatten the adoption curve by providing a unified framework for all these tasks.
My approach to this presentation
Not and API presentation
Great documentation and examples
Lots of presentations of this variety
Architecture presentation – Use case study.
What unique features facilitate workloads.
What is here and coming for new workloads.
Adhoc Queries with Shark
An early use case.
Not usually the first production.
MapR has a few.
In the long run...
Adhoc Queries with Shark
An early use case.
Not usually the first production.
MapR has a few.
In the long run...
Logfile enrichment
Streaming API
Leveraging near time resolution for enrichment.
Not the same as Storm - Its Batch
Hooks to other messaging tools ZeroMQ, Kafka etc..
Sliding Window features
NoSQL capabilities
Access to Tables via Hbase API
Access to in memory RDDs
Adhoc Queries with Shark
An early use case.
Not usually the first production.
MapR has a few.
In the long run...
SQL Mixing with Machine Learning
“ETL for the math nerd”
R and Python
Current access is Shark
Spark SQL will drive to native SQL
Adhoc Queries with Shark
An early use case.
Not usually the first production.
MapR has a few.
In the long run...
Recommendation Engine
Provides all aspects
ETL
Vector/Matrix Generation Training
Near time recommendation
Adhoc Queries with Shark
An early use case.
Not usually the first production.
MapR has a few.
In the long run...
Pharma’s like Adam Academia is behind
Few Graph use cases in development none deployed none on GraphX
Mlib and Mahout may join forces..
PySpark according to DataBricks is some of the most active code
SparkR
BlinkDB Time limited queries separate git hub will be merged in to main Spark Branch
Couple OEM vendors for Spark they are covered on the @ databricks site.
MapR has a very large and robust, growing ecosystem of partners. This is important for you because you have existing investments and relationships with other technologies which need to work well with MapR, integrate easily, and allow you to create a differentiated set of technologies.
(highlight key partners which are important to your customer)