SlideShare a Scribd company logo
1 of 28
Shark: SQL and Rich
Analytics at Scale
November 26, 2012
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf
AMPLab, EECS, UC Berkeley
Presented by Alexander Ivanichev
Reynold S. Xin, Josh Rosen, Matei Zaharia,
Michael J. Franklin, Scott Shenker, Ion Stoica
Agenda
• Problems with existing database solutions
• What is Shark ?
• Shark architecture overview
• Extending spark for SQL
• Shark perfomance and examples
Challenges in modern data analysis
Data size growing
• Processing has to scale out over large clusters
• Faults and stragglers complicate DB design
Complexity of analysis increasing
• Massive ETL (web crawling)
• Machine learning, graph processing
• Leads to long running jobs
Existinganalysissolutions
MPP Databases
• Vertica, SAP HANA, Teradata, Google Dremel, Google PowerDrill, Cloudera Impala...
• Fast!
• Generally not fault-tolerant; challenging for long running queries as clusters scale up.
• Lack rich analytics such as machine learning and graph algorithms.
Map Reduce
• Apache Hive, Google Tenzing, Turn Cheetah...
• Enables fine-grained fault-tolerance, resource sharing, scalability.
• Expressive Machine Learning algorithms.
• High-latency, dismissed for interactive workloads.
What’sgoodabout MapReduce?
A data warehouse
• scales out to thousands of nodes in a fault- tolerant manner
• puts structure/schema onto HDFS data (schema-on-read)
• compiles HiveQL queries into MapReduce jobs
• flexible and extensible: support UDFs, scripts, custom serializers, storage formats
Data-parallel operations
• Apply the same operations on a defined set of data
Fine-grained, deterministic tasks
• Enables fault-tolerance & straggler mitigation
But slow: 30+ seconds even for simple queries
Sharkresearch
Shows MapReduce model can be extended to support SQL efficiently
• Started from a powerful MR-like engine (Spark)
• Extended the engine in various ways
The artifact: Shark, a fast engine on top of MR
• Performant SQL
• Complex analytics in the same engine
• Maintains MR benefits, e.g. fault-tolerance
WhatisShark?
A new data analysis system that is :
MapReduce-based architecture
• Built on the top of the RDD and spark execution engine
• Scales out and is fault-tolerant
Performant
• Similar speedups of up to 100x
• Supports low-latency, interactive queries through in-memory computation
Expressive and flexible
• Supports both SQL and complex analytics such as machine learning
• Apache Hive compatible with (storage, UDFs, types, metadata, etc)
SharkontopofSparkengine
Fast MapReduce-like engine
• In-memory storage for fast iterative computations
• General execution graphs
• Designed for low latency (~100ms jobs)
Compatible with Hadoop storage APIs
• Read/write to any Hadoop-supported systems,
including HDFS, Hbase, SequenceFiles, etc
Growing open source platform
• 17+ companies contributing code
Sharkarchitecture
Master Process
Meta
store
HDFS
Client
Driver
SQL
Parser
Physical Plan
Execution
Spark
Cache Mgr.
Query
Optimizer
JDBCCLI
Sharkarchitecture
HDFS Name Node
Metastore
(System
Catalog)
Resource ManagerScheduler
Master Process
Used to query an existing Hive warehouse
returns result much faster without
modification
Master Node
SparkRuntime
Slave Node
Execution Engine
Memstore
Resource Manager Daemon
HDFS DataNode
SparkRuntime
Slave Node
Execution Engine
Memstore
Resource Manager Daemon
HDFS DataNode
Sharkarchitecture-overview
Spark
• Support partial DAG execution
• Optimization of joint algorithm
Features of shark
• Supports general computation
• Provides in-memory storage abstraction-RDD
• Engine is optimized for low latency
Sharkarchitecture-overview
RDD
• Sparks main abstraction-RDD
• Collection stored in external storage system or derived data set
• Contains arbitrary data types
Benefits of RDD’s
• Return at the speed of DRAM
• Use of lineage
• Speedy recovery
• Immutable-foundation for relational processing.
groupByKey
join
mapPartitions
sortByKey
distinct
filter
Sharkarchitecture-overview
Fault tolerance guarantees
• Shark can tolerate the loss of any set of worker nodes.
• Recovery is parallelized across the cluster.
• The deterministic nature of RDDs also enables straggler mitigation
• Recovery works even in queries that combine SQL and machine learning UDFs
ExecutingSQLoverRDDs
Process of executing SQL queries which includes:
Query
parsing
Logical plan
generation
Physical plan
generation
AnalyzingdataonShark
CREATE EXTERNAL TABLE wiki
(id BIGINT, title STRING, last_modified STRING,xml STRING,text STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
LOCATION 's3n://spark-data/wikipedia-sample/';
SELECT COUNT(*) FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
CachingDatainShark
CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache"= "true") AS
SELECT * FROM wiki;
CREATE TABLE wiki_cached AS SELECT * FROM wiki;
Creates a table that is stored in a cluster’s memory using RDD.cache().
Tuningthedegreeofparallelism
• Relies on Spark to infer the number of map tasks (automatically based on input size).
• Number of reduce tasks needs to be specified by the user.
SET mapred.reduce.tasks=499;
• Out of memory error on slaves if the number is too small.
• It is usually OK to set a higher value since the overhead of task launching is low in Spark
ExtendingSparkforSQL
• Partial DAG Execution
• Columnar Memory Store
• Machine Learning Integration
• Hash-based Shuffle vs Sort-based Shuffle
• Data Co-partitioning
• Partition Pruning based on Range Statistics
• Distributed Data Loading
• Distributed sorting
• Better push-down of limits
PartialDAGExecution(PDE)
Partial DAG execution(PDE)
• Static query optimization
• Dynamic query optimization
• Modification of statistics
Example of statistics
• Partition size record count
• List of “heavy hitters”
• Approximate histogram
SELECT * FROM table1 a
JOIN table2 b ON a.key=b.key
WHERE my_udf(b.field1, b.field2) = true;
How to optimize the following query?
• Hard to estimate cardinality!
• Without cardinality estimation, cost-
based optimizer breaks down
PDE–Joinoptimization
Skew handling and degree parallelism
• Task scheduling overhead
Join
Result
Table 1
Map join Shuffle join
Join
Result
Table 2
Columnarmemorystore
• Simply catching records as JVM objects is insufficient
• Shark employs column oriented storage , a partition of
columns is one MapReduce “record”
• Benefits: compact representation, cpu efficient compression,
cache locality
1
2
3
John
Mike
Sally
4.1
3.5
6.4
1
John
4.1
2
Mike
3.5
3
Sally
6.4
Row storage Column storage
Machinelearningsupport
Language Integration
• Shark allows queries to perform logistic
regression over a user database.
• Example: Data analysis pipeline that performs logistic regression over
database.
Execution Engine Integration
• Common abstraction allows machine learning computation and SQL
queries to share workers and cached data.
• Enables end to end fault tolerance
Shark supports machine learning-first class citizen
IntegratingMachinelearningwithSQL
def logRegress(points: RDD[Point]):Vector {
var w = Vector(D, _ => 2 * rand.nextDouble - 1)
for (i <- 1 to ITERATIONS){
val gradient = points.map { p =>
val denom = 1 + exp(-p.y* (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ +_)
w -=gradient
}
w
}
val users = sql2rdd("SELECT * FROM user u
JOIN comment c ONc.uid=u.uid")
val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")),
...)}
val trainedVector = logRegress(features.cache())
Unified system for query processing and machine learning
Machinelearningsupport-Performance
1.1 0.8 0.7 1
0
10
20
30
40
50
60
70
80
90
100
Q1 Q2 Q3 Q4
Conviva Warehouse Queries (1.7 TB)
Shark Shark (disk) Hive
Machinelearningworkflowcomparison
110
80
0.96
0 20 40 60 80 100 120
Hadoop (text)
Hadoop (binary)
Shark
Logistic regression, per-iteration
runtime (seconds)
155
120
4.1
0 50 100 150 200
Hadoop (text)
Hadoop (binary)
Shark
K-means clustering, per-
iteration runtime (seconds)
1. Selecting the data of interest from the warehouse using SQL
2. Extracting Features
3. Applying Iterative Algorithms:
• Logistic Regression
• K-Means Clustering
Machinelearningsupport
120
4.1
0 20 40 60 80 100 120 140
Hadoop
Shark/Spark
Machine Learning (1B records, 10 features/record)
K-means
80
0.96
0 10 20 30 40 50 60 70 80 90
Hadoop
Shark/Spark
logistic regression
Summary
• By using Spark as the execution engine and employing novel and traditional database
techniques, Shark bridges the gap between MapReduce and MPP databases.
• User can conveniently mix the best parts of both SQL and MapReduce-style programming
and avoid moving data.
• Provide fine-grained fault recovery across both types of operations.
• In memory computation, users can choose to load high-value data into Shark’s memory store
for fast analytics
• Shark can answer queries up to 100X faster than Hive and machine learning 100X faster than
Hadoop MapReduce and can achieve comparable performance with MPP databases
Thank
youhttps://spark.apache.org/sql/

More Related Content

What's hot

High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkMartin Toshev
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to HiveDatabricks
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoUri Savelchev
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 

What's hot (20)

High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 

Similar to Shark

Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupRadu Chilom
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaData Con LA
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceChitturi Kiran
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 

Similar to Shark (20)

Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Solr as a Spark SQL Datasource
Solr as a Spark SQL DatasourceSolr as a Spark SQL Datasource
Solr as a Spark SQL Datasource
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Shark

  • 1. Shark: SQL and Rich Analytics at Scale November 26, 2012 http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf AMPLab, EECS, UC Berkeley Presented by Alexander Ivanichev Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica
  • 2. Agenda • Problems with existing database solutions • What is Shark ? • Shark architecture overview • Extending spark for SQL • Shark perfomance and examples
  • 3. Challenges in modern data analysis Data size growing • Processing has to scale out over large clusters • Faults and stragglers complicate DB design Complexity of analysis increasing • Massive ETL (web crawling) • Machine learning, graph processing • Leads to long running jobs
  • 4. Existinganalysissolutions MPP Databases • Vertica, SAP HANA, Teradata, Google Dremel, Google PowerDrill, Cloudera Impala... • Fast! • Generally not fault-tolerant; challenging for long running queries as clusters scale up. • Lack rich analytics such as machine learning and graph algorithms. Map Reduce • Apache Hive, Google Tenzing, Turn Cheetah... • Enables fine-grained fault-tolerance, resource sharing, scalability. • Expressive Machine Learning algorithms. • High-latency, dismissed for interactive workloads.
  • 5. What’sgoodabout MapReduce? A data warehouse • scales out to thousands of nodes in a fault- tolerant manner • puts structure/schema onto HDFS data (schema-on-read) • compiles HiveQL queries into MapReduce jobs • flexible and extensible: support UDFs, scripts, custom serializers, storage formats Data-parallel operations • Apply the same operations on a defined set of data Fine-grained, deterministic tasks • Enables fault-tolerance & straggler mitigation But slow: 30+ seconds even for simple queries
  • 6. Sharkresearch Shows MapReduce model can be extended to support SQL efficiently • Started from a powerful MR-like engine (Spark) • Extended the engine in various ways The artifact: Shark, a fast engine on top of MR • Performant SQL • Complex analytics in the same engine • Maintains MR benefits, e.g. fault-tolerance
  • 7. WhatisShark? A new data analysis system that is : MapReduce-based architecture • Built on the top of the RDD and spark execution engine • Scales out and is fault-tolerant Performant • Similar speedups of up to 100x • Supports low-latency, interactive queries through in-memory computation Expressive and flexible • Supports both SQL and complex analytics such as machine learning • Apache Hive compatible with (storage, UDFs, types, metadata, etc)
  • 8. SharkontopofSparkengine Fast MapReduce-like engine • In-memory storage for fast iterative computations • General execution graphs • Designed for low latency (~100ms jobs) Compatible with Hadoop storage APIs • Read/write to any Hadoop-supported systems, including HDFS, Hbase, SequenceFiles, etc Growing open source platform • 17+ companies contributing code
  • 10. Sharkarchitecture HDFS Name Node Metastore (System Catalog) Resource ManagerScheduler Master Process Used to query an existing Hive warehouse returns result much faster without modification Master Node SparkRuntime Slave Node Execution Engine Memstore Resource Manager Daemon HDFS DataNode SparkRuntime Slave Node Execution Engine Memstore Resource Manager Daemon HDFS DataNode
  • 11. Sharkarchitecture-overview Spark • Support partial DAG execution • Optimization of joint algorithm Features of shark • Supports general computation • Provides in-memory storage abstraction-RDD • Engine is optimized for low latency
  • 12. Sharkarchitecture-overview RDD • Sparks main abstraction-RDD • Collection stored in external storage system or derived data set • Contains arbitrary data types Benefits of RDD’s • Return at the speed of DRAM • Use of lineage • Speedy recovery • Immutable-foundation for relational processing. groupByKey join mapPartitions sortByKey distinct filter
  • 13. Sharkarchitecture-overview Fault tolerance guarantees • Shark can tolerate the loss of any set of worker nodes. • Recovery is parallelized across the cluster. • The deterministic nature of RDDs also enables straggler mitigation • Recovery works even in queries that combine SQL and machine learning UDFs
  • 14. ExecutingSQLoverRDDs Process of executing SQL queries which includes: Query parsing Logical plan generation Physical plan generation
  • 15. AnalyzingdataonShark CREATE EXTERNAL TABLE wiki (id BIGINT, title STRING, last_modified STRING,xml STRING,text STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION 's3n://spark-data/wikipedia-sample/'; SELECT COUNT(*) FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
  • 16. CachingDatainShark CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache"= "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster’s memory using RDD.cache().
  • 17. Tuningthedegreeofparallelism • Relies on Spark to infer the number of map tasks (automatically based on input size). • Number of reduce tasks needs to be specified by the user. SET mapred.reduce.tasks=499; • Out of memory error on slaves if the number is too small. • It is usually OK to set a higher value since the overhead of task launching is low in Spark
  • 18. ExtendingSparkforSQL • Partial DAG Execution • Columnar Memory Store • Machine Learning Integration • Hash-based Shuffle vs Sort-based Shuffle • Data Co-partitioning • Partition Pruning based on Range Statistics • Distributed Data Loading • Distributed sorting • Better push-down of limits
  • 19. PartialDAGExecution(PDE) Partial DAG execution(PDE) • Static query optimization • Dynamic query optimization • Modification of statistics Example of statistics • Partition size record count • List of “heavy hitters” • Approximate histogram SELECT * FROM table1 a JOIN table2 b ON a.key=b.key WHERE my_udf(b.field1, b.field2) = true; How to optimize the following query? • Hard to estimate cardinality! • Without cardinality estimation, cost- based optimizer breaks down
  • 20. PDE–Joinoptimization Skew handling and degree parallelism • Task scheduling overhead Join Result Table 1 Map join Shuffle join Join Result Table 2
  • 21. Columnarmemorystore • Simply catching records as JVM objects is insufficient • Shark employs column oriented storage , a partition of columns is one MapReduce “record” • Benefits: compact representation, cpu efficient compression, cache locality 1 2 3 John Mike Sally 4.1 3.5 6.4 1 John 4.1 2 Mike 3.5 3 Sally 6.4 Row storage Column storage
  • 22. Machinelearningsupport Language Integration • Shark allows queries to perform logistic regression over a user database. • Example: Data analysis pipeline that performs logistic regression over database. Execution Engine Integration • Common abstraction allows machine learning computation and SQL queries to share workers and cached data. • Enables end to end fault tolerance Shark supports machine learning-first class citizen
  • 23. IntegratingMachinelearningwithSQL def logRegress(points: RDD[Point]):Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS){ val gradient = points.map { p => val denom = 1 + exp(-p.y* (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ +_) w -=gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ONc.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache()) Unified system for query processing and machine learning
  • 24. Machinelearningsupport-Performance 1.1 0.8 0.7 1 0 10 20 30 40 50 60 70 80 90 100 Q1 Q2 Q3 Q4 Conviva Warehouse Queries (1.7 TB) Shark Shark (disk) Hive
  • 25. Machinelearningworkflowcomparison 110 80 0.96 0 20 40 60 80 100 120 Hadoop (text) Hadoop (binary) Shark Logistic regression, per-iteration runtime (seconds) 155 120 4.1 0 50 100 150 200 Hadoop (text) Hadoop (binary) Shark K-means clustering, per- iteration runtime (seconds) 1. Selecting the data of interest from the warehouse using SQL 2. Extracting Features 3. Applying Iterative Algorithms: • Logistic Regression • K-Means Clustering
  • 26. Machinelearningsupport 120 4.1 0 20 40 60 80 100 120 140 Hadoop Shark/Spark Machine Learning (1B records, 10 features/record) K-means 80 0.96 0 10 20 30 40 50 60 70 80 90 Hadoop Shark/Spark logistic regression
  • 27. Summary • By using Spark as the execution engine and employing novel and traditional database techniques, Shark bridges the gap between MapReduce and MPP databases. • User can conveniently mix the best parts of both SQL and MapReduce-style programming and avoid moving data. • Provide fine-grained fault recovery across both types of operations. • In memory computation, users can choose to load high-value data into Shark’s memory store for fast analytics • Shark can answer queries up to 100X faster than Hive and machine learning 100X faster than Hadoop MapReduce and can achieve comparable performance with MPP databases