SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Beyond SQL:
Speeding up Spark with DataFrames
Michael Armbrust - @michaelarmbrust
March 2015 – Spark Summit East
2
0	
  
50	
  
100	
  
150	
  
# of Unique Contributors
0	
  
50	
  
100	
  
150	
  
200	
  
# Of Commits Per Month
Graduated
from Alpha
in 1.3
About Me and SQL	
  
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
3
SELECT	
  COUNT(*)	
  
FROM	
  hiveTable	
  
WHERE	
  hive_udf(data)	
  	
  
About Me and SQL	
  
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
4
About Me and SQL	
  
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, and Java
5
About Me and SQL	
  
Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, and Java
@michaelarmbrust
•  Lead developer of Spark SQL @databricks
About Me and
6
SQL	
  
The not-so-secret truth...
7
is not about SQL.
	
  
SQL	
  
Execution Engine Performance
8
0
50
100
150
200
250
300
350
400
450
3 7 19 27 34 42 43 46 52 53 55 59 63 68 73 79 89 98
TPC-DS Performance
Shark
Spark SQL
The not-so-secret truth...
9
is about more than SQL.
	
  
SQL	
  
Spark SQL: The whole story
Creating and Running Spark Programs Faster:
•  Write less code
•  Read less data
•  Let the optimizer do the hard work
10
DataFrame
noun – [dey-tuh-freym]
11
1.  A distributed collection of rows organized into
named columns.
2.  An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
	
  
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
12
{ JSON }
Built-In External
JDBC
and more…
Write Less Code: High-Level Operations
Common operations can be expressed concisely as calls
to the DataFrame API:
•  Selecting required columns
•  Joining different data sources
•  Aggregation (count, sum, average, etc)
•  Filtering
13
Write Less Code: Compute an Average
private	
  IntWritable	
  one	
  =	
  	
  
	
  	
  new	
  IntWritable(1)	
  
private	
  IntWritable	
  output	
  =	
  
	
  	
  new	
  IntWritable()	
  
proctected	
  void	
  map(	
  
	
  	
  	
  	
  LongWritable	
  key,	
  
	
  	
  	
  	
  Text	
  value,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  String[]	
  fields	
  =	
  value.split("t")	
  
	
  	
  output.set(Integer.parseInt(fields[1]))	
  
	
  	
  context.write(one,	
  output)	
  
}	
  
	
  
IntWritable	
  one	
  =	
  new	
  IntWritable(1)	
  
DoubleWritable	
  average	
  =	
  new	
  DoubleWritable()	
  
	
  
protected	
  void	
  reduce(	
  
	
  	
  	
  	
  IntWritable	
  key,	
  
	
  	
  	
  	
  Iterable<IntWritable>	
  values,	
  
	
  	
  	
  	
  Context	
  context)	
  {	
  
	
  	
  int	
  sum	
  =	
  0	
  
	
  	
  int	
  count	
  =	
  0	
  
	
  	
  for(IntWritable	
  value	
  :	
  values)	
  {	
  
	
  	
  	
  	
  	
  sum	
  +=	
  value.get()	
  
	
  	
  	
  	
  	
  count++	
  
	
  	
  	
  	
  }	
  
	
  	
  average.set(sum	
  /	
  (double)	
  count)	
  
	
  	
  context.Write(key,	
  average)	
  
}	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [x.[1],	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
Write Less Code: Compute an Average
15
Using RDDs
	
  
data	
  =	
  sc.textFile(...).split("t")	
  
data.map(lambda	
  x:	
  (x[0],	
  [int(x[1]),	
  1]))	
  	
  
	
  	
  	
  .reduceByKey(lambda	
  x,	
  y:	
  [x[0]	
  +	
  y[0],	
  x[1]	
  +	
  y[1]])	
  	
  
	
  	
  	
  .map(lambda	
  x:	
  [x[0],	
  x[1][0]	
  /	
  x[1][1]])	
  	
  
	
  	
  	
  .collect()	
  
	
  
	
  
	
  Using DataFrames
	
  
sqlCtx.table("people")	
  	
  
	
  	
  	
  .groupBy("name")	
  	
  
	
  	
  	
  .agg("name",	
  avg("age"))	
  	
  
	
  	
  	
  .collect()	
  	
  
	
  
Full API Docs
•  Python
•  Scala
•  Java
Not Just Less Code: Faster Implementations
16
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
17
Demo: Data Sources API
Using Spark SQL to read, write, and transform data in a variety of
formats.
http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf
Read Less Data
The fastest way to process big data is to never read it.
Spark SQL can help you read less data automatically:
1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 18
•  Converting to more efficient formats
•  Using columnar formats (i.e. parquet)
•  Using partitioning (i.e., /year=2014/month=02/…)1
•  Skipping data using statistics (i.e., min, max)2
•  Pushing predicates into storage systems (i.e., JDBC)
	
  
Plan Optimization & Execution
19
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
20
21
def	
  add_demographics(events):	
  
	
  	
  	
  u	
  =	
  sqlCtx.table("users")	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  #	
  Load	
  Hive	
  table	
  
	
  	
  	
  events	
  	
  
	
  	
  	
  	
  	
  .join(u,	
  events.user_id	
  ==	
  u.user_id)	
  	
  	
  	
  	
  	
  #	
  Join	
  on	
  user_id	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  .withColumn("city",	
  zipToCity(df.zip))	
  	
  	
  	
  	
  	
  #	
  Run	
  udf	
  to	
  add	
  city	
  column	
  
	
  events	
  =	
  add_demographics(sqlCtx.load("/data/events",	
  "json"))	
  	
  
training_data	
  =	
  events.where(events.city	
  ==	
  "New	
  York").select(events.timestamp).collect()	
  	
  
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
22
def	
  add_demographics(events):	
  
	
  	
  	
  u	
  =	
  sqlCtx.table("users")	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  #	
  Load	
  partitioned	
  Hive	
  table	
  
	
  	
  	
  events	
  	
  
	
  	
  	
  	
  	
  .join(u,	
  events.user_id	
  ==	
  u.user_id)	
  	
  	
  	
  	
  	
  #	
  Join	
  on	
  user_id	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  .withColumn("city",	
  zipToCity(df.zip))	
  	
  	
  	
  	
  	
  #	
  Run	
  udf	
  to	
  add	
  city	
  column	
  
	
  
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events	
  =	
  add_demographics(sqlCtx.load("/data/events",	
  "parquet"))	
  	
  
training_data	
  =	
  events.where(events.city	
  ==	
  "New	
  York").select(events.timestamp).collect()	
  	
  
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
Machine Learning Pipelines
23
tokenizer	
  =	
  Tokenizer(inputCol="text",	
  outputCol="words”)	
  
hashingTF	
  =	
  HashingTF(inputCol="words",	
  outputCol="features”)	
  
lr	
  =	
  LogisticRegression(maxIter=10,	
  regParam=0.01)	
  
pipeline	
  =	
  Pipeline(stages=[tokenizer,	
  hashingTF,	
  lr])	
  
	
  
df	
  =	
  sqlCtx.load("/path/to/data")	
  
model	
  =	
  pipeline.fit(df)	
  	
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model	
  
Create and Run Spark Programs Faster:
•  Write less code
•  Read less data
•  Let the optimizer do the hard work
SQL	
  
Questions?

Contenu connexe

Tendances

Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 

Tendances (20)

Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 

Similaire à Beyond SQL: Speeding up Spark with DataFrames

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 

Similaire à Beyond SQL: Speeding up Spark with DataFrames (20)

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Dernier (20)

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 

Beyond SQL: Speeding up Spark with DataFrames

  • 1. Beyond SQL: Speeding up Spark with DataFrames Michael Armbrust - @michaelarmbrust March 2015 – Spark Summit East
  • 2. 2 0   50   100   150   # of Unique Contributors 0   50   100   150   200   # Of Commits Per Month Graduated from Alpha in 1.3 About Me and SQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014)
  • 3. 3 SELECT  COUNT(*)   FROM  hiveTable   WHERE  hive_udf(data)     About Me and SQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments
  • 4. 4 About Me and SQL   Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC
  • 5. Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, and Java 5 About Me and SQL  
  • 6. Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, and Java @michaelarmbrust •  Lead developer of Spark SQL @databricks About Me and 6 SQL  
  • 7. The not-so-secret truth... 7 is not about SQL.   SQL  
  • 8. Execution Engine Performance 8 0 50 100 150 200 250 300 350 400 450 3 7 19 27 34 42 43 46 52 53 55 59 63 68 73 79 89 98 TPC-DS Performance Shark Spark SQL
  • 9. The not-so-secret truth... 9 is about more than SQL.   SQL  
  • 10. Spark SQL: The whole story Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work 10
  • 11. DataFrame noun – [dey-tuh-freym] 11 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).  
  • 12. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 12 { JSON } Built-In External JDBC and more…
  • 13. Write Less Code: High-Level Operations Common operations can be expressed concisely as calls to the DataFrame API: •  Selecting required columns •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Filtering 13
  • 14. Write Less Code: Compute an Average private  IntWritable  one  =        new  IntWritable(1)   private  IntWritable  output  =      new  IntWritable()   proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)   }     IntWritable  one  =  new  IntWritable(1)   DoubleWritable  average  =  new  DoubleWritable()     protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)   }   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [x.[1],  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()  
  • 15. Write Less Code: Compute an Average 15 Using RDDs   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [int(x[1]),  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()        Using DataFrames   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .collect()       Full API Docs •  Python •  Scala •  Java
  • 16. Not Just Less Code: Faster Implementations 16 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  • 17. 17 Demo: Data Sources API Using Spark SQL to read, write, and transform data in a variety of formats. http://people.apache.org/~marmbrus/talks/dataframe.demo.pdf
  • 18. Read Less Data The fastest way to process big data is to never read it. Spark SQL can help you read less data automatically: 1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 18 •  Converting to more efficient formats •  Using columnar formats (i.e. parquet) •  Using partitioning (i.e., /year=2014/month=02/…)1 •  Skipping data using statistics (i.e., min, max)2 •  Pushing predicates into storage systems (i.e., JDBC)  
  • 19. Plan Optimization & Execution 19 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 20. Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 20
  • 21. 21 def  add_demographics(events):        u  =  sqlCtx.table("users")                                          #  Load  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)            #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))            #  Run  udf  to  add  city  column    events  =  add_demographics(sqlCtx.load("/data/events",  "json"))     training_data  =  events.where(events.city  ==  "New  York").select(events.timestamp).collect()     Logical Plan filter join events file users table expensive only join relevent users Physical Plan join scan (events) filter scan (users)
  • 22. 22 def  add_demographics(events):        u  =  sqlCtx.table("users")                                          #  Load  partitioned  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)            #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))            #  Run  udf  to  add  city  column     Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  =  add_demographics(sqlCtx.load("/data/events",  "parquet"))     training_data  =  events.where(events.city  ==  "New  York").select(events.timestamp).collect()     Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users)
  • 23. Machine Learning Pipelines 23 tokenizer  =  Tokenizer(inputCol="text",  outputCol="words”)   hashingTF  =  HashingTF(inputCol="words",  outputCol="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])     df  =  sqlCtx.load("/path/to/data")   model  =  pipeline.fit(df)   ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr Pipeline Model  
  • 24. Create and Run Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work SQL   Questions?