SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit 2015 - June, 15th
Graduated
from Alpha
in 1.3
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
SQL!About Me and
2
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2
3
SELECT&COUNT(*)&
FROM&hiveTable&
WHERE&hive_udf(data)&&
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
SQL!About Me and
Improved
multi-version
support in 1.4
4
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
SQL!About Me and
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, Java, and R
5
SQL!About Me and
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, Java, and R
• @michaelarmbrust
•  Lead developer of Spark SQL @databricks
6
SQL!About Me and
The not-so-secret truth...
7
is about more than SQL.
!
SQL!
Spark SQL: The whole story
Creating and Running Spark Programs Faster:
•  Write less code
•  Read less data
•  Let the optimizer do the hard work
8
DataFrame
noun – [dey-tuh-freym]
9
1.  A distributed collection of rows organized into
named columns.
2.  An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
!
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 10
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
!
read and write&
functions create
new builders for
doing I/O
11
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
•  Format
•  Partitioning
•  Handling of
existing data
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 12
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…)&
finish the I/O
specification
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 13
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
14
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ETL Using Custom Data Sources
sqlContext.read&
&&.format("com.databricks.spark.jira")&
&&.option("url",&"https://issues.apache.org/jira/rest/api/latest/search")&
&&.option("user",&"marmbrus")&
&&.option("password",&"*******")&
&&.option("query",&"""&
&&&&|project&=&SPARK&AND&&
&&&&|component&=&SQL&AND&&
&&&&|(status&=&Open&OR&status&=&"In&Progress"&OR&status&=&Reopened)""".stripMargin)&
&&.load()&
&&.repartition(1)&
&&.write&
&&.format("parquet")&
&&.saveAsTable("sparkSqlJira")&
15
Write Less Code: High-Level Operations
Solve common problems concisely using DataFrame functions:
•  Selecting columns and filtering
•  Joining different data sources
•  Aggregation (count, sum, average, etc)
•  Plotting results with Pandas
16
Write Less Code: Compute an Average
private&IntWritable&one&=&&
&&new&IntWritable(1)&
private&IntWritable&output&=&
&&new&IntWritable()&
proctected&void&map(&
&&&&LongWritable&key,&
&&&&Text&value,&
&&&&Context&context)&{&
&&String[]&fields&=&value.split("t")&
&&output.set(Integer.parseInt(fields[1]))&
&&context.write(one,&output)&
}&
&
IntWritable&one&=&new&IntWritable(1)&
DoubleWritable&average&=&new&DoubleWritable()&
&
protected&void&reduce(&
&&&&IntWritable&key,&
&&&&Iterable<IntWritable>&values,&
&&&&Context&context)&{&
&&int&sum&=&0&
&&int&count&=&0&
&&for(IntWritable&value&:&values)&{&
&&&&&sum&+=&value.get()&
&&&&&count++&
&&&&}&
&&average.set(sum&/&(double)&count)&
&&context.Write(key,&average)&
}&
data&=&sc.textFile(...).split("t")&
data.map(lambda&x:&(x[0],&[x.[1],&1]))&&
&&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&&
&&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&&
&&&.collect()&
17
Write Less Code: Compute an Average
Using RDDs
&
data&=&sc.textFile(...).split("t")&
data.map(lambda&x:&(x[0],&[int(x[1]),&1]))&&
&&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&&
&&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&&
&&&.collect()&
&
&
&Using DataFrames
&
sqlCtx.table("people")&&
&&&.groupBy("name")&&
&&&.agg("name",&avg("age"))&&
&&&.collect()&&
!
Full API Docs
•  Python
•  Scala
•  Java
•  R
18
Not Just Less Code: Faster Implementations
19
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
20
Demo
Combine data from with data from
Running in
•  Hosted Spark in the cloud
•  Notebooks with integrated visualization
•  Scheduled production jobs
https://accounts.cloud.databricks.com/
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 1/5
> 
Command took 2.08s -- by dbadmin at 6/15/2015, 1:17:07 PM on michael (54 GB)
> 
> 
%run /home/michael/ss.2015.demo/spark.sql.lib ...
%sql SELECT * FROM sparkSqlJira
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 2/5
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 3/5
Command took 1.95s -- by dbadmin at 6/15/2015, 1:18:46 PM on michael (54 GB)
> 
rawPRs: org.apache.spark.sql.DataFrame = [commenters: array<struct<data:struct<asked_to_close:boolea
n,avatar:string,body:string,date:array<string>,diff_hunk:string,said_lgtm:boolean,url:string>,usernam
e:string>>, components: array<string>, is_mergeable: boolean, jira_issuetype_icon_url: string, jira_i
ssuetype_name: string, jira_priority_icon_url: string, jira_priority_name: string, last_jenkins_comme
nt: struct<body:string,html_url:string,user:struct<login:string>>, last_jenkins_outcome: string, line
s_added: bigint, lines_changed: bigint, lines_deleted: bigint, number: bigint, parsed_title: struct<j
iras:array<bigint>,metadata:string,title:string>, state: string, updated_at: string, user: string]
Command took 2.01s -- by dbadmin at 6/15/2015, 1:19:08 PM on michael (54 GB)
> 
val rawPRs = sqlContext.read
.format("com.databricks.spark.rest")
.option("url", "https://spark-prs.appspot.com/search-open-prs")
.load()
display(rawPRs)
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 4/5
Command took 0.26s -- by dbadmin at 6/15/2015, 1:19:39 PM on michael (54 GB)
> 
import org.apache.spark.sql.functions._
sparkPRs: org.apache.spark.sql.DataFrame = [component: string, pr_jira: string, title: string, jira_i
ssuetype_icon_url: string, jira_priority_icon_url: string, number: bigint, commenters: array<struct<d
ata:struct<asked_to_close:boolean,avatar:string,body:string,date:array<string>,diff_hunk:string,sai
d_lgtm:boolean,url:string>,username:string>>, user: string, last_jenkins_outcome: string, is_mergeabl
e: boolean]
import org.apache.spark.sql.functions._
val sparkPRs = rawPRs
.select(
  // "Explode" nested array to create one row per item.
  explode($"components").as("component"),
 
  // Use a built-in function to construct the full 'SPARK-XXXX' key
  concat("SPARK-", $"parsed_title.jiras"(0)).as("pr_jira"),
// Other required columns.
  $"parsed_title.title",
$"jira_issuetype_icon_url",
$"jira_priority_icon_url",
$"number",
$"commenters",
$"user",
$"last_jenkins_outcome",
$"is_mergeable")
.where($"component" === "SQL") // Select only SQL PRs
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 5/5
Command took 7.55s -- by dbadmin at 6/15/2015, 1:20:15 PM on michael (54 GB)
> 
✗
✗
✗
✗
table("sparkSqlJira")
.join(sparkPRs, $"key" === $"pr_jira")
.jiraTable
Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity&=&udf(lambda&zipCode:&<custom&logic&here>)&
&
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&
Augments any
DataFrame
that contains
user_id&
22
Optimize Entire Pipelines
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
23
events&=&add_demographics(sqlCtx.load("/data/events",&"json"))&
&&
training_data&=&events&&
&&.where(events.city&==&"San&Francisco")&&
&&.select(events.timestamp)&&
&&.collect()&&
24
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&Hive&table&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column&
&events&=&add_demographics(sqlCtx.load("/data/events",&"json"))&&
training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&&
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
24
25
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&partitioned&Hive&table&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column&
&
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events&=&add_demographics(sqlCtx.load("/data/events",&"parquet"))&&
training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&&
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
25
Machine Learning Pipelines
26
tokenizer&=&Tokenizer(inputCol="text",!outputCol="words”)&
hashingTF&=&HashingTF(inputCol="words",!outputCol="features”)&
lr&=&LogisticRegression(maxIter=10,&regParam=0.01)&
pipeline&=&Pipeline(stages=[tokenizer,&hashingTF,&lr])&
&
df&=&sqlCtx.load("/path/to/data")!
model&=&pipeline.fit(df)!	
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model!
Find out more during Joseph’s Talk: 3pm Today
Project Tungsten: Initial Results
27
0
50
100
150
200
1x 2x 4x 8x 16x
Average GC
time per
node
(seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten offheap
Find out more during Josh’s Talk: 5pm Tomorrow
Questions?
Spark SQL Office Hours Today
-  Michael Armbrust 1:45-2:30
-  Yin Huai 3:40-4:15
Spark SQL Office Hours Tomorrow
-  Reynold 1:45-2:30

Contenu connexe

Tendances

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 

Tendances (20)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Similaire à Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015

Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...
Fabio Franzini
 

Similaire à Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015 (20)

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Php frameworks
Php frameworksPhp frameworks
Php frameworks
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015

  • 1. Spark DataFrames: Simple and Fast Analytics on Structured Data Michael Armbrust Spark Summit 2015 - June, 15th
  • 2. Graduated from Alpha in 1.3 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) SQL!About Me and 2 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 # of Contributors 2
  • 3. 3 SELECT&COUNT(*)& FROM&hiveTable& WHERE&hive_udf(data)&& • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments SQL!About Me and Improved multi-version support in 1.4
  • 4. 4 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC SQL!About Me and
  • 5. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R 5 SQL!About Me and
  • 6. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R • @michaelarmbrust •  Lead developer of Spark SQL @databricks 6 SQL!About Me and
  • 7. The not-so-secret truth... 7 is about more than SQL. ! SQL!
  • 8. Spark SQL: The whole story Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work 8
  • 9. DataFrame noun – [dey-tuh-freym] 9 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3). !
  • 10. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 10
  • 11. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! read and write& functions create new builders for doing I/O 11
  • 12. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: •  Format •  Partitioning •  Handling of existing data df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 12
  • 13. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)& finish the I/O specification df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 13
  • 14. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 14 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/
  • 15. ETL Using Custom Data Sources sqlContext.read& &&.format("com.databricks.spark.jira")& &&.option("url",&"https://issues.apache.org/jira/rest/api/latest/search")& &&.option("user",&"marmbrus")& &&.option("password",&"*******")& &&.option("query",&"""& &&&&|project&=&SPARK&AND&& &&&&|component&=&SQL&AND&& &&&&|(status&=&Open&OR&status&=&"In&Progress"&OR&status&=&Reopened)""".stripMargin)& &&.load()& &&.repartition(1)& &&.write& &&.format("parquet")& &&.saveAsTable("sparkSqlJira")& 15
  • 16. Write Less Code: High-Level Operations Solve common problems concisely using DataFrame functions: •  Selecting columns and filtering •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Plotting results with Pandas 16
  • 17. Write Less Code: Compute an Average private&IntWritable&one&=&& &&new&IntWritable(1)& private&IntWritable&output&=& &&new&IntWritable()& proctected&void&map(& &&&&LongWritable&key,& &&&&Text&value,& &&&&Context&context)&{& &&String[]&fields&=&value.split("t")& &&output.set(Integer.parseInt(fields[1]))& &&context.write(one,&output)& }& & IntWritable&one&=&new&IntWritable(1)& DoubleWritable&average&=&new&DoubleWritable()& & protected&void&reduce(& &&&&IntWritable&key,& &&&&Iterable<IntWritable>&values,& &&&&Context&context)&{& &&int&sum&=&0& &&int&count&=&0& &&for(IntWritable&value&:&values)&{& &&&&&sum&+=&value.get()& &&&&&count++& &&&&}& &&average.set(sum&/&(double)&count)& &&context.Write(key,&average)& }& data&=&sc.textFile(...).split("t")& data.map(lambda&x:&(x[0],&[x.[1],&1]))&& &&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&& &&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&& &&&.collect()& 17
  • 18. Write Less Code: Compute an Average Using RDDs & data&=&sc.textFile(...).split("t")& data.map(lambda&x:&(x[0],&[int(x[1]),&1]))&& &&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&& &&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&& &&&.collect()& & & &Using DataFrames & sqlCtx.table("people")&& &&&.groupBy("name")&& &&&.agg("name",&avg("age"))&& &&&.collect()&& ! Full API Docs •  Python •  Scala •  Java •  R 18
  • 19. Not Just Less Code: Faster Implementations 19 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  • 20. 20 Demo Combine data from with data from Running in •  Hosted Spark in the cloud •  Notebooks with integrated visualization •  Scheduled production jobs https://accounts.cloud.databricks.com/
  • 21. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 1/5 >  Command took 2.08s -- by dbadmin at 6/15/2015, 1:17:07 PM on michael (54 GB) >  >  %run /home/michael/ss.2015.demo/spark.sql.lib ... %sql SELECT * FROM sparkSqlJira
  • 22. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 2/5
  • 23. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 3/5 Command took 1.95s -- by dbadmin at 6/15/2015, 1:18:46 PM on michael (54 GB) >  rawPRs: org.apache.spark.sql.DataFrame = [commenters: array<struct<data:struct<asked_to_close:boolea n,avatar:string,body:string,date:array<string>,diff_hunk:string,said_lgtm:boolean,url:string>,usernam e:string>>, components: array<string>, is_mergeable: boolean, jira_issuetype_icon_url: string, jira_i ssuetype_name: string, jira_priority_icon_url: string, jira_priority_name: string, last_jenkins_comme nt: struct<body:string,html_url:string,user:struct<login:string>>, last_jenkins_outcome: string, line s_added: bigint, lines_changed: bigint, lines_deleted: bigint, number: bigint, parsed_title: struct<j iras:array<bigint>,metadata:string,title:string>, state: string, updated_at: string, user: string] Command took 2.01s -- by dbadmin at 6/15/2015, 1:19:08 PM on michael (54 GB) >  val rawPRs = sqlContext.read .format("com.databricks.spark.rest") .option("url", "https://spark-prs.appspot.com/search-open-prs") .load() display(rawPRs)
  • 24. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 4/5 Command took 0.26s -- by dbadmin at 6/15/2015, 1:19:39 PM on michael (54 GB) >  import org.apache.spark.sql.functions._ sparkPRs: org.apache.spark.sql.DataFrame = [component: string, pr_jira: string, title: string, jira_i ssuetype_icon_url: string, jira_priority_icon_url: string, number: bigint, commenters: array<struct<d ata:struct<asked_to_close:boolean,avatar:string,body:string,date:array<string>,diff_hunk:string,sai d_lgtm:boolean,url:string>,username:string>>, user: string, last_jenkins_outcome: string, is_mergeabl e: boolean] import org.apache.spark.sql.functions._ val sparkPRs = rawPRs .select(   // "Explode" nested array to create one row per item.   explode($"components").as("component"),     // Use a built-in function to construct the full 'SPARK-XXXX' key   concat("SPARK-", $"parsed_title.jiras"(0)).as("pr_jira"), // Other required columns.   $"parsed_title.title", $"jira_issuetype_icon_url", $"jira_priority_icon_url", $"number", $"commenters", $"user", $"last_jenkins_outcome", $"is_mergeable") .where($"component" === "SQL") // Select only SQL PRs
  • 25. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 5/5 Command took 7.55s -- by dbadmin at 6/15/2015, 1:20:15 PM on michael (54 GB) >  ✗ ✗ ✗ ✗ table("sparkSqlJira") .join(sparkPRs, $"key" === $"pr_jira") .jiraTable
  • 26. Plan Optimization & Execution 21 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 27. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity&=&udf(lambda&zipCode:&<custom&logic&here>)& & def&add_demographics(events):& &&&u&=&sqlCtx.table("users")& &&&events&& &&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&& &&&&&.withColumn("city",&zipToCity(df.zip))& Augments any DataFrame that contains user_id& 22
  • 28. Optimize Entire Pipelines Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 23 events&=&add_demographics(sqlCtx.load("/data/events",&"json"))& && training_data&=&events&& &&.where(events.city&==&"San&Francisco")&& &&.select(events.timestamp)&& &&.collect()&&
  • 30. 25 def&add_demographics(events):& &&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&partitioned&Hive&table& &&&events&& &&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&& &&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column& & Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events&=&add_demographics(sqlCtx.load("/data/events",&"parquet"))&& training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&& Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users) 25
  • 32. Project Tungsten: Initial Results 27 0 50 100 150 200 1x 2x 4x 8x 16x Average GC time per node (seconds) Data set size (relative) Default Code Gen Tungsten onheap Tungsten offheap Find out more during Josh’s Talk: 5pm Tomorrow
  • 33. Questions? Spark SQL Office Hours Today -  Michael Armbrust 1:45-2:30 -  Yin Huai 3:40-4:15 Spark SQL Office Hours Tomorrow -  Reynold 1:45-2:30