SlideShare a Scribd company logo
1 of 33
Download to read offline
Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit Amsterdam 2015 - October, 28th
Graduated
from Alpha
in 1.3
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
SQLAbout Me and
2
0
100
200
300
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2
3
SELECT COUNT(*)
FROM hiveTable
WHERE hive_udf(data)
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
SQLAbout Me and
Improved
multi-version
support in 1.4
4
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
SQLAbout Me and
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
• Bindingsin Python,Scala, Java, and R
5
SQLAbout Me and
• Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Runs SQL / HiveQLqueries,optionally alongside or
replacing existing Hive deployments
• Connectexisting BI tools to Spark through JDBC
• Bindingsin Python,Scala, Java, and R
• @michaelarmbrust
• Creator of Spark SQL @databricks
6
SQLAbout Me and
The not-so-secret truth...
7
is about more than SQL.
SQL
8
Create and run
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
DataFrame
noun – [dey-tuh-freym]
9
1. A distributed collection of rows organized into
named columns.
2. An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
10
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
read and write  
functions create
new builders for
doing I/O
11
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
12
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…)  
finish the I/O
specification
df = sqlContext.read 
.format("json")  
.option("samplingRatio",  "0.1")  
.load("/home/michael/data.json")
df.write 
.format("parquet")  
.mode("append")  
.partitionBy("year")  
.saveAsTable("fasterData")
13
Read Less Data: Efficient Formats
• Compact binary encoding with intelligent compression
(delta, RLE, etc)
• Each column stored separately with an index that allows
skipping of unread columns
• Support for partitioning (/data/year=2015)
• Data skipping using statistics (column min/max, etc)
14
is an efficient columnar storage format:
ORC
Write Less Code: Data Source API
Spark SQL’s Data Source API can read and write DataFrames
usinga variety of formats.
15
{  JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ORCplain text*
ETL Using Custom Data Sources
sqlContext.read
.format("com.databricks.spark.jira")
.option("url",  "https://issues.apache.org/jira/rest/api/latest/search")
.option("user",  "marmbrus")
.option("password",  "*******")
.option("query",  """
|project  =  SPARK  AND  
|component  =  SQL  AND  
|(status  =  Open  OR  status  =  "In  Progress"  OR  status  =  Reopened)""".stripMargin)
.load()
16
.repartition(1)
.write
.format("parquet")
.saveAsTable("sparkSqlJira")
Load	
  data	
  from	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (Spark’s	
  Bug	
  Tracker)	
  
using	
  a	
  custom	
  data	
  source.
Write	
  the	
  converted	
  data	
  out	
  to	
  
a	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  table	
  stored	
  in	
  
Write Less Code: High-Level Operations
Solve common problems conciselyusing DataFrame functions:
• Selecting columnsand filtering
• Joining different data sources
• Aggregation (count, sum, average, etc)
• Plotting resultswith Pandas
17
Write Less Code: Compute an Average
private IntWritable one  =
new IntWritable(1)
private IntWritable output   =
new IntWritable()
proctected void map(
LongWritable key,
Text  value,
Context   context)   {
String[]   fields   = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one,   output)
}
IntWritable one  = new IntWritable(1)
DoubleWritable average   = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context   context)   {
int sum   = 0
int count   = 0
for(IntWritable value   : values)   {
sum  += value.get()
count++
}
average.set(sum   / (double)   count)
context.Write(key,   average)
}
data   = sc.textFile(...).split("t")
data.map(lambda x:  (x[0],   [x.[1],   1]))   
.reduceByKey(lambda x,  y:  [x[0]   + y[0],   x[1]   + y[1]])   
.map(lambda x:  [x[0],   x[1][0]   / x[1][1]])   
.collect()
18
Write Less Code: Compute an Average
Using RDDs
data  = sc.textFile(...).split("t")
data.map(lambda x:  (x[0],  [int(x[1]),  1]))  
.reduceByKey(lambda x,  y:  [x[0]  + y[0],  x[1]  + y[1]])  
.map(lambda x:  [x[0],  x[1][0]  / x[1][1]])  
.collect()
Using DataFrames
sqlCtx.table("people")   
.groupBy("name")   
.agg("name",  avg("age"))  
.map(lambda …)  
.collect()  
Full API Docs
• Python
• Scala
• Java
• R
19
Using SQL
SELECT name,  avg(age)
FROM people
GROUP BY  name
Not Just Less Code, Faster Too!
20
0 2 4 6 8 10
RDDScala
RDDPython
DataFrameScala
DataFramePython
DataFrameR
DataFrameSQL
Time to Aggregate 10 million int pairs (secs)
Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQLshare the same optimization/execution pipeline
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity =  udf(lambda zipCode:  <custom  logic  here>)
def add_demographics(events):
u  = sqlCtx.table("users")
events  
.join(u,  events.user_id == u.user_id)  
.withColumn("city",  zipToCity(df.zip))
Augments any
DataFrame
that contains
user_id
22
Optimize Full Pipelines
Optimization happensas late as possible, therefore
Spark SQL can optimize even across functions.
23
events  = add_demographics(sqlCtx.load("/data/events",  "json"))
training_data = events  
.where(events.city == "Amsterdam")  
.select(events.timestamp)  
.collect()  
24
def add_demographics(events):
u  = sqlCtx.table("users") #  Load  Hive  table
events  
.join(u,  events.user_id == u.user_id)   #  Join  on  user_id
.withColumn("city",  zipToCity(df.zip))   #  Run  udf to  add  city  column
events  = add_demographics(sqlCtx.load("/data/events",  "json"))  
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()  
Logical Plan
filter
bycity
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
24
25
def add_demographics(events):
u  = sqlCtx.table("users")                                          #  Load  partitioned Hive  table
events  
.join(u,  events.user_id == u.user_id)   #  Join  on  user_id
.withColumn("city",  zipToCity(df.zip))            #  Run  udf to  add  city  column
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events  = add_demographics(sqlCtx.load("/data/events",  "parquet"))  
training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()  
Logical Plan
filter
bycity
join
events file users table
Physical Plan
join
scan
(events)
filter
bycity
scan
(users)
25
Machine Learning Pipelines
26
tokenizer = Tokenizer(inputCol="text",	
  outputCol="words”)
hashingTF = HashingTF(inputCol="words",	
  outputCol="features”)
lr = LogisticRegression(maxIter=10,  regParam=0.01)
pipeline  = Pipeline(stages=[tokenizer,  hashingTF,  lr])
df = sqlCtx.load("/path/to/data")
model  = pipeline.fit(df)
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model
• 100+ native functionswith
optimized codegen
implementations
– String manipulation – concat,  
format_string,  lower,  lpad
– Data/Time – current_timestamp,  
date_format,  date_add,  …
– Math – sqrt,  randn,   …
– Other –
monotonicallyIncreasingId,  
sparkPartitionId,   …
27
Rich Function Library
from pyspark.sql.functions import *
yesterday  = date_sub(current_date(),  1)
df2  = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(),  1)
val df2 = df.filter(df("created_at")  > yesterday)
Added	
  in	
  
Spark	
  1.5
Optimized Execution with
Project Tungsten
Compact encoding, cacheaware algorithms,
runtime code generation
28
The overheads of JVM objects
“abcd”
29
• Native: 4 byteswith UTF-8 encoding
• Java: 48 bytes
java.lang.String object internals:
OFFSET SIZE TYPE DESCRIPTION VALUE
0 4 (object header) ...
4 4 (object header) ...
8 4 (object header) ...
12 4 char[] String.value []
16 4 int String.hash 0
20 4 int String.hash32 0
Instance size: 24 bytes (reported by Instrumentation API)
12 byte object header
8 byte hashcode
20 bytes data + overhead
6 “bricks”
Tungsten’s Compact Encoding
30
0x0 123 32L 48L 4 “data”
(123, “data”, “bricks”)
Null bitmap
Offset to data
Offset to data Field lengths
“abcd” with Tungsten encoding: ~5-­‐6	
  bytes	
  
Runtime Bytecode Generation
31
df.where(df("year")  > 2015)
GreaterThan(year#234,  Literal(2015))
bool filter(Object  baseObject)  {
int offset  = baseOffset + bitSetWidthInBytes + 3*8L;
int value  =  Platform.getInt(baseObject,  offset);
return value34  > 2015;
}
DataFrame Code / SQL
Catalyst Expressions
Low-level bytecode
JVM intrinsic JIT-ed to
pointer arithmetic
Platform.getInt(baseObject,  offset);
• Type-safe: operate on domain
objectswith compiled lambda
functions
• Fast: Code-generated
encodersfor fast serialization
• Interoperable: Easily convert
DataFrame ßà Dataset
withoutboilerplate
32
Coming soon: Datasets
val df = ctx.read.json("people.json")
//  Convert  data  to  domain  objects.
case class Person(name:  String,  age:  Int)
val ds: Dataset[Person]  = df.as[Person]
ds.filter(_.age  > 30)
//  Compute  histogram  of  age  by  name.
val hist =  ds.groupBy(_.name).mapGroups {
case (name,  people:  Iter[Person])  =>
val buckets =  new Array[Int](10)            
people.map(_.age).foreach {  a  =>
buckets(a  / 10)  += 1
}                  
(name,  buckets)
}
Preview	
  in	
  
Spark	
  1.6
33
Create and run your
Spark programs faster:
SQL
• Write less code
• Read less data
• Let the optimizer do the hard work
Questions?
Committer Office Hours
Weds, 4:00-5:00 pm Michael
Thurs, 10:30-11:30 am Reynold
Thurs, 2:00-3:00 pm Andrew

More Related Content

What's hot

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataSpark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
 

What's hot (20)

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And DataVertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Spark etl
Spark etlSpark etl
Spark etl
 

Viewers also liked

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Spark Summit EU 2015: Matei Zaharia keynote
Spark Summit EU 2015: Matei Zaharia keynoteSpark Summit EU 2015: Matei Zaharia keynote
Spark Summit EU 2015: Matei Zaharia keynoteDatabricks
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteDatabricks
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 

Viewers also liked (6)

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Spark Summit EU 2015: Matei Zaharia keynote
Spark Summit EU 2015: Matei Zaharia keynoteSpark Summit EU 2015: Matei Zaharia keynote
Spark Summit EU 2015: Matei Zaharia keynote
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 

Similar to Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingGerger
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkmxmxm
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...InfluxData
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...confluent
 
Introduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRIntroduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRPeter Elst
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2
 

Similar to Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Introduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIRIntroduction to SQLite in Adobe AIR
Introduction to SQLite in Adobe AIR
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Recently uploaded (20)

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 

Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structured Data

  • 1. Spark DataFrames: Simple and Fast Analytics on Structured Data Michael Armbrust Spark Summit Amsterdam 2015 - October, 28th
  • 2. Graduated from Alpha in 1.3 • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) SQLAbout Me and 2 0 100 200 300 # Of Commits Per Month 0 50 100 150 200 # of Contributors 2
  • 3. 3 SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data) • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments SQLAbout Me and Improved multi-version support in 1.4
  • 4. 4 • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC SQLAbout Me and
  • 5. • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC • Bindingsin Python,Scala, Java, and R 5 SQLAbout Me and
  • 6. • Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) • Runs SQL / HiveQLqueries,optionally alongside or replacing existing Hive deployments • Connectexisting BI tools to Spark through JDBC • Bindingsin Python,Scala, Java, and R • @michaelarmbrust • Creator of Spark SQL @databricks 6 SQLAbout Me and
  • 7. The not-so-secret truth... 7 is about more than SQL. SQL
  • 8. 8 Create and run Spark programs faster: SQL • Write less code • Read less data • Let the optimizer do the hard work
  • 9. DataFrame noun – [dey-tuh-freym] 9 1. A distributed collection of rows organized into named columns. 2. An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3. Archaic: Previously SchemaRDD (cf. Spark< 1.3).
  • 10. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 10
  • 11. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") read and write   functions create new builders for doing I/O 11
  • 12. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: • Format • Partitioning • Handling of existing data df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 12
  • 13. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)   finish the I/O specification df = sqlContext.read .format("json")   .option("samplingRatio",  "0.1")   .load("/home/michael/data.json") df.write .format("parquet")   .mode("append")   .partitionBy("year")   .saveAsTable("fasterData") 13
  • 14. Read Less Data: Efficient Formats • Compact binary encoding with intelligent compression (delta, RLE, etc) • Each column stored separately with an index that allows skipping of unread columns • Support for partitioning (/data/year=2015) • Data skipping using statistics (column min/max, etc) 14 is an efficient columnar storage format: ORC
  • 15. Write Less Code: Data Source API Spark SQL’s Data Source API can read and write DataFrames usinga variety of formats. 15 {  JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/ ORCplain text*
  • 16. ETL Using Custom Data Sources sqlContext.read .format("com.databricks.spark.jira") .option("url",  "https://issues.apache.org/jira/rest/api/latest/search") .option("user",  "marmbrus") .option("password",  "*******") .option("query",  """ |project  =  SPARK  AND   |component  =  SQL  AND   |(status  =  Open  OR  status  =  "In  Progress"  OR  status  =  Reopened)""".stripMargin) .load() 16 .repartition(1) .write .format("parquet") .saveAsTable("sparkSqlJira") Load  data  from                                (Spark’s  Bug  Tracker)   using  a  custom  data  source. Write  the  converted  data  out  to   a                                              table  stored  in  
  • 17. Write Less Code: High-Level Operations Solve common problems conciselyusing DataFrame functions: • Selecting columnsand filtering • Joining different data sources • Aggregation (count, sum, average, etc) • Plotting resultswith Pandas 17
  • 18. Write Less Code: Compute an Average private IntWritable one  = new IntWritable(1) private IntWritable output   = new IntWritable() proctected void map( LongWritable key, Text  value, Context   context)   { String[]   fields   = value.split("t") output.set(Integer.parseInt(fields[1])) context.write(one,   output) } IntWritable one  = new IntWritable(1) DoubleWritable average   = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context   context)   { int sum   = 0 int count   = 0 for(IntWritable value   : values)   { sum  += value.get() count++ } average.set(sum   / (double)   count) context.Write(key,   average) } data   = sc.textFile(...).split("t") data.map(lambda x:  (x[0],   [x.[1],   1]))   .reduceByKey(lambda x,  y:  [x[0]   + y[0],   x[1]   + y[1]])   .map(lambda x:  [x[0],   x[1][0]   / x[1][1]])   .collect() 18
  • 19. Write Less Code: Compute an Average Using RDDs data  = sc.textFile(...).split("t") data.map(lambda x:  (x[0],  [int(x[1]),  1]))   .reduceByKey(lambda x,  y:  [x[0]  + y[0],  x[1]  + y[1]])   .map(lambda x:  [x[0],  x[1][0]  / x[1][1]])   .collect() Using DataFrames sqlCtx.table("people")   .groupBy("name")   .agg("name",  avg("age"))   .map(lambda …)   .collect()   Full API Docs • Python • Scala • Java • R 19 Using SQL SELECT name,  avg(age) FROM people GROUP BY  name
  • 20. Not Just Less Code, Faster Too! 20 0 2 4 6 8 10 RDDScala RDDPython DataFrameScala DataFramePython DataFrameR DataFrameSQL Time to Aggregate 10 million int pairs (secs)
  • 21. Plan Optimization & Execution 21 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQLshare the same optimization/execution pipeline
  • 22. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity =  udf(lambda zipCode:  <custom  logic  here>) def add_demographics(events): u  = sqlCtx.table("users") events   .join(u,  events.user_id == u.user_id)   .withColumn("city",  zipToCity(df.zip)) Augments any DataFrame that contains user_id 22
  • 23. Optimize Full Pipelines Optimization happensas late as possible, therefore Spark SQL can optimize even across functions. 23 events  = add_demographics(sqlCtx.load("/data/events",  "json")) training_data = events   .where(events.city == "Amsterdam")   .select(events.timestamp)   .collect()  
  • 24. 24 def add_demographics(events): u  = sqlCtx.table("users") #  Load  Hive  table events   .join(u,  events.user_id == u.user_id)   #  Join  on  user_id .withColumn("city",  zipToCity(df.zip))   #  Run  udf to  add  city  column events  = add_demographics(sqlCtx.load("/data/events",  "json"))   training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()   Logical Plan filter bycity join events file users table expensive only join relevent users Physical Plan join scan (events) filter bycity scan (users) 24
  • 25. 25 def add_demographics(events): u  = sqlCtx.table("users")                                          #  Load  partitioned Hive  table events   .join(u,  events.user_id == u.user_id)   #  Join  on  user_id .withColumn("city",  zipToCity(df.zip))            #  Run  udf to  add  city  column Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  = add_demographics(sqlCtx.load("/data/events",  "parquet"))   training_data = events.where(events.city == "Amsterdam").select(events.timestamp).collect()   Logical Plan filter bycity join events file users table Physical Plan join scan (events) filter bycity scan (users) 25
  • 26. Machine Learning Pipelines 26 tokenizer = Tokenizer(inputCol="text",  outputCol="words”) hashingTF = HashingTF(inputCol="words",  outputCol="features”) lr = LogisticRegression(maxIter=10,  regParam=0.01) pipeline  = Pipeline(stages=[tokenizer,  hashingTF,  lr]) df = sqlCtx.load("/path/to/data") model  = pipeline.fit(df) ds0 ds1 ds2 ds3tokenizer hashingTF lr.model lr Pipeline Model
  • 27. • 100+ native functionswith optimized codegen implementations – String manipulation – concat,   format_string,  lower,  lpad – Data/Time – current_timestamp,   date_format,  date_add,  … – Math – sqrt,  randn,   … – Other – monotonicallyIncreasingId,   sparkPartitionId,   … 27 Rich Function Library from pyspark.sql.functions import * yesterday  = date_sub(current_date(),  1) df2  = df.filter(df.created_at > yesterday) import org.apache.spark.sql.functions._ val yesterday = date_sub(current_date(),  1) val df2 = df.filter(df("created_at")  > yesterday) Added  in   Spark  1.5
  • 28. Optimized Execution with Project Tungsten Compact encoding, cacheaware algorithms, runtime code generation 28
  • 29. The overheads of JVM objects “abcd” 29 • Native: 4 byteswith UTF-8 encoding • Java: 48 bytes java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) ... 4 4 (object header) ... 8 4 (object header) ... 12 4 char[] String.value [] 16 4 int String.hash 0 20 4 int String.hash32 0 Instance size: 24 bytes (reported by Instrumentation API) 12 byte object header 8 byte hashcode 20 bytes data + overhead
  • 30. 6 “bricks” Tungsten’s Compact Encoding 30 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Null bitmap Offset to data Offset to data Field lengths “abcd” with Tungsten encoding: ~5-­‐6  bytes  
  • 31. Runtime Bytecode Generation 31 df.where(df("year")  > 2015) GreaterThan(year#234,  Literal(2015)) bool filter(Object  baseObject)  { int offset  = baseOffset + bitSetWidthInBytes + 3*8L; int value  =  Platform.getInt(baseObject,  offset); return value34  > 2015; } DataFrame Code / SQL Catalyst Expressions Low-level bytecode JVM intrinsic JIT-ed to pointer arithmetic Platform.getInt(baseObject,  offset);
  • 32. • Type-safe: operate on domain objectswith compiled lambda functions • Fast: Code-generated encodersfor fast serialization • Interoperable: Easily convert DataFrame ßà Dataset withoutboilerplate 32 Coming soon: Datasets val df = ctx.read.json("people.json") //  Convert  data  to  domain  objects. case class Person(name:  String,  age:  Int) val ds: Dataset[Person]  = df.as[Person] ds.filter(_.age  > 30) //  Compute  histogram  of  age  by  name. val hist =  ds.groupBy(_.name).mapGroups { case (name,  people:  Iter[Person])  => val buckets =  new Array[Int](10)             people.map(_.age).foreach {  a  => buckets(a  / 10)  += 1 }                   (name,  buckets) } Preview  in   Spark  1.6
  • 33. 33 Create and run your Spark programs faster: SQL • Write less code • Read less data • Let the optimizer do the hard work Questions? Committer Office Hours Weds, 4:00-5:00 pm Michael Thurs, 10:30-11:30 am Reynold Thurs, 2:00-3:00 pm Andrew