Soumettre la recherche
Mettre en ligne
Pivoting Data with SparkSQL by Andrew Ray
•
18 j'aime
•
16,393 vues
Spark Summit
Suivre
Spark Summit East Talk
Lire moins
Lire la suite
Données & analyses
Signaler
Partager
Signaler
Partager
1 sur 33
Télécharger maintenant
Télécharger pour lire hors ligne
Recommandé
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
Databricks
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
The Elastic ELK Stack
The Elastic ELK Stack
enterprisesearchmeetup
Recommandé
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
Databricks
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
The Elastic ELK Stack
The Elastic ELK Stack
enterprisesearchmeetup
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
Spark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
Change Data Feed in Delta
Change Data Feed in Delta
Databricks
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
Introduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
Apache Sentry for Hadoop security
Apache Sentry for Hadoop security
bigdatagurus_meetup
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
Contenu connexe
Tendances
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
Spark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
Change Data Feed in Delta
Change Data Feed in Delta
Databricks
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
Introduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
Apache Sentry for Hadoop security
Apache Sentry for Hadoop security
bigdatagurus_meetup
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
Tendances
(20)
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Spark with Delta Lake
Spark with Delta Lake
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Change Data Feed in Delta
Change Data Feed in Delta
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Introduction to Apache Calcite
Introduction to Apache Calcite
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Apache Sentry for Hadoop security
Apache Sentry for Hadoop security
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Similaire à Pivoting Data with SparkSQL by Andrew Ray
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
R Get Started II
R Get Started II
Sankhya_Analytics
RichardPughspatial.ppt
RichardPughspatial.ppt
EnnerHereniodeAlcnta
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
NoSQL
NoSQL
Antonio Castellon
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Huy Nguyen
Introduction to R for data science
Introduction to R for data science
Long Nguyen
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Data Con LA
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
Torsten Steinbach
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
DB Tsai
Spark DataFrames for Data Munging
Spark DataFrames for Data Munging
(Susan) Xinh Huynh
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
Daniel Chan
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
Syracuse University
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
MLconf
India software developers conference 2013 Bangalore
India software developers conference 2013 Bangalore
Satnam Singh
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
Torsten Steinbach
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
Amazon Web Services
Similaire à Pivoting Data with SparkSQL by Andrew Ray
(20)
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
R Get Started II
R Get Started II
RichardPughspatial.ppt
RichardPughspatial.ppt
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
NoSQL
NoSQL
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Introduction to R for data science
Introduction to R for data science
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
Spark DataFrames for Data Munging
Spark DataFrames for Data Munging
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
India software developers conference 2013 Bangalore
India software developers conference 2013 Bangalore
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
Plus de Spark Summit
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
Plus de Spark Summit
(20)
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Dernier
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
suginr1
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Timothy Spann
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
HyderabadDolls
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
Rajesh Mondal
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
HyderabadDolls
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
kumargunjan9515
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
gargpaaro
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
khraisr
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
Graham Ware
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
amy56318795
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
kumargunjan9515
Dernier
(20)
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Pivoting Data with SparkSQL by Andrew Ray
1.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 1 Pivoting Data with SparkSQL Andrew Ray Senior Data Engineer Silicon Valley Data Science
2.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 2 CODE git.io/vgy34 (github.com/silicon-valley-data-science/spark-pivot-examples)
3.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 3 OUTLINE • What’s a Pivot? • Syntax • Real world examples • Tips and Tricks • Implementation • Future work git.io/vgy34
4.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 4 WHAT’S A PIVOT?
5.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 5 WHAT’S A PIVOT?
6.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 6 WHAT’S A PIVOT? Group by A, pivot on B, and sum C A B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A X Y Z G 4 2 H 4 5
7.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 7 WHAT’S A PIVOT? Group by A and B Pivot on BA B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A B C G X 4 G Y 2 H Y 4 H Z 5 A X Y Z G 4 2 H 4 5
8.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 8 WHAT’S A PIVOT? Pivot on B (w/o agg.) Group by AA B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A X Y Z G 1 G 2 G 3 H 4 H 5 A X Y Z G 4 2 H 4 5
9.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 9 SYNTAX
10.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 10 SYNTAX • Dataframe/table with columns A, B, C, and D. • How to – group by A and B – pivot on C (with distinct values “small” and “large”) – sum of D
11.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 11 SYNTAX: COMPETITION • pandas (Python) – pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum) • reshape2 (R) – dcast(df, A + B ~ C, sum) • Oracle 11g – SELECT * FROM df PIVOT (sum(D) FOR C IN ('small', 'large')) p
12.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 12 SYNTAX: SPARKSQL • Simple – df.groupBy("A", "B").pivot("C").sum("D") • Explicit pivot values – df.groupBy("A", "B") .pivot("C", Seq("small", "large")) .sum("D")
13.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 13 PIVOT • Added to DataFrame API in Spark 1.6 – Scala – Java – Python – Not R L
14.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 14 REAL WORLD EXAMPLES
15.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 15 EXAMPLE 1: REPORTING • Retail sales • TPC-DS dataset – scale factor 1 • Docker image: docker run -it svds/spark-pivot-reporting TPC Benchmark™ DS - Standard Specification, Version 2.1.0 Page 18 of 135 2.2.2.3 The implementation chosen by the test sponsor for a particular datatype definition shall be applied consistently to all the instances of that datatype definition in the schema, except for identifier columns, whose datatype may be selected to satisfy database scaling requirements. 2.2.3 NULLs If a column definition includes an ‘N’ in the NULLs column this column is populated in every row of the table for all scale factors. If the field is blank this column may contain NULLs. 2.2.4 Foreign Key If the values in this column join with another column, the foreign columns name is listed in the Foreign Key field of the column definition. 2.3 Fact Table Definitions 2.3.1 Store Sales (SS) 2.3.1.1 Store Sales ER-Diagram 2.3.1.2 Store Sales Column Definitions Each row in this table represents a single lineitem for a sale made through the store channel and recorded in the store_sales fact table. Table 2-1 Store_sales Column Definitions Column Datatype NULLs Primary Key Foreign Key ss_sold_date_sk identifier d_date_sk ss_sold_time_sk identifier t_time_sk ss_item_sk (1) identifier N Y i_item_sk ss_customer_sk identifier c_customer_sk ss_cdemo_sk identifier cd_demo_sk ss_hdemo_sk identifier hd_demo_sk ss_addr_sk identifier ca_address_sk ss_store_sk identifier s_store_sk ss_promo_sk identifier p_promo_sk
16.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 16 SALES BY CATEGORY AND QUARTER sql("""select *, concat('Q', d_qoy) as qoy from store_sales join date_dim on ss_sold_date_sk = d_date_sk join item on ss_item_sk = i_item_sk""") .groupBy("i_category") .pivot("qoy") .agg(round(sum("ss_sales_price")/1000000,2)) .show
17.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 17 SALES BY CATEGORY AND QUARTER +-----------+----+----+----+----+ | i_category| Q1| Q2| Q3| Q4| +-----------+----+----+----+----+ | Books|1.58|1.50|2.84|4.66| | Women|1.41|1.36|2.54|4.16| | Music|1.50|1.44|2.66|4.36| | Children|1.54|1.46|2.74|4.51| | Sports|1.47|1.40|2.62|4.30| | Shoes|1.51|1.48|2.68|4.46| | Jewelry|1.45|1.39|2.59|4.25| | null|0.04|0.04|0.07|0.13| |Electronics|1.56|1.49|2.77|4.57| | Home|1.57|1.51|2.79|4.60| | Men|1.60|1.54|2.86|4.71| +-----------+----+----+----+----+
18.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 18 EXAMPLE 2: FEATURE GENERATION • MovieLens 1M Dataset – ~1M movie ratings – 6040 users – 3952 movies • Predict gender based on ratings – Using 100 most popular movies
19.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 19 LOAD RATINGS val ratings_raw = sc.textFile("Downloads/ml-1m/ratings.dat") case class Rating(user: Int, movie: Int, rating: Int) val ratings = ratings_raw.map(_.split("::").map(_.toInt)).map(r => Rating(r(0),r(1),r(2))).toDF ratings.show +----+-----+------+ |user|movie|rating| +----+-----+------+ | 11| 1753| 4| | 11| 1682| 1| | 11| 216| 4| | 11| 2997| 4| | 11| 1259| 3| ...
20.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 20 LOAD USERS val users_raw = sc.textFile("Downloads/ml-1m/users.dat") case class User(user: Int, gender: String, age: Int) val users = users_raw.map(_.split("::")).map(u => User(u(0).toInt, u(1), u(2).toInt)).toDF val sample_users = users.where(expr("gender = 'F' or ( rand() * 5 < 2 )")) sample_users.groupBy("gender").count().show +------+-----+ |gender|count| +------+-----+ | F| 1709| | M| 1744| +------+-----+
21.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 21 PREP DATA val popular = ratings.groupBy("movie") .count() .orderBy($"count".desc) .limit(100) .map(_.get(0)).collect val ratings_pivot = ratings.groupBy("user") .pivot("movie", popular.toSeq) .agg(expr("coalesce(first(rating),3)").cast("double")) ratings_pivot.where($"user" === 11).show +----+----+---+----+----+---+----+---+----+----+---+... |user|2858|260|1196|1210|480|2028|589|2571|1270|593|... +----+----+---+----+----+---+----+---+----+----+---+... | 11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| 3.0| 3.0|5.0|... +----+----+---+----+----+---+----+---+----+----+---+...
22.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 22 BUILD MODEL val data = ratings_pivot.join(sample_users, "user") .withColumn("label", expr("if(gender = 'M', 1, 0)").cast("double")) val assembler = new VectorAssembler() .setInputCols(popular.map(_.toString)) .setOutputCol("features") val lr = new LogisticRegression() val pipeline = new Pipeline().setStages(Array(assembler, lr)) val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345) val model = pipeline.fit(training)
23.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 23 TEST val res = model.transform(test).select("label", "prediction") res.groupBy("label").pivot("prediction", Seq(1.0, 0.0)).count().show +-----+---+---+ |label|1.0|0.0| +-----+---+---+ | 1.0|114| 74| | 0.0| 46|146| +-----+---+---+ Accuracy 68%
24.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 24 TIPS AND TRICKS
25.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 25 USAGE NOTES • Specify the distinct values of the pivot column – Otherwise it does this: val values = df.select(pivotColumn) .distinct() .sort(pivotColumn) .map(_.get(0)) .take(maxValues + 1) .toSeq
26.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 26 MULTIPLE AGGREGATIONS df.groupBy("A", "B").pivot("C").agg(sum("D"), avg("D")).show +---+---+------------+------------+------------+------------+ | A| B|small_sum(D)|small_avg(D)|large_sum(D)|large_avg(D)| +---+---+------------+------------+------------+------------+ |foo|two| 6| 3.0| null| null| |bar|two| 6| 6.0| 7| 7.0| |foo|one| 1| 1.0| 4| 2.0| |bar|one| 5| 5.0| 4| 4.0| +---+---+------------+------------+------------+------------+
27.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 27 PIVOT MULTIPLE COLUMNS • Merge columns and pivot as usual df.withColumn(“p”, concat($”p1”, $”p2”)) .groupBy(“a”, “b”) .pivot(“p”) .agg(…)
28.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 28 MAX COLUMNS • spark.sql.pivotMaxValues – Default: 10,000 – When doing a pivot without specifying values for the pivot column this is the maximum number of (distinct) values that will be collected without error.
29.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 29 IMPLEMENTATION
30.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 30 PIVOT IMPLEMENTATION • pivot is a method of GroupedData and returns GroupedData with PivotType. • New logical operator: o.a.s.sql.catalyst.plans.logical.Pivot • Analyzer rule: o.a.s.sql.catalyst.analysis.Analyzer.ResolvePivot – Currently translates logical pivot into an aggregation with lots of if statements.
31.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 31 FUTURE WORK
32.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 32 FUTURE WORK • Add to R API • Add to SQL syntax • Add support for unpivot • Faster implementation
33.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 33 Pivoting Data with SparkSQL Andrew Ray andrew@svds.com We’re hiring! svds.com/careers THANK YOU. git.io/vgy34
Télécharger maintenant