SlideShare une entreprise Scribd logo
1  sur  156
Télécharger pour lire hors ligne
WHOAMI
> Ruben Berenguel (@berenguel)
> PhD in Mathematics
> (big) data consultant
> Lead data engineer using Python, Go and Scala
> Right now at Affectv
#UnifiedDataAnalytics #SparkAISummit
What is Pandas?
#UnifiedDataAnalytics #SparkAISummit
What is Pandas?
> Python Data Analysis library
#UnifiedDataAnalytics #SparkAISummit
What is Pandas?
> Python Data Analysis library
> Used everywhere data and Python appear in job offers
#UnifiedDataAnalytics #SparkAISummit
What is Pandas?
> Python Data Analysis library
> Used everywhere data and Python appear in job offers
> Efficient (is columnar and has a C and Cython backend)
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
HOW DOES PANDAS
MANAGE COLUMNAR DATA?
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
WHAT IS ARROW?
#UnifiedDataAnalytics #SparkAISummit
WHAT IS ARROW?
> Cross-language in-memory columnar format library
#UnifiedDataAnalytics #SparkAISummit
WHAT IS ARROW?
> Cross-language in-memory columnar format library
> Optimised for efficiency across languages
#UnifiedDataAnalytics #SparkAISummit
WHAT IS ARROW?
> Cross-language in-memory columnar format library
> Optimised for efficiency across languages
> Integrates seamlessly with Pandas
#UnifiedDataAnalytics #SparkAISummit
HOW DOES ARROW
MANAGE COLUMNAR DATA?
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
! ❤
#UnifiedDataAnalytics #SparkAISummit
! ❤
> Arrow uses RecordBatches
#UnifiedDataAnalytics #SparkAISummit
! ❤
> Arrow uses RecordBatches
> Pandas uses blocks handled by a BlockManager
#UnifiedDataAnalytics #SparkAISummit
! ❤
> Arrow uses RecordBatches
> Pandas uses blocks handled by a BlockManager
> You can convert an Arrow Table into a Pandas
DataFrame easily
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
WHAT IS SPARK?
#UnifiedDataAnalytics #SparkAISummit
WHAT IS SPARK?
> Distributed Computation framework
#UnifiedDataAnalytics #SparkAISummit
WHAT IS SPARK?
> Distributed Computation framework
> Open source
#UnifiedDataAnalytics #SparkAISummit
WHAT IS SPARK?
> Distributed Computation framework
> Open source
> Easy to use
#UnifiedDataAnalytics #SparkAISummit
WHAT IS SPARK?
> Distributed Computation framework
> Open source
> Easy to use
> Scales horizontally and vertically
#UnifiedDataAnalytics #SparkAISummit
HOW DOES
SPARK WORK?
#UnifiedDataAnalytics #SparkAISummit
SPARK
USUALLY
RUNS ON TOP
OF A CLUSTER
MANAGER
AND A
DISTRIBUTED
STORAGE
A SPARK PROGRAM
RUNS IN THE DRIVER
THE DRIVER REQUESTS
RESOURCES FROM THE
CLUSTER MANAGER TO
RUN TASKS
THE DRIVER REQUESTS
RESOURCES FROM THE
CLUSTER MANAGER TO
RUN TASKS
THE DRIVER REQUESTS
RESOURCES FROM THE
CLUSTER MANAGER TO
RUN TASKS
THE DRIVER REQUESTS
RESOURCES FROM THE
CLUSTER MANAGER TO
RUN TASKS
THE MAIN BUILDING BLOCK
IS THE RDD:
RESILIENT DISTRIBUTED
DATASET
#UnifiedDataAnalytics #SparkAISummit
PYSPARK
#UnifiedDataAnalytics #SparkAISummit
PYSPARK OFFERS A
PYTHON API TO THE SCALA
CORE OF SPARK
#UnifiedDataAnalytics #SparkAISummit
IT USES THE
PY4J BRIDGE
#UnifiedDataAnalytics #SparkAISummit
# Connect to the gateway
gateway = JavaGateway(
gateway_parameters=GatewayParameters(
port=gateway_port,
auth_token=gateway_secret,
auto_convert=True))
# Import the classes used by PySpark
java_import(gateway.jvm, "org.apache.spark.SparkConf")
java_import(gateway.jvm, "org.apache.spark.api.java.*")
java_import(gateway.jvm, "org.apache.spark.api.python.*")
.
.
.
return gateway
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
THE MAIN ENTRYPOINTS
ARE RDD AND
PipelinedRDD(RDD)
#UnifiedDataAnalytics #SparkAISummit
PipelinedRDD
BUILDS IN THE JVM A
PythonRDD
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
THE MAGIC IS
IN
compute
compute
IS RUN ON EACH
EXECUTOR AND STARTS
A PYTHON WORKER VIA
PythonRunner
Workers act as standalone processors of streams of
data
#UnifiedDataAnalytics #SparkAISummit
Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
#UnifiedDataAnalytics #SparkAISummit
Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
> Load included Python libraries
#UnifiedDataAnalytics #SparkAISummit
Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
> Load included Python libraries
> Deserializes the pickled function coming from the
stream
#UnifiedDataAnalytics #SparkAISummit
Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
> Load included Python libraries
> Deserializes the pickled function coming from the
stream
> Applies the function to the data coming from the stream
#UnifiedDataAnalytics #SparkAISummit
Workers act as standalone processors of streams of
data
> Connects back to the JVM that started it
> Load included Python libraries
> Deserializes the pickled function coming from the
stream
> Applies the function to the data coming from the stream
> Sends the output back
#UnifiedDataAnalytics #SparkAISummit
…
#UnifiedDataAnalytics #SparkAISummit
BUT… WASN'T SPARK
MAGICALLY OPTIMISING
EVERYTHING?
#UnifiedDataAnalytics #SparkAISummit
YES, FOR SPARK
DataFrame
#UnifiedDataAnalytics #SparkAISummit
SPARK WILL GENERATE
A PLAN
(A DIRECTED ACYCLIC GRAPH)
TO COMPUTE THE
RESULT
AND THE PLAN WILL BE
OPTIMISED USING
CATALYST
DEPENDING ON THE FUNCTION, THE
OPTIMISER WILL CHOOSE
PythonUDFRunner
OR
PythonArrowRunner
(BOTH EXTEND PythonRunner)
#UnifiedDataAnalytics #SparkAISummit
IF WE CAN DEFINE OUR FUNCTIONS
USING PANDAS Series
TRANSFORMATIONS WE CAN SPEED UP
PYSPARK CODE FROM 3X TO 100X!
#UnifiedDataAnalytics #SparkAISummit
QUICK
EXAMPLES
#UnifiedDataAnalytics #SparkAISummit
THE BASICS: toPandas
from pyspark.sql.functions import rand
df = spark.range(1 << 20).toDF("id").withColumn("x", rand())
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pandas_df = df.toPandas() # we'll time this
#UnifiedDataAnalytics #SparkAISummit
from pyspark.sql.functions import rand
df = spark.range(1 << 20).toDF("id").withColumn("x", rand())
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pandas_df = df.toPandas() # we'll time this
#UnifiedDataAnalytics #SparkAISummit
from pyspark.sql.functions import rand
df = spark.range(1 << 20).toDF("id").withColumn("x", rand())
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pandas_df = df.toPandas() # we'll time this
#UnifiedDataAnalytics #SparkAISummit
THE FUN: .groupBy
from pyspark.sql.functions import rand, randn, floor
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.range(20000000).toDF("row").drop("row") 
.withColumn("id", floor(rand()*10000)).withColumn("spent", (randn()+3)*100)
@pandas_udf("id long, spent double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
spent = pdf.spent
return pdf.assign(spent=spent - spent.mean())
df_to_pandas_arrow = df.groupby("id").apply(subtract_mean).toPandas()
#UnifiedDataAnalytics #SparkAISummit
from pyspark.sql.functions import rand, randn, floor
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.range(20000000).toDF("row").drop("row") 
.withColumn("id", floor(rand()*10000)).withColumn("spent", (randn()+3)*100)
@pandas_udf("id long, spent double", PandasUDFType.GROUPED_MAP)
def subtract_mean(pdf):
spent = pdf.spent
return pdf.assign(spent=spent - spent.mean())
df_to_pandas_arrow = df.groupby("id").apply(subtract_mean).toPandas()
#UnifiedDataAnalytics #SparkAISummit
BEFORE YOU MAY HAVE DONE SOMETHING LIKE..
import numpy as np
from pyspark.sql.functions import collect_list
grouped = df2.groupby("id").agg(collect_list('spent').alias("spent_list"))
as_pandas = grouped.toPandas()
as_pandas["mean"] = as_pandas["spent_list"].apply(np.mean)
as_pandas["substracted"] = as_pandas["spent_list"].apply(np.array) - as_pandas["mean"]
df_to_pandas = as_pandas.drop(columns=["spent_list", "mean"]).explode("substracted")
#UnifiedDataAnalytics #SparkAISummit
import numpy as np
from pyspark.sql.functions import collect_list
grouped = df2.groupby("id").agg(collect_list('spent').alias("spent_list"))
as_pandas = grouped.toPandas()
as_pandas["mean"] = as_pandas["spent_list"].apply(np.mean)
as_pandas["substracted"] = as_pandas["spent_list"].apply(np.array) - as_pandas["mean"]
df_to_pandas = as_pandas.drop(columns=["spent_list", "mean"]).explode("substracted")
#UnifiedDataAnalytics #SparkAISummit
TLDR:
USE1
ARROW AND PANDAS UDFS
1
in pyspark
#UnifiedDataAnalytics #SparkAISummit
RESOURCES
> Spark documentation
> High Performance Spark by Holden Karau
> The Internals of Apache Spark 2.4.2 by Jacek Laskowski
> Spark's Github
> Become a contributor
#UnifiedDataAnalytics #SparkAISummit
QUESTIONS?
#UnifiedDataAnalytics #SparkAISummit
THANKS!
#UnifiedDataAnalytics #SparkAISummit
Get the slides from my github:
github.com/rberenguel/
The repository is
pyspark-arrow-pandas
FURTHER
REFERENCES
#UnifiedDataAnalytics #SparkAISummit
ARROW
Arrow's home
Arrow's github
Arrow speed benchmarks
Arrow to Pandas conversion benchmarks
Post: Streaming columnar data with Apache Arrow
Post: Why Pandas users should be excited by Apache Arrow
Code: Arrow-Pandas compatibility layer code
Code: Arrow Table code
PyArrow in-memory data model
Ballista: a POC distributed compute platform (Rust)
PyJava: POC on Java/Scala and Python data interchange with Arrow
#UnifiedDataAnalytics #SparkAISummit
PANDAS
Pandas' home
Pandas' github
Guide: Idiomatic Pandas
Code: Pandas internals
Design: Pandas internals
Talk: Demystifying Pandas' internals, by Marc Garcia
Memory Layout of Multidimensional Arrays in numpy
#UnifiedDataAnalytics #SparkAISummit
SPARK/PYSPARK
Code: PySpark serializers
JIRA: First steps to using Arrow (only in the PySpark driver)
Post: Speeding up PySpark with Apache Arrow
Original JIRA issue: Vectorized UDFs in Spark
Initial doc draft
Post by Bryan Cutler (leader for the Vec UDFs PR)
Post: Introducing Pandas UDF for PySpark
Code: org.apache.spark.sql.vectorized
Post by Bryan Cutler: Spark toPandas() with Arrow, a Detailed Look
#UnifiedDataAnalytics #SparkAISummit
PY4J
Py4J's home
Py4J's github
Code: Reflection engine
#UnifiedDataAnalytics #SparkAISummit
TABLE FOR toPandas
2^x Direct (s) With Arrow (s) Factor
17 1,08 0,18 5,97
18 1,69 0,26 6,45
19 4,16 0,30 13,87
20 5,76 0,61 9,44
21 9,73 0,96 10,14
22 17,90 1,64 10,91
23 (OOM) 3,42
24 (OOM) 11,40
#UnifiedDataAnalytics #SparkAISummit
EOF
#UnifiedDataAnalytics #SparkAISummit

Contenu connexe

Tendances

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 

Tendances (20)

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 

Similaire à Internals of Speeding up PySpark with Arrow

An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 
Connecting the Dots: Integrating Apache Spark into Production Pipelines
Connecting the Dots: Integrating Apache Spark into Production PipelinesConnecting the Dots: Integrating Apache Spark into Production Pipelines
Connecting the Dots: Integrating Apache Spark into Production Pipelines
Databricks
 

Similaire à Internals of Speeding up PySpark with Arrow (20)

How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?How does that PySpark thing work? And why Arrow makes it faster?
How does that PySpark thing work? And why Arrow makes it faster?
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
 
Vectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkVectorized R Execution in Apache Spark
Vectorized R Execution in Apache Spark
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Speeding up PySpark with Arrow
Speeding up PySpark with ArrowSpeeding up PySpark with Arrow
Speeding up PySpark with Arrow
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
PYSPARK PROGRAMMING.pdf
PYSPARK PROGRAMMING.pdfPYSPARK PROGRAMMING.pdf
PYSPARK PROGRAMMING.pdf
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber Attacks
 
Connecting the Dots: Integrating Apache Spark into Production Pipelines
Connecting the Dots: Integrating Apache Spark into Production PipelinesConnecting the Dots: Integrating Apache Spark into Production Pipelines
Connecting the Dots: Integrating Apache Spark into Production Pipelines
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Intro to apache spark
Intro to apache sparkIntro to apache spark
Intro to apache spark
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Dernier (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Internals of Speeding up PySpark with Arrow