SlideShare a Scribd company logo
1 of 56
Download to read offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Powering Custom Apps at Facebook
using Spark Script Transformation
Abdulrahman Alfozan
Spark Summit Europe
Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
2015
Small Scale
Experiments
2016
Few Pipelines in
Production
2017
Running 60TB+
shuffle pipelines
2018
Full-production
deployment
Successor to Apache
Hive at Facebook
2019
Scaling Spark
Largest Compute
Engine at Facebook
by CPU
Spark at Facebook
Reliability and efficiency are our top priority
Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
Script Transforms
SQL query
SELECT
TRANSFORM (inputs)
USING “script”
AS (outputs)
FROM src_tbl;
Script Transforms
SQL query
ScriptTransformation (inputs,
script, outputs)
TableScan (src_tbl)
SELECT
TRANSFORM (inputs)
USING “script”
AS (outputs)
FROM src_tbl;
Query plan
Script Transforms
ScriptTransformation (inputs,
script, outputs)
TableScan (src_tbl)
SQL query Query plan
Spark External
Process
Input Table
Output Table
inputs
outputs
Execution
SELECT
TRANSFORM (inputs)
USING “script”
AS (outputs)
FROM src_tbl;
1. Flexibility:
Unlike UDFs, transforms allow unlimited use-cases
2. Efficiency:
Most transformers are written in C++
Why Script Transforms?
1. Flexibility:
Unlike UDFs, transforms allow unlimited use-cases
2. Efficiency:
Most transformers are written in C++
Why Script Transforms?
Transforms provide custom data processing while relying on Spark for
ETL, data partitioning, distributed execution, and fault-tolerance.
1. Flexibility:
Unlike UDFs, transforms allow unlimited use-cases
2. Efficiency:
Most transformers are written in C++
Why Script Transforms?
Transforms provide custom data processing while relying on Spark for
ETL, data partitioning, distributed execution, and fault-tolerance.
e.g. Spark is optimized for ETL. PyTorch is optimized for model serving.
Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future plans
Transform Pipelines Usage
% of overall CPU
15%
12%
9%
6%
3%
0%
Pure SQL (54%)
Pure SQL (72%)
Transforms & UDFs
(45%)
Transforms & UDFs
(20%)
DataFrames (1%)
DataFrames (8%)
Count CPU
Transform Pipelines Usage
Query Count CPU
Comparison
Use-case 1: Batch Inference
SQL Query
Transform resourcesADD FILES inference_engine, model.md;
SELECT
TRANSFORM (id INT, metadata STRING, image STRING)
ROW FORMAT SERDE 'JSONSimpleSerDe'
USING ‘inference_engine --model=model.md’
AS labels MAP<STRING, DOUBLE>
ROW FORMAT SERDE 'JSONSimpleSerDe'
FROM tlb_images;
Output: category>confidence
Input columns
Input format
Output format
Use-case 1: Batch Inference
Transform main.cpp
#include ”spark/Transformer.h”
...
while (transformer.readRow(input)) {
// data processing
auto prediction = predict(input)
// write output map
transformer.writeRow(prediction)
}
Transform lib
Row iterator
Use-case 1: Batch Inference
PyTorch runtime container
Self-contained Executable
Spark Executor
Transform Process
stdin
stdout
Spark Task
InternalRow
Serialization into JSON
JSON deserialization
into InternalRow
JSON deserialization into
C++ objects
C++ objects
serialization into JSON
{id:1, metadata:, image:…}
{label_1: score, label_2: score}
Model
Use-case 2: Batch Indexing
SQL Query
Transform resourcesADD FILES indexer;
SELECT
TRANSFORM (shard_id INT, data STRING)
ROW FORMAT SERDE ‘RowFormatDelimited‘
USING ‘indexer --schema=data<STRING>’
FROM src_tbl
CLUSTER BY shard_id;
Input columns
Input format
Partition operator
Spark Task
Spark Task
Use-case 2: Batch Indexing
shard_id data
1 {…}
1 {…}
2 {…}
shard_id data
1 {…}
2 {…}
2 {…}
shard_id data
1 {…}
1 {…}
1 {…}
shard_id data
2 {…}
2 {…}
2 {…}
Mappers
Reducer Transforms
Shuffle Reducer 1
Mapper 1
Mapper 2
Transform Process
stdin
indexer
…
Reducer 2
Execution
Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
ScriptTransformationExec.scala
• Direct process invocation
• Class IOSchema to handle SerDe schema and config
• MonitorThread to track transform process progress
• Transform process error handling and surfacing
Core Engine Improvements
Operator
• DelimitedJSONSerDe.scala
JSON format standard RFC 8259
Core Engine Improvements
SerDe support
• SimpleSerDe.scala
ROW FORMAT DELIMITED
Core Engine Improvements
SerDe support
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY 'n'
Configurable properties
Core Engine Improvements
SerDe support
Development
• Text-based
DelimitedJSONSerDe.scala
SimpleSerDe.scala
• Binary
?
Production
• Binary format
Text-based encoding is slow and less-compact
Core Engine Improvements
Production SerDe Requirements
• Binary format
Text-based encoding is slow and less-compact
• Zero-copy
Access to serialized data without parsing or unpacking
Improving Facebook’s performance on Android with FlatBuffers
Core Engine Improvements
Production SerDe Requirements
• Binary format
Text-based encoding is slow and less-compact
• Zero-copy
Access to serialized data without parsing or unpacking
Improving Facebook’s performance on Android with FlatBuffers
• Word-aligned data
Allow for SIMD optimizations
Core Engine Improvements
Production SerDe Requirements
• LazyBinarySerDe (Apache Hive)
Not zero-copy nor word-aligned, require converters in Spark
• Protocol Buffers / Thrift
Not zero-copy, more suited for RPC
• Flatbuffers / Cap’n Proto
require converters (to/from InternalRow) in Spark Core
• Apache Arrow
great future option
Binary SerDe Considerations
UnsafeRow
• Binary & Word-aligned
• Zero-copy
• Already part of Spark core
• Available converters to/from InternalRow
Binary SerDe Considerations
Chosen format
UnsafeRow SerDe
SPARK-7076: Introduced UnsafeRow format to Spark
apache/spark/sql/catalyst/expressions/UnsafeRow.java
UnsafeRow SerDe
SPARK-15962: Introduced UnsafeArrayData and UnsafeMapData
apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
apache/spark/sql/catalyst/expressions/UnsafeMapData.java
UnsafeRow SerDe
UnsafeRow SerDe C++ library
INT
BIGINT
BOOLEAN
FLOAT
DOUBLE
STRING
ARRAY<INT>
MAP<INT,STRING>
int32_t
int64_t
bool
float
double
unsaferow::String
unsaferow::List<int32_t>
unsaferow::Map<int32_t, unsaferow::String>
SQL datatypes C++ datatypes
UnsafeRow SerDe
UnsafeRow SerDe C++ library
SELECT
TRANSFORM (id INT)
ROW FORMAT SERDE 'UnsafeRowSerDe'
USING ‘script’
AS (value BIGINT)
ROW FORMAT SERDE UnsafeRowSerDe'
FROM src_tbl;
#include ”spark/Transformer.h”
while (transformer.readRow(input)) {
// data processing
int32_t id = input->getID();
output->setValue(id*id);
// write output
transformer.writeRow(output)
}
SQL Query C++ Transformer
Core Engine Improvements
SerDe support summary
Development
Production
• Text-based
DelimitedJSONSerDe.scala
SimpleSerDe.scala
• Binary
UnsafeRowSerDe.scala
Core Engine Improvements
SELECT
TRANSFORM (id, AVG(value) AS value_avg)
USING ‘script’
AS (output)
FROM src_tbl;
GROUP BY id;
Aggregation and projection support (SQL)
Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
• Text-based (UTF-8)
- JSON
- Row Format Delimited
• Binary:
- UnsafeRow
Efficiency Analysis
SerDe overhead
JSON lib
Efficiency Analysis
Text-SerDe CPU overhead: Spark
Efficiency Analysis
Text-SerDe CPU overhead: Transform process
• Text-based SerDe overhead is non-negligible
especially for Complex types
• SerDe cost could be up to 70% of pipeline’s CPU resources
Efficiency Analysis
SerDe overhead
• Text-based SerDe overhead is non-negligible
especially for Complex types
• SerDe cost could be up to 70% of pipeline’s CPU resources
Solution: use an efficient binary SerDe
Efficiency Analysis
SerDe overhead
Efficiency Analysis: UnsafeRow
Efficient Binary SerDe
UnsafeRow
C++ lib
Efficiency Analysis: UnsafeRow
Spark
Efficiency Analysis: UnsafeRow
Transform process
UnsafeRow SerDe Benchmark
Text-Based SerDe vs Binary
SerDe (UnsafeRow)
Transform pipelines end-
to-end CPU savings:
up to 4x
Complex types SerDe
impacted the most
Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
CPU cores per container: spark.executor.cores = 4
Memory per container: spark.executor.memory=4GB + spark.transform.memory=4GB
Transforms Execution Model
Resource Request
Spark
Driver
Cluster
Manager
Node Manager
Node Manager
Executor
Task 1 Task 2
Resource Request
CPU cores = 4,
Memory = 8GB
Launch Spark
Executor
Process 1 Process 2
Executor
• JVM’s memory limits: Xms, Xmx and Xss.
• CPU threads:
spark.executor.core,spark.task.cpus
Transforms Execution Model
Resource Control
• JVM’s memory limits: Xms, Xmx and Xss.
• CPU threads:
spark.executor.core,spark.task.cpus
These limits are irrelevant when running an
external process!
Transforms Execution Model
Resource Control
• JVM’s memory limits: Xms, Xmx and Xss.
• CPU threads:
spark.executor.core,spark.task.cpus
These limits are irrelevant when running an
external process!
Solution: cgroup v2 containers
Transforms Execution Model
Resource Control
cgroup v2 controllers:
• cpu.weight
Allows Multi-threaded transforms
• memory.max
OOM offending processes
• io.latency
IO QoS
Transforms Execution Model
Resource Control & Isolation
Transforms Execution Model
Resource Control & Isolation
/cgroup2/task_container/exec1
Agenda
1. Intro to Spark Script Transforms
2. Spark Transforms at Facebook
3. Core Engine Improvements
4. Efficiency Analysis and Results
5. Transforms Execution Model
6. Future Plans
• Binary SerDe based on Apache arrow
• Vectorization
Future Plans
Questions
INFRASTRUCTURE

More Related Content

What's hot

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowC4Media
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowNeo4j
 
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Amazon Web Services
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get startedIsmaeel Enjreny
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?Anton Zadorozhniy
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationOri Reshef
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksDatabricks
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Kai Wähner
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley
 
Understanding InfluxDB Basics: Tags, Fields and Measurements
Understanding InfluxDB Basics: Tags, Fields and MeasurementsUnderstanding InfluxDB Basics: Tags, Fields and Measurements
Understanding InfluxDB Basics: Tags, Fields and MeasurementsInfluxData
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
 

What's hot (20)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud Dataflow
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache ArrowAchieve Blazing-Fast Ingest Speeds with Apache Arrow
Achieve Blazing-Fast Ingest Speeds with Apache Arrow
 
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get started
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for Databricks
 
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
Architecture patterns for distributed, hybrid, edge and global Apache Kafka d...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
Understanding InfluxDB Basics: Tags, Fields and Measurements
Understanding InfluxDB Basics: Tags, Fields and MeasurementsUnderstanding InfluxDB Basics: Tags, Fields and Measurements
Understanding InfluxDB Basics: Tags, Fields and Measurements
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
 

Similar to Powering Custom Apps at Facebook using Spark Script Transformation

Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapJulien Le Dem
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Iulian Pintoiu
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
Practices and tools for building better APIs
Practices and tools for building better APIsPractices and tools for building better APIs
Practices and tools for building better APIsNLJUG
 
Practices and tools for building better API (JFall 2013)
Practices and tools for building better API (JFall 2013)Practices and tools for building better API (JFall 2013)
Practices and tools for building better API (JFall 2013)Peter Hendriks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN
 

Similar to Powering Custom Apps at Facebook using Spark Script Transformation (20)

Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Practices and tools for building better APIs
Practices and tools for building better APIsPractices and tools for building better APIs
Practices and tools for building better APIs
 
Practices and tools for building better API (JFall 2013)
Practices and tools for building better API (JFall 2013)Practices and tools for building better API (JFall 2013)
Practices and tools for building better API (JFall 2013)
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Recently uploaded (20)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

Powering Custom Apps at Facebook using Spark Script Transformation

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Powering Custom Apps at Facebook using Spark Script Transformation Abdulrahman Alfozan Spark Summit Europe
  • 3. Agenda 1. Intro to Spark Script Transforms 2. Spark Transforms at Facebook 3. Core Engine Improvements 4. Efficiency Analysis and Results 5. Transforms Execution Model 6. Future Plans
  • 4. 2015 Small Scale Experiments 2016 Few Pipelines in Production 2017 Running 60TB+ shuffle pipelines 2018 Full-production deployment Successor to Apache Hive at Facebook 2019 Scaling Spark Largest Compute Engine at Facebook by CPU Spark at Facebook Reliability and efficiency are our top priority
  • 5. Agenda 1. Intro to Spark Script Transforms 2. Spark Transforms at Facebook 3. Core Engine Improvements 4. Efficiency Analysis and Results 5. Transforms Execution Model 6. Future Plans
  • 6. Script Transforms SQL query SELECT TRANSFORM (inputs) USING “script” AS (outputs) FROM src_tbl;
  • 7. Script Transforms SQL query ScriptTransformation (inputs, script, outputs) TableScan (src_tbl) SELECT TRANSFORM (inputs) USING “script” AS (outputs) FROM src_tbl; Query plan
  • 8. Script Transforms ScriptTransformation (inputs, script, outputs) TableScan (src_tbl) SQL query Query plan Spark External Process Input Table Output Table inputs outputs Execution SELECT TRANSFORM (inputs) USING “script” AS (outputs) FROM src_tbl;
  • 9. 1. Flexibility: Unlike UDFs, transforms allow unlimited use-cases 2. Efficiency: Most transformers are written in C++ Why Script Transforms?
  • 10. 1. Flexibility: Unlike UDFs, transforms allow unlimited use-cases 2. Efficiency: Most transformers are written in C++ Why Script Transforms? Transforms provide custom data processing while relying on Spark for ETL, data partitioning, distributed execution, and fault-tolerance.
  • 11. 1. Flexibility: Unlike UDFs, transforms allow unlimited use-cases 2. Efficiency: Most transformers are written in C++ Why Script Transforms? Transforms provide custom data processing while relying on Spark for ETL, data partitioning, distributed execution, and fault-tolerance. e.g. Spark is optimized for ETL. PyTorch is optimized for model serving.
  • 12. Agenda 1. Intro to Spark Script Transforms 2. Spark Transforms at Facebook 3. Core Engine Improvements 4. Efficiency Analysis and Results 5. Transforms Execution Model 6. Future plans
  • 13. Transform Pipelines Usage % of overall CPU 15% 12% 9% 6% 3% 0%
  • 14. Pure SQL (54%) Pure SQL (72%) Transforms & UDFs (45%) Transforms & UDFs (20%) DataFrames (1%) DataFrames (8%) Count CPU Transform Pipelines Usage Query Count CPU Comparison
  • 15. Use-case 1: Batch Inference SQL Query Transform resourcesADD FILES inference_engine, model.md; SELECT TRANSFORM (id INT, metadata STRING, image STRING) ROW FORMAT SERDE 'JSONSimpleSerDe' USING ‘inference_engine --model=model.md’ AS labels MAP<STRING, DOUBLE> ROW FORMAT SERDE 'JSONSimpleSerDe' FROM tlb_images; Output: category>confidence Input columns Input format Output format
  • 16. Use-case 1: Batch Inference Transform main.cpp #include ”spark/Transformer.h” ... while (transformer.readRow(input)) { // data processing auto prediction = predict(input) // write output map transformer.writeRow(prediction) } Transform lib Row iterator
  • 17. Use-case 1: Batch Inference PyTorch runtime container Self-contained Executable Spark Executor Transform Process stdin stdout Spark Task InternalRow Serialization into JSON JSON deserialization into InternalRow JSON deserialization into C++ objects C++ objects serialization into JSON {id:1, metadata:, image:…} {label_1: score, label_2: score} Model
  • 18. Use-case 2: Batch Indexing SQL Query Transform resourcesADD FILES indexer; SELECT TRANSFORM (shard_id INT, data STRING) ROW FORMAT SERDE ‘RowFormatDelimited‘ USING ‘indexer --schema=data<STRING>’ FROM src_tbl CLUSTER BY shard_id; Input columns Input format Partition operator
  • 19. Spark Task Spark Task Use-case 2: Batch Indexing shard_id data 1 {…} 1 {…} 2 {…} shard_id data 1 {…} 2 {…} 2 {…} shard_id data 1 {…} 1 {…} 1 {…} shard_id data 2 {…} 2 {…} 2 {…} Mappers Reducer Transforms Shuffle Reducer 1 Mapper 1 Mapper 2 Transform Process stdin indexer … Reducer 2 Execution
  • 20. Agenda 1. Intro to Spark Script Transforms 2. Spark Transforms at Facebook 3. Core Engine Improvements 4. Efficiency Analysis and Results 5. Transforms Execution Model 6. Future Plans
  • 21. ScriptTransformationExec.scala • Direct process invocation • Class IOSchema to handle SerDe schema and config • MonitorThread to track transform process progress • Transform process error handling and surfacing Core Engine Improvements Operator
  • 22. • DelimitedJSONSerDe.scala JSON format standard RFC 8259 Core Engine Improvements SerDe support
  • 23. • SimpleSerDe.scala ROW FORMAT DELIMITED Core Engine Improvements SerDe support FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' LINES TERMINATED BY 'n' Configurable properties
  • 24. Core Engine Improvements SerDe support Development • Text-based DelimitedJSONSerDe.scala SimpleSerDe.scala • Binary ? Production
  • 25. • Binary format Text-based encoding is slow and less-compact Core Engine Improvements Production SerDe Requirements
  • 26. • Binary format Text-based encoding is slow and less-compact • Zero-copy Access to serialized data without parsing or unpacking Improving Facebook’s performance on Android with FlatBuffers Core Engine Improvements Production SerDe Requirements
  • 27. • Binary format Text-based encoding is slow and less-compact • Zero-copy Access to serialized data without parsing or unpacking Improving Facebook’s performance on Android with FlatBuffers • Word-aligned data Allow for SIMD optimizations Core Engine Improvements Production SerDe Requirements
  • 28. • LazyBinarySerDe (Apache Hive) Not zero-copy nor word-aligned, require converters in Spark • Protocol Buffers / Thrift Not zero-copy, more suited for RPC • Flatbuffers / Cap’n Proto require converters (to/from InternalRow) in Spark Core • Apache Arrow great future option Binary SerDe Considerations
  • 29. UnsafeRow • Binary & Word-aligned • Zero-copy • Already part of Spark core • Available converters to/from InternalRow Binary SerDe Considerations Chosen format
  • 30. UnsafeRow SerDe SPARK-7076: Introduced UnsafeRow format to Spark apache/spark/sql/catalyst/expressions/UnsafeRow.java
  • 31. UnsafeRow SerDe SPARK-15962: Introduced UnsafeArrayData and UnsafeMapData apache/spark/sql/catalyst/expressions/UnsafeArrayData.java apache/spark/sql/catalyst/expressions/UnsafeMapData.java
  • 32. UnsafeRow SerDe UnsafeRow SerDe C++ library INT BIGINT BOOLEAN FLOAT DOUBLE STRING ARRAY<INT> MAP<INT,STRING> int32_t int64_t bool float double unsaferow::String unsaferow::List<int32_t> unsaferow::Map<int32_t, unsaferow::String> SQL datatypes C++ datatypes
  • 33. UnsafeRow SerDe UnsafeRow SerDe C++ library SELECT TRANSFORM (id INT) ROW FORMAT SERDE 'UnsafeRowSerDe' USING ‘script’ AS (value BIGINT) ROW FORMAT SERDE UnsafeRowSerDe' FROM src_tbl; #include ”spark/Transformer.h” while (transformer.readRow(input)) { // data processing int32_t id = input->getID(); output->setValue(id*id); // write output transformer.writeRow(output) } SQL Query C++ Transformer
  • 34. Core Engine Improvements SerDe support summary Development Production • Text-based DelimitedJSONSerDe.scala SimpleSerDe.scala • Binary UnsafeRowSerDe.scala
  • 35. Core Engine Improvements SELECT TRANSFORM (id, AVG(value) AS value_avg) USING ‘script’ AS (output) FROM src_tbl; GROUP BY id; Aggregation and projection support (SQL)
  • 36. Agenda 1. Intro to Spark Script Transforms 2. Spark Transforms at Facebook 3. Core Engine Improvements 4. Efficiency Analysis and Results 5. Transforms Execution Model 6. Future Plans
  • 37. • Text-based (UTF-8) - JSON - Row Format Delimited • Binary: - UnsafeRow Efficiency Analysis SerDe overhead JSON lib
  • 39. Efficiency Analysis Text-SerDe CPU overhead: Transform process
  • 40. • Text-based SerDe overhead is non-negligible especially for Complex types • SerDe cost could be up to 70% of pipeline’s CPU resources Efficiency Analysis SerDe overhead
  • 41. • Text-based SerDe overhead is non-negligible especially for Complex types • SerDe cost could be up to 70% of pipeline’s CPU resources Solution: use an efficient binary SerDe Efficiency Analysis SerDe overhead
  • 42. Efficiency Analysis: UnsafeRow Efficient Binary SerDe UnsafeRow C++ lib
  • 45. UnsafeRow SerDe Benchmark Text-Based SerDe vs Binary SerDe (UnsafeRow) Transform pipelines end- to-end CPU savings: up to 4x Complex types SerDe impacted the most
  • 46. Agenda 1. Intro to Spark Script Transforms 2. Spark Transforms at Facebook 3. Core Engine Improvements 4. Efficiency Analysis and Results 5. Transforms Execution Model 6. Future Plans
  • 47. CPU cores per container: spark.executor.cores = 4 Memory per container: spark.executor.memory=4GB + spark.transform.memory=4GB Transforms Execution Model Resource Request Spark Driver Cluster Manager Node Manager Node Manager Executor Task 1 Task 2 Resource Request CPU cores = 4, Memory = 8GB Launch Spark Executor Process 1 Process 2 Executor
  • 48. • JVM’s memory limits: Xms, Xmx and Xss. • CPU threads: spark.executor.core,spark.task.cpus Transforms Execution Model Resource Control
  • 49. • JVM’s memory limits: Xms, Xmx and Xss. • CPU threads: spark.executor.core,spark.task.cpus These limits are irrelevant when running an external process! Transforms Execution Model Resource Control
  • 50. • JVM’s memory limits: Xms, Xmx and Xss. • CPU threads: spark.executor.core,spark.task.cpus These limits are irrelevant when running an external process! Solution: cgroup v2 containers Transforms Execution Model Resource Control
  • 51. cgroup v2 controllers: • cpu.weight Allows Multi-threaded transforms • memory.max OOM offending processes • io.latency IO QoS Transforms Execution Model Resource Control & Isolation
  • 52. Transforms Execution Model Resource Control & Isolation /cgroup2/task_container/exec1
  • 53. Agenda 1. Intro to Spark Script Transforms 2. Spark Transforms at Facebook 3. Core Engine Improvements 4. Efficiency Analysis and Results 5. Transforms Execution Model 6. Future Plans
  • 54. • Binary SerDe based on Apache arrow • Vectorization Future Plans