SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Taro L. Saito
Arm Treasure Data
July 31th, 2020
Spark Meetup Tokyo #3
td-spark internals
Extending Spark with Airframe
1
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
About Me: Taro L. Saito
2
● Ph.D., Principal Software Engineer of
Arm Treasure Data
● Living US for 5 years
● Created Presto as a service
● Processing 1 million SQL queries /
day on the cloud. Presto Webinar
● OSS:
● Airframe, snappy-java (used in
Parquet, Spark core),
sbt-sonatype, etc.
● Books:
WIP
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Challenge: Adding Treasure Data Support to Spark
● PlazmaDB: Cloud Data Store of Treasure Data
● MessagePack-based columnar format (MPC1)
● Each table column is represented as a sequence of MessagePack values
● What was necessary for supporting Spark?
● td-spark driver (td-spark-assembly.jar)
■ MPC1 <-> DataFrame conversion
● Plazma Public API
■ APIs for reading and writing MPC1 files from PlazmaDB
● Created these two components with Airframe OSS
Airframe
3
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe: Core Scala Modules of Treasure Data
● Airframe
● Scala OSS assets of our knowledges, production experiences, and design decisions
● 20+ Common Utilities for Scala
● Dependency Injection (DI)
● Airframe RPC
■ HTTP Server, Client builder (ScalaMatsuri. Tokyo, October 2020)
● AirSpec
■ Testing framework for Scala (ScalaDays. Seattle, May 2021)
4
Knowledge
Experiences
Design Decisions
Products
24/7 Services
Business Values
Programming OSS Outcome
Airframe
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe Modules Used Inside td-spark
Airframe DI
DataFrame MPC1
airframe-codec
airframe-msgpack
Plazma Public API
airframe-http
airframe-finagle
Airframe DI
Airframe RPC
airframe-fluentd
Master Worker
DesignSparkContext
TDSparkContext TDSparkService
MPC1 Reader/Writer IO Manager
Airframe DI
airframe-http
airframe-config
airframe-launcher
airframe-jmx
airframe-metrics
airframe-control
airframe-metrics
td-spark.jarairframe-log
airframe-log
airframe-codec
airframe-json
Airframe
5
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Reading MPC1 Partitions and Column Blocks
6
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
Plazma
Public API
Table Data
column blocks
column blocks
column blocks
column blocks
column blocks
td-spark
Table Data
Data Frame
Data Frame
Data Frame
Data Frame
Data Frame
td-pyspark
Parallel Read
User
Programs
Columnar Data Download
DataFrame
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Uploading DataFrame as MPC1 Partitions
7
td-spark
td-pyspark
User
Programs
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
DataFrame
Format
Conversion
Plazma
Public API
Amazon S3
Parallel Upload
Copy
Transaction
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
Table DataTable Data
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe
Airframe RPC
RPC Interface
Router
Scala.js Client
RPC Web Server
Generate
HTTP/gRPC Client
Open API Spec
RPC Impl
Create
RPC CallsJSON
Cross-Language
RPC Client
Scala.js
Web Application
Micro Servicesbt-airframeairframe-http
airframe-http-finagle
airframe-http-rx
airframe-codec
API Documentation
airframe-gRPC
8
● Use Scala As An RPC Interface
● Generate HTTP Server/Client (REST or gRPC)
● HTTP calls -> JSON/MessagePack data -> Remote Scala function calls
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-spark: Adding More Functions to Spark
● Using an implicit class to extend SparkSession (spark variable)
● Adding TD-specific functionalities
● Time series data queries
■ e.g., : spark.td.table(“TD’s table”).within(“-1h”).df
● Predicate pushdown for time-series data
● etc.
9
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Tips: Avoid Task Serialization Errors with Airframe DI
● Serializable
● spark.conf key-value properties inside SparkContext
● Non-Serializable
● Complex service objects
● Solution: Airframe DI (Dependency Injection)
● Distribute the service design (= how to construct objects) with the jar file
● Build service objects from the design (20+ components and config objects)
Airframe DI
Master
Worker
TDSparkContext TDSparkService
TDSparkContext
td-spark.jar
td-spark.jarSerialization
Error (!) Design
Design
TDSparkService
Airframe
Build OK!
Config
Config
10
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Flexible Format Conversion with MessagePack
DataFrame
Airframe
Codec
Pack/Unpack Pack/Unpack
MPC1
JDBC
ResultSet
Plazma Public API
Airframe
11
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Spark 3.0 and PySpark Support
12
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Resources
● Airframe: https://wvlet.org/airframe/
● Airframe Meetup #1 ~ #3 reports
● ScalaMatsuri 2019 presentation
■ And more!!
● td-spark documentation: https://treasure-data.github.io/td-spark/
● See Also: Spark with Airframe (@smdmts)
● Spark to Spark data transfer with MessagePack-based airframe-codec
● Spark -> AWS service call management with airframe-control
13
Airframe
New!
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Appendix
14
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Treasure Data: A Ready-to-Use Cloud Data Platform
15
Logs
Device
Data
Batch
Data
PlazmaDB
Table Schema
Data Collection Cloud Storage Distributed Data Processing
Jobs
Job Management
SQL Editor
Scheduler
Workflows
Machine
Learning
Treasure Data OSS
Third Party OSS
Data
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
TDSparkContext
16
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-pyspark
● Supporting PySpark
● Access Scala methods of td-spark:
■ sparkContext._jvm.(jvm package name).method(...)
● Conversion to PySpark’s DataFrame
● DataFrame(Scala DataFrame, sqlContext)
17
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Predicate Pushdown
● Traverse DataFrame Column Filters
● Extract time conditions (e.g., -1d, -1w, -7d, etc.)
●
18
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Class Loader Hierarchy of Databricks
● Base class loader
● User library class loader
● td-spark.jar will be loaded here
● Shared between multiple notebooks
■ Static variables used inside td-spark.jar
will be shared by multiple notebooks!
● REPL class loader
● Shared between multiple notebooks
● Spark-library class loader
● Notebook-local
● Notebook-local class loader
● Caching local instances to static variables in
td-spark caused ClassNotFound error in
other notebooks
19
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Using Presto with Spark
● presto-jdbc
● Submit select * from (Original SQL) limit 0 => Query result schema
● JDBC ResultSet => Airframe Codec => DataFrame
20
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-prestobase: A Proxy Gateway to Presto Clusters
21
● td-prestobase is a proxy gateway to Presto clusters that talks standard presto
protocol to support any Presto clients (e.g., presto-cli, jdbc, odbc, etc.)
● td-spark uses presto-jdbc and td-prestobase APIs for making Presto queries
Airframe

Contenu connexe

Tendances

Fluentd: Data streams in Ruby world #rdrc2014
Fluentd: Data streams in Ruby world #rdrc2014Fluentd: Data streams in Ruby world #rdrc2014
Fluentd: Data streams in Ruby world #rdrc2014
SATOSHI TAGOMORI
 

Tendances (20)

Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
Search engine based on Elasticsearch
Search engine based on ElasticsearchSearch engine based on Elasticsearch
Search engine based on Elasticsearch
 
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core enginePLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
 
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache KafkaFast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Fluentd: Data streams in Ruby world #rdrc2014
Fluentd: Data streams in Ruby world #rdrc2014Fluentd: Data streams in Ruby world #rdrc2014
Fluentd: Data streams in Ruby world #rdrc2014
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
201810 td tech_talk
201810 td tech_talk201810 td tech_talk
201810 td tech_talk
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
PGConf APAC 2018 Keynote: PostgreSQL goes elevenPGConf APAC 2018 Keynote: PostgreSQL goes eleven
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
 
OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
 
Rust in TiKV
Rust in TiKVRust in TiKV
Rust in TiKV
 
OrientDB and Hazelcast
OrientDB and HazelcastOrientDB and Hazelcast
OrientDB and Hazelcast
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 

Similaire à td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
Patrick Allaert
 

Similaire à td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020 (20)

Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
 
How to Upgrade Major Version of Your Production PostgreSQL
How to Upgrade Major Version of Your Production PostgreSQLHow to Upgrade Major Version of Your Production PostgreSQL
How to Upgrade Major Version of Your Production PostgreSQL
 
Alexander Pavlenko, Java Software Engineer, DataArt.
Alexander Pavlenko, Java Software Engineer, DataArt.Alexander Pavlenko, Java Software Engineer, DataArt.
Alexander Pavlenko, Java Software Engineer, DataArt.
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Easy enterprise application integration with RabbitMQ and AMQP
Easy enterprise application integration with RabbitMQ and AMQPEasy enterprise application integration with RabbitMQ and AMQP
Easy enterprise application integration with RabbitMQ and AMQP
 
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
 

Plus de Taro L. Saito

Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 

Plus de Taro L. Saito (18)

Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編
 
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
Silkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミングSilkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミング
 
2011年度 生物データベース論 2日目 木構造データ
2011年度 生物データベース論 2日目 木構造データ2011年度 生物データベース論 2日目 木構造データ
2011年度 生物データベース論 2日目 木構造データ
 
Relational-Style XML Query @ SIGMOD-J 2008 Dec.
Relational-Style XML Query @ SIGMOD-J 2008 Dec.Relational-Style XML Query @ SIGMOD-J 2008 Dec.
Relational-Style XML Query @ SIGMOD-J 2008 Dec.
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

  • 1. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Taro L. Saito Arm Treasure Data July 31th, 2020 Spark Meetup Tokyo #3 td-spark internals Extending Spark with Airframe 1
  • 2. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. About Me: Taro L. Saito 2 ● Ph.D., Principal Software Engineer of Arm Treasure Data ● Living US for 5 years ● Created Presto as a service ● Processing 1 million SQL queries / day on the cloud. Presto Webinar ● OSS: ● Airframe, snappy-java (used in Parquet, Spark core), sbt-sonatype, etc. ● Books: WIP
  • 3. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Challenge: Adding Treasure Data Support to Spark ● PlazmaDB: Cloud Data Store of Treasure Data ● MessagePack-based columnar format (MPC1) ● Each table column is represented as a sequence of MessagePack values ● What was necessary for supporting Spark? ● td-spark driver (td-spark-assembly.jar) ■ MPC1 <-> DataFrame conversion ● Plazma Public API ■ APIs for reading and writing MPC1 files from PlazmaDB ● Created these two components with Airframe OSS Airframe 3
  • 4. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe: Core Scala Modules of Treasure Data ● Airframe ● Scala OSS assets of our knowledges, production experiences, and design decisions ● 20+ Common Utilities for Scala ● Dependency Injection (DI) ● Airframe RPC ■ HTTP Server, Client builder (ScalaMatsuri. Tokyo, October 2020) ● AirSpec ■ Testing framework for Scala (ScalaDays. Seattle, May 2021) 4 Knowledge Experiences Design Decisions Products 24/7 Services Business Values Programming OSS Outcome Airframe
  • 5. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe Modules Used Inside td-spark Airframe DI DataFrame MPC1 airframe-codec airframe-msgpack Plazma Public API airframe-http airframe-finagle Airframe DI Airframe RPC airframe-fluentd Master Worker DesignSparkContext TDSparkContext TDSparkService MPC1 Reader/Writer IO Manager Airframe DI airframe-http airframe-config airframe-launcher airframe-jmx airframe-metrics airframe-control airframe-metrics td-spark.jarairframe-log airframe-log airframe-codec airframe-json Airframe 5
  • 6. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Reading MPC1 Partitions and Column Blocks 6 Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition Plazma Public API Table Data column blocks column blocks column blocks column blocks column blocks td-spark Table Data Data Frame Data Frame Data Frame Data Frame Data Frame td-pyspark Parallel Read User Programs Columnar Data Download DataFrame
  • 7. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Uploading DataFrame as MPC1 Partitions 7 td-spark td-pyspark User Programs Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition DataFrame Format Conversion Plazma Public API Amazon S3 Parallel Upload Copy Transaction Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition Table DataTable Data
  • 8. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe Airframe RPC RPC Interface Router Scala.js Client RPC Web Server Generate HTTP/gRPC Client Open API Spec RPC Impl Create RPC CallsJSON Cross-Language RPC Client Scala.js Web Application Micro Servicesbt-airframeairframe-http airframe-http-finagle airframe-http-rx airframe-codec API Documentation airframe-gRPC 8 ● Use Scala As An RPC Interface ● Generate HTTP Server/Client (REST or gRPC) ● HTTP calls -> JSON/MessagePack data -> Remote Scala function calls
  • 9. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-spark: Adding More Functions to Spark ● Using an implicit class to extend SparkSession (spark variable) ● Adding TD-specific functionalities ● Time series data queries ■ e.g., : spark.td.table(“TD’s table”).within(“-1h”).df ● Predicate pushdown for time-series data ● etc. 9
  • 10. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Tips: Avoid Task Serialization Errors with Airframe DI ● Serializable ● spark.conf key-value properties inside SparkContext ● Non-Serializable ● Complex service objects ● Solution: Airframe DI (Dependency Injection) ● Distribute the service design (= how to construct objects) with the jar file ● Build service objects from the design (20+ components and config objects) Airframe DI Master Worker TDSparkContext TDSparkService TDSparkContext td-spark.jar td-spark.jarSerialization Error (!) Design Design TDSparkService Airframe Build OK! Config Config 10
  • 11. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Flexible Format Conversion with MessagePack DataFrame Airframe Codec Pack/Unpack Pack/Unpack MPC1 JDBC ResultSet Plazma Public API Airframe 11
  • 12. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Spark 3.0 and PySpark Support 12
  • 13. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Resources ● Airframe: https://wvlet.org/airframe/ ● Airframe Meetup #1 ~ #3 reports ● ScalaMatsuri 2019 presentation ■ And more!! ● td-spark documentation: https://treasure-data.github.io/td-spark/ ● See Also: Spark with Airframe (@smdmts) ● Spark to Spark data transfer with MessagePack-based airframe-codec ● Spark -> AWS service call management with airframe-control 13 Airframe New!
  • 14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Appendix 14
  • 15. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Treasure Data: A Ready-to-Use Cloud Data Platform 15 Logs Device Data Batch Data PlazmaDB Table Schema Data Collection Cloud Storage Distributed Data Processing Jobs Job Management SQL Editor Scheduler Workflows Machine Learning Treasure Data OSS Third Party OSS Data
  • 16. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. TDSparkContext 16
  • 17. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-pyspark ● Supporting PySpark ● Access Scala methods of td-spark: ■ sparkContext._jvm.(jvm package name).method(...) ● Conversion to PySpark’s DataFrame ● DataFrame(Scala DataFrame, sqlContext) 17
  • 18. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Predicate Pushdown ● Traverse DataFrame Column Filters ● Extract time conditions (e.g., -1d, -1w, -7d, etc.) ● 18
  • 19. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Class Loader Hierarchy of Databricks ● Base class loader ● User library class loader ● td-spark.jar will be loaded here ● Shared between multiple notebooks ■ Static variables used inside td-spark.jar will be shared by multiple notebooks! ● REPL class loader ● Shared between multiple notebooks ● Spark-library class loader ● Notebook-local ● Notebook-local class loader ● Caching local instances to static variables in td-spark caused ClassNotFound error in other notebooks 19
  • 20. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Using Presto with Spark ● presto-jdbc ● Submit select * from (Original SQL) limit 0 => Query result schema ● JDBC ResultSet => Airframe Codec => DataFrame 20
  • 21. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-prestobase: A Proxy Gateway to Presto Clusters 21 ● td-prestobase is a proxy gateway to Presto clusters that talks standard presto protocol to support any Presto clients (e.g., presto-cli, jdbc, odbc, etc.) ● td-spark uses presto-jdbc and td-prestobase APIs for making Presto queries Airframe