Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

1

Partager

Télécharger pour lire hors ligne

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

Télécharger pour lire hors ligne

Techniques for extending Spark by using Airframe OSS. This presentation shows how td-spark is enhancing Spark for supporting Treasure Data platform.

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

  1. 1. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Taro L. Saito Arm Treasure Data July 31th, 2020 Spark Meetup Tokyo #3 td-spark internals Extending Spark with Airframe 1
  2. 2. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. About Me: Taro L. Saito 2 ● Ph.D., Principal Software Engineer of Arm Treasure Data ● Living US for 5 years ● Created Presto as a service ● Processing 1 million SQL queries / day on the cloud. Presto Webinar ● OSS: ● Airframe, snappy-java (used in Parquet, Spark core), sbt-sonatype, etc. ● Books: WIP
  3. 3. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Challenge: Adding Treasure Data Support to Spark ● PlazmaDB: Cloud Data Store of Treasure Data ● MessagePack-based columnar format (MPC1) ● Each table column is represented as a sequence of MessagePack values ● What was necessary for supporting Spark? ● td-spark driver (td-spark-assembly.jar) ■ MPC1 <-> DataFrame conversion ● Plazma Public API ■ APIs for reading and writing MPC1 files from PlazmaDB ● Created these two components with Airframe OSS Airframe 3
  4. 4. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe: Core Scala Modules of Treasure Data ● Airframe ● Scala OSS assets of our knowledges, production experiences, and design decisions ● 20+ Common Utilities for Scala ● Dependency Injection (DI) ● Airframe RPC ■ HTTP Server, Client builder (ScalaMatsuri. Tokyo, October 2020) ● AirSpec ■ Testing framework for Scala (ScalaDays. Seattle, May 2021) 4 Knowledge Experiences Design Decisions Products 24/7 Services Business Values Programming OSS Outcome Airframe
  5. 5. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe Modules Used Inside td-spark Airframe DI DataFrame MPC1 airframe-codec airframe-msgpack Plazma Public API airframe-http airframe-finagle Airframe DI Airframe RPC airframe-fluentd Master Worker DesignSparkContext TDSparkContext TDSparkService MPC1 Reader/Writer IO Manager Airframe DI airframe-http airframe-config airframe-launcher airframe-jmx airframe-metrics airframe-control airframe-metrics td-spark.jarairframe-log airframe-log airframe-codec airframe-json Airframe 5
  6. 6. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Reading MPC1 Partitions and Column Blocks 6 Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition Plazma Public API Table Data column blocks column blocks column blocks column blocks column blocks td-spark Table Data Data Frame Data Frame Data Frame Data Frame Data Frame td-pyspark Parallel Read User Programs Columnar Data Download DataFrame
  7. 7. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Uploading DataFrame as MPC1 Partitions 7 td-spark td-pyspark User Programs Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition DataFrame Format Conversion Plazma Public API Amazon S3 Parallel Upload Copy Transaction Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition Table DataTable Data
  8. 8. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe Airframe RPC RPC Interface Router Scala.js Client RPC Web Server Generate HTTP/gRPC Client Open API Spec RPC Impl Create RPC CallsJSON Cross-Language RPC Client Scala.js Web Application Micro Servicesbt-airframeairframe-http airframe-http-finagle airframe-http-rx airframe-codec API Documentation airframe-gRPC 8 ● Use Scala As An RPC Interface ● Generate HTTP Server/Client (REST or gRPC) ● HTTP calls -> JSON/MessagePack data -> Remote Scala function calls
  9. 9. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-spark: Adding More Functions to Spark ● Using an implicit class to extend SparkSession (spark variable) ● Adding TD-specific functionalities ● Time series data queries ■ e.g., : spark.td.table(“TD’s table”).within(“-1h”).df ● Predicate pushdown for time-series data ● etc. 9
  10. 10. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Tips: Avoid Task Serialization Errors with Airframe DI ● Serializable ● spark.conf key-value properties inside SparkContext ● Non-Serializable ● Complex service objects ● Solution: Airframe DI (Dependency Injection) ● Distribute the service design (= how to construct objects) with the jar file ● Build service objects from the design (20+ components and config objects) Airframe DI Master Worker TDSparkContext TDSparkService TDSparkContext td-spark.jar td-spark.jarSerialization Error (!) Design Design TDSparkService Airframe Build OK! Config Config 10
  11. 11. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Flexible Format Conversion with MessagePack DataFrame Airframe Codec Pack/Unpack Pack/Unpack MPC1 JDBC ResultSet Plazma Public API Airframe 11
  12. 12. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Spark 3.0 and PySpark Support 12
  13. 13. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Resources ● Airframe: https://wvlet.org/airframe/ ● Airframe Meetup #1 ~ #3 reports ● ScalaMatsuri 2019 presentation ■ And more!! ● td-spark documentation: https://treasure-data.github.io/td-spark/ ● See Also: Spark with Airframe (@smdmts) ● Spark to Spark data transfer with MessagePack-based airframe-codec ● Spark -> AWS service call management with airframe-control 13 Airframe New!
  14. 14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Appendix 14
  15. 15. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Treasure Data: A Ready-to-Use Cloud Data Platform 15 Logs Device Data Batch Data PlazmaDB Table Schema Data Collection Cloud Storage Distributed Data Processing Jobs Job Management SQL Editor Scheduler Workflows Machine Learning Treasure Data OSS Third Party OSS Data
  16. 16. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. TDSparkContext 16
  17. 17. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-pyspark ● Supporting PySpark ● Access Scala methods of td-spark: ■ sparkContext._jvm.(jvm package name).method(...) ● Conversion to PySpark’s DataFrame ● DataFrame(Scala DataFrame, sqlContext) 17
  18. 18. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Predicate Pushdown ● Traverse DataFrame Column Filters ● Extract time conditions (e.g., -1d, -1w, -7d, etc.) ● 18
  19. 19. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Class Loader Hierarchy of Databricks ● Base class loader ● User library class loader ● td-spark.jar will be loaded here ● Shared between multiple notebooks ■ Static variables used inside td-spark.jar will be shared by multiple notebooks! ● REPL class loader ● Shared between multiple notebooks ● Spark-library class loader ● Notebook-local ● Notebook-local class loader ● Caching local instances to static variables in td-spark caused ClassNotFound error in other notebooks 19
  20. 20. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Using Presto with Spark ● presto-jdbc ● Submit select * from (Original SQL) limit 0 => Query result schema ● JDBC ResultSet => Airframe Codec => DataFrame 20
  21. 21. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-prestobase: A Proxy Gateway to Presto Clusters 21 ● td-prestobase is a proxy gateway to Presto clusters that talks standard presto protocol to support any Presto clients (e.g., presto-cli, jdbc, odbc, etc.) ● td-spark uses presto-jdbc and td-prestobase APIs for making Presto queries Airframe
  • kwangshin

    Aug. 24, 2020

Techniques for extending Spark by using Airframe OSS. This presentation shows how td-spark is enhancing Spark for supporting Treasure Data platform.

Vues

Nombre de vues

783

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

254

Actions

Téléchargements

3

Partages

0

Commentaires

0

Mentions J'aime

1

×