Virtualizing Apache Spark with Justin Murray

•

1 j'aime•847 vues

This talk explains the reasons why virtualizing Spark, in-house or elsewhere, is a requirement in today’s fast-moving and experimental world of data science and data engineering. Different teams want to spin up a Spark cluster “on the fly” to carry out some research and quickly answer business questions. They are not concerned with the availability of the server hardware – or with what any other team might be doing on it at the time. Virtualization provides the means of working within your own sandbox to try out the new query or Machine Learning algorithm. Deep performance test results will be shown that demonstrate that Spark and ML programs perform equally well on virtual machines just like native implementations do. An early introduction is given to the best practices you should adhere to when you do this. If time allows, a short demo will be given of creating an ephemeral, single-purpose Spark cluster, running an ML application test program on that cluster, and bringing it down when finished.

Données & analyses

Justin Murray, VMware
Virtualizing Spark on
VMware vSphere

Use Cases : Virtualization of Big
Data
• IT wants to provide Spark clusters as a service on-demand for its
end users
• Enterprises have development, test, pre-prod staging and production
clusters that are required to be separated from each other and
provisioned independently
• Organizations need different versions of Spark to be available to
different teams - with possibly different services available
• Enterprises do not wish to dedicate a specific set of hardware to each
different requirement above, and want to reduce overall costs
CONFIDENTIAL 3

Worker Node 1 Worker Node 2 Worker Node 3
Input File
The Traditional Hadoop Architecture
ResourcemanagerJob
Datanode
Nodemanager
Split 1 – 64MB
AppMaster - 1
Split 2 – 64MB
Split 3 – 64MB
Nodemanager Nodemanager
Datanode Datanode
Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB
Container - 2 Container - 3
Master Roles
Namenode

Worker Node 1 Worker Node 2 Worker Node 3
Input File
Hadoop – in Virtual Machines
ResourceManagerJob
Datanode
Nodemanager
Split 1 – 64MB
AppMaster - 1
Split 2 – 64MB
Split 3 – 64MB
Nodemanager Nodemanager
Datanode Datanode
Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB
Container - 2 Container - 3
Namenode
Master Roles

Worker Node 1 Worker Node 2 Worker Node 3
The Spark Architecture – Standalone
Driver
Job
Executor
JVM
Executor Executor
JVM JVM
Executor
JVM
Executor
JVM
Executor
JVM

Worker Node 1 Worker Node 2 Worker Node 3
Spark Standalone - Virtualized
Driver
Job
Executor
JVM
Executor Executor
JVM JVM
Executor
JVM
Executor
JVM
Executor
JVM
Virtual
Machine

NodemanagerNodemanagerNodemanager
Worker Node 1 Worker Node 2 Worker Node 3
The Spark Architecture (on YARN)
Job
Datanode
AppMaster - 1
Datanode Datanode
Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB
Container - 2 Container - 3
Namenode
Driver Executor Executor
Resourcemanager

Virtualization
Host Server
VMDK
Hadoop
Node 1
Virtual
Machine
Datanode
Ext4
Nodemanager
Ext4 Ext4 Ext4
Six or More Local DAS disksper Virtual Machine
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
Hadoop
Node 2
Virtual
Machine
Datanode
Ext4
Nodemanager
Ext4 Ext4 Ext4Ext4
VMDKVMDK VMDKVMDK
Ext4Ext4Ext4
Combined Model: Two Virtual Machines on a Host

Workloads - Spark
• Two standard analytic programs from the Spark MLLib (Machine Learning
Library)
• Driven using SparkBench (https://github.com/SparkTC/spark-bench)
– Support Vector Machine
– Logistic Regression
CONFIDENTIAL 13

Spark Support Vector Machine
Performance
CONFIDENTIAL 14

Spark Logistic Regression
Performance
CONFIDENTIAL 15

Results - Spark
•Support Vector Machines workload, which stayed in memory, ran
about 10% faster in virtualized form than on bare metal
•Logistic Regression workload, which was written to disk at the
larger dataset sizes, showed a slight advantage to bare metal
•part of the dataset was cached to disk,
•larger memory of the bare metal Spark executors may help
•Both workloads showed linear scaling from 5 to 10 hosts and as
dataset size increased
CONFIDENTIAL 16

1 TB RAM
on Server
Each NUMA
Node has 1024/2
512GB
482 GB RAM
for each VM
NUMA and Virtual
Machine Placement

§Spark workloads work very well on VMware
vSphere
• Various performance studies have shown that any
difference between virtualized performance and native
performance is minimal
• Follow the general best practice guidelines that VMware
has published
• Design patterns such as data-compute separation can be
used to provide elasticity of your Spark cluster.
Conclusions

Thank You.
Contact jmurray@vmware.com or
bigdata@vmware.com

Add Slides as Necessary
• Supporting points go here.

Recommandé

Cassandra on Docker @ Walmart LabsDataStax Academy

Kafka for begginerYousun Jeong

How to collect and utilize logs at Kubernetes with Elastic StackRakuten Group, Inc.

Introduction to Container Storage Interface (CSI)Idan Atias

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit

Big Data on Cloud Native PlatformSunil Govindan

Spark day 2017 - Spark on KubernetesYousun Jeong

Gocd – Kubernetes/Nomad Continuous DeploymentLeandro Totino Pereira

Recommandé

Cassandra on Docker @ Walmart LabsDataStax Academy

Kafka for begginerYousun Jeong

How to collect and utilize logs at Kubernetes with Elastic StackRakuten Group, Inc.

Introduction to Container Storage Interface (CSI)Idan Atias

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit

Big Data on Cloud Native PlatformSunil Govindan

Spark day 2017 - Spark on KubernetesYousun Jeong

Gocd – Kubernetes/Nomad Continuous DeploymentLeandro Totino Pereira

Scylla Virtual Workshop 2020ScyllaDB

12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETESZalando adtech lab

February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network

Spark Powered by ScyllaScyllaDB

Simplifying the Move to OpenStackOpenStack

Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks

Web後端技術的演變inwin stack

Virtualizing Apache Spark and Machine Learning with Justin MurrayDatabricks

Kafka Deployment to Steel Threadconfluent

State of the Container EcosystemVinay Rao

AKSgirish goudar

DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesAndrey Kudryavtsev

Storage os kubernetes clusters need persistent dataLibbySchulze

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsLibbySchulze

Scylla Summit 2018: Scylla 3.0 and BeyondScyllaDB

Scylla Summit 2016: Scylla at Samsung SDSScyllaDB

Managing (Schema) Migrations in CassandraDataStax Academy

Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB

Persist your data in an ephemeral k8 ecosystemLibbySchulze

Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Contenu connexe

Tendances

Scylla Virtual Workshop 2020ScyllaDB

12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETESZalando adtech lab

February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network

Spark Powered by ScyllaScyllaDB

Simplifying the Move to OpenStackOpenStack

Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks

Web後端技術的演變inwin stack

Virtualizing Apache Spark and Machine Learning with Justin MurrayDatabricks

Kafka Deployment to Steel Threadconfluent

State of the Container EcosystemVinay Rao

AKSgirish goudar

DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesAndrey Kudryavtsev

Storage os kubernetes clusters need persistent dataLibbySchulze

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage OptionsLibbySchulze

Scylla Summit 2018: Scylla 3.0 and BeyondScyllaDB

Scylla Summit 2016: Scylla at Samsung SDSScyllaDB

Managing (Schema) Migrations in CassandraDataStax Academy

Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB

Persist your data in an ephemeral k8 ecosystemLibbySchulze

Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy

Tendances (20)

Scylla Virtual Workshop 2020

12.07.2017 Docker Meetup - POSTGRE SQL ON KUBERNETES

February 2016 HUG: Running Spark Clusters in Containers with Docker

Spark Powered by Scylla

Simplifying the Move to OpenStack

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

Web後端技術的演變

Virtualizing Apache Spark and Machine Learning with Justin Murray

Kafka Deployment to Steel Thread

State of the Container Ecosystem

AKS

DUG'20: 10 - Storage Orchestration for Composable Storage Architectures

Storage os kubernetes clusters need persistent data

Introducing Kubestr - A New Way to Explore Your Kubernetes Storage Options

Scylla Summit 2018: Scylla 3.0 and Beyond

Scylla Summit 2016: Scylla at Samsung SDS

Managing (Schema) Migrations in Cassandra

Critical Attributes for a High-Performance, Low-Latency Database

Persist your data in an ephemeral k8 ecosystem

Cisco: Cassandra adoption on Cisco UCS & OpenStack

Similaire à Virtualizing Apache Spark with Justin Murray

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

Big Telco - Yousun JeongSpark Summit

Big Telco Real-Time Network AnalyticsYousun Jeong

Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesDatabricks

Introduction to Apache Geode (Cork, Ireland)Anthony Baker

The Fast Path to Building Operational Applications with SparkSingleStore

DevOps Supercharged with Docker on ExadataMarketingArrowECS_CZ

Apache Sparkmasifqadri

EMC SRM vs. Sentinel Navigator - Deep divesansentinel

MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...Principled Technologies

IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...In-Memory Computing Summit

Introduce_non-volatile_generic_object_programming_model_for_In-Memory_ComputingYanpingWang

Spark introduction and architectureSohil Jain

Sparkfatemehjamalii

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf

Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Similaire à Virtualizing Apache Spark with Justin Murray (20)

Why Kubernetes as a container orchestrator is a right choice for running spar...

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

Big Telco - Yousun Jeong

Big Telco Real-Time Network Analytics

Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes

Introduction to Apache Geode (Cork, Ireland)

The Fast Path to Building Operational Applications with Spark

DevOps Supercharged with Docker on Exadata

Apache Spark

EMC SRM vs. Sentinel Navigator - Deep dive

MySQL and Spark machine learning performance on Azure VMsbased on 3rd Gen AMD...

IMC Summit 2016 Breakout - Yanping Wang - Non-volatile Generic Object Program...

Introduce_non-volatile_generic_object_programming_model_for_In-Memory_Computing

Spark introduction and architecture

Spark

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Jump Start on Apache® Spark™ 2.x with Databricks

Plus de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Dernier

Mature dropshipping via API with DroFx.pptxolyaivanovalion

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Edukaciniai dropshipping via API with DroFxolyaivanovalion

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Introduction-to-Machine-Learning (1).pptxfirstjob4

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

April 2024 - Crypto Market Report's Analysismanisha194592

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Industrialised data - the key to AI success.pdfLars Albertsson

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Dernier (20)

Mature dropshipping via API with DroFx.pptx

BabyOno dropshipping via API with DroFx.pptx

Edukaciniai dropshipping via API with DroFx

RA-11058_IRR-COMPRESS Do 198 series of 1998

Introduction-to-Machine-Learning (1).pptx

Unveiling Insights: The Role of a Data Analyst

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Generative AI on Enterprise Cloud with NiFi and Milvus

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

Schema on read is obsolete. Welcome metaprogramming..pdf

April 2024 - Crypto Market Report's Analysis

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Industrialised data - the key to AI success.pdf

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

04242024_CCC TUG_Joins and Relationships

Smarteg dropshipping via API with DroFx.pptx

Virtualizing Apache Spark with Justin Murray

1. Justin Murray, VMware Virtualizing Spark on VMware vSphere

2. Why Virtualize Spark?

3. Use Cases : Virtualization of Big Data • IT wants to provide Spark clusters as a service on-demand for its end users • Enterprises have development, test, pre-prod staging and production clusters that are required to be separated from each other and provisioned independently • Organizations need different versions of Spark to be available to different teams - with possibly different services available • Enterprises do not wish to dedicate a specific set of hardware to each different requirement above, and want to reduce overall costs CONFIDENTIAL 3

4. Worker Node 1 Worker Node 2 Worker Node 3 Input File The Traditional Hadoop Architecture ResourcemanagerJob Datanode Nodemanager Split 1 – 64MB AppMaster - 1 Split 2 – 64MB Split 3 – 64MB Nodemanager Nodemanager Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Master Roles Namenode

5. Worker Node 1 Worker Node 2 Worker Node 3 Input File Hadoop – in Virtual Machines ResourceManagerJob Datanode Nodemanager Split 1 – 64MB AppMaster - 1 Split 2 – 64MB Split 3 – 64MB Nodemanager Nodemanager Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Namenode Master Roles

6. Worker Node 1 Worker Node 2 Worker Node 3 The Spark Architecture – Standalone Driver Job Executor JVM Executor Executor JVM JVM Executor JVM Executor JVM Executor JVM

7. Worker Node 1 Worker Node 2 Worker Node 3 Spark Standalone - Virtualized Driver Job Executor JVM Executor Executor JVM JVM Executor JVM Executor JVM Executor JVM Virtual Machine

8. NodemanagerNodemanagerNodemanager Worker Node 1 Worker Node 2 Worker Node 3 The Spark Architecture (on YARN) Job Datanode AppMaster - 1 Datanode Datanode Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB Container - 2 Container - 3 Namenode Driver Executor Executor Resourcemanager

9. Reference Architectures

10. Virtualization Host Server VMDK Hadoop Node 1 Virtual Machine Datanode Ext4 Nodemanager Ext4 Ext4 Ext4 Six or More Local DAS disksper Virtual Machine VMDK VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Node 2 Virtual Machine Datanode Ext4 Nodemanager Ext4 Ext4 Ext4Ext4 VMDKVMDK VMDKVMDK Ext4Ext4Ext4 Combined Model: Two Virtual Machines on a Host

11. #1 Reference Architecture from Cloudera

12. Performance

13. Workloads - Spark • Two standard analytic programs from the Spark MLLib (Machine Learning Library) • Driven using SparkBench (https://github.com/SparkTC/spark-bench) – Support Vector Machine – Logistic Regression CONFIDENTIAL 13

14. Spark Support Vector Machine Performance CONFIDENTIAL 14

15. Spark Logistic Regression Performance CONFIDENTIAL 15

16. Results - Spark •Support Vector Machines workload, which stayed in memory, ran about 10% faster in virtualized form than on bare metal •Logistic Regression workload, which was written to disk at the larger dataset sizes, showed a slight advantage to bare metal •part of the dataset was cached to disk, •larger memory of the bare metal Spark executors may help •Both workloads showed linear scaling from 5 to 10 hosts and as dataset size increased CONFIDENTIAL 16

17. 1 TB RAM on Server Each NUMA Node has 1024/2 512GB 482 GB RAM for each VM NUMA and Virtual Machine Placement

18. §Spark workloads work very well on VMware vSphere • Various performance studies have shown that any difference between virtualized performance and native performance is minimal • Follow the general best practice guidelines that VMware has published • Design patterns such as data-compute separation can be used to provide elasticity of your Spark cluster. Conclusions

19. Thank You. Contact jmurray@vmware.com or bigdata@vmware.com

20. Add Slides as Necessary • Supporting points go here.