Contenu connexe

Similaire à Scaling your Data Pipelines with Apache Spark on Kubernetes(20)


Plus de Databricks(20)


Scaling your Data Pipelines with Apache Spark on Kubernetes

  1. Scaling Data Pipelines with Apache Spark on Kubernetes on Google Cloud Rajesh Thallam Machine Learning Specialist Google Sougata Biswas Data Analytics Specialist Google May 2021
  2. Outline Spark on Kubernetes on Google Cloud Why Spark on Kubernetes? 1 2 4 Use Cases / Implementation Patterns Things to Know 3 5 Wrap up
  3. Why Spark on Kubernetes?
  4. Utilize existing Kubernetes infrastructure to run data engineering or ML workload along with other applications without maintaining separate big data infrastructure Containerization of spark applications gives ability to run the spark application on-prem and on cloud Packaging job dependencies in containers provides a great way to isolate workloads. Allowing teams to scale independently Scaling containers are much faster than VMs (Virtual Machines) Why Spark on Kubernetes? Unique benefits orchestrating Spark Jobs on Kubernetes compared to other cluster managers - YARN and Mesos Optimize Costs Portability Isolation Faster Scaling
  5. Proprietary + Confidential Comparing Cluster Managers Apache Hadoop YARN vs Kubernetes for Apache Spark Apache Hadoop YARN ● First cluster manager since inception of Apache Spark ● Battle tested ● General purpose scheduler for big data applications ● Runs on cluster of VMs or physical machines (e.g. on-prem Hadoop clusters) ● Option to run: spark-submit to YARN Kubernetes (k8s) ● Resource manager starting Spark 2.3 as experimental and GA with Spark 3.1.1 ● Not in feature parity with YARN ● General purpose scheduler for any containerized apps ● Runs as a container on k8s cluster. Faster scaling in and out. ● Option to run: spark-submit, spark k8s operator
  6. Spark on Kubernetes on Google Cloud
  7. Secure Enterprise security Encryption Access control Cost Effective Only pay for what you use Managed Jobs Spark on GKE Workflow Templates Airflow Operators Managed Clusters 90s cluster spin-up Autoscaling Autozone placement Cloud Dataproc Combining the best of open source and cloud and simplifying Hadoop & Spark workloads on Cloud Built-in support for Hadoop & Spark Managed hardware and configuration Simplified version management Flexible job configuration Features of Dataproc
  8. ● Manage applications, not machines ○ Manages container clusters ○ Inspired and informed by Google’s experiences ○ Supports multiple cloud and bare-metal environments ○ Supports multiple container runtimes ● Features similar to an OS for a host ○ Scheduling workload ○ Finding the right host to fit your workload ○ Monitoring health of the workload ○ Scaling it up and down as needed ○ Moving it around as needed Kubernetes OS for your compute fleet
  9. Google Kubernetes Engine (GKE) Secured and fully managed Kubernetes service GKE, Kubernetes-as-a-service Control Plane Nodes kubectl gcloud ● Turn-key solution to Kubernetes ○ Provision a cluster in minutes ○ Industry-leading automation ○ Scales to an industry-leading 15k worker nodes ○ Reliable and available ○ Deep GCP integration ● Generally Available since August, 2015 ○ 99.5% or 99.95% SLA on Kubernetes APIs ○ $0.10 per cluster/hour + infrastructure cost ○ Supports GCE sole-tenant nodes and reservations
  10. Dataproc on GKE BETA Run Spark jobs on GKE clusters with Dataproc Jobs API ● Simple way of executing Spark jobs on GKE clusters ● Single API to run Spark job on Dataproc as well as GKE ● Extensible with custom Docker image for Spark job ● Enterprise security control out-of-box ● Ease of logging and monitoring with cloud Logging and Monitoring Create Cluster Dataproc GKE Submit Job Allocate resources Run Spark Job
  11. Node Dataproc Agent Spark Submit using Dataproc API Kubernetes Master API Server Scheduler .. Job Scheduling & Monitoring Driver Pod (Node 1) Executor Pod (Node 1) Executor Pod (Node 2) Executor Pod (Node n) Google Kubernetes Engine (GKE) Dataproc on GKE - How it works? Submit Spark jobs to a running GKE cluster from the Dataproc Jobs API ● Dataproc agent runs as container inside GKE communicating with GKE scheduler using spark-kubernetes operator ● User submit jobs using Dataproc Jobs API while job execution happens inside GKE cluster ● Spark driver and executor run on different Pods inside separate namespaces within GKE cluster ● Driver and executor logs are sent to Google Cloud Logging service
  12. How is Dataproc on GKE different from alternatives? Comparing against Spark Submit and Spark Operator for Kubernetes Create Cluster Dataproc GKE Submit Job Allocate resources Run Spark Job ● Easy to get started with familiar Dataproc API ● Easy to setup and manage. No need to install Spark Kubernetes operator and set up monitoring or logging separately. ● Built-in security features with Dataproc API - access control, auditing, encryption and more. ● Inherent benefits of managed services - Dataproc and GKE
  13. Demo Spark on GKE using Dataproc Jobs API
  14. Step 1: Setup a GKE Cluster # setup environment variables GCE_REGION=us-west2 #GCP region GCE_ZONE=us-west2-a #GCP zone GKE_CLUSTER=spark-on-gke #GKE Cluster name DATAPROC_CLUSTER=dataproc-gke #Dataproc Cluster name VERSION=1.4.27-beta #Dataproc image version BUCKET=my-project-spark-on-k8s #GCS bucket # create GKE cluster with auto-scaling enabled gcloud container clusters create "${GKE_CLUSTER}" --scopes=cloud-platform --workload-metadata=GCE_METADATA --machine-type=n1-standard-4 --zone="${GCE_ZONE}" --enable-autoscaling --min-nodes 1 --max-nodes 10 # add Kubernetes Engine Admin role to service-
  15. Step 2: Create and Register Dataproc to GKE # create dataproc cluster and register with GKE with K8s namespace gcloud dataproc clusters create "${DATAPROC_CLUSTER}" --gke-cluster="${GKE_CLUSTER}" --region="${GCE_REGION}" --zone="${GCE_ZONE}" --image-version="${VERSION}" --bucket="${BUCKET}" --gke-cluster-namespace="spark-on-gke"
  16. Step 3: Spark Job Execution # Running a sample pyspark job using Dataproc API # to read a table in Bigquery and generate word counts gcloud dataproc jobs submit pyspark --cluster=${DATAPROC_CLUSTER} --region=${GCE_REGION} -- properties="spark.dynamicAllocation.enabled=false,spar k.executor.instances=5,spark.executors.core=4" --jars gs://spark-lib/bigquery/spark-bigquery- latest_2.11.jar
  17. Step 4a: Monitoring - GKE & Cloud Logging # Spark Driver Logs from Google Cloud Logging resource.type="k8s_container" resource.labels.cluster_name="spark-on-gke" resource.labels.namespace_name="spark-on-gke" resource.labels.container_name="spark-kubernetes- driver" # Spark Executor Logs from Google Cloud Logging resource.type="k8s_container" resource.labels.cluster_name="spark-on-gke" resource.labels.namespace_name="spark-on-gke" resource.labels.container_name="executor"
  18. # TCP port forwarding to driver pod to view Spark UI gcloud container clusters get-credentials "${GKE_CLUSTER}" --zone "${GCE_ZONE}" --project "${PROJECT_ID}" && kubectl port-forward --namespace "${GKE_NAMESPACE}" $(kubectl get pod --namespace ${GKE_NAMESPACE} --selector="spark- role=driver, app_name" --output jsonpath='{.items[0]}') 8080:4040 Step 4b: Monitoring with Spark Web UI
  19. Dataproc with Apache Spark on GKE Things to Know
  20. Autoscaling Spark Jobs Automatically resize node pools of GKE cluster based on the workload demands # create GKE cluster with autoscaling enabled gcloud container clusters create "${GKE_CLUSTER}" --scopes=cloud-platform --workload-metadata=GCE_METADATA --machine-type n1-standard-2 --zone="${GCE_ZONE}" --num-nodes 2 --enable-autoscaling --min-nodes 1 --max-nodes 10 # create dataproc cluster on GKE gcloud dataproc clusters create "${DATAPROC_CLUSTER}" --gke-cluster="${GKE_CLUSTER}" --region="${GCE_REGION}" --zone="${GCE_ZONE}" --image-version="${VERSION}" --bucket="${BUCKET}" ● Dataproc Autoscaler not supported with Dataproc on GKE ● Instead enable autoscaling on GKE Cluster node pool ● Specify a minimum and maximum size for the GKE Cluster’s node pool, and the rest is automatic ● You can combine GKE Cluster Autoscaler with Horizontal/Vertical Pod Autoscaling
  21. # create GKE cluster or a node pool with local SSD gcloud container clusters create "${GKE_CLUSTER}" ... --local-ssd-count ${NUMBER_OF_DISKS} # config YAML to use local SSD as scratch space spec: volumes: - name: "spark-local-dir-1" hostPath: path: "/tmp/spark-local-dir" executor: volumeMounts: - name: "spark-local-dir-1" mountPath: "/tmp/spark-local-dir" # spark job conf to override scratch space spark.local.dir=/tmp/spark-local-dir/ Shuffle in Spark on Kubernetes Writes shuffle data to scratch space or local volume or Persistent Volume Claims ● Shuffle is the data exchange between different stages in a Spark job. ● Shuffle is expensive and its performance depends on disk IOPS and network throughput between the nodes. ● Spark supports writing shuffle data to Persistent Volume Claims or local volumes or scratch space. ● Local SSDs are performant compared to Persistent Disks but they are transient. Disk IOPS and throughput improves as disk size increases. ● External shuffle service is not available today. Source
  22. Dynamic Resource Allocation * Dynamically adjust the resources Spark application occupies based on the workload # spark job conf to enable dynamic allocation spark.dynamicAllocation.enabled=true spark.dynamicAllocation.shuffleTracking.enabled=true ● When enabled, Spark dynamically adjusts resources based on workload demand ● External shuffle service is not available in Spark on Kubernetes (work in progress) ● Instead soft dynamic resource allocation is available in Spark 3.0 where the driver tracks the shuffle files and evicts only executors not storing active shuffle files ● Dynamic allocation is a cost optimization technique - cost vs latency trade-off ● To improve latency consider over-provisioning GKE cluster - fine-tune Horizontal Pod Autoscaling or configure pause Pods * Dataproc on GKE supports only Spark 2.4 at the time of this talk and the support for Spark 3.x is coming soon. Spark 3.x is supported on Dataproc on GCE
  23. # create GKE cluster with preemptible VMs gcloud container clusters create "${GKE_CLUSTER}" --preemptible # or create GKE node pool with preemptible VMs gcloud container node-pools create "${GKE_NODE_POOL}" --preemptible --cluster "${GKE_CLUSTER}" # submit Dataproc job to node pool with preemptible VMs gcloud dataproc jobs submit pyspark --cluster="${DATAPROC_CLUSTER}" --region="${GCE_REGION}" -- com/gke-nodepool=${GKE_NODE_POOL}" Running Spark Jobs on Preemptible VMs (PVMs) on GKE Reduce cost of running Spark jobs without sacrificing predictability ● PVMs are excess Compute Engine capacity, that last for a max of 24 hours with no availability guarantees ● Best suited for running batch or fault-tolerant jobs ● Much cheaper than standard VMs and running Spark on GKE with PVMs reduces cost of deployment. But, ○ PVMs can shut down inadvertently and rescheduling Pods to a new node may add latency ○ Spark executors with active shuffle files that were shut down will be recomputed adding latency
  24. ● At the time of creating a Dataproc cluster on GKE, the default Dataproc Docker image is used based on the image version specified ● You can bring your own image or extend the default image as the container image to use for the Spark application ● Create Dataproc cluster with custom image when you need to include your own packages or applications # submit Dataproc job with custom container image gcloud dataproc jobs submit pyspark --cluster="${DATAPROC_CLUSTER}" --region="${GCE_REGION}" -- properties=spark.kubernetes.container.image="${P ROJECT_ID}/my-spark-image" Create Dataproc Cluster on GKE with Custom Image Bring your own image or extend the default Dataproc image
  25. Integrating with Google Cloud Storage (GCS) and BigQuery (BQ) Use Spark BigQuery Connector and Google Cloud Storage connector for better performance # submit Dataproc job to use BigQuery as source/sink gcloud dataproc jobs submit pyspark --cluster=${DATAPROC_CLUSTER} --region=${GCE_REGION} -- properties="spark.dynamicAllocation.enabled=false,spark .executor.instances=5,spark.executors.core=4" --jars gs://spark-lib/bigquery/spark-bigquery- latest_2.11.jar ● Built-in Cloud Storage Connector in the Dataproc default image ● Add Spark BigQuery connector as dependency, which uses BQ Storage API to stream data directly from BQ via gRPC without using GCS as an intermediary.
  26. Autoscaling Automatically resize GKE cluster node pools based on workload demand Shuffle Writes to scratch space or local volume or Persistent Volume Claims Dynamic Allocation Dynamically adjust the job resources based on the workload Preemptible VMs Reduce cost of running Spark jobs without sacrificing predictability Custom Image Bring your own image or extend the default Dataproc image Integration with Google Cloud Services Built-in Cloud Storage connector and add Spark BigQuery connector Dataproc with Apache Spark on GKE - Things to Know at a Glance
  27. Dataproc with Apache Spark on GKE Use Cases / Architectural Patterns
  28. Unified Infrastructure Google Kubernetes Engine (GKE) Cluster Dataproc Clusters on GKE Apache Spark 2.4 Airflow Kubeflow Other Workloads Apache Spark 3.x ● Unify all of our processing - data processing pipeline or a machine learning pipeline or a web application or anything else ● By migrating Spark jobs to a single cluster manager, you can focus on modern cloud management in Kubernetes ● Leads to a more efficient use of resources and provides a unified logging and management framework Dataproc on GKE supports only Spark 2.4 at the time of this talk and the support for Spark 3.x is coming soon. Spark 3.x is supported on Dataproc on GCE
  29. Cloud Composer Managed Apache Airflow service to create, schedule, monitor and manage workflows Author end-to-end workflows on GCP via triggers and integrations Enterprise security for your workflows through Google managed credentials. What is Cloud Composer? No need to think about managing the infrastructure after initial config done with a click. Makes troubleshooting simple with observability through Cloud Logging and Monitoring Azure Blob Storage AWS EMR AWS S3 AWS EC2 AWS Redshift Databricks SubmitRunOperator Workflow Orchestration Cloud Composer Public Cloud Integrations GCP Integrations On-prem integration BigQuery Cloud Dataproc Cloud Dataflow Cloud Pub/Sub Cloud AI Platform Cloud Storage Cloud Datastore
  30. Orchestrating Apache Spark Jobs from Cloud Composer Cloud Storage Source/Targe t BigQuery Source/Targe t Dataproc on GKE Data Processing Cloud Composer Google Kubernetes Engine (GKE) Any other data sources or targets ● Trigger DAG from Composer to submit job to Dataproc cluster running on GKE ● Save time by not creating and tear down ephemeral Dataproc cluster ● One cluster manager to orchestrate and process jobs. Better utilization of resources. ● Optimize costs + better visibility and reliability
  31. Machine Learning Lifecycle DATA SCIENTIST / ML ENGINEER • Apply ML model code on large datasets • Test performance and validate • Train on LARGE or FULL dataset DATA SCIENTIST • Explore data • Test features + algorithms • Build model prototypes • Prototype on SMALL or SAMPLED dataset DATA / ML ENGINEER • Operationalize data processing • Deploy models to production Model Accuracy Information ML Model Code ML Model DATA ENGINEER • Ingestion • Cleaning • Storage Exploration & Model Prototyping Model Scoring & Inference Production Training & Evaluation Data
  32. MLflow Open Source platform to manage the ML lifecycle Registry Store, annotate, discover, and manage models in a central repository Models Deploy machine learning models in diverse serving environments Projects Package data science code to reproduce runs on any platform Tracking Record and query experiments: code, data, config and results Components of MLflow
  33. Unifying Machine Learning & Data Pipeline Deployments API Connectors & Data Imports Cloud Storage Data Source Cloud Scheduler Trigger Security & Integrations Key Manageme nt Service Secret Manager Cloud IAM AI Platform Data Science / ML Target Bucket Cloud Bigtable BigQuery BigQuery Data Source Artifacts Storage Cloud Storage Dataproc on GKE Data Processing Cloud Composer Google Kubernetes Engine (GKE) ML Tracking Kubeflow Data Science / ML Notebooks Training Experimentation
  34. Dataproc with Apache Spark on GKE Wrapping up
  35. Apache Spark on Kubernetes Why Spark on Kubernetes? ● Do you have apps running on Kubernetes clusters? Are they underutilized? ● Do you have pain managing multiple cluster managers - YARN, Kubernetes? ● Do you have difficulties managing Spark job dependencies, different Spark versions? ● Do you want to get same benefits as apps running on Kubernetes - multitenancy, autoscaling, fine-grained access control? Why Dataproc on GKE? ● Faster scaling with reliability ● Inherent benefits of managed infrastructure ● Enterprise security control ● Unified logging and monitoring ● Optimized costs due to effective resource sharing
  36. Open Source Documentation ● Running Spark on Kubernetes - Spark Documentation ● Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes. ● Code Example used in the demo. Blog Posts & Solution ● Make the most out of your Data Lake with Google Cloud ● Cloud Dataproc Spark Jobs on GKE: How to get started Google Cloud Documentation ● Google Cloud Dataproc ● Google Kubernetes Engine (GKE) ● Google Cloud Composer ● Dataproc on Google Kubernetes Engine Resources Google Cloud
  37. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.