SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Standalone Spark Deployment
For Stability and Performance
Totango
❖ Leading Customer Success Platform
❖ Helps companies retain and grow their customer base
❖ Advanced actionable analytics for subscription and recurring
revenue
❖ Founded @ 2010
❖ Infrastructure on AWS cloud
❖ Spark for batch processing
❖ ElasticSearch for serving layer
About Me
Romi Kuntsman
Senior Big Data Engineer @ Totango
Working with Apache Spark since v1.0
Working with AWS Cloud since 2008
Spark on AWS - first attempts
❖ We tried Amazon EMR (Elastic MapReduce) to install Spark on
YARN
➢ Performance hit per application (starts Spark instance for each)
➢ Performance hit per server (running services we don't use, like
HDFS)
➢ Slow and unstable cluster resizing (often stuck and need to
recreate)
❖We tried spark-ec2 script to install Spark Standalone on AWS
EC2 machines
➢ Serial (not parallel) initialization of multiple servers - slow!
➢ Unmaintained scripts since availability of Spark on EMR (see
above)
➢ Doesn't integrate with our existing systems
Spark on AWS - road to success
❖ We decided to write our own scripts to integrate and control
everything
❖Understood all Spark components and configuration settings
❖Deployment based on Chef, like we do in all servers
❖Integrated monitoring and logging, like we have in all our systems
❖Full server utilization - running exactly what we need and nothing
more
❖Cluster hanging or crashing no longer happens
❖Seamless cluster resize without hurting any existing jobs
❖Able to upgrade to any version of Spark (not dependant on third
party)
What we'll discuss
❖Separation of Spark Components
❖Centralized Managed Logging
❖Monitoring Cluster Utilization
❖Auto Scaling Groups
❖Termination Protection
❖Upstart Mechanism
❖NewRelic Integration
❖Chef-based Instantiation
Data w/ Romi
Ops w/ Alon
Separation of Components
❖Spark Master Server (single)
➢Master Process - accepts requests to start applications
➢History Process - serves history data of completed
applications
❖Spark Slave Server (multiple)
➢Worker Process - handles workload of applications on
server
➢External Shuffle Service - handles data exchange
between workers
➢Executor Process (one per core - for running apps) - runs
actual code
Configuration - Deploy Spread Out
❖spark.deploy.spreadOut (SPARK_MASTER_OPTS)
➢true = use cores spread across all workers
➢false = fill up all worker cores before getting more
Configuration - Cleanup
❖spark.worker.cleanup.* (SPARK_WORKER_OPTS)
➢.enabled = true (turn on mechanism to clean up app folders)
➢.interval = 1800 (run every 1800 seconds, or 30 minutes)
➢.appDataTtl = 1800 (remove finished applications after 30
minutes)
❖We have 100s of applications per day, each with it's jars and
logs
❖Rapid cleanup is essential to avoid filling up disk space
❖We collect the logs before cleanup - details in following slides ;-)
❖Only cleans up files of completed applications
External Shuffle Service
❖Preserves shuffle files written by executors
❖Servers shuffle files to other executors who want to fetch
them
❖If (when) one executor crashes (OOM etc), others may still
access it's shuffle
❖We run the shuffle service itself in a separate process from
the executor
❖To enable: spark.shuffle.service.enable=true
❖Config: spark.shuffle.io.* (see documentation)
Logging - components
❖ Master Log (/logs/spark-runner-
org.apache.spark.deploy.master.Master-*)
➢ Application registration, worker coordination
❖History Log (/logs/spark-runner-
org.apache.spark.deploy.history.HistoryServer-*)
➢ Access to history, errors reading (e.g. I/O from S3, not found)
❖Worker Log (/logs/spark-runner-
org.apache.spark.deploy.worker.Worker-*)
➢ Executor management (launch, kill, ACLs)
❖Shuffle Log (/logs/org.apache.spark.deploy.ExternalShuffleService-*)
➢ External Executor Registrations
Logging - applications
❖Application Logs (/mnt/spark-work/app-12345/execid/stderr)
➢ All output from executor process, including your own code
❖Using LogStash to gather logs from all applications together
input {
file {
path => "/mnt/spark-work/app-*/*/std*"
start_position => beginning
}
}
filter {
grok {
match => [ "path", "/mnt/spark-work/%{NOTSPACE:application}/.+/%{NOTSPACE:logtype}" ]
}
}
output {
file {
path => "/logs/applications.log"
message_format => "%{application} %{logtype} %{message}"
}
}
Monitoring Cluster Utilization
❖ Spark Reports Metrics (Codahale) through Graphite
➢Master metrics - running application and their status
➢Worker metrics - used cores, free cores
➢JVM metrics - memory allocation, GC
❖We use Anodot to view and track
metrics trends and anomalies
And now, to the Ops side...
Alon Torres
DevOps Engineer @ Totango
Auto Scaling Group Components
❖Auto Scaling Group
➢ Scale your group up or down flexibly
➢ Supports health checks and load balancing
❖Launch Configuration
➢ Template used by the ASG to launch instances
➢ User Data script for post-launch configuration
❖User Data
➢ Install prerequisites and fetch instance info
➢ Install and start Chef client
➢ Sanity checks throughout
Launch
Configuratio
n
Auto
Scaling
Group
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
EC2
Instance
User
Data
Auto Scaling Group resizing in AWS
❖ Scheduled
➢ Set the desired size according to a specified schedule
➢ Good for scenarios with predictable, cyclic workloads.
❖Alert-Based
➢ Set specific alerts that trigger a cluster action
➢ Alerts can monitor instance health properties (resource usage)
❖Remote-triggered
➢ Using the AWS API/CLI, resize the cluster however you want
Resizing the ASG with Jenkins
❖We use schedule-based Jenkins jobs that utilize the AWS CLI
➢ Each job sets the desired Spark cluster size
➢ Makes it easy for our Data team to make changes to the
schedule
➢ Desired size can be manually overridden if needed
Termination Protection
❖When scaling down, ASG treats all nodes as equal
termination candidates
❖We want to avoid killing instances with currently running jobs
❖To achieve this, we used a built-in feature of ASG -
termination protection
❖Any instance in the ASG can be set as protected, thus
preventing termination when scaling down the cluster.
if [ $(ps -ef | grep executor | grep spark | wc -l) -ne 0 ]; then
aws autoscaling set-instance-protection --protected-from-scale-in …
fi
Upstart Jobs for Spark
❖ Every spark component has an upstart job the does the
following
➢ Set Spark Niceness (Process priority in CPU resource
distribution)
➢ Start the required Spark component and ensure it stays running
■ The default spark daemon script runs in the background
■ For Upstart, we modified the script to run in the foreground
❖ nohup nice -n "$SPARK_NICENESS"…&
vs
❖ nice -n "$SPARK_NICENESS" ...
NewRelic Monitoring
❖ Cloud-based Application and Server monitoring
❖Supports multiple alert policies for different needs
➢ Who to alert, and what triggers the alerts
❖Newly created instances are auto - assigned the default alert policy
Policy Assignment using AWS Lambda
❖Spark instances have their own policy in NewRelic
❖Each instance has to ask NewRelic to be reassigned to the
new policy
➢Parallel reassignment requests may collide and override
each other
❖Solution - during provisioning and shutdown, we do the
following:
➢Put a record in an AWS Kinesis stream that contains their
hostname and their desired NewRelic policy ID
➢The record triggers an AWS Lambda script that uses the
NewRelic API to reassign the hostname given to the policy
ID given
Chef
❖Configuration Management Tool, can provision and configure
instances
➢Describe an instance state as code, let chef handle the
rest
➢Typically works in server/client mode - client updates
every 30m
➢Besides provisioning, also prevents configuration drifts
❖Vast amount of plugins and cookbooks - the sky's the limit!
❖Configures all the instances in our DC
Spark Instance Provisioning
❖ Setup Spark
➢ Setup prerequisites - users, directories, symlinks and jars
➢ Download and extract spark package from S3
❖Configure termination protection cron script
❖Configure upstart conf files
❖Place spark config files
❖Assign NewRelic policy
❖Add shutdown scripts
➢ Delete instance from chef database
➢ Remove from NewRelic monitoring policy
Questions?
❖ Alon Torres, DevOps
https://il.linkedin.com/in/alontorres
❖Romi Kuntsman, Senior Big Data Engineer
https://il.linkedin.com/in/romik
❖Stay in touch!
Totango Engineering Technical Blog
http://labs.totango.com/

Contenu connexe

Tendances

Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Continuous Processing in Structured Streaming with Jose Torres
 Continuous Processing in Structured Streaming with Jose Torres Continuous Processing in Structured Streaming with Jose Torres
Continuous Processing in Structured Streaming with Jose Torres
Databricks
 

Tendances (20)

spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Structured Streaming Use-Cases at Apple
Structured Streaming Use-Cases at AppleStructured Streaming Use-Cases at Apple
Structured Streaming Use-Cases at Apple
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Zeppelin and spark sql demystified
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
 
Continuous Processing in Structured Streaming with Jose Torres
 Continuous Processing in Structured Streaming with Jose Torres Continuous Processing in Structured Streaming with Jose Torres
Continuous Processing in Structured Streaming with Jose Torres
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 

En vedette

En vedette (13)

Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
DockerCon EU 2015: Deploying and Managing Containers for Developers
DockerCon EU 2015: Deploying and Managing Containers for DevelopersDockerCon EU 2015: Deploying and Managing Containers for Developers
DockerCon EU 2015: Deploying and Managing Containers for Developers
 
DockerCon EU 2015: Official Repos and Project Nautilus
DockerCon EU 2015: Official Repos and Project NautilusDockerCon EU 2015: Official Repos and Project Nautilus
DockerCon EU 2015: Official Repos and Project Nautilus
 
DockerCon EU 2015: The Latest in Docker Engine
DockerCon EU 2015: The Latest in Docker EngineDockerCon EU 2015: The Latest in Docker Engine
DockerCon EU 2015: The Latest in Docker Engine
 
DockerCon EU 2015: Docker Universal Control Plane (Gordon's Special Session)
DockerCon EU 2015: Docker Universal Control Plane (Gordon's Special Session)DockerCon EU 2015: Docker Universal Control Plane (Gordon's Special Session)
DockerCon EU 2015: Docker Universal Control Plane (Gordon's Special Session)
 
Docker Orchestration at Production Scale
Docker Orchestration at Production Scale Docker Orchestration at Production Scale
Docker Orchestration at Production Scale
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
DockerCon EU 2015: What's New with Docker Trusted Registry
DockerCon EU 2015: What's New with Docker Trusted RegistryDockerCon EU 2015: What's New with Docker Trusted Registry
DockerCon EU 2015: What's New with Docker Trusted Registry
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 

Similaire à Standalone Spark Deployment for Stability and Performance

Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
Piyush Kumar
 
Automating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAutomating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps Approach
Akshaya Mahapatra
 
Openstack devops challenges
Openstack devops challenges Openstack devops challenges
Openstack devops challenges
openstackindia
 
SOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DBSOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DB
UniFabric
 

Similaire à Standalone Spark Deployment for Stability and Performance (20)

Standalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and PerformanceStandalone Spark Deployment for Stability and Performance
Standalone Spark Deployment for Stability and Performance
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Women Who Code Connect 2018 Conference
Women Who Code Connect 2018 ConferenceWomen Who Code Connect 2018 Conference
Women Who Code Connect 2018 Conference
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...
How To Setup Highly Available Web Servers with Keepalived & Floating IPs on U...
 
Automating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAutomating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps Approach
 
Hosting Ruby Web Apps
Hosting Ruby Web AppsHosting Ruby Web Apps
Hosting Ruby Web Apps
 
AWS migration: getting to Data Center heaven with AWS and Chef
AWS migration: getting to Data Center heaven with AWS and ChefAWS migration: getting to Data Center heaven with AWS and Chef
AWS migration: getting to Data Center heaven with AWS and Chef
 
Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28Introduction openstack-meetup-nov-28
Introduction openstack-meetup-nov-28
 
Namos openstack-manager
Namos openstack-managerNamos openstack-manager
Namos openstack-manager
 
Writing your First Ansible Playbook
Writing your First Ansible PlaybookWriting your First Ansible Playbook
Writing your First Ansible Playbook
 
Lessons From A DevOps Transformation on AWS
Lessons From A DevOps Transformation on AWSLessons From A DevOps Transformation on AWS
Lessons From A DevOps Transformation on AWS
 
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaSOpenstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
Openstack on Fedora, Fedora on Openstack: An Introduction to cloud IaaS
 
Infrastructure as Code
Infrastructure as CodeInfrastructure as Code
Infrastructure as Code
 
Openstack devops challenges
Openstack devops challenges Openstack devops challenges
Openstack devops challenges
 
SOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DBSOUG_Deployment__Automation_DB
SOUG_Deployment__Automation_DB
 
Ansible - Hands on Training
Ansible - Hands on TrainingAnsible - Hands on Training
Ansible - Hands on Training
 
Ansible Tutorial.pdf
Ansible Tutorial.pdfAnsible Tutorial.pdf
Ansible Tutorial.pdf
 

Dernier

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Dernier (20)

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 

Standalone Spark Deployment for Stability and Performance

  • 1. Standalone Spark Deployment For Stability and Performance
  • 2. Totango ❖ Leading Customer Success Platform ❖ Helps companies retain and grow their customer base ❖ Advanced actionable analytics for subscription and recurring revenue ❖ Founded @ 2010 ❖ Infrastructure on AWS cloud ❖ Spark for batch processing ❖ ElasticSearch for serving layer
  • 3. About Me Romi Kuntsman Senior Big Data Engineer @ Totango Working with Apache Spark since v1.0 Working with AWS Cloud since 2008
  • 4. Spark on AWS - first attempts ❖ We tried Amazon EMR (Elastic MapReduce) to install Spark on YARN ➢ Performance hit per application (starts Spark instance for each) ➢ Performance hit per server (running services we don't use, like HDFS) ➢ Slow and unstable cluster resizing (often stuck and need to recreate) ❖We tried spark-ec2 script to install Spark Standalone on AWS EC2 machines ➢ Serial (not parallel) initialization of multiple servers - slow! ➢ Unmaintained scripts since availability of Spark on EMR (see above) ➢ Doesn't integrate with our existing systems
  • 5. Spark on AWS - road to success ❖ We decided to write our own scripts to integrate and control everything ❖Understood all Spark components and configuration settings ❖Deployment based on Chef, like we do in all servers ❖Integrated monitoring and logging, like we have in all our systems ❖Full server utilization - running exactly what we need and nothing more ❖Cluster hanging or crashing no longer happens ❖Seamless cluster resize without hurting any existing jobs ❖Able to upgrade to any version of Spark (not dependant on third party)
  • 6. What we'll discuss ❖Separation of Spark Components ❖Centralized Managed Logging ❖Monitoring Cluster Utilization ❖Auto Scaling Groups ❖Termination Protection ❖Upstart Mechanism ❖NewRelic Integration ❖Chef-based Instantiation Data w/ Romi Ops w/ Alon
  • 7. Separation of Components ❖Spark Master Server (single) ➢Master Process - accepts requests to start applications ➢History Process - serves history data of completed applications ❖Spark Slave Server (multiple) ➢Worker Process - handles workload of applications on server ➢External Shuffle Service - handles data exchange between workers ➢Executor Process (one per core - for running apps) - runs actual code
  • 8. Configuration - Deploy Spread Out ❖spark.deploy.spreadOut (SPARK_MASTER_OPTS) ➢true = use cores spread across all workers ➢false = fill up all worker cores before getting more
  • 9. Configuration - Cleanup ❖spark.worker.cleanup.* (SPARK_WORKER_OPTS) ➢.enabled = true (turn on mechanism to clean up app folders) ➢.interval = 1800 (run every 1800 seconds, or 30 minutes) ➢.appDataTtl = 1800 (remove finished applications after 30 minutes) ❖We have 100s of applications per day, each with it's jars and logs ❖Rapid cleanup is essential to avoid filling up disk space ❖We collect the logs before cleanup - details in following slides ;-) ❖Only cleans up files of completed applications
  • 10. External Shuffle Service ❖Preserves shuffle files written by executors ❖Servers shuffle files to other executors who want to fetch them ❖If (when) one executor crashes (OOM etc), others may still access it's shuffle ❖We run the shuffle service itself in a separate process from the executor ❖To enable: spark.shuffle.service.enable=true ❖Config: spark.shuffle.io.* (see documentation)
  • 11. Logging - components ❖ Master Log (/logs/spark-runner- org.apache.spark.deploy.master.Master-*) ➢ Application registration, worker coordination ❖History Log (/logs/spark-runner- org.apache.spark.deploy.history.HistoryServer-*) ➢ Access to history, errors reading (e.g. I/O from S3, not found) ❖Worker Log (/logs/spark-runner- org.apache.spark.deploy.worker.Worker-*) ➢ Executor management (launch, kill, ACLs) ❖Shuffle Log (/logs/org.apache.spark.deploy.ExternalShuffleService-*) ➢ External Executor Registrations
  • 12. Logging - applications ❖Application Logs (/mnt/spark-work/app-12345/execid/stderr) ➢ All output from executor process, including your own code ❖Using LogStash to gather logs from all applications together input { file { path => "/mnt/spark-work/app-*/*/std*" start_position => beginning } } filter { grok { match => [ "path", "/mnt/spark-work/%{NOTSPACE:application}/.+/%{NOTSPACE:logtype}" ] } } output { file { path => "/logs/applications.log" message_format => "%{application} %{logtype} %{message}" } }
  • 13. Monitoring Cluster Utilization ❖ Spark Reports Metrics (Codahale) through Graphite ➢Master metrics - running application and their status ➢Worker metrics - used cores, free cores ➢JVM metrics - memory allocation, GC ❖We use Anodot to view and track metrics trends and anomalies
  • 14. And now, to the Ops side... Alon Torres DevOps Engineer @ Totango
  • 15. Auto Scaling Group Components ❖Auto Scaling Group ➢ Scale your group up or down flexibly ➢ Supports health checks and load balancing ❖Launch Configuration ➢ Template used by the ASG to launch instances ➢ User Data script for post-launch configuration ❖User Data ➢ Install prerequisites and fetch instance info ➢ Install and start Chef client ➢ Sanity checks throughout Launch Configuratio n Auto Scaling Group EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance User Data
  • 16. Auto Scaling Group resizing in AWS ❖ Scheduled ➢ Set the desired size according to a specified schedule ➢ Good for scenarios with predictable, cyclic workloads. ❖Alert-Based ➢ Set specific alerts that trigger a cluster action ➢ Alerts can monitor instance health properties (resource usage) ❖Remote-triggered ➢ Using the AWS API/CLI, resize the cluster however you want
  • 17. Resizing the ASG with Jenkins ❖We use schedule-based Jenkins jobs that utilize the AWS CLI ➢ Each job sets the desired Spark cluster size ➢ Makes it easy for our Data team to make changes to the schedule ➢ Desired size can be manually overridden if needed
  • 18. Termination Protection ❖When scaling down, ASG treats all nodes as equal termination candidates ❖We want to avoid killing instances with currently running jobs ❖To achieve this, we used a built-in feature of ASG - termination protection ❖Any instance in the ASG can be set as protected, thus preventing termination when scaling down the cluster. if [ $(ps -ef | grep executor | grep spark | wc -l) -ne 0 ]; then aws autoscaling set-instance-protection --protected-from-scale-in … fi
  • 19. Upstart Jobs for Spark ❖ Every spark component has an upstart job the does the following ➢ Set Spark Niceness (Process priority in CPU resource distribution) ➢ Start the required Spark component and ensure it stays running ■ The default spark daemon script runs in the background ■ For Upstart, we modified the script to run in the foreground ❖ nohup nice -n "$SPARK_NICENESS"…& vs ❖ nice -n "$SPARK_NICENESS" ...
  • 20. NewRelic Monitoring ❖ Cloud-based Application and Server monitoring ❖Supports multiple alert policies for different needs ➢ Who to alert, and what triggers the alerts ❖Newly created instances are auto - assigned the default alert policy
  • 21. Policy Assignment using AWS Lambda ❖Spark instances have their own policy in NewRelic ❖Each instance has to ask NewRelic to be reassigned to the new policy ➢Parallel reassignment requests may collide and override each other ❖Solution - during provisioning and shutdown, we do the following: ➢Put a record in an AWS Kinesis stream that contains their hostname and their desired NewRelic policy ID ➢The record triggers an AWS Lambda script that uses the NewRelic API to reassign the hostname given to the policy ID given
  • 22. Chef ❖Configuration Management Tool, can provision and configure instances ➢Describe an instance state as code, let chef handle the rest ➢Typically works in server/client mode - client updates every 30m ➢Besides provisioning, also prevents configuration drifts ❖Vast amount of plugins and cookbooks - the sky's the limit! ❖Configures all the instances in our DC
  • 23. Spark Instance Provisioning ❖ Setup Spark ➢ Setup prerequisites - users, directories, symlinks and jars ➢ Download and extract spark package from S3 ❖Configure termination protection cron script ❖Configure upstart conf files ❖Place spark config files ❖Assign NewRelic policy ❖Add shutdown scripts ➢ Delete instance from chef database ➢ Remove from NewRelic monitoring policy
  • 24. Questions? ❖ Alon Torres, DevOps https://il.linkedin.com/in/alontorres ❖Romi Kuntsman, Senior Big Data Engineer https://il.linkedin.com/in/romik ❖Stay in touch! Totango Engineering Technical Blog http://labs.totango.com/

Notes de l'éditeur

  1. 2
  2. 3
  3. 4
  4. 5
  5. 6
  6. 7
  7. 8
  8. 9
  9. 10
  10. 11
  11. 12
  12. 13
  13. 14
  14. 15
  15. 16
  16. 17
  17. 18
  18. 19
  19. 20
  20. 21
  21. 22
  22. 23
  23. 24