SlideShare une entreprise Scribd logo
1  sur  80
Télécharger pour lire hors ligne
Using Apache Spark for processing
trillions of records each day at
Datadog
Vadim Semenov
Data Engineer @ Datadog
vadim@datadoghq.com
Initial setup
AWS EMR
100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE
3200-6400 cores
24-48 TiB memory
spot instances
spark 1.6 in yarn-cluster mode
scala + RDD API
Initial setup
AWS EMR
100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE
3200-6400 cores
23.5-47 TiB memory
spot instances
spark 1.6 in yarn-cluster mode
scala + RDD API
only 240.23
GiB available
because of Xen
Some initial settings
yarn.nodemanager.resource.memory-mb 240g
yarn.scheduler.maximum-allocation-mb 240g
spark.driver.memory 8g
spark.yarn.driver.memoryOverhead 3g
spark.executor.memory 201g
spark.yarn.executor.memoryOverhead 28g
spark.driver.cores 4
spark.executor.cores 32
spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
-XX:ErrorFile=/tmp/hs_err_pid%p.log
Trillion
How big is a trillion?
2^40 = 1,099,511,627,776
2^31 = 2,147,483,648 = Int.MaxValue
a trillion Integers = 4.3 TiB
OOMs
OOMs
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)
The driver must survive
spark.driver.memory 8g 83g
spark.yarn.driver.memoryOverhead 3g 32g
spark.driver.cores 4 15
spark.executor.memory 201g 166g
spark.yarn.executor.memoryOverhead 28g 64g
spark.executor.cores 32 30
IMAGE: TYNE & WEAR ARCHIVES & MUSEUMS
Measure memory usage
https://github.com/etsy/statsd-jvm-profiler
spark.files = /tmp/statsd-jvm-profiler.jar
spark.executor.extraJavaOptions +=
-javaagent:statsd-jvm-profiler.jar=server=localhost,port=8125,profilers=Mem
oryProfiler
Measure memory usage
Measure memory usage
Measure GC
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)
Off-heap OOMs
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
…
at parquet.hadoop.codec.…
Off-heap OOMs
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
…
at parquet.hadoop.codec.…
Off-heap memory
Direct Allocated Buffers (NIO): Parquet, MessagePack, …
Java Native Interface (JNI): dynamically-linked native
libraries like libhadoop.so, GZIP, ZLIB, LZ4
sun.misc.Unsafe: org.apache.hadoop.io.nativeio,
org.apache.spark.unsafe
Process memory
$ cat /proc/<spark driver/executor pid>/status
VmPeak: 190317312 kB
VmSize: 190268160 kB
VmHWM: 187586408 kB
VmRSS: 187586408 kB
VmData: 190044492 kB
Process memory
$ cat /proc/<spark driver/executor pid>/status
VmPeak: 190317312 kB
VmSize: 190268160 kB
VmHWM: 187586408 kB
VmRSS: 187586408 kB
VmData: 190044492 kB
Process memory
Solution: let the java-agent get the memory
usage of its process right from the procfs
https://github.com/DataDog/spark-jvm-profiler
Measure memory usage
Measure each executor
OOMs
- java.lang.OutOfMemoryError: Java heap space (Increase heap size)
- java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much
garbage)
- java.lang.OutOfMemoryError: Direct buffer memory (NIO)
- There is insufficient memory for the Java Runtime Environment to
continue (Add more memory, reduce memory consumption)
- YARN Container is running beyond physical memory limits. Killing
container. (Increase memory overhead)
Lessons
- Give more resources than you think you
would need, and then reduce
Lessons
- Give more resources than you think you
would need, and then reduce
- Measure memory usage of each executor
Lessons
- Give more resources than you think you
would need, and then reduce
- Measure memory usage of each executor
- Keep an eye on your GC metrics
Measure slow parts
val timer = MaxAndTotalTimeAccumulator
rdd.map(key => {
val startTime = System.nanoTime()
...
val endTime = System.nanoTime()
val millisecondsPassed = ((endTime - startTime) / 1000000).toInt
timer.add(millisecondsPassed)
})
Watch skewed parts
.groupByKey().flatMap({ case (key, iter) =>
val size = iter.size
maxAccumulator.add(key, size)
if (size >= 100,000,000) {
log.info(s"Key $key has $size values")
None
} else {
Report accumulators per partition
sc.addSparkListener(new SparkListener {
override def onTaskEnd(
taskEnd: SparkListenerTaskEnd
): Unit =
Option(taskEnd.taskMetrics)
.foreach(taskMetrics => … )
})
Collect executors metrics
Lessons
- Measure slowest parts of your job
Lessons
- Measure slowest parts of your job
- Count records in the most skewed parts
Lessons
- Measure slowest parts of your job
- Count records in the most skewed parts
- Keep track of how much CPU time your job
actually consumes
Lessons
- Measure slowest parts of your job
- Count records in the most skewed parts
- Keep track of how much CPU time your job
actually consumes
- Have some alerting on these metrics, so you
would know that your job gets slower
Spot instances
Spot instances mitigation
- Break the job into smaller survivable pieces
Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs
Spot instances mitigation
- Break the job into smaller survivable pieces
- Use `rdd.checkpoint` instead of `rdd.persist`
to save data to HDFS
- Helps dynamic allocation since executors don't hold
any data, so they can leave the job and join other
jobs
- Losing multiple executors won't result in
recomputing partitions
ExternalShuffleService
Ex1 1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
Ex1 1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
1 2 3 4
Ex2
Ex3
Driver
ExternalShuffleService
1 2 3 4
Ex2
Ex3
Driver
ERROR o.a.s.n.shuffle.RetryingBlockFetcher:
Exception while beginning fetch of 13
outstanding blocks
java.io.IOException: Failed to connect to
ip-10-12-32-67.us-west-2.compute.internal/1
0.12.32.67:7337
ExternalShuffleService
2 3 4
Ex2
Ex3
Driver
1
ExternalShuffleService
2 3 4
Ex2
Ex3
Driver
1
ERROR o.a.s.n.shuffle.RetryingBlockFetcher:
Exception while beginning fetch of 13
outstanding blocks
java.io.IOException: Failed to connect to
ip-10-12-32-67.us-west-2.compute.internal/1
0.12.32.67:7337
ExternalShuffleService
3 4
Ex2
Ex3
Driver
1
2
ExternalShuffleService
SPARK-19753 Remove all shuffle files on a host in case
of slave lost of fetch failure
SPARK-20832 Standalone master should explicitly inform
drivers of worker deaths and invalidate external shuffle
service outputs
Other FetchFailures
SPARK-20178 Improve Scheduler fetch failures
Keep an eye on failed tasks
Lessons
- Keep all logs
Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies
Lessons
- Keep all logs
- Spark isn't super-resilient even when one
node dies
- Monitor the number of failed
tasks/stages/lost nodes
Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
Late arriving partitions
rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) =>
// We should always have one-to-one join, but who knows …
if (iterA.toSet.size() > 1)
throw new RuntimeException(s"Key $k received more than 1 A record")
if (iterB.toSet.size() > 1)
throw new RuntimeException(…)
if (iterC.toSet.size() > 1) …
Late arriving partitions
.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sortBy(_._1)
// (1L, 10), (1L, 1), (2L, 1)
// (1L, 1), (1L, 10), (2L, 1)
})
SPARK-19263 DAGScheduler should avoid sending
conflicting task set
Late arriving partitions
.map({ case (key, values: Iterator[(Long, Int)]) =>
values.toList.sorted
// (1L, 1), (1L, 10), (2L, 1)
})
Lessons
- Trust but put extra checks and log everything
Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome
Lessons
- Trust but put extra checks and log everything
- Add extra idempotency even if it should be there
- Fail the job if some unexpected situation is
encountered, but also think ahead of time if such
situations can be overcome
- Have retries on the pipeline scheduler level
Migration to Spark 2
SPARK-13850 TimSort Comparison method violates its general contract
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-14363 Executor OOM due to a memory leak in Sorter
SPARK-14560 Cooperative Memory Management for Spillables
SPARK-22033 BufferHolder, other size checks should account for the specific VM
array size limitations
Lessons
- Check the bug tracker periodically
Lessons
- Check the bug tracker periodically
- Subscribe to mailing lists
Lessons
- Check the bug tracker periodically
- Subscribe to mailing lists
- Participate in discussing issues
In conclusion
- Log everything (driver/executors,
NodeManagers, GC)
In conclusion
- Log everything
- Measure everything (heap/off-heap, GC,
executors cpu, failed tasks/stages, slow
parts, skewed parts)
In conclusion
- Log everything
- Measure everything
- Trust but be ready
In conclusion
- Log everything
- Measure everything
- Trust but be ready
- Smaller survivable pieces
Thanks!
Want to work with us on Spark, Kafka, ES, and
more? Come to our booth!
jobs.datadoghq.com
twitter.com/@databuryat
_@databuryat.com
vadim@datadoghq.com

Contenu connexe

Tendances

Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
Mahmoud Hatem
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region Replicas
DataWorks Summit
 

Tendances (20)

Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region Replicas
 
Paper: Oracle RAC Internals - The Cache Fusion Edition
Paper: Oracle RAC Internals - The Cache Fusion EditionPaper: Oracle RAC Internals - The Cache Fusion Edition
Paper: Oracle RAC Internals - The Cache Fusion Edition
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
TiDB for Big Data
TiDB for Big DataTiDB for Big Data
TiDB for Big Data
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 

Similaire à Using apache spark for processing trillions of records each day at Datadog

Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
Ontico
 
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas EricssonOSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
NETWAYS
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails Applications
Serge Smetana
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
IndicThreads
 

Similaire à Using apache spark for processing trillions of records each day at Datadog (20)

Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Spark performance tuning eng
Spark performance tuning engSpark performance tuning eng
Spark performance tuning eng
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
Os Gopal
Os GopalOs Gopal
Os Gopal
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Adobe AEM Maintenance - Customer Care Office Hours
Adobe AEM Maintenance - Customer Care Office HoursAdobe AEM Maintenance - Customer Care Office Hours
Adobe AEM Maintenance - Customer Care Office Hours
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Spark 101
Spark 101Spark 101
Spark 101
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
 
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)
Java и Linux — особенности эксплуатации / Алексей Рагозин (Дойче Банк)
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas EricssonOSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
OSMC 2012 | Neues in Nagios 4.0 by Andreas Ericsson
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails Applications
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 

Dernier

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Dernier (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 

Using apache spark for processing trillions of records each day at Datadog

  • 1. Using Apache Spark for processing trillions of records each day at Datadog Vadim Semenov Data Engineer @ Datadog vadim@datadoghq.com
  • 2.
  • 3.
  • 4.
  • 5. Initial setup AWS EMR 100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE 3200-6400 cores 24-48 TiB memory spot instances spark 1.6 in yarn-cluster mode scala + RDD API
  • 6. Initial setup AWS EMR 100-200x r3.8xlarge 32cores, 244GiB, 640GB SSD, 10GbE 3200-6400 cores 23.5-47 TiB memory spot instances spark 1.6 in yarn-cluster mode scala + RDD API only 240.23 GiB available because of Xen
  • 7. Some initial settings yarn.nodemanager.resource.memory-mb 240g yarn.scheduler.maximum-allocation-mb 240g spark.driver.memory 8g spark.yarn.driver.memoryOverhead 3g spark.executor.memory 201g spark.yarn.executor.memoryOverhead 28g spark.driver.cores 4 spark.executor.cores 32 spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -XX:ErrorFile=/tmp/hs_err_pid%p.log
  • 8. Trillion How big is a trillion? 2^40 = 1,099,511,627,776 2^31 = 2,147,483,648 = Int.MaxValue a trillion Integers = 4.3 TiB
  • 10. OOMs
  • 11. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size)
  • 12. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage)
  • 13. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO)
  • 14. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO) - YARN Container is running beyond physical memory limits. Killing container. (Increase memory overhead)
  • 15. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO) - YARN Container is running beyond physical memory limits. Killing container. (Increase memory overhead) - There is insufficient memory for the Java Runtime Environment to continue (Add more memory, reduce memory consumption)
  • 16. The driver must survive spark.driver.memory 8g 83g spark.yarn.driver.memoryOverhead 3g 32g spark.driver.cores 4 15 spark.executor.memory 201g 166g spark.yarn.executor.memoryOverhead 28g 64g spark.executor.cores 32 30
  • 17. IMAGE: TYNE & WEAR ARCHIVES & MUSEUMS
  • 18. Measure memory usage https://github.com/etsy/statsd-jvm-profiler spark.files = /tmp/statsd-jvm-profiler.jar spark.executor.extraJavaOptions += -javaagent:statsd-jvm-profiler.jar=server=localhost,port=8125,profilers=Mem oryProfiler
  • 22. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO) - YARN Container is running beyond physical memory limits. Killing container. (Increase memory overhead) - There is insufficient memory for the Java Runtime Environment to continue (Add more memory, reduce memory consumption)
  • 23. Off-heap OOMs java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) … at parquet.hadoop.codec.…
  • 24. Off-heap OOMs java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) … at parquet.hadoop.codec.…
  • 25. Off-heap memory Direct Allocated Buffers (NIO): Parquet, MessagePack, … Java Native Interface (JNI): dynamically-linked native libraries like libhadoop.so, GZIP, ZLIB, LZ4 sun.misc.Unsafe: org.apache.hadoop.io.nativeio, org.apache.spark.unsafe
  • 26. Process memory $ cat /proc/<spark driver/executor pid>/status VmPeak: 190317312 kB VmSize: 190268160 kB VmHWM: 187586408 kB VmRSS: 187586408 kB VmData: 190044492 kB
  • 27. Process memory $ cat /proc/<spark driver/executor pid>/status VmPeak: 190317312 kB VmSize: 190268160 kB VmHWM: 187586408 kB VmRSS: 187586408 kB VmData: 190044492 kB
  • 28. Process memory Solution: let the java-agent get the memory usage of its process right from the procfs https://github.com/DataDog/spark-jvm-profiler
  • 31. OOMs - java.lang.OutOfMemoryError: Java heap space (Increase heap size) - java.lang.OutOfMemoryError: GC Overhead limit exceeded (Too much garbage) - java.lang.OutOfMemoryError: Direct buffer memory (NIO) - There is insufficient memory for the Java Runtime Environment to continue (Add more memory, reduce memory consumption) - YARN Container is running beyond physical memory limits. Killing container. (Increase memory overhead)
  • 32. Lessons - Give more resources than you think you would need, and then reduce
  • 33. Lessons - Give more resources than you think you would need, and then reduce - Measure memory usage of each executor
  • 34. Lessons - Give more resources than you think you would need, and then reduce - Measure memory usage of each executor - Keep an eye on your GC metrics
  • 35. Measure slow parts val timer = MaxAndTotalTimeAccumulator rdd.map(key => { val startTime = System.nanoTime() ... val endTime = System.nanoTime() val millisecondsPassed = ((endTime - startTime) / 1000000).toInt timer.add(millisecondsPassed) })
  • 36. Watch skewed parts .groupByKey().flatMap({ case (key, iter) => val size = iter.size maxAccumulator.add(key, size) if (size >= 100,000,000) { log.info(s"Key $key has $size values") None } else {
  • 37. Report accumulators per partition sc.addSparkListener(new SparkListener { override def onTaskEnd( taskEnd: SparkListenerTaskEnd ): Unit = Option(taskEnd.taskMetrics) .foreach(taskMetrics => … ) })
  • 39. Lessons - Measure slowest parts of your job
  • 40. Lessons - Measure slowest parts of your job - Count records in the most skewed parts
  • 41. Lessons - Measure slowest parts of your job - Count records in the most skewed parts - Keep track of how much CPU time your job actually consumes
  • 42. Lessons - Measure slowest parts of your job - Count records in the most skewed parts - Keep track of how much CPU time your job actually consumes - Have some alerting on these metrics, so you would know that your job gets slower
  • 44. Spot instances mitigation - Break the job into smaller survivable pieces
  • 45. Spot instances mitigation - Break the job into smaller survivable pieces - Use `rdd.checkpoint` instead of `rdd.persist` to save data to HDFS
  • 46. Spot instances mitigation - Break the job into smaller survivable pieces - Use `rdd.checkpoint` instead of `rdd.persist` to save data to HDFS - Helps dynamic allocation since executors don't hold any data, so they can leave the job and join other jobs
  • 47. Spot instances mitigation - Break the job into smaller survivable pieces - Use `rdd.checkpoint` instead of `rdd.persist` to save data to HDFS - Helps dynamic allocation since executors don't hold any data, so they can leave the job and join other jobs - Losing multiple executors won't result in recomputing partitions
  • 48. ExternalShuffleService Ex1 1 2 3 4 Ex2 Ex3 Driver
  • 49. ExternalShuffleService Ex1 1 2 3 4 Ex2 Ex3 Driver
  • 50. ExternalShuffleService 1 2 3 4 Ex2 Ex3 Driver
  • 51. ExternalShuffleService 1 2 3 4 Ex2 Ex3 Driver
  • 52. ExternalShuffleService 1 2 3 4 Ex2 Ex3 Driver
  • 53. ExternalShuffleService 1 2 3 4 Ex2 Ex3 Driver ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception while beginning fetch of 13 outstanding blocks java.io.IOException: Failed to connect to ip-10-12-32-67.us-west-2.compute.internal/1 0.12.32.67:7337
  • 55. ExternalShuffleService 2 3 4 Ex2 Ex3 Driver 1 ERROR o.a.s.n.shuffle.RetryingBlockFetcher: Exception while beginning fetch of 13 outstanding blocks java.io.IOException: Failed to connect to ip-10-12-32-67.us-west-2.compute.internal/1 0.12.32.67:7337
  • 57. ExternalShuffleService SPARK-19753 Remove all shuffle files on a host in case of slave lost of fetch failure SPARK-20832 Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
  • 58. Other FetchFailures SPARK-20178 Improve Scheduler fetch failures
  • 59. Keep an eye on failed tasks
  • 61. Lessons - Keep all logs - Spark isn't super-resilient even when one node dies
  • 62. Lessons - Keep all logs - Spark isn't super-resilient even when one node dies - Monitor the number of failed tasks/stages/lost nodes
  • 63. Late arriving partitions rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) => // We should always have one-to-one join, but who knows … if (iterA.toSet.size() > 1) throw new RuntimeException(s"Key $k received more than 1 A record") if (iterB.toSet.size() > 1) throw new RuntimeException(…) if (iterC.toSet.size() > 1) …
  • 64. Late arriving partitions rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) => // We should always have one-to-one join, but who knows … if (iterA.toSet.size() > 1) throw new RuntimeException(s"Key $k received more than 1 A record") if (iterB.toSet.size() > 1) throw new RuntimeException(…) if (iterC.toSet.size() > 1) …
  • 65. Late arriving partitions rddA.cogroup(rddB, rddC).map({ case (k, iterA, iterB, iterC) => // We should always have one-to-one join, but who knows … if (iterA.toSet.size() > 1) throw new RuntimeException(s"Key $k received more than 1 A record") if (iterB.toSet.size() > 1) throw new RuntimeException(…) if (iterC.toSet.size() > 1) …
  • 66. Late arriving partitions .map({ case (key, values: Iterator[(Long, Int)]) => values.toList.sortBy(_._1) // (1L, 10), (1L, 1), (2L, 1) // (1L, 1), (1L, 10), (2L, 1) }) SPARK-19263 DAGScheduler should avoid sending conflicting task set
  • 67. Late arriving partitions .map({ case (key, values: Iterator[(Long, Int)]) => values.toList.sorted // (1L, 1), (1L, 10), (2L, 1) })
  • 68. Lessons - Trust but put extra checks and log everything
  • 69. Lessons - Trust but put extra checks and log everything - Add extra idempotency even if it should be there
  • 70. Lessons - Trust but put extra checks and log everything - Add extra idempotency even if it should be there - Fail the job if some unexpected situation is encountered, but also think ahead of time if such situations can be overcome
  • 71. Lessons - Trust but put extra checks and log everything - Add extra idempotency even if it should be there - Fail the job if some unexpected situation is encountered, but also think ahead of time if such situations can be overcome - Have retries on the pipeline scheduler level
  • 72. Migration to Spark 2 SPARK-13850 TimSort Comparison method violates its general contract SPARK-14560 Cooperative Memory Management for Spillables SPARK-14363 Executor OOM due to a memory leak in Sorter SPARK-14560 Cooperative Memory Management for Spillables SPARK-22033 BufferHolder, other size checks should account for the specific VM array size limitations
  • 73. Lessons - Check the bug tracker periodically
  • 74. Lessons - Check the bug tracker periodically - Subscribe to mailing lists
  • 75. Lessons - Check the bug tracker periodically - Subscribe to mailing lists - Participate in discussing issues
  • 76. In conclusion - Log everything (driver/executors, NodeManagers, GC)
  • 77. In conclusion - Log everything - Measure everything (heap/off-heap, GC, executors cpu, failed tasks/stages, slow parts, skewed parts)
  • 78. In conclusion - Log everything - Measure everything - Trust but be ready
  • 79. In conclusion - Log everything - Measure everything - Trust but be ready - Smaller survivable pieces
  • 80. Thanks! Want to work with us on Spark, Kafka, ES, and more? Come to our booth! jobs.datadoghq.com twitter.com/@databuryat _@databuryat.com vadim@datadoghq.com