SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
Emma Tang, Neustar
Optimal Strategies for
Large-Scale Batch ETL
Jobs
#EUDev3 October, 2017
2#EUdev3
https://www.neustar.biz/marketing
Neustar
• Help the world’s most valuable brands
understand and target their consumers both
online and offline
• Maximize ROI on Ad spend
• Billions of user events per day, petabytes of data
3#EUdev3
Architecture (simplified view)
4#EUdev3
Batch ETL
• Runs on schedule/ programmatically triggered
• Aim for complete utilization of cluster resources,
esp. memory and CPU
5#EUdev3
Why Batch?
• We care about historical state
• We don’t have SLA other than 1-3x daily delivery
• Efficient, tuned optimal use of resources, cost
efficiency
6#EUdev3
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
7#EUdev3
The attribution problem
• At Neustar, we process large quantities of ad
events
• Familiar events like: impressions, clicks,
conversions
• Which impression/click contributed to
conversion?
8#EUdev3
Example attribution
• Alice goes to her favorite news site, and sees 3
ads – impressions
• She clicks on one of them that leads to Macy’s –
click
• She buys something on Macy’s – conversion
• Her purchase can be attributed to the click and
impression events
9#EUdev3
The approach
• Join conversions with impressions and clicks on
userId
• Go through each user and attribute conversions
to correct target event (impressions/clicks)
• Latest target events are valued more, so
timestamp matters
10#EUdev3
The scale
• Impression: 250 billion
• Clicks: 20 billion
• Conversions: 50 billion
• Join 50 billion x 250 billion
11#EUdev3
impressions
clicks
conversions
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
12#EUdev3
Driver OOM
Exception in thread "map-output-dispatcher-12" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
at java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:615)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614)
at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287)
at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:617)
13#EUdev3
Driver OOM
• Array of mapStatuses of size m, each status
contains info about how the block is used by
each reducer (n).
• m (mappers) x n (reducers)
14#EUdev3
Status1
reducer1
reducer2
Status2
reducer1
reducer2
Driver OOM
• 2 types of mapStatus: highly compressed vs
compressed
• HighlyCompressedMapStatus tracks reduce
partition average size, with bitmap tracking
which blocks are empty for each reducer
15#EUdev3
Driver OOM
• Reduce number of partitions on either side
• 300k x 75k à 100k x 75k
16#EUdev3
Disable unnecessary GC
• spark.cleaner.periodicGC.interval
• GC cycles “stop the world”.
• Large heaps means longer GC
• Set to a long period (e.g. twice the length of your
job)
17#EUdev3
Disable unnecessary GC
• ContextCleaner uses weak references to keep
track of every RDD, ShuffleDependency, and
Broadcast, and registers when the objects go
out of scope
• periodicGCService is a single-thread executor
service that calls the JVM garbage collector
periodically
18#EUdev3
Allow extra time
• spark.rpc.askTimeout
• spark.network.timeout
• in case of GC, our heap size is so large, we will
exceed the timeout.
19#EUdev3
Spurious failures
• Reading from s3 can be flaky, especially when
reading millions of files
• Set spark.task.maxFailures higher than default
of 3
• We set to < 10 to ensure true errors propagate
out quickly
20#EUdev3
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
21#EUdev3
The skew
• Extreme skew in data
• A few users have 100k+ events for 90 days. The
average user has < 50
• Executors dying due to handful of extremely
large partitions
22#EUdev3
The skew
• Out of 20.5B users, 20.2B have < 50 events
23#EUdev3
0
5E+09
1E+10
1.5E+10
2E+10
2.5E+10
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
105000
110000
115000
120000
125000
130000
135000
140000
145000
150000
155000
161000
167000
172000
177000
183000
191000
197000
204000
211000
223000
248000
271000
305000
328000
393000
504000
#ofusers
# of events
count of users with number of events bucketed by 1000s
The skew: zoom
24#EUdev3
0
5
10
15
20
25
30
35
75000
79000
83000
87000
91000
95000
99000
103000
107000
111000
115000
119000
123000
127000
131000
135000
139000
143000
147000
151000
155000
160000
165000
169000
173000
177000
182000
189000
193000
198000
204000
210000
218000
231000
256000
271000
303000
314000
360000
395000
504000
#ofusers
# of events
# of users with # of events bucketed by 100s (> 75k)
Strategy: increase # of partitions
• First line of defense - increase number of
partitions so skewed data is more spread out
25#EUdev3
Strategy: Nest
• Group conversions by userId, group target event
by userId, then join the lists
• Avoid cartesian joins
26#EUdev3
Long tail: ganglia
27#EUdev3
Long tail: Spark UI
• 50 min long tail, median 24 s
28#EUdev3
Long tail: what else to do?
• If you have domain specific knowledge of your
data, use it to filter “bad” data out
• Salt your data, and shuffle twice (but shuffling is
expensive)
• Use bloom filter if one side of your join is much
smaller than the other
29#EUdev3
Bloom Filter
• Space-efficient probabilistic data structure to test
whether an element is a member of a set
• Size mainly determined by number of items in
the filter, and the probability of false positives
• No false negatives!
• Broadcast filter out to executors
30#EUdev3
Bloom Filter
• Using a high false positive rate, still very good
filter
• P = 5% -> 80% filtered out
• Subsequent join much faster
31#EUdev3
Bloom Filter
• Tradeoff between accuracy & size
• We’ve had great success with Bloom Filters with
size of < 5G
• Experiment with Bloom Filters
32#EUdev3
Bloom Filter Applied
• For conversions of 50 billion, false positive rate
of 0.1%, filter size is 80GB
• False positive rate of 5%, filter size is 35GB
• Still too big
33#EUdev3
Long tail: what else to do?
• If you have domain specific knowledge of your
data, use it to filter “bad” data out
• Salt your data, and shuffle twice (but shuffling is
expensive)
• Use bloom filter if one side of your join is much
smaller than the other
34#EUdev3
Long tail: ganglia
35#EUdev3
Long tail: what is it doing?
• Look at executor threads during long tail
com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:382)
com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:65)
com.esotericsoftware.kryo.Kryo.reset(Kryo.java:865)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:630)
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:209
)
org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239)
org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(Wri
tablePartitionedPairCollection.scala:56)
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scal
a:699)
36#EUdev3
Long tail: what is it doing?
• Mappers writing to shuffle space taking long
• Need to reduce data size before going into
shuffle
37#EUdev3
Long tail: what is it doing?
• Events in the long tail had almost identical
information, spread over time.
• For each user, if we retain just 1 event per hour,
at 90 days, it is around 2k events.
• However, this means we need to group by user
first, which requires a shuffle, which defeats the
whole purpose of this exercise, right?
38#EUdev3
Strategy: Filter during map side combine
• Use combineByKey and maximize map side
combine
• Thin collection out during map side combine ->
less is written to shuffle space
39#EUdev3
40#EUdev3
Still slow…
• What else can I do?
41#EUdev3
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
42#EUdev3
Avoid shuffles
• Reuse the same partitioner instance
43#EUdev3
The DAG
44#EUdev3
Avoid shuffles
• Denormalize data or union data to minimize
shuffle
• Rely on the fact we will reduce into a highly
compressed key space.
• For example, we want count of events by
campaign, also count of events by site
45#EUdev3
Avoid shuffle
46#EUdev3
Coalesce partitions when loading
• Loading many small files – coalesce down # of
partitions
• No shuffle
• Reduce task overhead, greatly improve speed
• Going from 300k partitions to 60k, cut time by
half
47#EUdev3
Coalesce partitions when loading
final JavaRDD<Event> eventRDD= loadDataFromS3(); // load data
final int loadingPartitions = eventRDD.getNumPartitions(); // inspect
how many partitions
final int coalescePartitions = loadingPartitions / 5; // use algorithm
to calculate new #
eventRDD
.coalesce(coalescePartitions) // coalesce to smaller #
.map(e -> transform(e)) // faster subsequent operations
48#EUdev3
Materialize data
• Large chunk of data persisted in memory
• Large RDD used to calculate small RDD
• Use an Action to materialize the smaller
calculated result so larger data can be
unpersisted
49#EUdev3
Materialize data
parent.cache() // persist large parent PairRDD to memory
child1 = parent.reduceByKey(a).cache() // calculate child1 from parent
child2 = parent.reduceByKey(b).cache() // calculate child2 from parent
child1.count()// perform an Action
child2.count()// perform an Action
parent.unpersist() // safe to mark parent as unpersisted
// rest of the code can use memory
50#EUdev3
What we will talk about today
• Issues at scale
• Skew
• Optimizations
• Ganglia
51#EUdev3
Ganglia
• Ganglia is an extremely useful tool to
understand performance bottlenecks, and to
tune for highest cluster utilization
52#EUdev3
Ganglia: CPU wave
53#EUdev3
Ganglia: CPU wave
• Executors are going into GC multiple times in
the same stage
• Running out of execution memory
• Persist to StorageLevel.DISK_ONLY()
54#EUdev3
Ganglia: inefficient use
55#EUdev3
Ganglia: inefficient use
• Decrease # of partitions of RDDs used in this
stage
56#EUdev3
Ganglia: much better
57#EUdev3
Final Configuration
• Master 1 r3.4xl
• Executors 110 r3.4xl
• Configurations:
58#EUdev3
spark maximizeResourceAllocation TRUE
spark-defaults spark.executor.cores 16
spark-defaults spark.dynamicAllocation.enabled FALSE
spark-defaults spark.driver.maxResultSize 8g
spark-defaults spark.rpc.message.maxSize 2047
spark-defaults spark.rpc.askTimeout 300
spark-defaults spark.network.timeout 300s
spark-defaults spark.executor.heartbeatInterval 20s
spark-defaults spark.executor.memory 92540m
spark-defaults spark.yarn.executor.memoryOverhead 23300
spark-defaults spark.task.maxFailures 10
spark-defaults spark.executor.extraJavaOptions -XX:+UseG1GC
spark-defaults spark.cleaner.periodicGC.interval 600min
Summary
• Large jobs are special, use special settings
• Outsmart the skew
• Use Ganglia!
59#EUdev3
Thank you
60#EUdev3
Emma Tang
@emmayolotang
@Neustar

Contenu connexe

Tendances

PYCON KR 2017 - 구름이 하늘의 일이라면 (윤상웅)
PYCON KR 2017 - 구름이 하늘의 일이라면 (윤상웅)PYCON KR 2017 - 구름이 하늘의 일이라면 (윤상웅)
PYCON KR 2017 - 구름이 하늘의 일이라면 (윤상웅)Haezoom Inc.
 
테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템QooJuice
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit
 
게임에서 흔히 쓰이는 최적화 전략 by 엄윤섭 @ 지스타 컨퍼런스 2013
게임에서 흔히 쓰이는 최적화 전략 by 엄윤섭 @ 지스타 컨퍼런스 2013게임에서 흔히 쓰이는 최적화 전략 by 엄윤섭 @ 지스타 컨퍼런스 2013
게임에서 흔히 쓰이는 최적화 전략 by 엄윤섭 @ 지스타 컨퍼런스 2013영욱 오
 
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅DongMin Choi
 
온라인 게임에서 사례로 살펴보는 디버깅 in NDC2010
온라인 게임에서 사례로 살펴보는 디버깅 in NDC2010온라인 게임에서 사례로 살펴보는 디버깅 in NDC2010
온라인 게임에서 사례로 살펴보는 디버깅 in NDC2010Ryan Park
 
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019devCAT Studio, NEXON
 
[NDC2016] 신경망은컨텐츠자동생성의꿈을꾸는가
[NDC2016] 신경망은컨텐츠자동생성의꿈을꾸는가[NDC2016] 신경망은컨텐츠자동생성의꿈을꾸는가
[NDC2016] 신경망은컨텐츠자동생성의꿈을꾸는가Hwanhee Kim
 
Multiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theoremMultiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theoremSeungmo Koo
 
How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB MongoDB
 
Albion Online - Software Architecture of an MMO (talk at Quo Vadis 2016, Berlin)
Albion Online - Software Architecture of an MMO (talk at Quo Vadis 2016, Berlin)Albion Online - Software Architecture of an MMO (talk at Quo Vadis 2016, Berlin)
Albion Online - Software Architecture of an MMO (talk at Quo Vadis 2016, Berlin)David Salz
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBAlluxio, Inc.
 
Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016Markus Winand
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
Ndc14 분산 서버 구축의 ABC
Ndc14 분산 서버 구축의 ABCNdc14 분산 서버 구축의 ABC
Ndc14 분산 서버 구축의 ABCHo Gyu Lee
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysisvwchu
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Multithread & shared_ptr
Multithread & shared_ptrMultithread & shared_ptr
Multithread & shared_ptr내훈 정
 
Microsoft SQL Server Dive Deep.pdf
Microsoft SQL Server Dive Deep.pdfMicrosoft SQL Server Dive Deep.pdf
Microsoft SQL Server Dive Deep.pdfAmazon Web Services
 

Tendances (20)

PYCON KR 2017 - 구름이 하늘의 일이라면 (윤상웅)
PYCON KR 2017 - 구름이 하늘의 일이라면 (윤상웅)PYCON KR 2017 - 구름이 하늘의 일이라면 (윤상웅)
PYCON KR 2017 - 구름이 하늘의 일이라면 (윤상웅)
 
테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템테라로 살펴본 MMORPG의 논타겟팅 시스템
테라로 살펴본 MMORPG의 논타겟팅 시스템
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
게임에서 흔히 쓰이는 최적화 전략 by 엄윤섭 @ 지스타 컨퍼런스 2013
게임에서 흔히 쓰이는 최적화 전략 by 엄윤섭 @ 지스타 컨퍼런스 2013게임에서 흔히 쓰이는 최적화 전략 by 엄윤섭 @ 지스타 컨퍼런스 2013
게임에서 흔히 쓰이는 최적화 전략 by 엄윤섭 @ 지스타 컨퍼런스 2013
 
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
 
온라인 게임에서 사례로 살펴보는 디버깅 in NDC2010
온라인 게임에서 사례로 살펴보는 디버깅 in NDC2010온라인 게임에서 사례로 살펴보는 디버깅 in NDC2010
온라인 게임에서 사례로 살펴보는 디버깅 in NDC2010
 
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
이무림, Enum의 Boxing을 어찌할꼬? 편리하고 성능좋게 Enum 사용하기, NDC2019
 
[NDC2016] 신경망은컨텐츠자동생성의꿈을꾸는가
[NDC2016] 신경망은컨텐츠자동생성의꿈을꾸는가[NDC2016] 신경망은컨텐츠자동생성의꿈을꾸는가
[NDC2016] 신경망은컨텐츠자동생성의꿈을꾸는가
 
Multiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theoremMultiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theorem
 
How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB
 
Albion Online - Software Architecture of an MMO (talk at Quo Vadis 2016, Berlin)
Albion Online - Software Architecture of an MMO (talk at Quo Vadis 2016, Berlin)Albion Online - Software Architecture of an MMO (talk at Quo Vadis 2016, Berlin)
Albion Online - Software Architecture of an MMO (talk at Quo Vadis 2016, Berlin)
 
Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDB
 
Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016Row Pattern Matching in SQL:2016
Row Pattern Matching in SQL:2016
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Ndc14 분산 서버 구축의 ABC
Ndc14 분산 서버 구축의 ABCNdc14 분산 서버 구축의 ABC
Ndc14 분산 서버 구축의 ABC
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Multithread & shared_ptr
Multithread & shared_ptrMultithread & shared_ptr
Multithread & shared_ptr
 
Microsoft SQL Server Dive Deep.pdf
Microsoft SQL Server Dive Deep.pdfMicrosoft SQL Server Dive Deep.pdf
Microsoft SQL Server Dive Deep.pdf
 

En vedette

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Spark Summit
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Spark Summit
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuDatabricks
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouFast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouSpark Summit
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Spark Summit
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 

En vedette (9)

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
 
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos ErotocritouFast Data with Apache Ignite and Apache Spark with Christos Erotocritou
Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 

Similaire à Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
There's no magic... until you talk about databases
 There's no magic... until you talk about databases There's no magic... until you talk about databases
There's no magic... until you talk about databasesESUG
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsSamantha Quiñones
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneNoriaki Tatsumi
 
Gemtalk Systems Product Roadmap
Gemtalk Systems Product RoadmapGemtalk Systems Product Roadmap
Gemtalk Systems Product RoadmapESUG
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
 
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slapDEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slapFelipe Prado
 
SAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldSAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldAndrey Karpov
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixDocker, Inc.
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
MSR 2009
MSR 2009MSR 2009
MSR 2009swy351
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
 
Making (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingMaking (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingAmazon Web Services
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...Imperva Incapsula
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitinbloomreacheng
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloudVarun Thacker
 

Similaire à Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang (20)

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
There's no magic... until you talk about databases
 There's no magic... until you talk about databases There's no magic... until you talk about databases
There's no magic... until you talk about databases
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Microservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital OneMicroservices, Continuous Delivery, and Elasticsearch at Capital One
Microservices, Continuous Delivery, and Elasticsearch at Capital One
 
Gemtalk Systems Product Roadmap
Gemtalk Systems Product RoadmapGemtalk Systems Product Roadmap
Gemtalk Systems Product Roadmap
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slapDEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slap
 
SAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security worldSAST, CWE, SEI CERT and other smart words from the information security world
SAST, CWE, SEI CERT and other smart words from the information security world
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
Making (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingMaking (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with Caching
 
Scalable IoT platform
Scalable IoT platformScalable IoT platform
Scalable IoT platform
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Dernier (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang

  • 1. Emma Tang, Neustar Optimal Strategies for Large-Scale Batch ETL Jobs #EUDev3 October, 2017
  • 3. Neustar • Help the world’s most valuable brands understand and target their consumers both online and offline • Maximize ROI on Ad spend • Billions of user events per day, petabytes of data 3#EUdev3
  • 5. Batch ETL • Runs on schedule/ programmatically triggered • Aim for complete utilization of cluster resources, esp. memory and CPU 5#EUdev3
  • 6. Why Batch? • We care about historical state • We don’t have SLA other than 1-3x daily delivery • Efficient, tuned optimal use of resources, cost efficiency 6#EUdev3
  • 7. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 7#EUdev3
  • 8. The attribution problem • At Neustar, we process large quantities of ad events • Familiar events like: impressions, clicks, conversions • Which impression/click contributed to conversion? 8#EUdev3
  • 9. Example attribution • Alice goes to her favorite news site, and sees 3 ads – impressions • She clicks on one of them that leads to Macy’s – click • She buys something on Macy’s – conversion • Her purchase can be attributed to the click and impression events 9#EUdev3
  • 10. The approach • Join conversions with impressions and clicks on userId • Go through each user and attribute conversions to correct target event (impressions/clicks) • Latest target events are valued more, so timestamp matters 10#EUdev3
  • 11. The scale • Impression: 250 billion • Clicks: 20 billion • Conversions: 50 billion • Join 50 billion x 250 billion 11#EUdev3 impressions clicks conversions
  • 12. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 12#EUdev3
  • 13. Driver OOM Exception in thread "map-output-dispatcher-12" java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) at java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:615) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:614) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1287) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:617) 13#EUdev3
  • 14. Driver OOM • Array of mapStatuses of size m, each status contains info about how the block is used by each reducer (n). • m (mappers) x n (reducers) 14#EUdev3 Status1 reducer1 reducer2 Status2 reducer1 reducer2
  • 15. Driver OOM • 2 types of mapStatus: highly compressed vs compressed • HighlyCompressedMapStatus tracks reduce partition average size, with bitmap tracking which blocks are empty for each reducer 15#EUdev3
  • 16. Driver OOM • Reduce number of partitions on either side • 300k x 75k à 100k x 75k 16#EUdev3
  • 17. Disable unnecessary GC • spark.cleaner.periodicGC.interval • GC cycles “stop the world”. • Large heaps means longer GC • Set to a long period (e.g. twice the length of your job) 17#EUdev3
  • 18. Disable unnecessary GC • ContextCleaner uses weak references to keep track of every RDD, ShuffleDependency, and Broadcast, and registers when the objects go out of scope • periodicGCService is a single-thread executor service that calls the JVM garbage collector periodically 18#EUdev3
  • 19. Allow extra time • spark.rpc.askTimeout • spark.network.timeout • in case of GC, our heap size is so large, we will exceed the timeout. 19#EUdev3
  • 20. Spurious failures • Reading from s3 can be flaky, especially when reading millions of files • Set spark.task.maxFailures higher than default of 3 • We set to < 10 to ensure true errors propagate out quickly 20#EUdev3
  • 21. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 21#EUdev3
  • 22. The skew • Extreme skew in data • A few users have 100k+ events for 90 days. The average user has < 50 • Executors dying due to handful of extremely large partitions 22#EUdev3
  • 23. The skew • Out of 20.5B users, 20.2B have < 50 events 23#EUdev3 0 5E+09 1E+10 1.5E+10 2E+10 2.5E+10 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95000 100000 105000 110000 115000 120000 125000 130000 135000 140000 145000 150000 155000 161000 167000 172000 177000 183000 191000 197000 204000 211000 223000 248000 271000 305000 328000 393000 504000 #ofusers # of events count of users with number of events bucketed by 1000s
  • 25. Strategy: increase # of partitions • First line of defense - increase number of partitions so skewed data is more spread out 25#EUdev3
  • 26. Strategy: Nest • Group conversions by userId, group target event by userId, then join the lists • Avoid cartesian joins 26#EUdev3
  • 28. Long tail: Spark UI • 50 min long tail, median 24 s 28#EUdev3
  • 29. Long tail: what else to do? • If you have domain specific knowledge of your data, use it to filter “bad” data out • Salt your data, and shuffle twice (but shuffling is expensive) • Use bloom filter if one side of your join is much smaller than the other 29#EUdev3
  • 30. Bloom Filter • Space-efficient probabilistic data structure to test whether an element is a member of a set • Size mainly determined by number of items in the filter, and the probability of false positives • No false negatives! • Broadcast filter out to executors 30#EUdev3
  • 31. Bloom Filter • Using a high false positive rate, still very good filter • P = 5% -> 80% filtered out • Subsequent join much faster 31#EUdev3
  • 32. Bloom Filter • Tradeoff between accuracy & size • We’ve had great success with Bloom Filters with size of < 5G • Experiment with Bloom Filters 32#EUdev3
  • 33. Bloom Filter Applied • For conversions of 50 billion, false positive rate of 0.1%, filter size is 80GB • False positive rate of 5%, filter size is 35GB • Still too big 33#EUdev3
  • 34. Long tail: what else to do? • If you have domain specific knowledge of your data, use it to filter “bad” data out • Salt your data, and shuffle twice (but shuffling is expensive) • Use bloom filter if one side of your join is much smaller than the other 34#EUdev3
  • 36. Long tail: what is it doing? • Look at executor threads during long tail com.esotericsoftware.kryo.util.IdentityObjectIntMap.clear(IdentityObjectIntMap.java:382) com.esotericsoftware.kryo.util.MapReferenceResolver.reset(MapReferenceResolver.java:65) com.esotericsoftware.kryo.Kryo.reset(Kryo.java:865) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:630) org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:209 ) org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134) org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239) org.apache.spark.util.collection.WritablePartitionedPairCollection$$anon$1.writeNext(Wri tablePartitionedPairCollection.scala:56) org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scal a:699) 36#EUdev3
  • 37. Long tail: what is it doing? • Mappers writing to shuffle space taking long • Need to reduce data size before going into shuffle 37#EUdev3
  • 38. Long tail: what is it doing? • Events in the long tail had almost identical information, spread over time. • For each user, if we retain just 1 event per hour, at 90 days, it is around 2k events. • However, this means we need to group by user first, which requires a shuffle, which defeats the whole purpose of this exercise, right? 38#EUdev3
  • 39. Strategy: Filter during map side combine • Use combineByKey and maximize map side combine • Thin collection out during map side combine -> less is written to shuffle space 39#EUdev3
  • 41. Still slow… • What else can I do? 41#EUdev3
  • 42. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 42#EUdev3
  • 43. Avoid shuffles • Reuse the same partitioner instance 43#EUdev3
  • 45. Avoid shuffles • Denormalize data or union data to minimize shuffle • Rely on the fact we will reduce into a highly compressed key space. • For example, we want count of events by campaign, also count of events by site 45#EUdev3
  • 47. Coalesce partitions when loading • Loading many small files – coalesce down # of partitions • No shuffle • Reduce task overhead, greatly improve speed • Going from 300k partitions to 60k, cut time by half 47#EUdev3
  • 48. Coalesce partitions when loading final JavaRDD<Event> eventRDD= loadDataFromS3(); // load data final int loadingPartitions = eventRDD.getNumPartitions(); // inspect how many partitions final int coalescePartitions = loadingPartitions / 5; // use algorithm to calculate new # eventRDD .coalesce(coalescePartitions) // coalesce to smaller # .map(e -> transform(e)) // faster subsequent operations 48#EUdev3
  • 49. Materialize data • Large chunk of data persisted in memory • Large RDD used to calculate small RDD • Use an Action to materialize the smaller calculated result so larger data can be unpersisted 49#EUdev3
  • 50. Materialize data parent.cache() // persist large parent PairRDD to memory child1 = parent.reduceByKey(a).cache() // calculate child1 from parent child2 = parent.reduceByKey(b).cache() // calculate child2 from parent child1.count()// perform an Action child2.count()// perform an Action parent.unpersist() // safe to mark parent as unpersisted // rest of the code can use memory 50#EUdev3
  • 51. What we will talk about today • Issues at scale • Skew • Optimizations • Ganglia 51#EUdev3
  • 52. Ganglia • Ganglia is an extremely useful tool to understand performance bottlenecks, and to tune for highest cluster utilization 52#EUdev3
  • 54. Ganglia: CPU wave • Executors are going into GC multiple times in the same stage • Running out of execution memory • Persist to StorageLevel.DISK_ONLY() 54#EUdev3
  • 56. Ganglia: inefficient use • Decrease # of partitions of RDDs used in this stage 56#EUdev3
  • 58. Final Configuration • Master 1 r3.4xl • Executors 110 r3.4xl • Configurations: 58#EUdev3 spark maximizeResourceAllocation TRUE spark-defaults spark.executor.cores 16 spark-defaults spark.dynamicAllocation.enabled FALSE spark-defaults spark.driver.maxResultSize 8g spark-defaults spark.rpc.message.maxSize 2047 spark-defaults spark.rpc.askTimeout 300 spark-defaults spark.network.timeout 300s spark-defaults spark.executor.heartbeatInterval 20s spark-defaults spark.executor.memory 92540m spark-defaults spark.yarn.executor.memoryOverhead 23300 spark-defaults spark.task.maxFailures 10 spark-defaults spark.executor.extraJavaOptions -XX:+UseG1GC spark-defaults spark.cleaner.periodicGC.interval 600min
  • 59. Summary • Large jobs are special, use special settings • Outsmart the skew • Use Ganglia! 59#EUdev3