Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail of a Shared-Nothing Architecture [Performance]

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 26 Publicité

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail of a Shared-Nothing Architecture [Performance]

Télécharger pour lire hors ligne

Talk Abstract

As with all open-source databases, Accumulo developers often compete between building exciting new features and hacking on performance and stability. As the core features solidify and expand, we see many opportunities to improve performance. An effective methodology for performance improvement is scientific in nature, and follows a well-definite modeling and simulation approach, matching theory to experimentation in an iterative fashion.

Ingest performance is one of the most differentiating characteristics of Accumulo. However, there is much room for improvement for typical ingest-heavy applications. Accumulo supports two mechanisms to bring data in: streaming ingest and bulk ingest. In bulk ingest, the goal is to maximize throughput without constraining latency. Bulk ingest involves creating a set of files that conform to Accumulo's internal RFile format and then registering those files with Accumulo. MapReduce provides a framework for generating, sorting, and storing key/value pairs, which form the primary elements of preparing RFiles for bulk ingest. MapReduce has been used many times over the years to break sorting records, such as Terasort. We can expect it is a reasonable choice for maximizing bulk ingest throughput. However, the theory often proves challenging to implement as there are many performance pitfalls along the way.

In this talk, we dive deep into optimizing MapReduce for Accumulo bulk ingest. We share detailed theoretical and empirical performance models, we discuss techniques for profiling performance, and we suggest reusable techniques for squeezing the maximum performance out of enterprise-grade Accumulo bulk ingest.

Speaker

Chris McCubbin
Director of Data Science, Sqrrl

Chris is the Director of Data Science for Sqrrl. He has extensive experience with the Hadoop ecosystem and applying scientific computation algorithms to real-world datasets. Previously, Chris developed Big Data analysis tools for the Intelligence Community and applied artificial intelligence techniques to unmanned vehicle systems. He holds a MS in Computer Science and BS in Computer Science and Mathematics from the University of Maryland.

Talk Abstract

As with all open-source databases, Accumulo developers often compete between building exciting new features and hacking on performance and stability. As the core features solidify and expand, we see many opportunities to improve performance. An effective methodology for performance improvement is scientific in nature, and follows a well-definite modeling and simulation approach, matching theory to experimentation in an iterative fashion.

Ingest performance is one of the most differentiating characteristics of Accumulo. However, there is much room for improvement for typical ingest-heavy applications. Accumulo supports two mechanisms to bring data in: streaming ingest and bulk ingest. In bulk ingest, the goal is to maximize throughput without constraining latency. Bulk ingest involves creating a set of files that conform to Accumulo's internal RFile format and then registering those files with Accumulo. MapReduce provides a framework for generating, sorting, and storing key/value pairs, which form the primary elements of preparing RFiles for bulk ingest. MapReduce has been used many times over the years to break sorting records, such as Terasort. We can expect it is a reasonable choice for maximizing bulk ingest throughput. However, the theory often proves challenging to implement as there are many performance pitfalls along the way.

In this talk, we dive deep into optimizing MapReduce for Accumulo bulk ingest. We share detailed theoretical and empirical performance models, we discuss techniques for profiling performance, and we suggest reusable techniques for squeezing the maximum performance out of enterprise-grade Accumulo bulk ingest.

Speaker

Chris McCubbin
Director of Data Science, Sqrrl

Chris is the Director of Data Science for Sqrrl. He has extensive experience with the Hadoop ecosystem and applying scientific computation algorithms to real-world datasets. Previously, Chris developed Big Data analysis tools for the Intelligence Community and applied artificial intelligence techniques to unmanned vehicle systems. He holds a MS in Computer Science and BS in Computer Science and Mathematics from the University of Maryland.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail of a Shared-Nothing Architecture [Performance] (20)

Plus récents (20)

Publicité

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail of a Shared-Nothing Architecture [Performance]

  1. 1. Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHARED- NOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc.
  2. 2. TODAY’S TALK 1.  Quick intro to performance optimization 2.  Techniques for targeted distributed application modeling performance improvement 3.  A deep dive in to improving bulk load application performance 4.  A shallow dive in to partial schemas 2©2014 Sqrrl Data, Inc
  3. 3. SO, YOUR DISTRIBUTED APPLICATION IS SLOW •  Today’s distributed applications run on tens or hundreds of library components •  Many versions so internet advice could be ineffective, or worse, flat out wrong •  Hundreds of settings •  Some, shall we say, could be better documented •  Shared-nothing architectures are usually “shared-little” architectures with tricky interactions •  Profiling is hard and time-consuming ©2014 Sqrrl Data, Inc 3
  4. 4. ROUND UP THE ‘USUAL SUSPECTS’? •  “Common knowledge” that some things can cause performance issues •  Too much network usage •  Disk Bound •  Stragglers •  Framework settings •  Unbalanced distribution •  SerDe •  This might be a good start, but we really want to focus on the biggest problem if we can •  Technology, installations and use cases have high variability: what works for one job on one cluster may be useless on another ©2014 Sqrrl Data, Inc 4
  5. 5. PERFORMANCE ANALYSIS CYCLE 5©2014 Sqrrl Data, Inc Simulate & Experiment Modify Code Analyze Start: Create Model Refine Model Outputs: Better Code + Models
  6. 6. MAKING A MODEL ©2014 Sqrrl Data, Inc 6 •  Determine points of low-impact metrics •  Add some if needed •  Create parallel state machine models with components driven by these metrics •  Estimate running times and bottlenecks from a-priori information and/or apply measured statistics •  Focus testing on validation of the initial model and the (estimated) pain points •  Apply Amdahl’s Law •  Rinse, repeat
  7. 7. The Apache Accumulo™ sorted, distributed key/value store is a secure, robust, scalable, high performance data storage and retrieval system. •  Many applications in real-time storage and analysis of “big data”: •  Spatio-temporal indexing in non-relational distributed databases - Fox et al 2013 IEEE International Congress on Big Data •  Big Data Dimensional Analysis - Gadepally et al IEEE HPEC 2014 •  Leading its peers in performance and scalability: •  Achieving 100,000,000 database inserts per second using Accumulo and D4M - Kepner et al IEEE HPEC 2014 •  An NSA Big Graph experiment (Technical Report NSA-RD-2013-056002v1) •  Benchmarking Apache Accumulo BigData Distributed Table Store Using Its Continuous Test Suite - Sen et al 2013 IEEE International Congress on Big Data For more papers and presentations, see http://accumulo.apache.org/papers.html 7©2014 Sqrrl Data, Inc
  8. 8. •  Collections of KV pairs form Tables •  Tables are partitioned into Tablets •  Metadata tablets hold info about other tablets, forming a 3-level hierarchy •  A Tablet is a unit of work for a Tablet Server Data  Tablet   -­‐∞  :  thing   Data  Tablet   thing  :  ∞     Data  Tablet   -­‐∞  :  Ocelot     Data  Tablet   Ocelot  :  Yak     Data  Tablet   Yak  :  ∞     Data  Tablet   -­‐∞  to  ∞     Table:    Adam’s  Table   Table:    Encyclopedia   Table:    Foo   SCALING UP: DIVIDE & CONQUER Well-­‐Known   Loca9on   (zookeeper)   Root  Tablet   -­‐∞  to  ∞     Metadata  Tablet  2   “Encyclopedia:Ocelot”  to  ∞   Metadata  Tablet  1   -­‐∞  to  “Encyclopedia:Ocelot”   8©2014 Sqrrl Data, Inc
  9. 9. BULK INGEST OVERVIEW •  Accumulo supports two mechanisms to bring data in: streaming ingest and bulk ingest. •  Bulk Ingest •  Goal: maximize throughput without constraining latency. •  Create a set of Accumulo Rfiles by some means, then register those files with Accumulo. •  RFiles are groups of sorted key-value pairs with some indexing information •  MapReduce has a built-in key sorting phase: a good fit to produce RFiles ©2014 Sqrrl Data, Inc 9
  10. 10. BULK INGEST MODEL 10 Map Reduce Register Time ©2014 Sqrrl Data, Inc
  11. 11. BULK INGEST MODEL 11 Time •  100% CPU •  20% Disk •  0% Network •  46 seconds •  40% CPU •  100% Disk •  20% Network •  168 seconds •  10% CPU •  20% Disk •  40% Network •  17 seconds Hypothetical Resource Usage ©2014 Sqrrl Data, Inc Map Reduce Register
  12. 12. INSIGHT 12 Time •  100% CPU •  20% Disk •  0% Network •  46 seconds •  40% CPU •  100% Disk •  20% Network •  168 seconds •  10% CPU •  20% Disk •  40% Network •  17 seconds •  Spare disk here, spare CPU there – can we even out resource consumption? •  Why did reduce take 168 seconds? It should be more like 40 seconds. •  No clear bottleneck during registration – is there a synchronization or serialization problem? ©2014 Sqrrl Data, Inc Map Reduce Register
  13. 13. Reduce Thread Map Thread LOOKING DEEPER: REFINED BULK INGEST MODEL 13 Map Setup Map Sort Sort Reduce Output Spill Merge Shuffle Serve Time ©2014 Sqrrl Data, Inc Parallel Latch
  14. 14. BULK INGEST MODEL PREDICTIONS •  We can constrain parts of the model by physical throughput limitations •  Disk -> memory (100Mbps avg 7200rpm seq. read rate) •  Input reader •  Memory -> Disk (100Mbps) •  Spill, OutputWriter •  Disk -> Disk (50Mbps) •  Merge •  Network (Gigabit = 125Mbps) •  Shuffle •  And/or algorithmic limitations •  Sort, (Our) Map, (Our) Reduce, SerDe ©2014 Sqrrl Data, Inc 14
  15. 15. PERFORMANCE GOAL MODEL ©2014 Sqrrl Data, Inc 15 Performance goals obtained through: •  Simulation of individual components •  Prediction of available resources at runtime
  16. 16. INSTRUMENTATION application version 1.3.3 SYSTEM DATA application sha 8d17baf8 node num 1 input type arcsight yarn.nodemanager.resource.memory-mb 43008 map num containers 20 input block size 32 yarn.scheduler.minimum-allocation-mb 2048 red num containers 20 input block count 20 yarn.scheduler.maximum-allocation-mb 43008 cores physical 12 input total 672054649 yarn.app.mapreduce.am.resource.mb 2048 cores logical 24 output map 9313303723 yarn.app.mapreduce.am.command-opts -Xmx1536m disk num 8 output map:combine input records 243419324 mapreduce.map.memory.mb 2048 disk bandwidth 100 output map:combine records out 209318830 mapreduce.map.java.opts -Xmx1638m replication 1 output map:spill 7325671992 mapreduce.reduce.memory.mb 2048 monitoring TRUE output final 573802787 mapreduce.reduce.java.opts -Xmx1638m output map:combine 7301374577 mapreduce.task.io.sort.mb 100 TIME mapreduce.map.sort.spill.percent 0.8 map:setup avg 8 RATIOS mapreduce.task.io.sort.factor 10 map:map avg 12 input explosion factor 13.877904 mapreduce.reduce.shuffle.parallelcopies 5 map:sort avg 12 compression intermediate 1.003327786 mapreduce.job.reduce.slowstart.completedmaps 1 map:spill avg 12 load combiner output 0.783972562 mapreduce.map.output.compress FALSE map:spill count 7 total ratio 0.786581455 mapred.map.output.compression.codec n/a map:merge avg 46 description baseline map total 290 CONSTANTS red:shuffle avg 6 avg schema entry size (bytes) 59 red:merge avg 38 red:reduce avg 68 effective MB/sec 1.618488025 red:total avg 112 red:reducer count 20 job:total 396 16©2014 Sqrrl Data, Inc
  17. 17. PERFORMANCE MEASUREMENT Baseline (naive implementation) 17©2014 Sqrrl Data, Inc Reduce Thread Map Thread Map Setup Map Sort Sort Reduce Output Spill Merge Shuffle Serve
  18. 18. PATH TO IMPROVEMENT 1.  Profiling revealed much time spent serializing/deserializing Accumulo’s Key class 1.  Supported by recent investigations on e.g. spark jobs 1.  “as much as half of the CPU time is spent deserializing and decompressing data.” https://www.eecs.berkeley.edu/~keo/ publications/nsdi15-final147.pdf 2.  With proper configuration, MapReduce supports comparison of MR keys in serialized form 3.  Rewriting Key’s serialization lead to an order-preserving encoding, easy to compare in serialized form 4.  Configure MapReduce to use native code to compare Keys 5.  Tweak map input size and spill memory for as few spills as possible 18©2014 Sqrrl Data, Inc
  19. 19. PERFORMANCE MEASUREMENT Optimized sorting •  Improvements: •  Time for map-side merge went down •  Sort performance drastically improved in both map and reduce phases •  300% faster 19©2014 Sqrrl Data, Inc
  20. 20. PERFORMANCE MEASUREMENT Optimized sorting Insights: •  Map is slower than expected •  Intermediate data inflation ratio (output from map) is very high, and the mapper is now disk-bound •  Amdahl’s law strikes again •  Reducer Output is also already disk bound. •  Can we trade disk time in Map for ‘free’ CPU time in Reduce? 20©2014 Sqrrl Data, Inc Reduce Thread Map Thread Map Setup Map Sort Sort Reduce Output Spill Merge Shuffle Serve
  21. 21. PATH TO IMPROVEMENT •  Evaluation of data passed from map to reduce revealed inefficiencies: •  Constant timestamp cost 8 bytes per key •  Repeated column names could be encoded/ compressed •  Some Key/Value pairs didn’t need to be created until reduce •  Blocks of data output from the mapper guaranteed to transfer ‘en masse’ to the same reducer •  Hypothesis •  Create ‘dehydrated’ key-value pairs of consecutive values when possible •  Spend CPU time in reduce to ‘rehydrate’ the key-values prior to output •  Fewer keys in shuffle also means the sort phase is more efficient 21©2014 Sqrrl Data, Inc
  22. 22. PERFORMANCE MEASUREMENT Optimized map code •  Improvement: •  Big speedup in map function •  Twice as fast •  Reduced intermediate inflation sped up all steps between map and reduce 22©2014 Sqrrl Data, Inc
  23. 23. DO TRY THIS AT HOME With these steps, we achieved 6X speedup: •  Perform comparisons on serialized objects •  With Map/Reduce, calculate how many merge steps are needed •  Avoid premature data inflation •  Leverage compression to shift bottlenecks •  Always consider how fast your code should run Hints for Accumulo Application Optimization 23©2014 Sqrrl Data, Inc
  24. 24. POSTSCRIPT: CARRYING IMPROVEMENTS IN TO THE APPLICATION ©2014 Sqrrl Data, Inc 24 •  Recall that we “dehydrated” consecutive KVs into one KV out of map, and “rehydrated” them in reduce •  Specifically, document storage •  We can do this if we know the schema of the document in advance •  What if we just store dehydrated documents on disk?
  25. 25. POSTSCRIPT: PARTIAL SCHEMAS ©2014 Sqrrl Data, Inc 25 •  Advantages •  Bulk ingest just got even faster (no rehydrate step) •  Disk footprint smaller •  Potentially faster query response •  Potential issues •  Need to keep schemas around (but still want to have flexible schemas) •  How do you handle (lazy) updates? •  Documents need to be rehydrated at some point… when? And what’s the perf trade-off? •  Perhaps we should model this? •  To be continued…
  26. 26. Securely explore your data QUESTIONS? Chris McCubbin Director of Data Science Sqrrl Data, Inc.

×