Contenu connexe


Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches

  2. 2 COURSE TOPICS Ordering optimization approaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
  3. 3 OUTLINE What is optimization? Pragmatic approach Optimization trade-offs Knowledge which is required Where to get the performance? Optimization cycle Top-Down (High-low) approach Optimization cycle (revised) Optimization steps overview How to learn optimization? Recommended literature Summary
  4. 4 WHAT IS OPTIMIZATION? In computing, optimization is a process of modifying a system to make some aspect of it to work more ef ciently or use fewer resources, in particular, a process of transforming a piece of code to make it more ef cient without changing its output.
  5. 5 PRAGMATIC APPROACH “Programmers waste enormous amounts of time thinking about, or worrying about, the speed of non-critical parts of their programs, and these attempts at ef ciency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small inef ciencies, say about 97% of the time; premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.“ -Donald Knuth, Structur ed Programming With go to Statements 1. Find what to start from (3%) 2. Know when to stop (97%)
  6. 6 OPTIMIZATION TRADE-OFFS Code portability decreases when we go deeper Performance portability decreases when we go deeper The cost of maintenance and extensibility increases when we go deeper Optimizations are often not reusable Optimizations become obsolete very quickly ...but still performance is a crucial requirement for most applications.
  7. 7 KNOWLEDGE WHICH IS REQUIRED 1. The code The problem, it solves The algorithm, it implements The algorithmic complexity 2. The compiler Compilation trajectory Compiler's capabilities and obstacles 3. The platform Architecture capabilities Instruction Set Architecture Micro-architecture speci cs
  8. 8 WHERE TO GET THE PERFORMANCE? High-level Programmer Middle-level Compiler Low-level Hardware
  10. 10 TOP-DOWN (HIGH-LOW) APPROACH 1. Understand the code 2. Use appropriate algorithms 3. Optimize memory access patterns 4. Minimize number of operations 5. Shrink the critical path 6. Perform HW-speci c optimizations 7. Dive into assembly
  12. 12 STEP #1: UNDERSTAND THE CODE Different people think differently you'll need some time to get used to the code Understand data ow input/output parameters data dependencies Identify performance limiters Time Pro le Collect metrics e.g. CPI, power consumption
  13. 13 . 1 STEP #2: USE APPROPRIATE ALGORITHM Consider and lower big O complexity Choose data structures wisely Look for optimized libraries Find opportunities to scalarize & parallelize
  14. 13 . 2 STEP #2: USE APPROPRIATE ALGORITHM Compilers are not aware of semantics of code, taking this into account focus on an algorithmic aspect rst. Decrease big-O complexity Use optimized libraries for subroutines Restructure the code to use fewer resources Split problem on subtasks, organize them wisely Parallelize What if you need to sort 100 Mb of numerical data... WHAT SORTING ALGORITHM WOULD YOU CHOOSE?
  15. 14 . 1 STEP #3: OPTIMIZE MEMORY ACCESSES You'll be surprised how many algorithms are memory bound! Optimization for memory usually involves: Data restructuring to load only data that is really needed for computations. Data packaging to shrink the data in size Loop transformations to walk through the data in a more ef cient way, to increase temporal & spacial locality, to perform cache-aware optimization
  16. 14 . 2 STEP #3: OPTIMIZE MEMORY ACCESSES Compilers are quite good at local optimization, such as loop bodies transformations, local functions inlining, arithmetic expressions simpli cation so help a compiler rather than try to outfox it. Work cohesively with it on enabling auto-vectorization, optimizing critical loops, vectorizing.
  17. 14 . 3 STEP #3: OPTIMIZE MEMORY ACCESSES for (int j = 0; j < height; j++) for (int i = 0; i < width; i++) if (img[j * width + i] > 0) count++; for (int i = 0; i < width; i++) for (int j = 0; j < height; j++) if (img[j * width + i] > 0) count++; WHICH IS MORE OPTIMAL FOR CONVENTIONAL CPU PROCESSOR?
  18. 14 . 4 STEP #3: OPTIMIZE MEMORY ACCESSES for (int j = 0; j < height; j++) for (int i = 0; i < width; i++) if (img[j * width + i] > 0) count++; for (int i = 0; i < width; i++) for (int j = 0; j < height; j++) if (img[j * width + i] > 0) count++;
  19. 15 . 1 The compiler usually helps a lot here: STEP #4: MINIMIZE NUMBER OF OPERATIONS Reducing a program in the number of operations doesn't necessarily decrease its runtime, but it's a good heuristic, though. Machine-independent optimizations Common Sub-expression Elimination Constant propagation Redundancy elimination .. Machine-dependent optimizations Register allocation Instruction selectIon Instruction scheduling Peephole optimization ..
  20. 15 . 2 STEP #4: MINIMIZE NUMBER OF OPERATIONS float pows(float a,float b,float c, float d, float e, float f, float x) { return a * powf(x, 5.f) + b * powf(x, 4.f) + c * powf(x, 3.f) + d * powf(x, 2.f) + e * x + f ; } gcc -march=armv7-a -mfpu=neon-vfpv4 -mthumb -mfloat-abi=softfp -O3 1.c -S -o 1.s
  21. 15 . 3 ... Let's apply Horner rule. STEP #4: MINIMIZE NUMBER OF OPERATIONS pows: push {r3, lr} flds s17, [sp, #56] fmsr s24, r1 movs r1, #0 fmsr s22, r0 movt r1, 16544 fmrs r0, s17 fmsr s21, r2 fmsr s20, r3 flds s19, [sp, #48] flds s18, [sp, #52] bl powf(PLT) mov r1, #1082130432 fmsr s23, r0 fmrs r0, s17 bl powf(PLT) movs r1, #0 movt r1, 16448 fmsr s16, r0 fmrs r0, s17 bl powf(PLT) fmuls s16, s16, s24 vfma.f32 s16, s23, s22 fmsr s15, r0 vfma.f32 s16, s15, s21 fmuls s15, s17, s17 vfma.f32 s16, s20, s15 vfma.f32 s16, s19, s17 fadds s15, s16, s18 fldmfdd sp!, {d8-d12} fmrs r0, s15 pop {r3, pc}
  22. 15 . 4 STEP #4: MINIMIZE NUMBER OF OPERATIONS float horner(float a, float b, float c, float d, float e, float f, float x) { return ((((a * x + b) * x + c) * x + d) * x + e) * x + f; } horner: flds s15, [sp, #8] fmsr s11, r0 fmsr s12, r1 flds s14, [sp] vfma.f32 s12, s11, s15 fmsr s11, r2 flds s13, [sp, #4] vfma.f32 s11, s12, s15 fcpys s12, s11 fmsr s11, r3 vfma.f32 s11, s12, s15 vfma.f32 s14, s11, s15 vfma.f32 s13, s14, s15 fmrs r0, s13 bx lr
  23. 15 . 5 STEP #4: MINIMIZE NUMBER OF OPERATIONS Unfortunately, sometimes a compiler fails some optimization steps (e.g. register allocation, scalarization) and harms the performance by introducing redundant operations. Starting from this optimization step it is worth to look at the assembly code to check whether the compiler is actually automating a particular optimization.
  24. 16 . 1 STEP #5: SHRINK THE CRITICAL PATH Critical path is the longest sequence of operations in a code block that must be completed in order, which is usually caused by dependencies between steps or operations. The critical path of a code block is hardly deducible from high-level code and requires assembly inspection. Knowledge about architecture capabilities is required to estimate critical path more precisely. Some pro lers are able to do critical path analysis. The term could also refer to the longest sequence of dependent steps in a pipeline that limits its parallelization. Control- ow diagram is used to nding the critical path.
  25. 16 . 2 STEP #5: SHRINK THE CRITICAL PATH Let's look at the critical path of the following code block. const uint8_t* p0 = src.ptr(row0); const uint8_t* p1 = src.ptr(row1); uint8_t* dptr = dst.ptr(row); for (int col = 0; col < cols; ++col) { dptr[col] = (p0[col*2]+p0[col*2+1] + p1[col*2]+p1[col*2+1]+2)>>2; } WHAT IS THE CRITICAL PATH OF THIS CODE LINE?
  26. 16 . 3 STEP #5: SHRINK THE CRITICAL PATH Let's create 3-positional representation of the code block r0 = col*2 // 1 r1 = r0+1 // 2 r2 = load(sptr0, r0) // 3 r3 = load(sptr0, r1) // 4 r4 = load(sptr1, r0) // 5 r5 = load(sptr1, r1) // 6 r6 = r2+r3 // 7 r7 = r6+r4 // 8 r8 = r7+r5 // 9 r9 = r8+2 // 10 r10 = shl(r9, 2) // 11 11 ? Let's construct the dependency graph...
  27. 16 . 4 STEP #5: SHRINK THE CRITICAL PATH 8 ?
  28. 16 . 5 But, the compiler reorders instructions since integer math is associative STEP #5: SHRINK THE CRITICAL PATH 6 ?
  29. 16 . 6 And let's assume that hardware schedules 1 arithmetic and 1 memory operation per clock. STEP #5: SHRINK THE CRITICAL PATH AND BACK TO 8 AGAIN
  30. 17 . 1 STEP #6: DO HW-SPECIFIC OPTIMIZATION It requires comprehensive understanding of the target HW, which usually goes beyond compiler's abilities Using special hardware capabilities Overcoming micro-architecture weakness Using instructions, which are speci c for concrete HW balancing usage of different instruction types A classical example here is a question of recomputing temporal v.s. getting it from the memory.
  31. 17 . 2 STEP #6: DO HW-SPECIFIC OPTIMIZATION Modern hardware is quite advanced, deep pipelines, out-of-order execution, sophisticated branch prediction, multi-level memory hierarchies, processor specialization. so utilize unique properties of the hardware. Peephole optimization is not as important as used to be 10 years ago.
  32. 18 STEP #7: DIVE INTO ASSEMBLY Assembler is a must-have to check the compiler but it is rarely used to write low-level code. Raw assembly make sense to: Overcome compiler bugs & optimization limitations addition of redundant instructions suboptimal register allocation Use speci c hardware features which are not expressed in higher level ISA Keep in mind that: Assembly writing is the least portable optimization In-line assembly limits compiler optimizations
  33. 19 HOW TO LEARN OPTIMIZATION? Optimization is a craft rather than a science. Practice more Do not make practical knowledge too theoretical. Look, what other people do Do nd real use cases of different optimization approaches and techniques. Dig into an architecture HW evolves rapidly hence devices obsolete in a wink. Comprehensive knowledge helps see beforehand.
  34. 20 . 1 RECOMMENDED LITERATURE by and Computer Architecture, Fifth Edition: A Quantitative Approach John L. Hennessy David A. Patterson.
  35. 20 . 2 RECOMMENDED LITERATURE byThe Mature Optimization Handbook Carlos Bueno
  36. 20 . 3 RECOMMENDED LITERATURE by Is Parallel Programming Hard, And, If So, What Can You Do About It? Paul E. McKenney
  37. 20 . 4 RECOMMENDED LITERATURE by and Engineering a Compiler Keith Cooper Linda Torczon
  38. 21 SUMMARY Practice, look what others do and dig into an architecture. The main task of an optimizer is nding the critical part. Optimizer's mastership is to know where to stop. Knowledge about the code, the compiler and the platform is a must-have. Optimization is a measure-analyze-optimize-check cycle. Stick to the high-to-low approach. Get the performance from algorithmic and data structure choices rst, ... ensure memory access patterns next, ... then go deeper.
  39. 22 THE END / 2015-2016MARINA KOLPAKOVA