Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

"Quantum" Performance Effects

4 285 vues

Publié le

Nowadays, CPU microarchitecture is concealed from developers by compilers, VMs, etc.
Do Java developers need to know microarchitecture details of modern processors?
Or, does it like to learn quantum mechanics for cooking?
Are Java developers safe from leaking low-level microarchitecture details into high level application performance behaviour?
We will try to answer these questions by analyzing several Java examples.

Publié dans : Logiciels

"Quantum" Performance Effects

  1. 1. “Quantum” Performance Effects v 3.0; February 2015 Sergey Kuksenko sergey.kuksenko@oracle.com, @kuksenk0
  2. 2. The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 2/52
  3. 3. Intro Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 3/52
  4. 4. Intro: performance engineering 1. Computer Science → Software Engineering – Build software to meet functional requirements – Mostly don’t care about HW and data specifics – Abstract and composable, “formal science” 2. Performance Engineering – “Real world strikes back!” – Exploring complex interactions between hardware, software, and data – Based on empirical evidence, i.e. “natural science” Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 4/52
  5. 5. Intro: what’s the difference? architecture vs microarchitecture Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 5/52
  6. 6. Intro: what’s the difference? architecture vs microarchitecture x86 AMD64(x86-64/Intel64) ARMv7 .... Nehalem Sandy Bridge Bulldozer Bobcat Cortex-A9 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 5/52
  7. 7. Intro: SUTs1 ∙ Intel® Core— i5-4300M [2.6 GHz] 1x2x2 – 𝜇arch: Haswell – launched: Q4’2013 – OS: Xubuntu 14.04 (64-bits) 1System Under Test Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 6/52
  8. 8. Intro: SUTs1 ∙ Intel® Core— i5-4300M [2.6 GHz] 1x2x2 – 𝜇arch: Haswell – launched: Q4’2013 – OS: Xubuntu 14.04 (64-bits) ∙ Samsung Exynos 4412, ARMv7 [1.6 GHz] 1x4x1 – 𝜇arch: Cortex-A9 – launched: 2011 – OS: Linaro 12.11 1System Under Test Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 6/52
  9. 9. Intro: SUTs (cont.) ∙ AMD Opteron— 4274HE [2.5 GHz] 2x8x1 – 𝜇arch: Bulldozer/Valencia – launched: Q4’2011 – OS: Oracle Linux Server release 6.0 (64-bits) ∙ Intel® Xeon® CPU E5-2680 [2.70 GHz] 2x8x2 – 𝜇arch: Sandy Bridge – launched: Q1’2012 – OS: Oracle Linux Server release 6.3 (64-bits) Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 7/52
  10. 10. Intro: JVM ∙ Java HotSpot— “1.8.0_25” 32-bits ∙ Java HotSpot— “1.8.0_25” 64-bits ∙ Java HotSpot— Embedded “1.8.0-ea-b79” Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 8/52
  11. 11. Intro: Demo code https://github.com/kuksenko/quantum Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 9/52
  12. 12. Intro: Demo code https://github.com/kuksenko/quantum ∙ Required: JMH (Java Microbenchmark Harness) – http://openjdk.java.net/projects/code-tools/jmh/ Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 9/52
  13. 13. Core Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 10/52
  14. 14. Demo1: double sum private double [] A = new double [2048]; @Benchmark public double test1 () { double sum = 0.0; for (int i = 0; i < A.length; i++) { sum += A[i]; } return sum; } @Benchmark public double manualUnroll () { double sum = 0.0; for (int i = 0; i < A.length; i += 4) { sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3]; } return sum; } Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 11/52
  15. 15. Demo1: double sum private double [] A = new double [2048]; @Benchmark public double test1 () { double sum = 0.0; for (int i = 0; i < A.length; i++) { sum += A[i]; } return sum; } @Benchmark public double manualUnroll () { double sum = 0.0; for (int i = 0; i < A.length; i += 4) { sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3]; } return sum; } 426 𝑜𝑝𝑠 𝑚𝑠 1120 𝑜𝑝𝑠 𝑚𝑠 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 11/52
  16. 16. Demo1: looking into asm, test1 loop: vaddsd 0x10(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x18(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x20(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x28(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x30(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x38(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x40(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x48(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x50(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x58(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x60(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x68(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x70(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x78(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x80(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x88(%edi ,%eax ,8),%xmm0 ,%xmm0 add $0x10 ,%eax cmp %ebx ,%eax jl loop: Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 12/52
  17. 17. Demo1: looking into asm, manualUnroll loop: vmovsd 0x48(%eax ,%edx ,8),% xmm0 vmovsd %xmm0 ,(% esp) vmovsd 0x40(%eax ,%edx ,8),% xmm0 vmovsd %xmm0 ,0x8(%esp) vmovsd 0x78(%eax ,%edx ,8),% xmm0 vaddsd 0x70(%eax ,%edx ,8),%xmm0 ,%xmm1 vmovsd 0x80(%eax ,%edx ,8),% xmm2 vmovsd 0x88(%eax ,%edx ,8),% xmm0 vmovsd %xmm0 ,0x10(%esp) vmovsd 0x38(%eax ,%edx ,8),% xmm0 vaddsd 0x30(%eax ,%edx ,8),%xmm0 ,%xmm0 vmovsd %xmm0 ,0x18(%esp) vmovsd 0x58(%eax ,%edx ,8),% xmm0 vaddsd 0x50(%eax ,%edx ,8),%xmm0 ,%xmm3 vmovsd 0x28(%eax ,%edx ,8),% xmm4 vmovsd 0x60(%eax ,%edx ,8),% xmm5 vmovsd 0x68(%eax ,%edx ,8),% xmm6 vmovsd 0x20(%eax ,%edx ,8),% xmm7 vmovsd 0x18(%eax ,%edx ,8),% xmm0 vaddsd 0x10(%eax ,%edx ,8),%xmm0 ,%xmm0 vaddsd %xmm2 ,%xmm1 ,%xmm1 vaddsd %xmm7 ,%xmm0 ,%xmm0 vaddsd 0x10(%esp),%xmm1 ,%xmm1 vaddsd %xmm4 ,%xmm0 ,%xmm0 vaddsd %xmm5 ,%xmm3 ,%xmm2 vaddsd 0x20(%esp),%xmm0 ,%xmm3 vaddsd %xmm6 ,%xmm2 ,%xmm2 vmovsd 0x18(%esp),%xmm0 vaddsd 0x8(%esp),%xmm0 ,%xmm0 vaddsd (%esp),%xmm0 ,%xmm0 vaddsd %xmm0 ,%xmm3 ,%xmm0 vaddsd %xmm0 ,%xmm2 ,%xmm0 vaddsd %xmm0 ,%xmm1 ,%xmm0 vmovsd %xmm0 ,0x20(%esp) add $0x10 ,%edx cmp %ebx ,%edx jl loop: Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 13/52
  18. 18. Demo1: measure time @Benchmark @BenchmarkMode(Mode.AverageTime) @OperationsPerInvocation (2048) public double test1 () { double sum = 0.0; for (int i = 0; i < A.length; i++) { sum += A[i]; } return sum; } @Benchmark @BenchmarkMode(Mode.AverageTime) @OperationsPerInvocation (2048) public double manualUnroll () { double sum = 0.0; for (int i = 0; i < A.length; i += 4) { sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3]; } return sum; } Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 14/52
  19. 19. Demo1: measure time @Benchmark @BenchmarkMode(Mode.AverageTime) @OperationsPerInvocation (2048) public double test1 () { double sum = 0.0; for (int i = 0; i < A.length; i++) { sum += A[i]; } return sum; } @Benchmark @BenchmarkMode(Mode.AverageTime) @OperationsPerInvocation (2048) public double manualUnroll () { double sum = 0.0; for (int i = 0; i < A.length; i += 4) { sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3]; } return sum; } 𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠 𝑜𝑝 𝑡𝑖𝑚𝑒 = 0.44 𝑛𝑠 𝑜𝑝 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 14/52
  20. 20. Demo1: measure time @Benchmark @BenchmarkMode(Mode.AverageTime) @OperationsPerInvocation (2048) public double test1 () { double sum = 0.0; for (int i = 0; i < A.length; i++) { sum += A[i]; } return sum; } @Benchmark @BenchmarkMode(Mode.AverageTime) @OperationsPerInvocation (2048) public double manualUnroll () { double sum = 0.0; for (int i = 0; i < A.length; i += 4) { sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3]; } return sum; } 𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠 𝑜𝑝 𝐶𝑃 𝐼 =∼ 2.5 𝑡𝑖𝑚𝑒 = 0.44 𝑛𝑠 𝑜𝑝 𝐶𝑃 𝐼 =∼ 0.5 Cycles Per Instruction Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 14/52
  21. 21. 𝜇arch: x86 CISC vs RISC Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 15/52
  22. 22. 𝜇arch: x86 CISC and RISC modern x86 CPU is not what it seems All instructions (CISC) are dynamically translated into RISC-like microoperations ( 𝜇ops). Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 15/52
  23. 23. 𝜇arch: Intel’s internals http://commons.wikimedia.org/wiki/ File:Intel_Nehalem_arch.svg (c) Appaloosa, CC BY-SA 3.0 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 16/52
  24. 24. 𝜇arch: simplified scheme Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 17/52
  25. 25. 𝜇arch: looking into instruction tables2 Operation Latency 1 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 addition (floating-point) 3 1 multiplication (floating-point) 5 0.5 addition (integer) 1 0.25 multiplication (integer) 3 1 2Haswell 𝜇arch Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 18/52
  26. 26. Demo1: test1, looking into asm again loop: vaddsd 0x10(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x18(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x20(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x28(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x30(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x38(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x40(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x48(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x50(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x58(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x60(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x68(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x70(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x78(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x80(%edi ,%eax ,8),%xmm0 ,%xmm0 vaddsd 0x88(%edi ,%eax ,8),%xmm0 ,%xmm0 add $0x10 ,%eax cmp %ebx ,%eax jl loop: 𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠 𝑜𝑝 𝐶𝑃 𝐼 =∼ 2.5 ∼ 3 𝑐𝑦𝑐𝑙𝑒𝑠 𝑜𝑝 𝑢𝑛𝑟𝑜𝑙𝑙𝑒𝑑 𝑏𝑦 16 19 𝑖𝑛𝑠𝑡𝑟𝑢𝑠𝑡𝑖𝑜𝑛𝑠 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 19/52
  27. 27. Demo1: test1, structural view 𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠 𝑜𝑝 𝐶𝑃 𝐼 =∼ 2.5 ∼ 3 𝑐𝑦𝑐𝑙𝑒𝑠 𝑜𝑝 𝑢𝑛𝑟𝑜𝑙𝑙𝑒𝑑 𝑏𝑦 16 19 𝑖𝑛𝑠𝑡𝑟𝑢𝑠𝑡𝑖𝑜𝑛𝑠 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 20/52
  28. 28. Demo1: manualUnroll Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 21/52
  29. 29. Demo1: manualUnroll, structural view 𝑡𝑖𝑚𝑒 = 0.44 𝑛𝑠 𝑜𝑝 𝐶𝑃 𝐼 =∼ 0.5 ∼ 1.14 𝑐𝑦𝑐𝑙𝑒𝑠 𝑜𝑝 𝑢𝑛𝑟𝑜𝑙𝑙𝑒𝑑 𝑏𝑦 4 * 4 37 𝑖𝑛𝑠𝑡𝑟𝑢𝑠𝑡𝑖𝑜𝑛𝑠 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 22/52
  30. 30. 𝜇arch: Dependences Performance ILP 3 of many programs is limited by natural data dependencies. 3Instruction Level Parallelism Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 23/52
  31. 31. 𝜇arch: Dependences Performance ILP 3 of many programs is limited by natural data dependencies. What to do? Break Dependency Chains! 3Instruction Level Parallelism Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 23/52
  32. 32. Demo1(cont.): breaking chains in a “right” way Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 24/52
  33. 33. Demo1(cont.): breaking chains in a “right” way ... for (int i = 0; i < A.length; i++) { sum += A[i]; } return sum; Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 24/52
  34. 34. Demo1(cont.): breaking chains in a “right” way ... for (int i = 0; i < A.length; i++) { sum += A[i]; } return sum; ... for (int i = 0; i < A.length; i += 2) { sum0 += A[i]; sum1 += A[i + 1]; } return sum0 + sum1; Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 24/52
  35. 35. Demo1(cont.): breaking chains in a “right” way ... for (int i = 0; i < A.length; i++) { sum += A[i]; } return sum; ... for (int i = 0; i < A.length; i += 2) { sum0 += A[i]; sum1 += A[i + 1]; } return sum0 + sum1; ... for (int i = 0; i < array.length; i += 4) { sum0 += A[i]; sum1 += A[i + 1]; sum2 += A[i + 2]; sum3 += A[i + 3]; } return (sum0 + sum1) + (sum2 + sum3); Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 24/52
  36. 36. Demo1(cont.): double sum final results Haswell AMD ARM manualUnroll 0.44 0.45 3.30 test1 1.15 1.50 6.60 test2 0.58 0.80 4.25 test4 0.39 0.43 4.25 test8 0.39 0.25 2.55 time, ns/op Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 25/52
  37. 37. Demo2: results Haswell AMD ARM DoubleMul.test1 2.84 2.52 8.17 DoubleMul.test2 2.50 2.37 4.25 DoubleMul.test4 0.48 0.49 3.15 DoubleMul.test8 0.25 0.30 2.53 IntMul.test1 1.14 1.16 10.04 IntMul.test2 0.58 0.75 7.38 IntMul.test4 0.38 0.67 4.64 IntSum.test1 0.39 0.32 8.92 IntSum.test2 0.24 0.48 6.12 time, ns/op Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 26/52
  38. 38. Branches: to jump or not to jump public int absSumBranch(int a[]) { int sum = 0; for (int x : a) { if (x < 0) { sum -= x; } else { sum += x; } } return sum; } loop: mov 0xc(%ecx ,%ebp ,4),%ebx test %ebx ,%ebx jl L1 add %ebx ,%eax jmp L2 L1: sub %ebx ,%eax L2: inc %ebp cmp %edx ,%ebp jl loop Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 27/52
  39. 39. Branches: to jump or not to jump public int absSumPredicated(int a[]) { int sum = 0; for (int x : a) { sum += Math.abs(x); } return sum; } loop: mov 0xc(%ecx ,%ebp ,4),%ebx mov %ebx ,%esi neg %esi test %ebx ,%ebx cmovl %esi ,%ebx add %ebx ,%eax inc %ebp cmp %edx ,%ebp jl Loop Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 28/52
  40. 40. Demo3: results Regular Pattern = (+, –)* Nehalem Haswell AMD ARM branch_sorted 0.9 0.5 1.0 5.0 branch_regular 0.9 0.5 0.8 5.0 branch_shuffled 6.4 1.0 2.8 9.4 predicated_sorted 1.3 0.8 0.9 5.6 predicated_regular 1.3 0.8 0.9 5.3 predicated_shuffled 1.3 0.8 0.9 9.3 time, ns/op Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 29/52
  41. 41. Demo3: results Regular Pattern = (+, +, –, +, –, –, +, –, –, +)* Nehalem Haswell AMD ARM branch_sorted 0.9 0.5 1.0 5.0 branch_regular 1.6 0.9 1.0 5.0 branch_shuffled 6.4 1.0 2.3 9.5 predicated_sorted 1.3 0.8 0.9 5.6 predicated_regular 1.3 0.8 0.9 5.3 predicated_shuffled 1.3 0.8 0.9 9.3 time, ns/op Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 30/52
  42. 42. Demo4: && vs & public int countConditional(boolean [] f0 , boolean [] f1) { int cnt = 0; for (int j = 0; j < SIZE; j++) { for (int i = 0; i < SIZE; i++) { if (f0[i] && f1[j]) { cnt ++; } } } return cnt; } && shuffled 1.8 ns/op sorted 0.6 ns/op Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 31/52
  43. 43. Demo4: && vs & public int countLogical(boolean [] f0 , boolean [] f1) { int cnt = 0; for (int j = 0; j < SIZE; j++) { for (int i = 0; i < SIZE; i++) { if (f0[i] & f1[j]) { cnt ++; } } } return cnt; } && shuffled 1.8 ns/op sorted 0.6 ns/op & shuffled 1.2 ns/op sorted 1.2 ns/op Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 32/52
  44. 44. Demo5: interface invocation cost public interface I { public int amount (); } ... public class C0 implements I { public int amount (){ return 0; } } public class C1 implements I { public int amount (){ return 1; } } public class C2 implements I { public int amount (){ return 2; } } public class C3 implements I { public int amount (){ return 3; } } ... @Benchmark @BenchmarkMode(Mode.AverageTime) @OperationsPerInvocation(SIZE) public int sum(I[] a) { int s = 0; for (I i : a) { s += i.amount (); } return s; } Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 33/52
  45. 45. Demo5: results 1 target 2 targets 3 targets 4 targets sorted 0.8 0.8 4.9 5.0 regular 0.8 4.9 5.0 shuffled 1.0 17.5 19.1 time, ns/op Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 34/52
  46. 46. Not a Real Core Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 35/52
  47. 47. Not a Real Core: HW Multithreading ∙ Simultaneous multithreading, SMT e.g. Intel® Hyper-Threading Technology ∙ Fine-grained temporal multithreading e.g. CMT, Sun/Oracle ULTRASparc T1, T2, T3, T4, T5 ... Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 36/52
  48. 48. Back to Demo1: Execution Units Saturation 1 thread 2 threads 2 threads 4 threads -cpu 1,3 -cpu 2,3 DoubleSum.test1 426 850 840 1660 DoubleSum.test2 845 1690 1260 2500 DoubleSum.test4 1260 2513 1260 2520 DoubleSum.manualUnroll 1120 2240 1260 2504 overall throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 37/52
  49. 49. Back to Demo1: Execution Units Saturation 1 thread 2 threads 2 threads 4 threads -cpu 1,3 -cpu 2,3 DoubleSum.test1 426 850 840 1660 DoubleSum.test2 845 1690 1260 2500 DoubleSum.test4 1260 2513 1260 2520 DoubleSum.manualUnroll 1120 2240 1260 2504 overall throughput, ops/ms Max single core throughput Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 37/52
  50. 50. Back to Demo1: Execution Units Saturation 1 thread 2 threads 2 threads 4 threads -cpu 1,3 -cpu 2,3 DoubleSum.test1 426 850 840 1660 DoubleSum.test2 845 1690 1260 2500 DoubleSum.test4 1260 2513 1260 2520 DoubleSum.manualUnroll 1120 2240 1260 2504 overall throughput, ops/ms Max system throughput Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 37/52
  51. 51. Demo6: Map.get() private Map <Integer , Integer > jdk_map; private int[] keys; @Benchmark public int testJdkPrimitive () { int s = 0; for (int key : keys) { s += jdk_map.get(key); } return s; } Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 38/52
  52. 52. Demo6: Map.get() private Map <Integer , Integer > jdk_map; private Integer [] boxedKeys; @Benchmark public int testJdkBoxed () { int s = 0; for (Integer key : boxedKeys) { s += jdk_map.get(key); } return s; } Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 39/52
  53. 53. Demo6: Map.get() private TIntIntMap third_party_map; private int[] keys; @Benchmark public int test3dParty () { int s = 0; for (int key : keys) { s += third_party_map.get(key); } return s; } Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 40/52
  54. 54. Demo6: Map.get() results -cpu 1 JdkPrimitive JdkBoxed 3dParty throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  55. 55. Demo6: Map.get() results -cpu 1 JdkPrimitive 47 JdkBoxed 3dParty throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  56. 56. Demo6: Map.get() results -cpu 1 JdkPrimitive 47 JdkBoxed 71 3dParty throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  57. 57. Demo6: Map.get() results -cpu 1 JdkPrimitive 47 JdkBoxed 71 3dParty 74 throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  58. 58. Demo6: Map.get() results -cpu 1 -cpu 2,3 JdkPrimitive 47 JdkBoxed 71 3dParty 74 throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  59. 59. Demo6: Map.get() results -cpu 1 -cpu 2,3 JdkPrimitive 47 (25, 25) JdkBoxed 71 3dParty 74 throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  60. 60. Demo6: Map.get() results -cpu 1 -cpu 2,3 JdkPrimitive 47 (25, 25) JdkBoxed 71 (40, 40) 3dParty 74 throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  61. 61. Demo6: Map.get() results -cpu 1 -cpu 2,3 JdkPrimitive 47 (25, 25) JdkBoxed 71 (40, 40) 3dParty 74 (43, 43) throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  62. 62. Demo6: Map.get() results -cpu 1 -cpu 2,3 -cpu 2 (?) on -cpu 3 JdkPrimitive 47 (25, 25) JdkBoxed 71 (40, 40) 3dParty 74 (43, 43) throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  63. 63. Demo6: Map.get() results -cpu 1 -cpu 2,3 -cpu 2 (?) on -cpu 3 JdkPrimitive 47 (25, 25) (30, ?) JdkBoxed 71 (40, 40) 3dParty 74 (43, 43) throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  64. 64. Demo6: Map.get() results -cpu 1 -cpu 2,3 -cpu 2 (?) on -cpu 3 JdkPrimitive 47 (25, 25) (30, ?) JdkBoxed 71 (40, 40) (50, ?) 3dParty 74 (43, 43) throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  65. 65. Demo6: Map.get() results -cpu 1 -cpu 2,3 -cpu 2 (?) on -cpu 3 JdkPrimitive 47 (25, 25) (30, ?) JdkBoxed 71 (40, 40) (50, ?) 3dParty 74 (43, 43) (16, ?) throughput, ops/ms Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52
  66. 66. Demo6: Hyper.troll() public static double d0; public static double d1; public static double d2; @Benchmark @OperationsPerInvocation (5) public double troll () { return (d0 / d2) / ((d1 / d2) / (d0 / d1)); } Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 42/52
  67. 67. Demo6: division results on Haswell 1 thread int 250 double 180 throughput, ops/ 𝜇s -cpu 1,3 -cpu 2,3 -cpu 3 (int, int) (250, 250) (125, 125) (125, 125) (double, double) (180, 180) (90, 90) (90, 90) (double, int) (180, 250) (150, 57) (90, 125) throughput, ops/ 𝜇s Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 43/52
  68. 68. Demo6: division results on AMD 1 thread int 128 double 300 throughput, ops/ 𝜇s -cpu 0,1 -cpu 0,2 -cpu 0,8 -cpu 0 (int, int) (92, 92) (128, 128) (128, 128) (64, 64) (double, double) (150, 150) (300, 300) (300, 300) (150, 150) (double, int) (280, 120) (290, 128) (300, 128) (120, 64) throughput, ops/ 𝜇s Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 44/52
  69. 69. Conclusion Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 45/52
  70. 70. Enlarge your knowledge with these simple tricks! Reading list: ∙ “Computer Architecture: A Quantitative Approach” John L. Hennessy, David A. Patterson ∙ CPU vendors documentation ∙ http://www.agner.org/optimize/ ∙ etc. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 46/52
  71. 71. Thanks! Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 47/52
  72. 72. Q & A ? Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 48/52
  73. 73. Appendix Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 49/52
  74. 74. Appendix: Frequency Variance ∙ Dynamic CPU Frequency – TurboBoost and similar "The processor must be working in the power, temperature, and specification limits of the thermal design power (TDP)."©Intel Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 50/52
  75. 75. Appendix: TurboBoost in action max normal freq. measured freq. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 51/52
  76. 76. Appendix: Set Fixed Frequency! e.g. cpufreq-set -u 2600000 -g performance Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 52/52

×