Publicité
Publicité

Contenu connexe

Publicité
Publicité

Webinaron muticoreprocessors

  1. Introduction toIntroduction to Multi-CoreMulti-Core Architectures: AArchitectures: A Paradigm Shift in ITParadigm Shift in IT Dr NB Venkateswarlu RITCH CENTER, Visakhapatnam Prof in CSE, AITAM, TEKKALI
  2. IndexIndex  What is Computer Architecture  Which are Computer Components  Classification of Computers  The Dawn of Multi-Core (Stream) processors.  Symmetric Multi-Core (SMP)  Heterogeneous Multi-Core Processors  GPU (Graphic Processing Unit)  Intel Family of Multi-Core Processors  AMD Family of Multi-Core Processors  Others  Multi-Threading, Hyper-Threading, Multi-Core  OpenMP Programming Paradigm
  3. ConceptsConcepts What is Computer Architecture I/O systemInstr. Set Proc. Compiler Operating System Application Digital Design Circuit Design Instruction Set Architecture Firmware Datapath & Control Layout
  4. Evolvement of computer architecture scalar Pre- exe seque nce Multi- processing array Multi- computer multiprocesin g Ledgend processor I/E overlap Fun parallel Multi fun Pipe line Pseudo-vector Obvious vector m-m r-r SIMD MIMD Multi-Core
  5. Categories of computer system  Flynn’S Classification  SISD  SIMD  MISD  MIMD
  6.  SISD CU MUPUI/O IS DS IS
  7. CU PU1 IS IS PUn DS DS MM 1 MM n
  8. MIMDMIMD Control Proc.n-1 Men.n-1 Proc.0 Men.0 Proc.1 Men.1 Proc.2 Men.2 …… NETWORK
  9. MISDMISD DS CU1 PE1 IS PEnCUn … … IS … MM1 MMn… ……DS I/O
  10. Amdahl’s LawAmdahl’s Law  states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
  11. Amdahl’s LawAmdahl’s Law S P = 1 – S 0 1 time 1 Speedup = ───────── S + (1 – S)/ N Where N = number of parallel processors Example: S = 0.6, N = 10, Speedup = 1.56 S = 0.6, N = ∞, Speedup = 1.67 Gene Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” AFIPS Conference Proceedings, (30), pp. 483-485, 1967.
  12. CPUCPU The Central Processing Unit (CPU) Microprocessor (µp)  The brain of the computer  Its task is to execute instructions  It keeps on executing instructions from the moment it is powered-on to the moment it is powered-off.  Execution of various instructions causes the computer to perform various tasks.  There are a wide range of CPUs  The dominant CPUs on the market today are from  Intel: Pentium (Desktop), Xeon (Server), Pentium-M (Mobile), X- scale (Embedded)  AMD: Athlon (Desktop), Opteron (Server), Turion (Mobile), Geode (Embedded)
  13. Key Characteristics of aKey Characteristics of a CPUCPU Several metrics are used to describe the characteristics of a CPU  Native “Word” size  32-bit or 64-bit  Clock speeds  The clock speed of a CPU defines the rate at which the internal clock of the CPU operates  The clock sequences and determines the speed at which instructions are executed by the CPU  Modern CPUs operate in Gigahertz (1 GHz = 109 Hz)  However, clock speeds alone do not determine the overall ability of a CPU  Instruction set  More versatile CPUs support a richer and more efficient instruction sets  Modern CPUs have new instruction sets such as SSE, SSE2, 3DNow that can be used to boost performance of many common applications  These instructions perform the operation of several instructions but take less time  FLOPS: Floating Point Operations per Second  FLOPS are a better metric for measuring a CPU’s computational capabilities  Modern CPUs deliver 2-3 Giga FLOPS (1 GFLOPS = 109 FLOPS)  FLOPS is also used as a metric for describing computational capabilities of computer systems  Modern supercomputers deliver 1 Terra Flop (TFLOP), 1 TFLOP = 1012 FLOPS.
  14. Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.) Several metrics are used to describe the characteristics of a CPU  Power consumption  Power is an important factor in today’s computing  Power is the product of voltage applied to the CPU (V) and the amount of current drawn by the CPU (I)  The unit for voltage is volts  The unit for current is ampere (aka amps)  Power = V x I  Power is typically represented in Watts  Modern desktop processors consume anywhere from 35 to 100 watts  Lower power is better as power is proportional to heat  More power implies the processor generates more heat and heating is a big problem  Number of computational units of core per CPU  Some CPUs have multiple cores  Each core is an independent computational unit and can execute instructions in parallel  More the number of cores the better the CPU is for multi-threaded applications  Single threaded applications typically experience a slow down on multi-core processors due to reduced clock speeds
  15. Key Characteristics of a CPU (Contd.)Key Characteristics of a CPU (Contd.) Several metrics are used to describe the characteristics of a CPU  Cache size and configuration  Cache is a small, but high speed memory that is fabricated along with the CPU  Size of cache is inversely proportional to its speed  Cost of the CPU increases as size of cache increases  It is much faster than RAM  It is used to minimize the overall latency of accessing RAM  Microprocessors have a hierarchy of caches  L1 cache: Fastest and closest to the core components  L3 cache: Relatively slower and further away from CPU  Example cache configurations  Quad core AMD Opteron (Shanghai)  32 KB (Data) + 32 KB (Instr.) L1 cache per core  Unified 512 KB L2 cache per core  Unified 6 MB shared L3 cache (for 4 cores)  Quad core Intel Xeon (Nehalem)  32 KB (Data) + 32 KB (Instr.) L1 Cache per core  Unified 256 KB L2 cache per core  Unified 8 MB shared L3 cache (for 4 cores)
  16. Trends in computingTrends in computing  Until recently, hardware performance improvements have been primarily achieved due to advancement in microprocessor fabrication technologies:  Steady improvement in processor clock speeds  Faster clocks (with in the same family of processors) provide higher FLOPS  Increase in number of transistors on-chip  More complex and sophisticated hardware to improve performance  Larger caches to provide rapid access to instruction and data
  17. Moore’s LawMoore’s Law  The steady advancement in microprocessor technology was predicted by Gordon Moore (in 1965), co-founder of Intel®  Moore’s law states that the number of transistors on microprocessors will double approximately every two years.  Many advancements in digital technologies can be linked to Moore’s law. This includes:  Processing speed  Memory Capacity  Speed and bandwidth of data communication networks and  Resolution of monitors and digital cameras  Thus far, Moore’s law has steadily held true for about 40 years (from 1965 to about 2005)  Breakthroughs in miniaturization of transistors has been the turnkey technology
  18. Moore’s Law vs. Intel’sMoore’s Law vs. Intel’s roadmaproadmap  Here is a graph illustrating the progress of Moore’s law based on Intel® Inc. technological roadmap (obtained from Wikipedia)
  19. Stagnation of Moore’s LawStagnation of Moore’s Law  In the past few years we have reached the fundamental limits at IC fabrication technology (particularly lithography and interconnect)  It is no longer feasible to further miniaturize the transistors on the IC  They are already just a several atoms large and at this point laws of physics change making it an extremely challenging task  Heat dissipation has reached breakdown threshold  Heat is generated as a part of regular transistor operations  Higher heat dissipations will cause the transistors to fail  With the current state of the art a single processor cannot yield any more than 4 to 5 GFLOPS  How do we move beyond this barrier?
  20. SIA Roadmap for ProcessorsSIA Roadmap for Processors (1999)(1999) Year 1999 2002 2005 2008 2011 2014 Feature size (nm) 180 130 100 70 50 35 Logic transistors/cm2 6.2M 18M 39M 84M 180M 390M Clock (GHz) 1.25 2.1 3.5 6.0 10.0 16.9 Chip size (mm2 ) 340 430 520 620 750 900 Power supply (V) 1.8 1.5 1.2 0.9 0.6 0.5 High-perf. Power (W) 90 130 160 170 175 183 Source: http://www.semichips.org
  21. ISSCC, Feb. 2001, KeynoteISSCC, Feb. 2001, Keynote “Ten years from now, microprocessors will run at 10GHz to 30GHz and be capable of processing 1 trillion operations per second -- about the same number of calculations that the world's fastest supercomputer can perform now. “Unfortunately, if nothing changes these chips will produce as much heat, for their proportional size, as a nuclear reactor. . . .” Patrick P. Gelsinger Senior Vice President General Manager Digital Enterprise Group INTEL CORP.
  22. Intel ViewIntel View  Soon >billion transistors integrated  Clock frequency can still increase  Future applications will demand TIPS  Power? Heat?
  23. TEMPARATURE IMAGE PF CELLTEMPARATURE IMAGE PF CELL PROCESSORPROCESSOR
  24. Multi-core and Multi-Multi-core and Multi- processorsprocessors  The solution to increasing the effective compute power is via the use of multi-core, multi-processor computer system along with suitable software  This is a paradigm shift in hardware and software technologies  Multiple cores and multiple processors are interconnected using a variety of high speed data communication networks  Software plays a central role in harnessing the power of multi-core/multi-processor systems  Most industry leaders believe this is the near future of computing!
  25. Everything is to be GREENEverything is to be GREEN NOW!NOW!  Green Buildings  Green-IT
  26. Multi-core TrendsMulti-core Trends  Multi-core processors are most definitely the future of computing  Both Intel and AMD are pushing for larger number of cores per CPU package  The Cell Broadband Engine (aka Cell) has 8 synergistic processing elements (SPE)  The Sun Microsystems Niagara has 8 cores, with each core capable of running 8-threads.
  27. Sun NiagaraSun Niagara  32 way parallelism typically delivered 23x performance improvement Sun Niagara 0.00 0.20 0.40 0.60 0.80 1.00 1.20 Dense Protein FEM-Sphr FEM-Cant Tunnel FEM-Har QCD FEM-Ship Econom Epidem FEM-Accel Circuit W ebbase LP Median GFlop/s 8 Cores x 4 Threads[Naïve] 1 Core x 1 Thread[Naïve]
  28. A GENERIC MODERN PCA GENERIC MODERN PC
  29. SINGLE CORE VS MULTI-CORE (8): ASINGLE CORE VS MULTI-CORE (8): A COMPARISION FROM GEORGIA PACKINGCOMPARISION FROM GEORGIA PACKING INSTITUTEINSTITUTE
  30. INTEL CORE-DUOINTEL CORE-DUO
  31. AMD DUAL COREAMD DUAL CORE PROCESSORPROCESSOR
  32. MulticoreMulticore ArchitectureArchitecture
  33. ArchitectureArchitecture
  34. Hierarchy of Modular Building BlocksHierarchy of Modular Building Blocks High Speed Network Core Core Mem Ctrl I/O AttachCache Interconnect Fabric High Speed Network SMP Interconnect Memory Memory Chip Board Rack Homogenous SMP on Board • 2 – 128 HW contexts on board Hierarchical SMP servers with NUMA characteristics Grid/Cluster Core Homogenous SMP on chip • 2-32 HW contexts on chip • Various forms of resource sharing Core Core will support multiple HW threads sharing a single cache exhibiting SMP characteristics. Main Processor(s) with Accelerator(s) • Master-Slave relationship between entities Hierarchical SMP servers with non- uniform memory access characteristics Heterogenous collection of processors on chip • Heterogenity at data and control flow level • Systems will increasingly need to implement a hybrid execution model • New programming systems need to reduce the need for programmer awareness of the topology on which their program executes The next gen programming system must support programming simplicity while leveraging the performance of the underlying HW topology.
  35. Looming “Multicore Crisis”Looming “Multicore Crisis” Slide Source: Berkeley View of Landscape
  36. Architecture trendsArchitecture trends  Several processor cores on a chip and specialized computing engines  XML processing, cryptography, graphics  Questions:  how to interconnect large number of processor cores  how to provide sufficient memory bandwidth  how to structure the multilevel caching subsystem  how to balance the general purpose computing resources with specialized processing engines and all the supporting memory, caching and interconnect structure, given a constant power budget  Software development processes  how to program for multicore architectures  how to test and evaluate the performance of multithreaded applications
  37. Intel Multi-Core Processors
  38. Key Features: Dual CoreKey Features: Dual Core  Two physical cores in a package  Each with its own execution  resources  Each with its own L1 cache  32K instruction and 32K data  8-way set associative; 64-byte line  Both cores share the L2 cache  2MB 8-way set associative; 64-byte line size  10 clock cycles latency; Write Back update policy Two Actual Processor CoresTwo Actual Processor Cores EXE CoreEXE Core FP UnitFP Unit EXE CoreEXE Core FP UnitFP Unit L2 CacheL2 Cache L1 CacheL1 Cache L1 CacheL1 Cache System Bus (667MHz, 5333MB/s) System Bus (667MHz, 5333MB/s) Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded ExecutionExecution Truly Parallel Multitasking & Multi-threadedTruly Parallel Multitasking & Multi-threaded ExecutionExecution
  39. Key Features: Smart CacheKey Features: Smart Cache  Shared between the two cores  Advanced Transfer Cache architecture  Reduced bus traffic  Both cores have full access to the entire cache  Dynamic Cache sizing BusBus 2 MB L2 Cache2 MB L2 Cache Core1Core1 Core2Core2 Enables Greater SystemEnables Greater System ResponsivenessResponsiveness
  40. Advanced Smart CacheAdvanced Smart Cache
  41. Key Features:Key Features: Enhanced SpeedStepEnhanced SpeedStep TechnologyTechnology  Multi-point demand-based switching  Supports transitions to deeper sleep modes  Clock partitioning and recovery  Dynamic Bus Parking Breakthrough Performance and Improved Battery LifeBreakthrough Performance and Improved Battery Life
  42. Wide Dynamic ExecutionWide Dynamic Execution
  43. Macro FusionMacro Fusion
  44. Smart Memory AccessSmart Memory Access
  45. Key Features: Digital MediaKey Features: Digital Media BoostBoost Streaming SIMD Extensions (SSE) Decoder Throughput Improvement  High Performance ComputingHigh Performance Computing  Digital PhotographyDigital Photography  Digital MusicDigital Music  Video EditingVideo Editing  Internet Content CreationInternet Content Creation  3D & 2D Modeling3D & 2D Modeling  CAD ToolsCAD Tools SSE/SSE2 Instruction Optimization Floating Point Performance Enhancement New Enhanced Streaming SIMD Extensions 3 (SSE3) Providing True SIMD Integer & Floating PointProviding True SIMD Integer & Floating Point Performance!Performance!
  46. Key Features: Digital MediaKey Features: Digital Media Boost – SSE3 InstructionsBoost – SSE3 Instructions  SIMD FP using AOS format*  Thread Synchronization  Video encoding  Complex arithmetic  FP to integer conversions  HADDPD, HSUBPD  HADDPS, HSUBPS  MONITOR, MWAIT  LDDQU  ADDSUBPD, ADDSUBPS,  MOVDDUP, MOVSHDUP,  MOVSLDUP  FISTTP * Also benefits Complex Arithmetic and Vectorization
  47. Intel Xeon(Clovertown)Intel Xeon(Clovertown)  pseudo quad-core / socket  4-issue, out of order, super-scalar  Fully pumped SSE  2.33GHz  21GB/s, 74.6 GFlop/s 4MB Shared L2 Xeon FSB Fully Buffered DRAM 10.6GB/s Xeon Blackford 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Xeon Xeon 4MB Shared L2 Xeon FSB Xeon 4MB Shared L2 Xeon Xeon 21.3 GB/s(read)
  48. AMD OpteronAMD Opteron  Dual core / socket  3-issue, out of order, super-scalar  Half pumped SSE  2.2GHz  21GB/s, 17.6 GFlop/s  Strong NUMA issues 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM DDR2 DRAM 10.6GB/s 10.6GB/s 8GB/s
  49. IBM Cell BladeIBM Cell Blade  Eight SPEs / socket  Dual-issue, in-order, VLIW like  SIMD only ISA  Disjoint Local Store address space + DMA  Weak DP FPU  51.2GB/s, 29.2GFlop/s,  Strong NUMA issues XDR DRAM 25.6GB/s EIB(RingNetwork) <<20GB/s each direction SPE256K PPE512K L2 MFC BIF XDR SPE256KMFC SPE256KMFC SPE256KMFC SPE256KMFC SPE256KMFC SPE256KMFC SPE256KMFC XDR DRAM 25.6GB/s EIB(RingNetwork) SPE 256K PPE 512K L2 MFC BIF XDR SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC
  50. Sun NiagaraSun Niagara  Eight core  Each core is 4 way multithreaded  Single-issue in-order  Shared, very slow FPU  1.0GHz  25.6GB/s, 0.1GFlop/s, (8GIPS) CrossbarSwitch 25.6GB/s DDR2DRAM 3MBSharedL2(12way) 8K D$MT UltraSparc FPU 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 8K D$MT UltraSparc 64GB/s (fill) 32GB/s (writethru)
  51. Cell’s Nine-Processor ChipCell’s Nine-Processor Chip © IEEE Spectrum, January 2006 Eight Identical Processors f = 5.6GHz (max) 44.8 Gflops
  52. QuickTime™ and a decompressor are needed to see this picture. Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors Phenom X4
  53. QuickTime™ and a decompressor are needed to see this picture. Symmetric Multi-Core ProcessorsSymmetric Multi-Core Processors UltraSparc
  54. Future of multi-coresFuture of multi-cores  Moore’s Law predicts that the number of cores will double every 18 - 24 months  2007 - 8 cores on a chip (IBM Cell has 9 and Sun will has 8)  2009 - 16 cores  2013 - 64 cores  2015 - 128 cores  2021 - 1k cores Exploiting TLP has the potential to close Moore’s gap Real performance has the potential to scale with number of cores
  55. IBM Cell Example: Will this be theIBM Cell Example: Will this be the trend for large multi-core designs?trend for large multi-core designs? Cell Processor Components • Power Processor Element (PPE) • Element Interconnect Bus (EIB) • 8 – Synergistic Processing Element (SPE) • Rambus XDR Memory controller • Flex I/O FLEX I/O SPE SPESPE SPE SPE SPESPE SPE Rambus XDR PPE PowerPC 512K L2 Cache 4 x 128 bit Element Interconnect Bus FLEX I/O SPE SPESPE SPE SPE SPESPE SPE Rambus XDR PPE PowerPC 512K L2 Cache 4 x 128 bit Element Interconnect Bus
  56. Obvious silicon real-estate advantageObvious silicon real-estate advantage for software memory managedfor software memory managed processorsprocessors IBM AMD Intel Cell BE
  57. Multithreading andMultithreading and Multi-CoreMulti-Core ProcessorsProcessors
  58. Instruction-Level ParallelismInstruction-Level Parallelism (ILP)(ILP)  Extract parallelism in a single program  Superscalar processors have multiple execution units working in parallel  Challenge to find enough instructions that can be executed concurrently  Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order
  59. Performance Beyond ILPPerformance Beyond ILP  Much higher natural parallelism in some applications  Database, web servers, or scientific codes  Explicit Thread-Level Parallelism  Thread: has own instructions and data  May be part of a parallel program or independent programs  Each thread has all state (instructions, data, PC, register state, and so on) needed to execute  Multithreading: Thread-Level Parallelism within a processor
  60. Thread-Level ParallelismThread-Level Parallelism (TLP)(TLP)  ILP exploits implicit parallel operations within loop or straight-line code segment  TLP is explicitly represented by multiple threads of execution that are inherently parallel  Goal: Use multiple instruction streams to improve  Throughput of computers that run many programs  Execution time of multi-threaded programs  TLP could be more cost-effective to exploit than ILP
  61. Fine-Grained MultithreadingFine-Grained Multithreading  Switches between threads on each instruction, interleaving execution of multiple threads  Usually done round-robin, skipping stalled threads  CPU must be able to switch threads on every clock cycle  Pro: Hide latency of both short and long stalls  Instructions from other threads always available to execute  Easy to insert on short stalls  Con: Slow down execution of individual threads  Thread ready to execute without stalls will be delayed by instructions from other threads  Used on Sun’s Niagara
  62. Course-GrainedCourse-Grained MultithreadingMultithreading  Switches threads only on costly stalls  e.g., L2 cache misses  Pro: No switching each clock cycle  Relieves need to have very fast thread switching  No slow down for ready-to-go threads  Other threads only issue instructions when the main one would stall (for long time) anyway  Con: Limitation in hiding shorter stalls  Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread  New thread must fill pipe before instructions can complete  Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time  Used in IBM AS/400
  63. MultithreadingMultithreading ParadigmsParadigms Thread 1 Unused ExecutionTime FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Simultaneous Multithreading (SMT) or Intel’s HT Fine-grained Multithreading (cycle-by-cycle Interleaving) Thread 2 Thread 3 Thread 4 Thread 5 Coarse-grained Multithreading (Block Interleaving) Chip Multiprocessor (CMP) or called Multi-Core Processors today
  64. Simultaneous MultithreadingSimultaneous Multithreading (SMT)(SMT) Exploits TLP at the same time it exploits ILP  Intel’s HyperThreading (2-way SMT)  Others: IBM Power5 and Intel future multicore (8-core, 2-thread 45nm Nehalem)  Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources RegReg FileFile FMultFMult (4 cyc(4 cyclesles)) FAddFAdd (2 cyc)(2 cyc) ALU1ALU1ALU2ALU2 Load/StoreLoad/Store (variable)(variable) Fdiv, unpipeFdiv, unpipe (16 cyc(16 cyclesles)) RS & ROBRS & ROB plusplus PhysicalPhysical RegisterRegister FileFile RS & ROBRS & ROB plusplus PhysicalPhysical RegisterRegister FileFile FetchFetch UnitUnit FetchFetch UnitUnit PCPCPCPCPCPCPCPCPCPCPCPCPCPCPCPC I-CacheI-CacheI-CacheI-Cache DecodeDecodeDecodeDecode RegisterRegister RRenameenamerr RegisterRegister RRenameenamerr RegReg FileFile RegReg FileFile RegReg FileFile RegReg FileFile RegReg FileFile RegReg FileFile Reg File DD-Cache-CacheDD-Cache-Cache
  65. Simultaneous MultithreadingSimultaneous Multithreading (SMT)(SMT)  Insight that dynamically scheduled processor already has many HW mechanisms to support multithreading  Large set of virtual registers that can be used to hold register sets for independent threads  Register renaming provides unique register identifiers  Instructions from multiple threads can be mixed in data path  Without confusing sources and destinations across threads!  Out-of-order completion allows the threads to execute out of order, and get better utilization of the HW  Just add per-thread renaming table and keep separate PCs  Independent commitment can be supported via separate reorder buffer for each thread
  66. SMT PipelineSMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire Icache Dcache PC Register Map Regs Regs Data from Compaq
  67. Power 4Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine; each can issue instruction each cycle.
  68. Power 4 Power 5 2 fetch (PC), 2 initial decodes 2 commits (architected register sets)
  69. Pentium 4 Hyperthreading:Pentium 4 Hyperthreading: Performance ImprovementsPerformance Improvements
  70. Questions To be Asked WhileQuestions To be Asked While Migrating to Multi-CoreMigrating to Multi-Core
  71. Application performanceApplication performance 1. Does the concurrency platform allow me to measure the parallelism I’ve exposed in my application? 2. Does the concurrency platform address response time bottlenecks,‐ or just offer more throughput? 3. Does application performance scale up linearly as cores are added, or does it quickly reach diminishing returns? 4. Is my multicore enabled code just as fast as my original serial‐ code when run on a single processor? 5. Does the concurrency platform's scheduler load balance irregular‐ applications efficiently to achieve full utilization? 6. Will my application “play nicely” with other jobs on the system, or do multiple jobs cause thrashing of resources? 7. What tools are available for detecting multicore performance bottlenecks?
  72. Software reliabilitySoftware reliability 8. How much harder is it to debug my multicore enabled‐ application than to debug my original application? 9. Can I use my standard, familiar debugging tools? 10. Are there effective debugging tools to identify and localize parallel programming‐ errors, such as data race bugs?‐ 11. Must I use a parallel debugger even if I make an ordinary serial programming error? 12. What changes must I make to my release engineering‐ processes to ensure that my delivered software is reliable? 13. Can I use my existing unit tests and regression tests?
  73. Development timeDevelopment time 14. To multicore enable my application, how much logical‐ restructuring of my application must I do? 15. Can I easily train programmers to use the concurrency platform? 16. Can I maintain just one code base, or must I maintain a serial and parallel versions? 17. Can I avoid rewriting my application every time a new processor generation increases the core count? 18. Can I easily multicore enable ill structured and irregular code, or‐ ‐ is the concurrency platform limited to data parallel applications?‐ 19. Does the concurrency platform properly support modern programming paradigms, such as objects, templates, and exceptions? 20. What does it take to handle global variables in my application?
  74. OpenMP: A Portable Solution for Threading
  75. What Is OpenMP*?What Is OpenMP*?  Compiler directives for multithreaded programming  Easy to create threaded Fortran and C/C++ codes  Supports data parallelism model  Incremental parallelism  Combines serial and parallel code in single source
  76. OpenMP* ArchitectureOpenMP* Architecture  Fork-join model  Work-sharing constructs  Data environment constructs  Synchronization constructs  Extensive Application Program Interface (API) for finer control
  77. Programming ModelProgramming Model Fork-join parallelism: • Master thread spawns a team of threads as needed • Parallelism is added incrementally: the sequential program evolves into a parallel program Parallel Regions Master Thread
  78. OpenMP* Pragma SyntaxOpenMP* Pragma Syntax  Most constructs in OpenMP* are compiler directives or pragmas.  For C and C++, the pragmas take the form:  #pragma omp construct [clause [clause]…]
  79. Parallel RegionsParallel Regions  Defines parallel region over structured block of code  Threads are created as ‘parallel’ pragma is crossed  Threads block at end of region  Data is shared among threads unless specified otherwise #pragma omp parallel Thread 1 Thread 2 Thread 3 C/C++ : #pragma omp parallel { block }
  80. How Many Threads?How Many Threads?  Set environment variable for number of threads set OMP_NUM_THREADS=4  There is no standard default for this variable  Many systems:  # of threads = # of processors  Intel® compilers use this default
  81. Work-sharing ConstructWork-sharing Construct  Splits loop iterations into threads  Must be in the parallel region  Must precede the loop #pragma omp parallel #pragma omp for for (I=0;I<N;I++){ Do_Work(I); }
  82. Work-sharing ConstructWork-sharing Construct  Threads are assigned an independent set of iterations  Threads must wait at the end of work-sharing construct #pragma omp parallel #pragma omp for Implicit barrier i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 #pragma omp parallel #pragma omp for for(i = 1, i < 13, i++) c[i] = a[i] + b[i]
  83. Combining pragmasCombining pragmas  These two code segments are equivalent #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } } #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); }
  84. Data EnvironmentData Environment  OpenMP uses a shared-memory programming model  Most variables are shared by default.  Global variables are shared among threads  C/C++: File scope variables, static
  85. Data EnvironmentData Environment  But, not everything is shared...  Stack variables in functions called from parallel regions are PRIVATE  Automatic variables within a statement block are PRIVATE  Loop index variables are private (with exceptions)  C/C+: The first loop index variable in nested loops following a #pragma omp for
  86. Data Scope AttributesData Scope Attributes  The default status can be modified with  default (shared | none)  Scoping attribute clauses  shared(varname,…)  private(varname,…)
  87. The Private ClauseThe Private Clause  Reproduces the variable for each thread  Variables are un-initialized; C++ object is default constructed  Any value external to the parallel region is undefined void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } }
  88. Protect Shared DataProtect Shared Data  Must protect access to shared, modifiable data float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; }
  89.  #pragma omp critical [(lock_name)]  Defines a critical region on a structured block OpenMP* Critical ConstructOpenMP* Critical Construct float RES; #pragma omp parallel { float B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical (RES_lock) consum (B, RES); } } Threads wait their turn –at a time, only one calls consum() thereby protecting RES from race conditions Naming the critical construct RES_lock is optional
  90. OpenMP* Reduction ClauseOpenMP* Reduction Clause  reduction (op : list)  The variables in “list” must be shared in the enclosing parallel region  Inside parallel or work-sharing construct:  A PRIVATE copy of each list variable is created and initialized depending on the “op”  These copies are updated locally by threads  At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable
  91. Reduction ExampleReduction Example  Local copy of sum for each thread  All local copies of sum added together and stored in “global” variable #pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; }
  92.  A range of associative operands can be used with reduction  Initial values are the ones that make sense mathematically C/C++ Reduction OperationsC/C++ Reduction Operations Operand Initial Value + 0 * 1 - 0 ^ 0 Operand Initial Value & ~0 | 0 && 1 || 0
  93. Schedule Clause When To Use STATIC Predictable and similar work per iteration DYNAMIC Unpredictable, highly variable work per iteration GUIDED Special case of dynamic to reduce scheduling overhead Which Schedule to UseWhich Schedule to Use
  94. Parallel SectionsParallel Sections  Independent sections of code can execute concurrently Serial Parallel #pragma omp parallel sections { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); }
  95.  Denotes block of code to be executed by only one thread  First thread to arrive is chosen  Implicit barrier at end Single ConstructSingle Construct #pragma omp parallel { DoManyThings(); #pragma omp single { ExchangeBoundaries(); } // threads wait here for single DoManyMoreThings(); }
  96.  Denotes block of code to be executed only by the master thread  No implicit barrier at end Master ConstructMaster Construct #pragma omp parallel { DoManyThings(); #pragma omp master { // if not master skip to next stmt ExchangeBoundaries(); } DoManyMoreThings(); }
  97. Implicit BarriersImplicit Barriers  Several OpenMP* constructs have implicit barriers  parallel  for  single  Unnecessary barriers hurt performance  Waiting threads accomplish no work!  Suppress implicit barriers, when safe, with the nowait clause
  98. Barrier ConstructBarrier Construct  Explicit barrier synchronization  Each thread waits until all threads arrive #pragma omp parallel shared (A, B, C) { DoSomeWork(A,B); printf(“Processed A into Bn”); #pragma omp barrier DoSomeWork(B,C); printf(“Processed B into Cn”); }
  99. Atomic ConstructAtomic Construct  Special case of a critical section  Applies only to simple update of memory location #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); }
  100. OpenMP* APIOpenMP* API  Get the thread number within a team  int omp_get_thread_num(void);  Get the number of threads in a team  int omp_get_num_threads(void);  Usually not needed for OpenMP codes  Can lead to code not being serially consistent  Does have specific uses (debugging)  Must include a header file #include <omp.h>
  101. More OpenMP*More OpenMP*  Data environment constructs  FIRSTPRIVATE  LASTPRIVATE  THREADPRIVATE
  102.  Variables initialized from shared variable  C++ objects are copy-constructed Firstprivate ClauseFirstprivate Clause incr=0; #pragma omp parallel for firstprivate(incr) for (I=0;I<=MAX;I++) { if ((I%2)==0) incr++; A(I)=incr; }
  103.  Variables update shared variable using value from last iteration  C++ objects are updated as if by assignment Lastprivate ClauseLastprivate Clause void sq2(int n, double *lastterm) { double x; int i; #pragma omp parallel #pragma omp for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } lastterm = x; }
  104.  Preserves global scope for per-thread storage  Legal for name-space-scope and file-scope  Use copyin to initialize from master thread Threadprivate ClauseThreadprivate Clause struct Astruct A; #pragma omp threadprivate(A) … #pragma omp parallel copyin(A) do_something_to(&A); … #pragma omp parallel do_something_else_to(&A); Private copies of “A” persist between regions
  105. Performance IssuesPerformance Issues  Idle threads do no useful work  Divide work among threads as evenly as possible  Threads should finish parallel tasks at same time  Synchronization may be necessary  Minimize time waiting for protected resources
  106. Load ImbalanceLoad Imbalance  Unequal work loads lead to idle threads and wasted time. time Busy Idle #pragma omp parallel { #pragma omp for for( ; ; ){ } } time
  107. #pragma omp parallel { #pragma omp critical { ... } ... } SynchronizationSynchronization  Lost time waiting for locks time Busy Idle In Critical
  108. Performance TuningPerformance Tuning  Profilers use sampling to provide performance data.  Traditional profilers are of limited use for tuning OpenMP*:  Measure CPU time, not wall clock time  Do not report contention for synchronization objects  Cannot report load imbalance  Are unaware of OpenMP constructs Programmers need profilers specifically designed for OpenMP.
  109. Static Scheduling: Doing ItStatic Scheduling: Doing It By HandBy Hand  Must know:  Number of threads (Nthrds)  Each thread ID number (id)  Compute start and end iterations: #pragma omp parallel { int i, istart, iend; istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++){ c[i] = a[i] + b[i];} }
  110. SummarySummary  Discussed about Multi-Core Processors  Discussed about openMP programming.
  111. Thank You

Notes de l'éditeur

  1. SMP: symmetric multi-processor NUMA: Non Uniform Memory Access
  2. Today we have examples of 8 core devices and have seen roadmap that indicate that by 2008 or 09 16 core devices will exists. Intel last year demo some technology that they say will lead to an 80 core device in 5 years or so. It is probable a good bet that by 2020 we will have devices with 1K cores. So what is the problem with moving to TLP architectures. Obviously the future of processing performance growth looks bright using this approach. Need for more highly suffocated runtime instruction level parallel abstraction is no longer required so longer necessarily to scrape the bottom of the barrel to get a few % better performance. However research in key areas are still key. Methods to reduce leakage currents as feature sizes continue to shrink Power management techniques maybe Asynchronous design. Off chip memory and I/O bandwidth have become even more of a bottleneck and improvements in this area are still needed Silicon lasers for free air optical interfaces for an example. But these aren’t the big problems ahead for us.
  3. IBM Cell process used in Playstation 3 has been shown to have an order of magnitude leap over many of comparable technologies. This is due to some radical changes in micro-architecture. Application managed memory in the SPEs (no hardware cache). Explicit data movement via DMA required SPE are simple SIMD machines, no dedicated scale pipeline, no suffocated branch prediction logic (mainly done with compile time branch hint), very high bandwidth interconnect, and very high chip I/O bandwidth (ok on memory bandwidth). Heterogenous processor core architecture (8 SPEs and 1 PPE (powerpc machine)
  4. OpenMP* Pragma Syntax For Fortran, you can also use other sentinel’s (!$OMP), but these must exactly line up on columns 1-5. Column 6 must be blank or contain a + indicating that this line is a continuation from the previous line. C$OMP *$OMP !$OMP
  5. Parallel Regions Other Fortran Methods.
  6. Fortran will make all nested loop indices private.
  7. Data Scope Attributes All data clauses apply to parallel regions and worksharing constructs except “shared,” which only applies to parallel regions.
  8. Private Cause For-loop iteration variable is PRIVATE by default.
  9. Sections are distributed among the threads in the parallel team. Each section is executed only once and each thread may execute zero or more sections. It’s not possible to determine whether or not a section will be executed before another. Therefore, the output of one section should not serve as the input to another. Instead, the section that generates output should be moved before the sections construct.
  10. Atomic Construct Since index[i] can be the same for different I values, the update to x must be protected. Use of a critical section would serialize updates to x. Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel.
  11. Load Imbalance Relieve load imbalance Static even scheduling Equal size iteration chunks Based on runtime loop limits Totally parallel scheduling OpenMP* default Dynamic and guided scheduling The threads do some work then get the next chunk. There is some cost in scheduling algorithm.
  12. Synchronization Methods for tuning synchronization: Reduce lock contention Different named critical sections Introduce domain-specific locks Merge parallel loops and remove barriers Merge small critical sections Move critical sections outside loops
  13. Performance Tuning Mention time inside synch constructs, and so on.
Publicité