Publicité

Exascale Capabl

Summer Research Intern à Forbes Marshall
5 May 2019
Publicité

Contenu connexe

Publicité

Exascale Capabl

  1. Building Affordable and Programmable Exascale Capable Computers SURFsara, 18 April 2019
  2. Think Ångströms not nanometers in 2019 Ideally we should steer movements of almost each individual electron to solve our problems 0.1 nm  1 Å 14 nm  140 Å DNA C-C bond 1Å 10Å 102Å 103Å 104Å glucose hemoglobin ribosome … cells 100s of Si atoms in 14nm very few atoms (e.g., 3nm / 30Å  6 to 12 atoms) light microscope resolution 2
  3. Moving data on-chip will use as much energy as computing with it Moving data off-chip will use 200x more energy! and is much slower as well The power challenge Today* 2020 Double precision Float Op ~20pJ <10pJ Moving data on-chip: 1mm 6pJ Moving data on-chip: 20mm 120pJ Moving data to off-chip memory 4,000pJ 2,000pJ 3 The data movement challenge Next generation computer systems should take data movements at all levels very seriously 3
  4. Data movement challenge confirmed 4 Wires that carry the data (and instructions) become more and more important (Courtesy: NVIDIA and ITRS) BUT 4
  5. 5 “… without dramatic increases in efficiency, ICT industry could use 20% of all electricity and emit up to 5.5% of the world’s carbon emissions by 2025.” “We have a tsunami of data approaching. Everything which can be is being digitalised. It is a perfect storm.” “ … a single $1bn Apple data centre planned for Athenry in Co Galway, expects to eventually use 300MW of electricity, or over 8% of the national capacity and more than the daily entire usage of Dublin. It will require 144 large diesel generators as back up for when the wind does not blow.” Why is all of this important?
  6. 6 Computing in Time: Follow a recipe step by step one at the time Computing in Space: Build a “recipe specific” factory with multiple paths performed simultaneously One result per clock cycle Efficient, predictable, reliable “mass production” of huge data amounts Build Computers for your Problem and Data
  7. 1. Describe Conjugate Gradient as dataflow graph 3. Stream data through the Custom Accelerator 2. Compile dataflow structure and load to hardware Create customized mega accelerators with massive inherent throughputs Programming a Dataflow “mass production” Engine 7
  8. Program in HL Language Machine Architecture Implementation Circuits Algorithm Devices Problem Solutions Solutions Co-optimise the HW and the SW stack for the performance critical areas of the application 8 Solving Computing Problems Vertically 8
  9. From Equations to Dataflow Hardware u x s x vd x F ah p ah TRuu a vu ah u t u                      1ln  9
  10. Real data flow graph as generated by MaxCompiler 4,866 nodes; 10,000s of stages/cycles Full Customization in: Space, Value and Time (SVT) 1010
  11. Easy it is not (and not really new) Slotnick’s law (of effort): “The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” Daniel Slotnick (1931-1985) Chief Architect of Illiac IV 11
  12. Programing in Space basics 12 • Control and Data-flows are decoupled – Both are fully programmable • Operations exist in space and by default run in parallel – Their number is limited only by the available space • All operations can be customized at various levels – e.g., from algorithm down to the number representation • Multiple operations constitute kernels • Data streams through the operations / kernels • The data transport and processing can be balanced • All resources work all of the time for max performance • The In/Out data rates determine the operating frequency Equally spread the available “forces” and move no faster than required by the application 12
  13. The Computational Model 13 • Dataflow sub-system (DataFlow Engine- DFE) – Spatial arithmetic chip “hardware” technology with flexible arithmetic units and programmable interconnect (looks like FPGAs but is not limited to) – Programmable Static Dataflow – Systolic Execution at kernel level – Streaming Custom Computing at system level – Implicit GALS* IO and kernel-to-kernel communication • Dedicated software (MaxCompiler, MaxelerOS and SLiC) – compilation toolchain and design methodology – Incorporated simulation and debug environment for rapid development – Linux fully integrated runtime system and low level software support – Help designer focus on the data/algorithm and the system architecture • Only three basic memory types (explicitly exposed) – Scalars (exposed to the CPU) – Fast Memory (FMEM): small and fast (on-chip) – Large Memory (LMEM): large and slow (off-chip) * GALS – Globally Asynchronous Locally Synchronous 13
  14. Maxeler’s DataFlow Engine (DFE, MAX4) 14 MaxRing Interconnect Dataflow Engine (DFE)LMEM (Large Memory) 4-96GB Reconfigurable compute fabric Dataflow cores & FMEM (Fast Memory) High bandwidth memory link Link to main data network (e.g., PCIe, Infiniband) MaxRing links • 48GB DRAM (LMEM) • Stratix V D8 • MaxRing interconnect • 4,000 multipliers • 700K logic cells • 6.25MB of FMEM 14
  15. Application Level Components SLiC MaxelerOS Memory CPU DFE Memory Kernels (MaxJ) (instantiate the arithmetic structure) *+ + Manager (MaxJ) (arrange the data orchestration) Host application (C, Python, Matlab..) 15 PCI Express or Infiniband 15
  16. MaxJ: Moving Average of three numbers Dataflow computing in hardware using a language you know 16
  17. x x + 30 y DFEVar x = io.input("x", dfeFloat(10,31)); DFEVar result = x * x + 30; io.output("y", result, dfeFloat(10,31)); 17 Simple example: y = x2 + 30 17
  18. MaxJ example: Control in Space 18 x + 1 y - 1 > 10 class SimpleKernel extends Kernel { SimpleKernel() { DFEVar x = io.input(“x”, dfeInt(24)); DFEVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, dfeInt(25)); } } 18
  19. 19 SIMULATE AND DEBUG GENERATE DATAFLOWPROGRAMARCHITECTANALYSE Used to build real systems, however, very difficult to learn/educate Non Traditional Design Process OK?many hours … Custom HW
  20. Multiple scales of computing Important features for optimization complete system level  balance compute, storage and IO parallel node level  maximize utilization of compute and interconnect microarchitecture level  minimize data movement arithmetic level  tradeoff range, precision and accuracy = discretize in Time, Space and Value bit level  encode and add redundancy transistor level => manipulate ‘0’ and ‘1’ and more, e.g., trade/hide Communication (Time) for/behind Computation (Space) 20 Optimizations at all levelsFlow/Time Space 20
  21. 21 1. Higher chip / system price compared to microprocessors 2. Lead design times (3 months in the best case) a. Complex numerical transformations b. Non-trivial area and data movement optimizations 3. “Painful” Place & Route times (12 hours to 24 hours) a. Expensive Vendor Specific tools b. Serious developments ask for dedicated build clusters 4. Need to compete at 200MHz with processors at 3GHz 5. Current HW technology is sub-optimal • On-chip memory not built for stream processing • On-chip interconnect overdesigned for Dataflow 6. Long learning curve (Tools and Methods needed) 7. Designer’s productivity should improve (Tools and Methods) 8. … Some of the challenges Ongoing effort on improved methodologies and tools
  22. MaxRing Interconnect Dataflow Engine (DFE) LMEM (Large Memory) 4-96GB Reconfigurable compute fabric Dataflow cores & FMEM (Fast Memory) High bandwidth memory link Link to main data network (e.g., PCIe, Infiniband) MaxRing links Multiple platforms, single DFE abstraction + { Application and MaxJ gen4 gen5 Performance Portable Migration (Intel based) (Xilinx based) 22
  23. • MaxCompiler generates VHDL ready for FPGA vendor tools • Synthesis transforms VHDL into logical “netlist” – sets of basic logic expressions • Map fits basic logic into N-input look-up tables • Place puts LUTs, DSPs, RAMs etc at specific locations on chip • Route sets up wiring between blocks 23 Substrate Agnostic Compilation MaxCompiler compilation Synthesis Map Place Route Generate Maxfile VHDL Complete FPGA Netlist LUTs Placed FPGA 23
  24. DFE Place and Route example 24 Mon 16:27: MaxCompiler version: 2012.2 Mon 16:27: Build “MyKernel" start time: Mon Apr 08 16:27:24 BST 2013 Mon 16:27: Main build process running as user training1 on host Maxworkstation7478 Mon 16:27: Build location: /home/training1/maxcompiler-builds/MyKernel Mon 16:27: Instantiating manager Mon 16:27: Instantiating kernel “MyKernel" Mon 16:27: Compiling manager (CPU I/O Only) Mon 16:27: Compiling kernel "MyKernel" Mon 16:27: Generating input files (VHDL, netlists, CoreGen) Mon 16:27: Running back-end build (12 phases) Mon 16:27: (1/12) - Prepare MaxFile Data (GenerateMaxFileDataFile) Mon 16:27: (2/12) - Synthesize DFE Modules (XST) Mon 16:30: (3/12) - Link DFE Modules (NGCBuild) Mon 16:30: (4/12) - Prepare for Resource Analysis (EDIF2MxruBuildPass) Mon 16:30: (5/12) - Generate Preliminary Annotated Source Code Mon 16:30: (6/12) - Report Resource Usage (ResourceCounter) Mon 16:30: About to start chip vendor Map/Place/Route toolflow. This will take some time. Mon 16:30: (7/12) - Prepare for Placement (NGDBuild) Mon 16:30: (8/12) - Place and Route DFE (MPPR) Mon 16:30: Executing MPPR with 1 cost tables and 1 threads. Mon 16:30: MPPR: Starting 1 cost table Mon 16:43: MPPR: Cost table 1 met timing with score 0 (best score 0) Mon 16:43: (9/12) - Prepare for Resource Analysis (XDLBuild) Mon 16:44: (10/12) - Generate Resource Report (ResourceUsageBuildPass) Mon 16:44: (11/12) - Generate Annotated Source Code (ResourceAnnotationBuildPass) Mon 16:44: (12/12) - Generate MaxFile (GenerateMaxFile) Mon 16:45: Mon 16:45: FINAL RESOURCE USAGE Mon 16:45: LUTs: 9503 / 149760 (6.35%) Mon 16:45: FFs: 12749 / 149760 (8.51%) Mon 16:45: BRAMs: 34 / 516 (6.59%) Mon 16:45: DSPs: 0 / 1056 (0.00%) Mon 16:45: Mon 16:45: MaxFile: /home/training1/maxcompiler-builds/MyKernel/results/MyKernel.max (MD5Sum: e564cd922aeeda04acfa2f4ecce8236d) Mon 16:45: Build completed: Mon Apr 08 16:45:58 BST 2013 (took 18 mins, 33 secs) FPGA vendor specific back-end tool flow Abstracted by MaxCompiler 24
  25. • Allows you to see what lines of code are • using what resources and focus optimization • Separate reports for each kernel and for the manager DFE Resource Usage Reporting LUTs FFs BRAMs DSPs : MyKernel.java 727 871 1.0 2 : resources used by this file 0.24% 0.15% 0.09% 0.10% : % of available 71.41% 61.82% 100.00% 100.00% : % of total used 94.29% 97.21% 100.00% 100.00% : % of user resources : : public class MyKernel extends Kernel { : public MyKernel (KernelParameters parameters) { : super(parameters); 1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24)); 2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8)); : DFEVar offset = io.scalarInput("offset", dfeUInt(8)); 8 8 0.0 0 : DFEVar addr = offset + q; 18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr, : dfeFloat(8,24), 256); 139 145 0.0 2 : p = p * p; 401 541 0.0 0 : p = p + v; : io.output("r", p, dfeFloat(8,24)); : } : } DSP Blocks Block RAMs IO Blocks LUT/FFs ? ? Different operations use different resources 25
  26. • MaxCompiler gives detailed latency and area annotation back to the programmer • Evaluate precise effect of code on latency and chip area 26 Optimization Feedback 12.8ns 6.4ns+ = 19.2ns (total compute latency) 26
  27. 27 Small pilot system deployed in Oct 2017 • one 1U MPC-X with 8 MAX5 DFEs • one 1U AMD EPYC based server • one 1U login head node Scaling using Amazon AWS cloud • MAX5 fully compatible with F1 instances • Elastic scaling between private and public MPC-X node Remote users MAX5 DFE EPYC CPU 1TB DDR4 Head/Build node ipmi 56 Gbps 2x Infiniband @ 56 Gbps 10 Gbps 10 Gbps Supermicro EPYC node Pilot System Deployed at Jülich http://www.prace-ri.eu/pcp/ 27
  28. 28 PRACE-PCP: SpecFEM3D on DFE 28
  29. The BQCD Chip - AERIAL VIEW Scalable Conjugate Gradient Design for the CG step of BQCD
  30. Problem (Small/Large) System (composition) (size) TTS [sec] ETS [kWh] DTS (F1) BQCD 32x32x32x32 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 1,054 0.44 - 64x64x64x64 1PF equivalent (48 DFEs, 512 EPYC cores) (14U) 1,703.8 4.26 $39.93 NEMO GYRE6 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 388 0.164 - GYRE144 1PF equivalent (48 DFEs, 92 EPYC cores) (8U) 1,942 3.77 $42.72 SFM3D 1 chunk x64x64 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 232 0.096 - 6 chunks x1,440x1,440 1PF equivalent (384 DFEs, 768 EPYC cores) (60U) 5,150 70.1 $1,267.2 QE Al2O3 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 32 0.013 - Ta2O5 1PF equivalent (64 DFEs, 64 EPYC cores) (9U) 3,210 7.58 $94.16 Achieved Performance PRACE workloads 30
  31. Global Weather Simulation with DFEs in China ⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, Accelerating solvers for global atmospheric equations through mixed-precision data flow engine, published at FPL 2013 ⬥Joint research with Imperial College and Tsinghua University ⬥Simulating the atmosphere using the shallow water equation An order of magnitude improvement over the Linpack-driven supercomputer technology Platform Speedup Efficiency 6 Core CPU 1x 1x Tianhe-1A Node 23x 15x Maxeler MPC-X 330x 145x 31
  32. 32 • A (fancy) name does not help with solving the problem at hand • Cloud, (Intelligent) Edge, Fog are just names like … Maria • FPGA is just a technology that can help bridging the gap to something better (Spatial Computing Acceleration HW, Quantum Processing, …) • just focus on building the best computer for the given job • Learn, think, pioneer and stay always critical • abstraction is powerful but quite often not needed • use it with great care and remember Dan Slotnick • We are turning Earth into a heterogeneous, planet-wide computer • so we should try to not kill it in the process • There is a lot of interest in this topic  Conclusions
  33. ? Questions Contact me at: georgi@maxeler.com Or find me on Google: “Georgi computer” should do 33
  34. Some links with more information Maxeler Multiscale Dataflow Computing: https://www.maxeler.com/technology/dataflow-computing/ Computing in Space explained by Mike Flynn: http://www.openspl.org/what-is-openspl/ Computing in Space Course at Imperial College: http://cc.doc.ic.ac.uk/openspl16/ Exciting Applications for DFEs (and JDFEs): http://appgallery.maxeler.com Maxeler DFEs on AWS EC2 F1: https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6- c66ab2ba202a Maxeler and Xilinx Alveo collaboration: https://www.xilinx.com/products/boards-and-kits/alveo.html 34
  35. Maxeler Applications Gallery Dataflow Engine (DFE) Ecosystem ⬥ With over 150 universities in our university program, we decided to create an app gallery to enable the community to share applications, examples, demos, … ⬥ The App Gallery is complemented by a teaching program, with the first successful course taught at Imperial College in 2014. see http://cc.doc.ic.ac.uk/openspl14 ⬥ Top 10 APPS: ➢ Correlation: in real-time, pairwise, on 6,000 streams ➢ 100% Guaranteed Packet Capture ➢ Webserver, cache and load balancing ➢ HESTON Option pricer ➢ N-body simulation ➢ Regex matching (e.g. for Security) ➢ Brain network simulation ➢ Quantum Chromo-Dynamics kernel ➢ Seismic Imaging ➢ Realtime Classification Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/ 35
  36. Peer Reviewed Dataflow Publications 2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008. 2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance Technology Award, with JP Morgan. 2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger, IEEE Micro, vol. 31, no. 2, March/April 2011. 2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC. 2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open Markets Magazine, Ari Studnitzer, CME Group. 2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis, Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497 2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation, Volume 12, IOP Publishing. 36
  37. Maxeler University Program 37
  38. Maxeler Trophy Cabinet Academic History since 2005 • Imperial College Research Excellence Award • Top EPSRC Advanced Fellowship • Two Best Paper Awards • Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most influential papers at the FPL conference in the last 25 years. Recent Commercial Awards • HPCwire Editors Choice Award, November 2011. • American Finance Technology Awards, New York, winner, “Most Cutting Edge IT Initiative,” December 2011. • Golden Arrow, “...for revolutionizing Computers, ” COM-SULT, January 2012. • Gartner “Cool Vendor of the Year,” March 2012. • Frost and Sullivan “Most innovative IT vendor, ” Dec 2013. • CIO Review, 20 Most Promising Networking Companies, March 2014. • CIO Review, 20 Most Promising HPC Companies, March 2015. 38
Publicité