SlideShare une entreprise Scribd logo
1  sur  18
The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
10/13/20
LEGaTO:
Software Stack
Runtimes
HiPEAC 2020
Computer Systems Week
16-10-2020
Miquel Pericas
Chalmers University of Technology
2
HiPEAC CSW Autumn 2020
• Middleware – SLURM and RedFish
• OmpSs@FPGA (Xavier)
• XiTAO:
−Introduction: XiTAO execution Model
−Energy Aware Scheduler
−Software Topologies
−Pipeline parallelism
• FPGA Undervolting
• Fault tolerance - GPU Checkpointing
Outline
HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Integration of Slurm with RECS Master
o Nodes specification at slurm configuration (partitions, limits…)
o Slurm gets node specification and selects target nodes
o Allocates, joins and starts nodes
o Executes the application(s)
o Shuts-down nodes and destroys allocation
3
$ sinfo
PART… AVAIL LIMIT NODES STATE NODELIST
debug* up infinite 1 idle* pcxavim5
debug* up infinite 16 idle BB_1_[0,2-15],pcxavim6
HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Slurm contacts RECS Master at job execution and
termination times
4
#!/bin/bash
#SBATCH -N 10
#SBATCH --constraint=ARM,bigLITTLE,hasGPU
#SBATCH -o test-%j.out
#SBATCH -e test-%j.err
// App invocation
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle* pcxavim5
debug* up infinite 10 alloc BB_1_[0,2-10]
debug* up infinite 6 idle BB_1_[11-15],pcxavim6
$ sbatch batch-10-bl.sh
Submitted batch job 39
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
39 debug batch-10 xavim R 0:42 10 BB_1_[0,2-10]
HiPEAC CSW Autumn 2020
Slurm and RECS Master
• Composed nodes are created using the
RECS Master webservice
• And started and stopped automatically
5
10 nodes are
turned on
6
HiPEAC CSW Autumn 2020
OmpSs@FPGA
● Offload of matrix multiplication to FPGA
#pragma omp target device(fpga) num_instances(3)
#pragma omp task in([BSIZE*BSIZE]a, [BSIZE*BSIZE]b) inout([BSIZE*BSIZE]c)
void matmulBlock(const elem_t *a, const elem_t *b, elem_t *c)
{
#pragma HLS INLINE // off
#pragma HLS array_partition variable=a cyclic factor=4
#pragma HLS array_partition variable=b cyclic factor=BSIZE/4
#pragma HLS array_partition variable=c cyclic factor=BSIZE/2
for (int k = 0; k < BSIZE; ++k) {
…
}
}
FPGA
7
HiPEAC CSW Autumn 2020
● Acceleration of matrix multiplication on FPGAs
− 4 ARM cores (OpenBLAS)
− 1 to 3 IP cores
● Block size 256x256
0
1
2
3
4
5
6
7
8
0
20
40
60
80
100
4 ARM cores 1 IP core 2 IP cores 3 IP cores
GFlops/W
GFlops
Axis Title
Matrix multiply, energy efficiency
Gflops Gflops/W
● Best performance
● 3 IP cores
● Best energy-efficiency
● 2 IP cores
OmpSs@FPGA
8
HiPEAC CSW Autumn 2020
XiTAO: Energy Aware Scheduler
• Module 1: Power Profiling
• help runtime understand CPU power consumption trends (number/type of
cores, different frequencies)
•
• Module 2: Dynamic Performance Modeling
• provide accurate prediction for future task given a set of resources
• independent of platforms and frequencies
• achieve scalablity and portablity goals
•
• Module 3: Idleness Tracing
• give the information about real-time status of cores
• put cores to ”sleep” when it is under-utilized
• sleeping time exploits backoff exponential strategy
• provide the real-time parallel slackness of active cores =>
calculation of shared board static power on each running task
•
• Module 4: Task Mapping Algorithm (Per task level)
For a given configuration (Start core, number of cores):
• Performance Tracer => Execution Time Prediction
• Power Profiles => Dynamic Power Prediction
• Power Profiles + Idleness Tracer => Static Power Prediction
• Energy Prediction = (Static Power + Dynamic Power) x Execution Time
9
HiPEAC CSW Autumn 2020
XiTAO: Energy Aware Scheduler
● 31%-74% energy
reduction than
RWS
● 19%-68% energy
reduction than
FCC
● 25%-73% energy
reduction than
LCC
Name Acronym ● Notion
Random Work Stealing
(+Sleep)
RWS
(+S)
Typical greedy scheduling (enhanced with Sleep)
Fastest Cores with
Criticality (+Sleep)
FCC
(+S)
Critical tasks are mapped to the set of cores that minimize
execution time and are not allowed work stealing, noncritical
tasks follow parent queue and only search for the best number of
cores that minimize the execution time of the task (enhanced with
Sleep)
Lowest Cost with
Criticality (+Sleep)
LCC
(+S)
The difference between LCC and FCC is that minimizing execution
time becomes minimizing parallel cost. The parallel cost means
”execution time * number of cores” (enhanced with Sleep)
Lowest Energy without
Criticality
LENC Task scheduling targets lowest energy, no need for criticality
awareness
10
HiPEAC CSW Autumn 2020
STA
train
Sched
• Mapping logical data locations to physical locations (to create a model per locality)
• The Software Topology Address (STA) is a portable key that is to
be interpreted by the XiTAO runtime to map a task to a place.
• Example: space filling order is used as an STA, transforming
coordinates to an integer for Cartesian inputs. Paper includes
other example such keys.
• This STA-to-location mapping is leveraged to model the
performance per task’s data locality
• A performance model per the (STA, task_type) tuple is created
• Energy aware model can be potentially used here.
• Example system’s elastic partitions to be used by the
model
XiTAO: Software & Hardware Topologies
11
HiPEAC CSW Autumn 2020
XiTAO: Model Validation on DAG Chain
•Adaptive resource selection (leader, width) for an
cache intensive task. Green is NUMA node where
task (depicted by STA) is initialized
•Scheduler mostly chooses widths 1 and 2 (within
the shared L2 cache)
• Adaptive resource selection (leader, width) for a
memory intensive task.
• Scheduler mostly chooses widths 12 (a socket
encapsulating 2 NUMA nodes)
• Random work-stealing behavior for compute
bound tasks while preferring larger widths
• Scalability of model running memory-bound DAG
chains. Up to 2.5x speedup with larger task count
• To validate the STA-driven
performance modeling, we
− Test on a 4-socket
AMD system (2
NUMA each)
− Print a resource
selection trace of a
chain of tasks
• The scheduler adaptively
behaves as locality-aware for
memory/cach intensive tasks,
and as a work-stealing
scheduler for compute bound
tasks
12
HiPEAC CSW Autumn 2020
XiTAO: Moldable pipelines for CNNs
on heterogenous edge devices
● A simple template tensor language to develop CNN
networks.
● XiTAO Pipelines are generated using the information
provided by language interface.
● An online training phase determines the optimal pipeline
configuration.
• Network Layer distribution among pipeline stages.
• Resource partitioning among pipeline stages
● The training is led by a search algorithm which utilizes
computational hints provided by the language interface.
13
HiPEAC CSW Autumn 2020
Network description in template language
main(){
…
Conv1 = CONV(ip, op, weights);
Conv2 = CONV(conv1, op, weights);
….
network.add(Conv1);
network.add(Conv2);
…
network.execute();
}
XiTAO: Moldable pipelines for CNNs
on heterogenous edge devices
14
HiPEAC CSW Autumn 2020
FPGA Undervolting
Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs
Goal: Bridge the power-efficiency gap between ASICs and FPGAs by
Undervolting below nominal level
• Case Study: Power consumption of neural networks is a main concern
✔ Hardware acceleration: GPUs, FPGAs, and ASICs
Evaluation Setup
✔ 5 Image classification workloads
✔ 3 Xilinx UltraScale+ ZCU102 platforms
✔ 2 On-chip voltage rails
Main Results
✔ Large voltage guardband (i.e., 33%)
✔ >3X power-efficiency gain
15
HiPEAC CSW Autumn 2020
Overall Voltage Behavior
Slight variation of voltage behavior across platforms and benchmarks
❑ FPGA stops operatingCrash
❑ No performance or reliability loss
❑ Added by the vendor to ensure the
worst-case conditions
❑ Large guardband, average of 33%
Guard
band
❑ A narrow voltage region
❑ Neural network accuracy collapseCritical
16
HiPEAC CSW Autumn 2020
GPU Checkpointing with FTI
● Transparent multi-
GPU/multi-node
checkpointing
● Parallel streams to
improve I/O efficiency
● Fast checksum
calculation using GPUs
MD5 algorithm
17
HiPEAC CSW Autumn 2020
GPU Checkpointing with FTI
● Over 100x speed up
with the new GPU MD5
algorithm
● Checkpoint takes less
than 1 second
● FPGA checkpoint
implementation coming
Thank you!

Contenu connexe

Tendances

Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
NECST Lab @ Politecnico di Milano
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
Edge AI and Vision Alliance
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
LEGATO project
 

Tendances (20)

IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
Automated Design Space Exploration and Roofline Analysis for FPGA-based HLS A...
 
Programming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & ProductivityProgramming Languages & Tools for Higher Performance & Productivity
Programming Languages & Tools for Higher Performance & Productivity
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
 
Bs25412419
Bs25412419Bs25412419
Bs25412419
 
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Moldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devicesMoldable pipelines for CNNs on heterogeneous edge devices
Moldable pipelines for CNNs on heterogeneous edge devices
 
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
 
Demosaic RTL for ISP workflow
Demosaic RTL for ISP workflowDemosaic RTL for ISP workflow
Demosaic RTL for ISP workflow
 
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...Device Data Directory and Asynchronous execution: A path to heterogeneous com...
Device Data Directory and Asynchronous execution: A path to heterogeneous com...
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...
SX Aurora TSUBASA  (Vector Engine) a Brand-new Vector Supercomputing power in...SX Aurora TSUBASA  (Vector Engine) a Brand-new Vector Supercomputing power in...
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Extracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated applicationExtracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated application
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
An open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresAn open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V cores
 

Similaire à LEGaTO: Software Stack Runtimes

186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
vaidehi87
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
chiportal
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resume
het shah
 

Similaire à LEGaTO: Software Stack Runtimes (20)

byteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA SolutionsbyteLAKE's Alveo FPGA Solutions
byteLAKE's Alveo FPGA Solutions
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Introduction to FPGA acceleration
Introduction to FPGA accelerationIntroduction to FPGA acceleration
Introduction to FPGA acceleration
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
electronics-11-03883.pdf
electronics-11-03883.pdfelectronics-11-03883.pdf
electronics-11-03883.pdf
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
FPGA In a Nutshell
FPGA In a NutshellFPGA In a Nutshell
FPGA In a Nutshell
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
 
The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...The CAOS framework: Democratize the acceleration of compute intensive applica...
The CAOS framework: Democratize the acceleration of compute intensive applica...
 
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
hetshah_resume
hetshah_resumehetshah_resume
hetshah_resume
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
6Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_20166Tisch telecom_bretagne_2016
6Tisch telecom_bretagne_2016
 

Plus de LEGATO project

HiPerMAb: A statistical tool for judging the potential of short fat data
HiPerMAb: A statistical tool for judging the potential of short fat dataHiPerMAb: A statistical tool for judging the potential of short fat data
HiPerMAb: A statistical tool for judging the potential of short fat data
LEGATO project
 

Plus de LEGATO project (20)

Scrooge Attack: Undervolting ARM Processors for Profit
Scrooge Attack: Undervolting ARM Processors for ProfitScrooge Attack: Undervolting ARM Processors for Profit
Scrooge Attack: Undervolting ARM Processors for Profit
 
A practical approach for updating an integrity-enforced operating system
A practical approach for updating an integrity-enforced operating systemA practical approach for updating an integrity-enforced operating system
A practical approach for updating an integrity-enforced operating system
 
TEEMon: A continuous performance monitoring framework for TEEs
TEEMon: A continuous performance monitoring framework for TEEsTEEMon: A continuous performance monitoring framework for TEEs
TEEMon: A continuous performance monitoring framework for TEEs
 
secureTF: A Secure TensorFlow Framework
secureTF: A Secure TensorFlow FrameworksecureTF: A Secure TensorFlow Framework
secureTF: A Secure TensorFlow Framework
 
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep...
 
LEGaTO: Machine Learning Use Case
LEGaTO: Machine Learning Use CaseLEGaTO: Machine Learning Use Case
LEGaTO: Machine Learning Use Case
 
Smart Home AI at the edge
Smart Home AI at the edgeSmart Home AI at the edge
Smart Home AI at the edge
 
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
 
LEGaTO Integration
LEGaTO IntegrationLEGaTO Integration
LEGaTO Integration
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
 
LEGaTO: Low-Energy Heterogeneous Computing Workshop
LEGaTO: Low-Energy Heterogeneous Computing WorkshopLEGaTO: Low-Energy Heterogeneous Computing Workshop
LEGaTO: Low-Energy Heterogeneous Computing Workshop
 
TZ4Fabric: Executing Smart Contracts with ARM TrustZone
TZ4Fabric: Executing Smart Contracts with ARM TrustZoneTZ4Fabric: Executing Smart Contracts with ARM TrustZone
TZ4Fabric: Executing Smart Contracts with ARM TrustZone
 
Infection Research with Maxeler Dataflow Computing
Infection Research with Maxeler Dataflow ComputingInfection Research with Maxeler Dataflow Computing
Infection Research with Maxeler Dataflow Computing
 
Smart Home - AI at the edge
Smart Home - AI at the edgeSmart Home - AI at the edge
Smart Home - AI at the edge
 
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-ResiliencyFPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
 
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric EnvironmentsScheduling Task-parallel Applications in Dynamically Asymmetric Environments
Scheduling Task-parallel Applications in Dynamically Asymmetric Environments
 
RECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
RECS – Cloud to Edge Microserver Platform for Energy-Efficient ComputingRECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
RECS – Cloud to Edge Microserver Platform for Energy-Efficient Computing
 
Secure Task-Based Programming with OmpSs and SGX
Secure Task-Based Programming with OmpSs and SGXSecure Task-Based Programming with OmpSs and SGX
Secure Task-Based Programming with OmpSs and SGX
 
HiPerMAb: A statistical tool for judging the potential of short fat data
HiPerMAb: A statistical tool for judging the potential of short fat dataHiPerMAb: A statistical tool for judging the potential of short fat data
HiPerMAb: A statistical tool for judging the potential of short fat data
 

Dernier

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 

Dernier (20)

Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 

LEGaTO: Software Stack Runtimes

  • 1. The LEGaTO project has received funding from the European Union's Horizon 2020 research and innovation programme under the grant agreement No 780681 10/13/20 LEGaTO: Software Stack Runtimes HiPEAC 2020 Computer Systems Week 16-10-2020 Miquel Pericas Chalmers University of Technology
  • 2. 2 HiPEAC CSW Autumn 2020 • Middleware – SLURM and RedFish • OmpSs@FPGA (Xavier) • XiTAO: −Introduction: XiTAO execution Model −Energy Aware Scheduler −Software Topologies −Pipeline parallelism • FPGA Undervolting • Fault tolerance - GPU Checkpointing Outline
  • 3. HiPEAC CSW Autumn 2020 Slurm and RECS Master • Integration of Slurm with RECS Master o Nodes specification at slurm configuration (partitions, limits…) o Slurm gets node specification and selects target nodes o Allocates, joins and starts nodes o Executes the application(s) o Shuts-down nodes and destroys allocation 3 $ sinfo PART… AVAIL LIMIT NODES STATE NODELIST debug* up infinite 1 idle* pcxavim5 debug* up infinite 16 idle BB_1_[0,2-15],pcxavim6
  • 4. HiPEAC CSW Autumn 2020 Slurm and RECS Master • Slurm contacts RECS Master at job execution and termination times 4 #!/bin/bash #SBATCH -N 10 #SBATCH --constraint=ARM,bigLITTLE,hasGPU #SBATCH -o test-%j.out #SBATCH -e test-%j.err // App invocation $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle* pcxavim5 debug* up infinite 10 alloc BB_1_[0,2-10] debug* up infinite 6 idle BB_1_[11-15],pcxavim6 $ sbatch batch-10-bl.sh Submitted batch job 39 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 39 debug batch-10 xavim R 0:42 10 BB_1_[0,2-10]
  • 5. HiPEAC CSW Autumn 2020 Slurm and RECS Master • Composed nodes are created using the RECS Master webservice • And started and stopped automatically 5 10 nodes are turned on
  • 6. 6 HiPEAC CSW Autumn 2020 OmpSs@FPGA ● Offload of matrix multiplication to FPGA #pragma omp target device(fpga) num_instances(3) #pragma omp task in([BSIZE*BSIZE]a, [BSIZE*BSIZE]b) inout([BSIZE*BSIZE]c) void matmulBlock(const elem_t *a, const elem_t *b, elem_t *c) { #pragma HLS INLINE // off #pragma HLS array_partition variable=a cyclic factor=4 #pragma HLS array_partition variable=b cyclic factor=BSIZE/4 #pragma HLS array_partition variable=c cyclic factor=BSIZE/2 for (int k = 0; k < BSIZE; ++k) { … } } FPGA
  • 7. 7 HiPEAC CSW Autumn 2020 ● Acceleration of matrix multiplication on FPGAs − 4 ARM cores (OpenBLAS) − 1 to 3 IP cores ● Block size 256x256 0 1 2 3 4 5 6 7 8 0 20 40 60 80 100 4 ARM cores 1 IP core 2 IP cores 3 IP cores GFlops/W GFlops Axis Title Matrix multiply, energy efficiency Gflops Gflops/W ● Best performance ● 3 IP cores ● Best energy-efficiency ● 2 IP cores OmpSs@FPGA
  • 8. 8 HiPEAC CSW Autumn 2020 XiTAO: Energy Aware Scheduler • Module 1: Power Profiling • help runtime understand CPU power consumption trends (number/type of cores, different frequencies) • • Module 2: Dynamic Performance Modeling • provide accurate prediction for future task given a set of resources • independent of platforms and frequencies • achieve scalablity and portablity goals • • Module 3: Idleness Tracing • give the information about real-time status of cores • put cores to ”sleep” when it is under-utilized • sleeping time exploits backoff exponential strategy • provide the real-time parallel slackness of active cores => calculation of shared board static power on each running task • • Module 4: Task Mapping Algorithm (Per task level) For a given configuration (Start core, number of cores): • Performance Tracer => Execution Time Prediction • Power Profiles => Dynamic Power Prediction • Power Profiles + Idleness Tracer => Static Power Prediction • Energy Prediction = (Static Power + Dynamic Power) x Execution Time
  • 9. 9 HiPEAC CSW Autumn 2020 XiTAO: Energy Aware Scheduler ● 31%-74% energy reduction than RWS ● 19%-68% energy reduction than FCC ● 25%-73% energy reduction than LCC Name Acronym ● Notion Random Work Stealing (+Sleep) RWS (+S) Typical greedy scheduling (enhanced with Sleep) Fastest Cores with Criticality (+Sleep) FCC (+S) Critical tasks are mapped to the set of cores that minimize execution time and are not allowed work stealing, noncritical tasks follow parent queue and only search for the best number of cores that minimize the execution time of the task (enhanced with Sleep) Lowest Cost with Criticality (+Sleep) LCC (+S) The difference between LCC and FCC is that minimizing execution time becomes minimizing parallel cost. The parallel cost means ”execution time * number of cores” (enhanced with Sleep) Lowest Energy without Criticality LENC Task scheduling targets lowest energy, no need for criticality awareness
  • 10. 10 HiPEAC CSW Autumn 2020 STA train Sched • Mapping logical data locations to physical locations (to create a model per locality) • The Software Topology Address (STA) is a portable key that is to be interpreted by the XiTAO runtime to map a task to a place. • Example: space filling order is used as an STA, transforming coordinates to an integer for Cartesian inputs. Paper includes other example such keys. • This STA-to-location mapping is leveraged to model the performance per task’s data locality • A performance model per the (STA, task_type) tuple is created • Energy aware model can be potentially used here. • Example system’s elastic partitions to be used by the model XiTAO: Software & Hardware Topologies
  • 11. 11 HiPEAC CSW Autumn 2020 XiTAO: Model Validation on DAG Chain •Adaptive resource selection (leader, width) for an cache intensive task. Green is NUMA node where task (depicted by STA) is initialized •Scheduler mostly chooses widths 1 and 2 (within the shared L2 cache) • Adaptive resource selection (leader, width) for a memory intensive task. • Scheduler mostly chooses widths 12 (a socket encapsulating 2 NUMA nodes) • Random work-stealing behavior for compute bound tasks while preferring larger widths • Scalability of model running memory-bound DAG chains. Up to 2.5x speedup with larger task count • To validate the STA-driven performance modeling, we − Test on a 4-socket AMD system (2 NUMA each) − Print a resource selection trace of a chain of tasks • The scheduler adaptively behaves as locality-aware for memory/cach intensive tasks, and as a work-stealing scheduler for compute bound tasks
  • 12. 12 HiPEAC CSW Autumn 2020 XiTAO: Moldable pipelines for CNNs on heterogenous edge devices ● A simple template tensor language to develop CNN networks. ● XiTAO Pipelines are generated using the information provided by language interface. ● An online training phase determines the optimal pipeline configuration. • Network Layer distribution among pipeline stages. • Resource partitioning among pipeline stages ● The training is led by a search algorithm which utilizes computational hints provided by the language interface.
  • 13. 13 HiPEAC CSW Autumn 2020 Network description in template language main(){ … Conv1 = CONV(ip, op, weights); Conv2 = CONV(conv1, op, weights); …. network.add(Conv1); network.add(Conv2); … network.execute(); } XiTAO: Moldable pipelines for CNNs on heterogenous edge devices
  • 14. 14 HiPEAC CSW Autumn 2020 FPGA Undervolting Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs Goal: Bridge the power-efficiency gap between ASICs and FPGAs by Undervolting below nominal level • Case Study: Power consumption of neural networks is a main concern ✔ Hardware acceleration: GPUs, FPGAs, and ASICs Evaluation Setup ✔ 5 Image classification workloads ✔ 3 Xilinx UltraScale+ ZCU102 platforms ✔ 2 On-chip voltage rails Main Results ✔ Large voltage guardband (i.e., 33%) ✔ >3X power-efficiency gain
  • 15. 15 HiPEAC CSW Autumn 2020 Overall Voltage Behavior Slight variation of voltage behavior across platforms and benchmarks ❑ FPGA stops operatingCrash ❑ No performance or reliability loss ❑ Added by the vendor to ensure the worst-case conditions ❑ Large guardband, average of 33% Guard band ❑ A narrow voltage region ❑ Neural network accuracy collapseCritical
  • 16. 16 HiPEAC CSW Autumn 2020 GPU Checkpointing with FTI ● Transparent multi- GPU/multi-node checkpointing ● Parallel streams to improve I/O efficiency ● Fast checksum calculation using GPUs MD5 algorithm
  • 17. 17 HiPEAC CSW Autumn 2020 GPU Checkpointing with FTI ● Over 100x speed up with the new GPU MD5 algorithm ● Checkpoint takes less than 1 second ● FPGA checkpoint implementation coming