SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
© 2014 IBM Corporation
HPC User Forum September 2014
Memory-Driven Near-Data
Acceleration
and its application to DOME/SKA
Jan van Lunteren
Heiner Giefers
Christoph Hagleitner
Rik Jongerius
IBM Research
© 2014 IBM CorporationHPC User Forum September 2014
Square Kilometer Array (SKA)
§  The world’s largest and most sensitive radio telescope
§  Co-located in South Africa and Australia
– deployment SKA-1: 2020
– deployment SKA-2: 2022+
DOME
§  Dutch-government-sponsored SKA-focused
research project between IBM and the
Netherlands Institute for Radio Astronomy
(ASTRON)
2
DOME / SKA Radio Telescope
Source: T. Engbersen et al., “SKA – A Bridge too far, or not?,” Exascale Radio Astronomy, 2014
Image credit: SKA Organisation
~3000 Dishes 3-10GHz~0.25M Antennae 0.5-1.7GHz~0.25M Antennae 0.07-0.45GHz
© 2014 IBM CorporationHPC User Forum September 2014
Big Data (~Exabytes/day) and Exascale Computing Problem
§  Example: SKA1-Mid band 1 SDP (incl. 2D FFTs, gridding, calibration, imaging)
– meeting the design specs requires ~550 PFLOPs
– extrapolating HPC trends: 37GFLOPs/W in 2022 (20% efficiency)
•  results in 15MW power consumption – power budget is only 2MW
•  peak performance ~2.75 EFLOPs (at 20% efficiency)
è Innovations needed to realize SKA
3
DOME / SKA Radio Telescope
Sources: R. Jongerius, “SKA Phase 1 Compute and Power Analysis,” CALIM 2014
Prelim. Spec. SKA, R.T. Schilizzi et al. 2007 / Chr. Broekema
Image credit: SKA Organisation
© 2014 IBM CorporationHPC User Forum September 20144
DOME / SKA Radio Telescope
Programmable general-purpose
off-the-shelf technology
q  CPU, GPU, FPGA, DSP
+ ride technology wave
- do not meet performance and
power efficiency targets
Fixed-function application-
specific custom accelerators
q  ASIC
+ meet performance and power
efficiency targets
- do not provide required flexibility
- high development cost
Example Energy Estimates (45nm)
§  70pJ for instruction (I-cache and register file access, control)
§  3.7pJ for 32b floating-point multiply operation
§  10-100pJ for cache access
§  1-2nJ for DRAM access
Source: M. Horowitz, “Computing’s Energy Problem (and what can we do about it),” ISSCC
2014.
high energy cost of programmability
large impact of memory on energy consumption
© 2014 IBM CorporationHPC User Forum September 20145
DOME / SKA Radio Telescope
Research Focus
§  Can we design a programmable custom accelerator that outperforms off-the-shelf
technology for a sufficiently large number of applications to justify the costs?
§  Focus on critical applications that involve regular processing steps and for which
the key challenges relate to storage of and access to the data structure:
“how to bring the right operand values efficiently to the execution pipelines”
–  examples: 1D/2D FFTs, gridding, linear algebra workloads, sparse, stencils
Programmable general-purpose
off-the-shelf technology
q  CPU, GPU, FPGA, DSP
+ ride technology wave
- do not meet performance and
power efficiency targets
Fixed-function application-
specific custom accelerators
q  ASIC
+ meet performance and power
efficiency targets
- do not provide required flexibility
- high development cost
© 2014 IBM CorporationHPC User Forum September 20146
Memory System
Observations (“common knowledge”)
§  Access bandwidth/latency/power depend on complex
interaction between access characteristics and
memory system operation
– access patterns/strides, locality of reference, etc.
– cache size, associativity, replacement policy, etc.
– bank interleaving, row buffer hits, refresh, etc.
§  Memory system operation is typically fixed and
cannot be adapted to the workload characteristics
– extremely challenging to make it programmable
due to performance constraints (in “critical path”)
è opposite happens: “bare metal” programming to
adapt workload to memory operation to achieve
substantial performance gains
§  Data organization often has to be changed between
consecutive processing steps
Main Memory
Core CoreCore
memory
controller(s)
shared L3 cache
L1/L2
cache
L1/L2
cache
L1/L2
cache
…
…
reg.filereg.filereg.file
DRAM
bank
row buf
DRAM
bank
row buf
DRAM
bank
row buf
…
© 2014 IBM CorporationHPC User Forum September 20147
Memory System
Programmable Near-Memory Processing
§  Offload performance-critical applications/functions to
programmable accelerators closely integrated into the
memory system
1)  Exploit benefits of near-memory processing (“closer
to the sense amplifiers” / reduce data movement)
2)  Make memory system operation programmable such
that it can be adapted to workload characteristics
3)  Apply an architecture/programming model that
enables a more tightly coupled scheduling of
accesses and data operations to match access
bandwidth with processing rate to reduce overhead
(including instruction fetch and decode)
§  Integration examples
– on die with eDRAM technology
– 2.5D / 3D stacked architectures
– memory module Main Memory
Near-Memory Accelerator
(memory controller)
…
DRAM
bank
row buf
DRAM
bank
row buf
DRAM
bank
row buf
Core CoreCore
shared L3 cache
L1/L2
cache
L1/L2
cache
L1/L2
cache
…
reg.filereg.filereg.file
…
© 2014 IBM CorporationHPC User Forum September 2014
Decoupled Access/Execute Architecture
§  Access processor (AP) handles all memory access
related functions
– basic operations are programmable (address
generation, address mapping, access scheduling)
– memory details (cycle times, bank organization,
retention times, etc.) are exposed to AP
– AP is the “master”
§  Execution pipelines (EPs) are configured by AP
– no need for instruction fetch/decoding
§  Tag-based data transfer
– tags identify configuration data, operand data
– enables out-of-order access and processing
– used as rate-control mechanism to prevent
the AP from overrunning the EPs
§  Availability of all operand values in the EP input
buffer/register triggers execution of the operation
cache hierarchy
CoreCoreCore
8
Near-Data Acceleration
…
Access Processor
Execution Pipeline(s)
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
data
+
tags
data
+
tags
addresses
+ data
© 2014 IBM CorporationHPC User Forum September 20149
Near-Data Acceleration
Example: FFT
(sample data in eDRAM)
cache hierarchy
CoreCoreCore …
Access Processor
Execution Pipeline(s)
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
data
+
tags
data
+
tags
addresses
+ data
© 2014 IBM CorporationHPC User Forum September 201410
Near-Data Acceleration
cache hierarchy
CoreCoreCore …
Access Processor
Execution Pipeline(s)
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
data
+
tags
data
+
tags
addresses
+ data
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
instr.
mem
© 2014 IBM CorporationHPC User Forum September 2014
cache hierarchy
CoreCoreCore
11
Near-Data Acceleration
…
Access Processor
Execution Pipeline(s)
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
configuration
data
addresses
+ data
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
© 2014 IBM CorporationHPC User Forum September 2014
cache hierarchy
CoreCoreCore
12
Near-Data Acceleration
…
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
© 2014 IBM CorporationHPC User Forum September 201413
Near-Data Acceleration
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
operand 1
tag 0
© 2014 IBM CorporationHPC User Forum September 201414
Near-Data Acceleration
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
operand 2
tag 1
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available
© 2014 IBM CorporationHPC User Forum September 201415
Near-Data Acceleration
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
twiddle factor
tag 2
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available
© 2014 IBM CorporationHPC User Forum September 201416
Near-Data Acceleration
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
all operands available
è processing starts
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available
© 2014 IBM CorporationHPC User Forum September 201417
Near-Data Acceleration
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
result 1
tag 4
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available
© 2014 IBM CorporationHPC User Forum September 201418
Near-Data Acceleration
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
result 2
tag 5
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available
© 2014 IBM CorporationHPC User Forum September 201419
Near-Data Acceleration
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
stage
configuration
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available butterfly
calculations
© 2014 IBM CorporationHPC User Forum September 201420
Near-Data Acceleration
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
twiddle factor
tag 2 - reuse flag
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available
© 2014 IBM CorporationHPC User Forum September 201421
Near-Data Acceleration
Example: FFT
1)  Application selects function and initializes
near-memory accelerator
2)  Access Processor configures Execution
pipeline
3a) Access Processor generates addresses,
schedules accesses and assigns tags
‒ read operand data for butterflies
‒ write butterfly results
(pre-calculate addresses)
3b) Execution pipeline performs butterfly
calculation each time a complete set of
operands is available
cache hierarchy
CoreCoreCore …
Access Processor
Butterfly Unit
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
addresses
+ data
operand 4
tag 0
result 1
tag 4
direct forwarding
© 2014 IBM CorporationHPC User Forum September 2014
cache hierarchy
CoreCore
Execution Pipeline Hierarchy
§  Different levels of coupling between access/data
transfer scheduling and operation execution
§  L2 - loosely coupled
–  timing of transfers, execution and write
accesses (execution results) not exactly
known in advance to AP
–  requires write buffer
–  rate control based on #tags being “in flight”
§  L1 – tightly coupled
–  AP “knows” execution pipeline length
–  reserves slots for write execution results
–  minimizes buffer requirements
–  optimized access scheduling
“just in time” / “just enough”
§  For completeness
–  L0 – table lookup (pre-calculated results)
–  L3 – host CPU or GPU
Core
22
Near-Data Acceleration
…
addresses
+ data
Access Processor
…
eDRAM
bank
eDRAM
bank
eDRAM
bank
data
+
tags
data
+
tags
Execution Pipeline(s) L2
loosely coupled
EP L0
buffer
Execution Pipeline(s) L1
tightly coupled
L3
© 2014 IBM CorporationHPC User Forum September 2014
Implementation Examples
§  On the same die with eDRAM
23
Near-Data Acceleration
eDRAM eDRAM eDRAM
eDRAM eDRAM
eDRAM eDRAM eDRAM
AP
EP
tightly
coupled
loosely
coupled DRAM
DRAM
DRAM
DRAM
Logic (AP/EP)
§  3D stack
EP
§  FPGA AP
+
eDRAM
loosely
coupled
FPGA
© 2014 IBM CorporationHPC User Forum September 2014
Programmable state machine B-FSM
§  Multiway branch capability supporting the evaluation
of many (combinations of) conditions in parallel
–  loop conditions (counters, timers), data arrival, etc.
–  hundreds of branches can be evaluated in parallel
§  Reacts extremely fast:
dispatch instructions within 2 cycles (@ > 2GHz)
§  Multi-threaded operation
§  Fast sleep/nap mode
24
Enabling Technologies
condition
vector
address
mapper
ALUregister
file
data path
control
B-FSM engine
instr. mem.
instruction
vector
bus control
memory control
© 2014 IBM CorporationHPC User Forum September 201425
B-FSM - Programmable State Machine Technology
§  Novel HW-based programmable state machine
– deterministic rate of 1 transition/cycle @ >2 GHz
– storage grows approx. linear with DFA size
•  1K transitions fit in ~5KB, 1M transitions fit in ~5MB
– supports wide input vectors (8 – 32 bits) and flexible
branch conditions: e.g., exact-, range-, and
ternary-match, negation, case-insensitive
è TCAM emulation
§  Successfully applied to a range of accelerators
– regular expression scanners, protocol engines, XML parsers
– processing rates of ~20Gbit/s for single B-FSM in 45nm
– small area cost enables scaling to extremely high aggregate
processing rates
Source: “Designing a programmable wire-speed regular-expression matching accelerator,” IEEE/ACM int. symposium on Microarchitecture (MICRO-45), 2012
© 2014 IBM CorporationHPC User Forum September 2014
Programmable address mapping
§  Unique programmable interleaving of multiple
power-of-2 address strides
§  Support for non-power-2 number of non-identical
sized banks and/or memory regions
§  Based on small lookup table (typ. 4-32 bytes)
26
Enabling Technologies
condition
vector
address
mapper
ALUregister
file
data path
control
B-FSM engine
instr. mem.
instruction
vector
bus control
memory control
© 2014 IBM CorporationHPC User Forum September 201427
Programmable Address Mapping
...
address space
eDRAM
bank
eDRAM
bank…
eDRAM
bank
bank id.
internal
bank
address
XY Lookup
table
internal bank address bank id.
address
© 2014 IBM CorporationHPC User Forum September 201428
Programmable Address Mapping
§  LUT size = 4 bytes
§  Simultaneous interleaving of row
and column accesses
§  Example: n=256
– two power-of-2 strides: 1 and 256
263
518
773
4
259
514
769
519
774
5
260
515
770
1
256
775
6
261
516
771
2
257
512
7
262
517
772
3
258
513
768
... ... ......
... ... ......
1024 1280 1536 1792
0
memory banks
7
6
5
4
3
2
1
...
...
256
0
1 2 30
a12 a13 … a1n
a21 a22 a23 … a2n
a31 a32 a33 … a3n
…
…
…
…
am1 am2 am3 … amn
© 2014 IBM CorporationHPC User Forum September 2014
... ... ......
784 16 272 528
...
16
§  LUT size = 16 bytes
§  Simultaneous interleaving of row,
column, and “vertical layer” accesses
§  Example
– three power-of-2 strides: 1, 16 and 256
29
Programmable Address Mapping
773
4
259
514
769
5
260
515
770
1
256
261
516
771
2
257
512
517
772
3
258
513
768
... ... ......
271 527 783 15
0
memory banks
5
4
3
2
1
...
15
0
1 2 30
... ... ......
544 800 32 288
...
32
31 287 543 79931
... ... ......
304 560 816 48
...
48
815 47 303 55947
... ... ......
64 320 576 832
...
64
575 831 63 31963
... ... ......
1024 1280 1536 1792
...
256
© 2014 IBM CorporationHPC User Forum September 201430
Power Efficiency
Power
§  Access Processor power estimate for 14nm, > 2GHz: 25-30 mW
Amortize Access Processor energy over large amount of data
§  Maximize memory bandwidth utilization (basic objective of the accelerator)
– bank interleaving, row buffer locality
§  Exploit wide access granularities
–  example: 512 bit accesses @ 500 MHz [eDRAM]
–  estimated AP energy overhead per accessed bit ≈ 0.1 pJ
è Make sure that all data is effectively used
‒  improved algorithms
‒  data shuffle unit: swap data at various
granularities within one/multiple lines
(configured similar as EP)
© 2014 IBM CorporationHPC User Forum September 201431
Conclusion
New Programmable Near-Memory Accelerator
§  Made feasible by novel state machine and address mapping technologies
that enable programmable address generation, address mapping, and access
scheduling operations in “real-time”
§  Objective is to minimize the (energy) overhead that goes beyond the basic storage
and processing needs (memory and execution units)
§  Proof of concept for selected workloads using FPGA prototypes and initial
compiler stacks
§  More details to be published soon

Contenu connexe

Tendances

High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for researchEsteban Hernandez
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDivyen Patel
 
dynamic resource allocation using virtual machines for cloud computing enviro...
dynamic resource allocation using virtual machines for cloud computing enviro...dynamic resource allocation using virtual machines for cloud computing enviro...
dynamic resource allocation using virtual machines for cloud computing enviro...Kumar Goud
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?Ian Lumb
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...IOSR Journals
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013James McGalliard
 
A Survey on Resource Allocation & Monitoring in Cloud Computing
A Survey on Resource Allocation & Monitoring in Cloud ComputingA Survey on Resource Allocation & Monitoring in Cloud Computing
A Survey on Resource Allocation & Monitoring in Cloud ComputingMohd Hairey
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
REVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud ComputingREVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud ComputingJaya Gautam
 
OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research Ganesan Narayanasamy
 
Application of selective algorithm for effective resource provisioning in clo...
Application of selective algorithm for effective resource provisioning in clo...Application of selective algorithm for effective resource provisioning in clo...
Application of selective algorithm for effective resource provisioning in clo...ijccsa
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
22). smlevel energy eff-dynamictaskschedng
22). smlevel energy eff-dynamictaskschedng22). smlevel energy eff-dynamictaskschedng
22). smlevel energy eff-dynamictaskschedngPoornima_Rajanna
 

Tendances (19)

High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
dynamic resource allocation using virtual machines for cloud computing enviro...
dynamic resource allocation using virtual machines for cloud computing enviro...dynamic resource allocation using virtual machines for cloud computing enviro...
dynamic resource allocation using virtual machines for cloud computing enviro...
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
 
Scheduling in CCE
Scheduling in CCEScheduling in CCE
Scheduling in CCE
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
 
A Survey on Resource Allocation & Monitoring in Cloud Computing
A Survey on Resource Allocation & Monitoring in Cloud ComputingA Survey on Resource Allocation & Monitoring in Cloud Computing
A Survey on Resource Allocation & Monitoring in Cloud Computing
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
REVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud ComputingREVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud Computing
 
OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research
 
Application of selective algorithm for effective resource provisioning in clo...
Application of selective algorithm for effective resource provisioning in clo...Application of selective algorithm for effective resource provisioning in clo...
Application of selective algorithm for effective resource provisioning in clo...
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
22). smlevel energy eff-dynamictaskschedng
22). smlevel energy eff-dynamictaskschedng22). smlevel energy eff-dynamictaskschedng
22). smlevel energy eff-dynamictaskschedng
 
Handling Big Data
Handling Big DataHandling Big Data
Handling Big Data
 

Similaire à Memory-Driven Near-Data Acceleration for Radio Telescope

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDeltares
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemDeepak Shankar
 
Srm suite technical presentation nrm - tim piqueur
Srm suite technical presentation   nrm - tim piqueurSrm suite technical presentation   nrm - tim piqueur
Srm suite technical presentation nrm - tim piqueurEMC Nederland
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...DataStax
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Nati Shalom
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
grid mining
grid mininggrid mining
grid miningARNOLD
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on ITAnand Haridass
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in VacouverNGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in VacouverScott Shadley, MBA,PMC-III
 
Cloud Models, Considerations, & Adoption Techniques
Cloud Models, Considerations, & Adoption TechniquesCloud Models, Considerations, & Adoption Techniques
Cloud Models, Considerations, & Adoption TechniquesEMC
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...Angela Williams
 
Exploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpExploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpIJERD Editor
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009lilyco
 
sector-sphere
sector-spheresector-sphere
sector-spherexlight
 
Energy-Efficient Task Scheduling in Cloud Environment
Energy-Efficient Task Scheduling in Cloud EnvironmentEnergy-Efficient Task Scheduling in Cloud Environment
Energy-Efficient Task Scheduling in Cloud EnvironmentIRJET Journal
 

Similaire à Memory-Driven Near-Data Acceleration for Radio Telescope (20)

A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Task allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed systemTask allocation on many core-multi processor distributed system
Task allocation on many core-multi processor distributed system
 
Srm suite technical presentation nrm - tim piqueur
Srm suite technical presentation   nrm - tim piqueurSrm suite technical presentation   nrm - tim piqueur
Srm suite technical presentation nrm - tim piqueur
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
grid mining
grid mininggrid mining
grid mining
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on IT
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in VacouverNGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
 
Webinar on radar
Webinar on radarWebinar on radar
Webinar on radar
 
Cloud Models, Considerations, & Adoption Techniques
Cloud Models, Considerations, & Adoption TechniquesCloud Models, Considerations, & Adoption Techniques
Cloud Models, Considerations, & Adoption Techniques
 
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
IEEE Paper - A Study Of Cloud Computing Environments For High Performance App...
 
Exploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed UpExploiting Multi Core Architectures for Process Speed Up
Exploiting Multi Core Architectures for Process Speed Up
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Energy-Efficient Task Scheduling in Cloud Environment
Energy-Efficient Task Scheduling in Cloud EnvironmentEnergy-Efficient Task Scheduling in Cloud Environment
Energy-Efficient Task Scheduling in Cloud Environment
 

Plus de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Plus de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Dernier

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Memory-Driven Near-Data Acceleration for Radio Telescope

  • 1. © 2014 IBM Corporation HPC User Forum September 2014 Memory-Driven Near-Data Acceleration and its application to DOME/SKA Jan van Lunteren Heiner Giefers Christoph Hagleitner Rik Jongerius IBM Research
  • 2. © 2014 IBM CorporationHPC User Forum September 2014 Square Kilometer Array (SKA) §  The world’s largest and most sensitive radio telescope §  Co-located in South Africa and Australia – deployment SKA-1: 2020 – deployment SKA-2: 2022+ DOME §  Dutch-government-sponsored SKA-focused research project between IBM and the Netherlands Institute for Radio Astronomy (ASTRON) 2 DOME / SKA Radio Telescope Source: T. Engbersen et al., “SKA – A Bridge too far, or not?,” Exascale Radio Astronomy, 2014 Image credit: SKA Organisation ~3000 Dishes 3-10GHz~0.25M Antennae 0.5-1.7GHz~0.25M Antennae 0.07-0.45GHz
  • 3. © 2014 IBM CorporationHPC User Forum September 2014 Big Data (~Exabytes/day) and Exascale Computing Problem §  Example: SKA1-Mid band 1 SDP (incl. 2D FFTs, gridding, calibration, imaging) – meeting the design specs requires ~550 PFLOPs – extrapolating HPC trends: 37GFLOPs/W in 2022 (20% efficiency) •  results in 15MW power consumption – power budget is only 2MW •  peak performance ~2.75 EFLOPs (at 20% efficiency) è Innovations needed to realize SKA 3 DOME / SKA Radio Telescope Sources: R. Jongerius, “SKA Phase 1 Compute and Power Analysis,” CALIM 2014 Prelim. Spec. SKA, R.T. Schilizzi et al. 2007 / Chr. Broekema Image credit: SKA Organisation
  • 4. © 2014 IBM CorporationHPC User Forum September 20144 DOME / SKA Radio Telescope Programmable general-purpose off-the-shelf technology q  CPU, GPU, FPGA, DSP + ride technology wave - do not meet performance and power efficiency targets Fixed-function application- specific custom accelerators q  ASIC + meet performance and power efficiency targets - do not provide required flexibility - high development cost Example Energy Estimates (45nm) §  70pJ for instruction (I-cache and register file access, control) §  3.7pJ for 32b floating-point multiply operation §  10-100pJ for cache access §  1-2nJ for DRAM access Source: M. Horowitz, “Computing’s Energy Problem (and what can we do about it),” ISSCC 2014. high energy cost of programmability large impact of memory on energy consumption
  • 5. © 2014 IBM CorporationHPC User Forum September 20145 DOME / SKA Radio Telescope Research Focus §  Can we design a programmable custom accelerator that outperforms off-the-shelf technology for a sufficiently large number of applications to justify the costs? §  Focus on critical applications that involve regular processing steps and for which the key challenges relate to storage of and access to the data structure: “how to bring the right operand values efficiently to the execution pipelines” –  examples: 1D/2D FFTs, gridding, linear algebra workloads, sparse, stencils Programmable general-purpose off-the-shelf technology q  CPU, GPU, FPGA, DSP + ride technology wave - do not meet performance and power efficiency targets Fixed-function application- specific custom accelerators q  ASIC + meet performance and power efficiency targets - do not provide required flexibility - high development cost
  • 6. © 2014 IBM CorporationHPC User Forum September 20146 Memory System Observations (“common knowledge”) §  Access bandwidth/latency/power depend on complex interaction between access characteristics and memory system operation – access patterns/strides, locality of reference, etc. – cache size, associativity, replacement policy, etc. – bank interleaving, row buffer hits, refresh, etc. §  Memory system operation is typically fixed and cannot be adapted to the workload characteristics – extremely challenging to make it programmable due to performance constraints (in “critical path”) è opposite happens: “bare metal” programming to adapt workload to memory operation to achieve substantial performance gains §  Data organization often has to be changed between consecutive processing steps Main Memory Core CoreCore memory controller(s) shared L3 cache L1/L2 cache L1/L2 cache L1/L2 cache … … reg.filereg.filereg.file DRAM bank row buf DRAM bank row buf DRAM bank row buf …
  • 7. © 2014 IBM CorporationHPC User Forum September 20147 Memory System Programmable Near-Memory Processing §  Offload performance-critical applications/functions to programmable accelerators closely integrated into the memory system 1)  Exploit benefits of near-memory processing (“closer to the sense amplifiers” / reduce data movement) 2)  Make memory system operation programmable such that it can be adapted to workload characteristics 3)  Apply an architecture/programming model that enables a more tightly coupled scheduling of accesses and data operations to match access bandwidth with processing rate to reduce overhead (including instruction fetch and decode) §  Integration examples – on die with eDRAM technology – 2.5D / 3D stacked architectures – memory module Main Memory Near-Memory Accelerator (memory controller) … DRAM bank row buf DRAM bank row buf DRAM bank row buf Core CoreCore shared L3 cache L1/L2 cache L1/L2 cache L1/L2 cache … reg.filereg.filereg.file …
  • 8. © 2014 IBM CorporationHPC User Forum September 2014 Decoupled Access/Execute Architecture §  Access processor (AP) handles all memory access related functions – basic operations are programmable (address generation, address mapping, access scheduling) – memory details (cycle times, bank organization, retention times, etc.) are exposed to AP – AP is the “master” §  Execution pipelines (EPs) are configured by AP – no need for instruction fetch/decoding §  Tag-based data transfer – tags identify configuration data, operand data – enables out-of-order access and processing – used as rate-control mechanism to prevent the AP from overrunning the EPs §  Availability of all operand values in the EP input buffer/register triggers execution of the operation cache hierarchy CoreCoreCore 8 Near-Data Acceleration … Access Processor Execution Pipeline(s) … eDRAM bank eDRAM bank eDRAM bank data + tags data + tags addresses + data
  • 9. © 2014 IBM CorporationHPC User Forum September 20149 Near-Data Acceleration Example: FFT (sample data in eDRAM) cache hierarchy CoreCoreCore … Access Processor Execution Pipeline(s) … eDRAM bank eDRAM bank eDRAM bank data + tags data + tags addresses + data
  • 10. © 2014 IBM CorporationHPC User Forum September 201410 Near-Data Acceleration cache hierarchy CoreCoreCore … Access Processor Execution Pipeline(s) … eDRAM bank eDRAM bank eDRAM bank data + tags data + tags addresses + data Example: FFT 1)  Application selects function and initializes near-memory accelerator instr. mem
  • 11. © 2014 IBM CorporationHPC User Forum September 2014 cache hierarchy CoreCoreCore 11 Near-Data Acceleration … Access Processor Execution Pipeline(s) … eDRAM bank eDRAM bank eDRAM bank configuration data addresses + data Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline
  • 12. © 2014 IBM CorporationHPC User Forum September 2014 cache hierarchy CoreCoreCore 12 Near-Data Acceleration … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline
  • 13. © 2014 IBM CorporationHPC User Forum September 201413 Near-Data Acceleration Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data operand 1 tag 0
  • 14. © 2014 IBM CorporationHPC User Forum September 201414 Near-Data Acceleration cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data operand 2 tag 1 Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available
  • 15. © 2014 IBM CorporationHPC User Forum September 201415 Near-Data Acceleration cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data twiddle factor tag 2 Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available
  • 16. © 2014 IBM CorporationHPC User Forum September 201416 Near-Data Acceleration cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data all operands available è processing starts Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available
  • 17. © 2014 IBM CorporationHPC User Forum September 201417 Near-Data Acceleration cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data result 1 tag 4 Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available
  • 18. © 2014 IBM CorporationHPC User Forum September 201418 Near-Data Acceleration cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data result 2 tag 5 Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available
  • 19. © 2014 IBM CorporationHPC User Forum September 201419 Near-Data Acceleration cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data stage configuration Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available butterfly calculations
  • 20. © 2014 IBM CorporationHPC User Forum September 201420 Near-Data Acceleration cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data twiddle factor tag 2 - reuse flag Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available
  • 21. © 2014 IBM CorporationHPC User Forum September 201421 Near-Data Acceleration Example: FFT 1)  Application selects function and initializes near-memory accelerator 2)  Access Processor configures Execution pipeline 3a) Access Processor generates addresses, schedules accesses and assigns tags ‒ read operand data for butterflies ‒ write butterfly results (pre-calculate addresses) 3b) Execution pipeline performs butterfly calculation each time a complete set of operands is available cache hierarchy CoreCoreCore … Access Processor Butterfly Unit … eDRAM bank eDRAM bank eDRAM bank addresses + data operand 4 tag 0 result 1 tag 4 direct forwarding
  • 22. © 2014 IBM CorporationHPC User Forum September 2014 cache hierarchy CoreCore Execution Pipeline Hierarchy §  Different levels of coupling between access/data transfer scheduling and operation execution §  L2 - loosely coupled –  timing of transfers, execution and write accesses (execution results) not exactly known in advance to AP –  requires write buffer –  rate control based on #tags being “in flight” §  L1 – tightly coupled –  AP “knows” execution pipeline length –  reserves slots for write execution results –  minimizes buffer requirements –  optimized access scheduling “just in time” / “just enough” §  For completeness –  L0 – table lookup (pre-calculated results) –  L3 – host CPU or GPU Core 22 Near-Data Acceleration … addresses + data Access Processor … eDRAM bank eDRAM bank eDRAM bank data + tags data + tags Execution Pipeline(s) L2 loosely coupled EP L0 buffer Execution Pipeline(s) L1 tightly coupled L3
  • 23. © 2014 IBM CorporationHPC User Forum September 2014 Implementation Examples §  On the same die with eDRAM 23 Near-Data Acceleration eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM AP EP tightly coupled loosely coupled DRAM DRAM DRAM DRAM Logic (AP/EP) §  3D stack EP §  FPGA AP + eDRAM loosely coupled FPGA
  • 24. © 2014 IBM CorporationHPC User Forum September 2014 Programmable state machine B-FSM §  Multiway branch capability supporting the evaluation of many (combinations of) conditions in parallel –  loop conditions (counters, timers), data arrival, etc. –  hundreds of branches can be evaluated in parallel §  Reacts extremely fast: dispatch instructions within 2 cycles (@ > 2GHz) §  Multi-threaded operation §  Fast sleep/nap mode 24 Enabling Technologies condition vector address mapper ALUregister file data path control B-FSM engine instr. mem. instruction vector bus control memory control
  • 25. © 2014 IBM CorporationHPC User Forum September 201425 B-FSM - Programmable State Machine Technology §  Novel HW-based programmable state machine – deterministic rate of 1 transition/cycle @ >2 GHz – storage grows approx. linear with DFA size •  1K transitions fit in ~5KB, 1M transitions fit in ~5MB – supports wide input vectors (8 – 32 bits) and flexible branch conditions: e.g., exact-, range-, and ternary-match, negation, case-insensitive è TCAM emulation §  Successfully applied to a range of accelerators – regular expression scanners, protocol engines, XML parsers – processing rates of ~20Gbit/s for single B-FSM in 45nm – small area cost enables scaling to extremely high aggregate processing rates Source: “Designing a programmable wire-speed regular-expression matching accelerator,” IEEE/ACM int. symposium on Microarchitecture (MICRO-45), 2012
  • 26. © 2014 IBM CorporationHPC User Forum September 2014 Programmable address mapping §  Unique programmable interleaving of multiple power-of-2 address strides §  Support for non-power-2 number of non-identical sized banks and/or memory regions §  Based on small lookup table (typ. 4-32 bytes) 26 Enabling Technologies condition vector address mapper ALUregister file data path control B-FSM engine instr. mem. instruction vector bus control memory control
  • 27. © 2014 IBM CorporationHPC User Forum September 201427 Programmable Address Mapping ... address space eDRAM bank eDRAM bank… eDRAM bank bank id. internal bank address XY Lookup table internal bank address bank id. address
  • 28. © 2014 IBM CorporationHPC User Forum September 201428 Programmable Address Mapping §  LUT size = 4 bytes §  Simultaneous interleaving of row and column accesses §  Example: n=256 – two power-of-2 strides: 1 and 256 263 518 773 4 259 514 769 519 774 5 260 515 770 1 256 775 6 261 516 771 2 257 512 7 262 517 772 3 258 513 768 ... ... ...... ... ... ...... 1024 1280 1536 1792 0 memory banks 7 6 5 4 3 2 1 ... ... 256 0 1 2 30 a12 a13 … a1n a21 a22 a23 … a2n a31 a32 a33 … a3n … … … … am1 am2 am3 … amn
  • 29. © 2014 IBM CorporationHPC User Forum September 2014 ... ... ...... 784 16 272 528 ... 16 §  LUT size = 16 bytes §  Simultaneous interleaving of row, column, and “vertical layer” accesses §  Example – three power-of-2 strides: 1, 16 and 256 29 Programmable Address Mapping 773 4 259 514 769 5 260 515 770 1 256 261 516 771 2 257 512 517 772 3 258 513 768 ... ... ...... 271 527 783 15 0 memory banks 5 4 3 2 1 ... 15 0 1 2 30 ... ... ...... 544 800 32 288 ... 32 31 287 543 79931 ... ... ...... 304 560 816 48 ... 48 815 47 303 55947 ... ... ...... 64 320 576 832 ... 64 575 831 63 31963 ... ... ...... 1024 1280 1536 1792 ... 256
  • 30. © 2014 IBM CorporationHPC User Forum September 201430 Power Efficiency Power §  Access Processor power estimate for 14nm, > 2GHz: 25-30 mW Amortize Access Processor energy over large amount of data §  Maximize memory bandwidth utilization (basic objective of the accelerator) – bank interleaving, row buffer locality §  Exploit wide access granularities –  example: 512 bit accesses @ 500 MHz [eDRAM] –  estimated AP energy overhead per accessed bit ≈ 0.1 pJ è Make sure that all data is effectively used ‒  improved algorithms ‒  data shuffle unit: swap data at various granularities within one/multiple lines (configured similar as EP)
  • 31. © 2014 IBM CorporationHPC User Forum September 201431 Conclusion New Programmable Near-Memory Accelerator §  Made feasible by novel state machine and address mapping technologies that enable programmable address generation, address mapping, and access scheduling operations in “real-time” §  Objective is to minimize the (energy) overhead that goes beyond the basic storage and processing needs (memory and execution units) §  Proof of concept for selected workloads using FPGA prototypes and initial compiler stacks §  More details to be published soon