1. 1
ď§ Very long instruction word or VLIW refers to a processor
architecture designed to take advantage of instruction level
parallelism(ILP).
ď§ Whereas conventional processors mostly only allow programs that
specify instructions to be executed one after another.
ď§ A VLIW processor allows programs that can explicitly specify
instructions to be executed at the same time (i.e. in parallel).
ď§ This type of processor architecture is intended to allow higher
performance without the inherent complexity of some other
approaches.
ď§ The term VLIW, and the concept of VLIW architecture itself, were
invented by Josh Fisher in his research group at Yale University in
the early 1980s.
Very long instruction word
2. 2
ď§ Explicitly parallel instruction computing (EPIC) is a term coined in
1997 by the HPâIntel alliance to describe a computing paradigm that
researchers had been investigating since the early 1980s.
ď§ This paradigm is also called Independence architectures. It was the
basis for Intel and HP development of the Intel Itanium architecture,
and HP later asserted that "EPIC" was merely an old term for the
Itanium architecture.
ď§ EPIC permits microprocessors to execute software instructions in
parallel by using the compiler, rather than complex on-die circuitry,
to control parallel instruction execution.
ď§ This was intended to allow simple performance scaling without
resorting to higher clock frequencies.
Explicitly parallel instruction computing
3. 3
ď§ VLIW (very long instruction word):
- Compiler packs a fixed number of instructions into a single VLIW
instruction.
- The instructions within a VLIW instruction are issued and executed in
parallel
Example: High-end signal processors (TMS320C6201)
ď§ EPIC (explicit parallel instruction computing):
Evolution of VLIW
Example: Intelâs IA-64, exemplified by the Itanium processor
VLIW or EPIC
4. 4
VLIW
ď§ VLIW (very long instruction word)
processors use a long instruction word that
contains a usually fixed number of
operations that are fetched, decoded,
issued, and executed synchronously.
ď§ All operations specified within a VLIW
instruction must be independent of one
another.
5. 5
VLIW
ď§ Some of the key issues of a (V)LIW processor:
⢠(very) long instruction word (up to 1 024 bits per
instruction),
⢠each instruction consists of multiple independent
parallel operations,
⢠each operation requires a statically known number
of cycles to complete,
⢠a central controller that issues a long instruction
word every cycle,
⢠multiple FUs connected through a global shared
register file.
6. 6
VLIW
ď§ sequential stream of long instruction words
ď§ instructions scheduled statically by the
compiler
ď§ number of simultaneously issued
instructions is fixed during compile-time
ď§ instruction issue is less complicated than in
a superscalar processor
7. 7
VLIW
ď§ Disadvantage: VLIW processors cannot react on dynamic
events,
e.g. cache misses, with the same flexibility like superscalars.
ď§ The number of instructions in a VLIW instruction word is
usually fixed.
ď§ Padding VLIW instructions with no-ops is needed in case
the full issue bandwidth is not be met. This increases code
size. More recent VLIW architectures use a denser code
format which allows to remove the no-ops.
ď§ VLIW is an architectural technique, whereas superscalar is
a microarchitecture technique.
ď§ VLIW processors take advantage of spatial parallelism.
8. 8
EPIC: a paradigm shift
ď§ Superscalar RISC solution
⢠Based on sequential execution semantics
⢠Compilerâs role is limited by the instruction set architecture
⢠Superscalar hardware identifies and exploits parallelism
ď§ EPIC solution â (the evolution of VLIW)
⢠Based on parallel execution semantics
⢠EPIC ISA enhancements support static parallelization
⢠Compiler takes greater responsibility for exploiting parallelism
⢠Compiler / hardware collaboration often resembles superscalar
9. 9
EPIC: a paradigm shift
ď§ Advantages of pursuing EPIC architectures
⢠Make wide issue & deep latency less expensive in
hardware
⢠Allow processor parallelism to scale with additional
VLSI density
ď§ Architect the processor to do well with in-order execution
⢠Enhance the ISA to allow static parallelization
⢠Use compiler technology to parallelize program
⢠However, a purely static VLIW is not appropriate for
general-purpose use
10. 10
The fusion of VLIW and superscalar techniques
ď§ Superscalars need improved support for static parallelization
⢠Static scheduling
⢠Limited support for predicated execution
ď§ VLIWs need improved support for dynamic parallelization
⢠Caches introduce dynamically changing memory latency
⢠Compatibility: issue width and latency may change with new
hardware
⢠Application requirements - e.g. object oriented programming with
dynamic binding
ď§ EPIC processors exhibit features derived from both
⢠Interlock & out-of-order execution hardware are compatible with
EPIC (but not required!)
⢠EPIC processors can use dynamic translation to parallelize in
software
11. 11
Many EPIC features are taken from VLIWs
ďľMinisupercomputer products stimulated VLIW research (FPS,
Multiflow, Cydrome)
ďľMinisupercomputers were specialized, costly, and short-lived
ďľTraditional VLIWs not suited to general purpose computing
ďľVLIW resurgence in single chip DSP & media processors
ďľMinisupercomputers exaggerated forward-looking challenges:
ďľLong latency
ďľWide issue
ďľLarge number of architected registers
ďľCompile-time scheduling to exploit exotic amounts of parallelism
ďľEPIC exploits many VLIW techniques
12. 12
Shortcomings of early VLIWs
ď§ Expensive multi-chip implementations
ď§ No data cache
ď§ Poor "scalar" performance
ď§ No strategy for object code compatibility
13. 13
EPIC design challenges
ď§ Develop architectures applicable to general-purpose
computing
⢠Find substantial parallelism in âdifficult to parallelizeâ
scalar programs
⢠Provide compatibility across hardware generations
⢠Support emerging applications (e.g. multimedia)
ď§ Compiler must find or create sufficient ILP
ď§ Combine the best attributes of VLIW & superscalar RISC
(incorporated best concepts from all available sources)
ď§ Scale architectures for modern single-chip implementation
14. 14
EPIC Processors, Intel's IA-64 ISA and Itanium
ď§ Joint R&D project by Hewlett-Packard and Intel
(announced in June 1994)
ď§ This resulted in explicitly parallel instruction
computing (EPIC) design style:
⢠specifying ILP explicit in the machine code, that is, the
parallelism is encoded directly into the instructions
similarly to VLIW;
⢠a fully predicated instruction set;
⢠an inherently scalable instruction set (i.e., the ability to
scale to a lot of FUs);
⢠many registers;
⢠speculative execution of load instructions
15. 15
IA-64 Architecture
ď§ Unique architecture features & enhancements
⢠Explicit parallelism and templates
⢠Predication, speculation, memory support, and
others
⢠Floating-point and multimedia architecture
ď§ IA-64 resources available to applications
⢠Large, application visible register set
⢠Rotating registers, register stack, register stack
engine
ď§ IA-32 & PA-RISC compatibility models
16. 16
Todayâs Architecture Challenges
ď§ Performance barriers :
⢠Memory latency
⢠Branches
⢠Loop pipelining and call / return overhead
ď§ Headroom constraints :
⢠Hardware-based instruction scheduling
- Unable to efficiently schedule parallel execution
⢠Resource constrained
- Too few registers
- Unable to fully utilize multiple execution units
ď§ Scalability limitations :
⢠Memory addressing efficiency
17. 17
Intel's IA-64 ISA
ď§ Intel 64-bit Architecture (IA-64) register model:
⢠128, 64-bit general purpose registers GR0-GR127
to hold values for integer and multimedia computations
- each register has one additional NaT (Not a Thing) bit to
indicate whether the value stored is valid,
⢠128, 82-bit floating-point registers FR0-FR127
- registers f0 and f1 are read-only with values +0.0 and +1.0,
⢠64, 1-bit predicate registers P0-PR63
- the first register p0 is read-only and always reads 1 (true)
⢠8, 64-bit branch registers BR0-BR7 to specify the target
addresses of indirect branches
19. 19
Intel's IA-64 ISA
⢠IA-64 instructions are 41-bit (previously stated 40 bit) long and consist of
- op-code,
- predicate field (6 bits),
- two source register addresses (7 bits each),
- destination register address (7 bits), and
- special fields (includes integer and floating-point arithmetic).
⢠The 6-bit predicate field in each IA-64 instruction refers to a set of 64 predicate
registers.
⢠6 types of instructions
- A: Integer ALU ==> I-unit or M-unit
- I: Non-ALU integer ==> I-unit
- M: Memory ==> M-unit
- B: Branch ==> B-unit
- F: Floating-point ==> F-unit
- L: Long Immediate ==> I-unit
⢠IA-64 instructions are packed by compiler into bundles.
20. 20
IA-64 Bundles
ď§ A bundle is a 128-bit long instruction word (LIW) containing three 41-bit IA-64
instructions along with a so-called 5-bit template that contains instruction
grouping information.
ď§ IA-64 does not insert no-op instructions to fill slots in the bundles.
ď§ The template explicitly indicates :
⢠first 4 bits: types of instructions
⢠last bit (stop bit): whether the bundle can be executed in parallel with the
next bundle
⢠(previous literature): whether the instructions in the bundle can be executed
in parallel or if one or more must be executed serially.
ď§ Bundled instructions don't have to be in their original program order, and they
can even represent entirely different paths of a branch.
ď§ Also, the compiler can mix dependent and independent instructions together in a
bundle, because the template keeps track of which is which.
21. 21
IA-64 : Explicitly Parallel Architecture
ď§ IA-64 template specifies
⢠The type of operation for each instruction
- MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB,
MBB, BBB
⢠Intra-bundle relationship
- M / MI or MI / I
⢠Inter-bundle relationship
ď§ Most common combinations covered by templates
⢠Headroom for additional templates
ď§ Simplifies hardware requirements
ď§ Scales compatibly to future generations
Instruction 2
41 bits
Instruction 1
41 bits
Instruction 0
41 bits
Template
5 bits
128 bits (bundle)
M=Memory
F=Floating-point
I=Integer
L=Long Immediate
B=Branch
(MMI)Memory (M) Memory (M) Integer (I)