The document discusses the evolution of programmable logic from TTL to FPGAs. It describes how early programmable logic arrays (PLAs) combined logic gates and registers into single devices with programmable connections. Modern FPGAs arrange logic blocks in an array with programmable interconnect to implement complex digital designs with high density, performance and reprogrammability. The document outlines FPGA architecture including look-up tables, routing resources and specialized blocks to efficiently implement applications like high-speed data processing.
2. Programmable Logic
Evolution:TTL PLA CPLD FPGA
ASIC
Development aspects
Using FPGA for high speed data processing
OpenCL
3.
4.
5.
6.
7.
8.
9.
10. General features of logic implementations
Sum of products (AND-OR gates, combinatorial logic)
Stored results (registered outputs)
Wired together
What if
Logic functions were fixed (likeTTL), but combined into a
single device?
Wiring (routing) connections could be controlled
(programmed) somehow?
11. Simplest implementation of programmable logic
Logic gates and registers are fixed
Programmable sum of products array and output
control
12. Fewer devices required
Lower cost
Power savings
Simpler to test and debug
Design security (prevent reverse engineering)
Design flexibility
Automated tools simplify and consolidate design
flow
In-system reprogrammability! (in some cases)
14. Combine multiple PLDs in single device with
programmable interconnect and I/O
15. Ample amounts of logic and advanced configurable
I/Os
Programmable routing
Instant on
Low cost
Non-volatile configuration
Reprogrammable
16. Higher density CPLDs don’t scale well because of
requires additional global routing
Rearrange LABs themselves into an array
17. LABs arranged in an array
Row and column programmable interconnect
Interconnect may span all or part of the array
18. FPGA LABs made up of logic elements (LEs) instead of
product terms and macrocells
Easier to create complex functions through LE
cascading
19. Replaces product term array
Combinational functions created with programmed
“tables” (cascaded multiplexers)
LUT inputs are mux select lines
20. Based on LE, but includes dedicated resources & adaptive LUT (ALUT)
Improves performance and resource utilization
21. All device resources can feed into or be fed by any
routing in device
Differing fixed lengths to adjust for timing
Scales linearly as density increases
Local interconnect
Connects between Les or ALMs within a LAB
Can include direct connections between adjacent LABs
Row and column interconnect
Fixed length routing segments
Span a number of LABs or entire device
22. Embedded multipliers
Useful for DSP
High-performance multiply/add/accumulate operations
Memory blocks
High-speed transceivers
Replace some LABs with dedicated functional
hardware blocks
PLLs
SDRAM controllers
Hard Processor System
23.
24. FPGA programming information must be stored
somewhere to program device at power on
Use external EEPROM, CPLD or CPU to program
Two programming methods
Active: FPGA controls programming sequence
automatically at power on
Passive: Intelligent host (typically CPU) controls
programming
Also programmable through JTAG connection
25. High density to create many complex logic
functions
High performance
Low cost
Integration of many functions
Many available I/O standards and features
Fast programming
26. A true ASIC: no configuration at power-on required
Create and test design with FPGA device
Migrate design to pin–compatible, functionally
equivalent ASIC device
27. Verilog Hardware Description Language
VDHL - Very high speed integrated circuits
Hardware Description Language
37. A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
Time
38. A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
Space
39. A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove
“Fetch”
40. A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove
“Fetch”
2. Remove unused ALU ops
41. A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove
“Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
42. R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove
“Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly! And
propagate state.
43. R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove
“Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly! And
propagate state.
5. Remove dead data.
44. R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove
“Fetch”
2. Remove unused ALU ops
3. Remove unused Load / Store
4. Wire up registers properly! And
propagate state.
5. Remove dead data.
6. Reschedule!
45. Load Load
Store
42
FPGA datapath =Your algorithm, in silicon
Build exactly what you need:
Operations
Data widths
Memory size, configuration
Efficiency:
Throughput / Latency / Power
49. To achieve acceleration, we can pipeline each iteration of the
loop
Analyze any dependencies between iterations
Schedule these operations
Launch the next iteration as soon as possible
float array[M];
for (int i=0; i < n*numSets; i++)
{
for (int j=0; j < M-1; j++)
array[j] = array[j+1];
array[M-1] = a[i];
for (int j=0; j < M; j++)
answer[i] += array[j] * coefs[j];
}
At this point, we can
launch the next
iteration
50. No Loop Pipelining
i0
i1
i2
i0
i1
i2
i3
i4
Looks almost
like parallel
thread
execution!
With Loop Pipelining