What is FPGA?

 Programmable Logic
 Evolution:TTL  PLA  CPLD  FPGA 
ASIC
 Development aspects
 Using FPGA for high speed data processing
 OpenCL

 General features of logic implementations
 Sum of products (AND-OR gates, combinatorial logic)
 Stored results (registered outputs)
 Wired together
 What if
 Logic functions were fixed (likeTTL), but combined into a
single device?
 Wiring (routing) connections could be controlled
(programmed) somehow?

 Simplest implementation of programmable logic
 Logic gates and registers are fixed
 Programmable sum of products array and output
control

 Fewer devices required
 Lower cost
 Power savings
 Simpler to test and debug
 Design security (prevent reverse engineering)
 Design flexibility
 Automated tools simplify and consolidate design
flow
 In-system reprogrammability! (in some cases)

 Arrange multiple PAL arrays in a single device

 Combine multiple PLDs in single device with
programmable interconnect and I/O

 Ample amounts of logic and advanced configurable
I/Os
 Programmable routing
 Instant on
 Low cost
 Non-volatile configuration
 Reprogrammable

 Higher density CPLDs don’t scale well because of
requires additional global routing
 Rearrange LABs themselves into an array

 LABs arranged in an array
 Row and column programmable interconnect
 Interconnect may span all or part of the array

 FPGA LABs made up of logic elements (LEs) instead of
product terms and macrocells
 Easier to create complex functions through LE
cascading

 Replaces product term array
 Combinational functions created with programmed
“tables” (cascaded multiplexers)
 LUT inputs are mux select lines

 Based on LE, but includes dedicated resources & adaptive LUT (ALUT)
 Improves performance and resource utilization

 All device resources can feed into or be fed by any
routing in device
 Differing fixed lengths to adjust for timing
 Scales linearly as density increases
 Local interconnect
 Connects between Les or ALMs within a LAB
 Can include direct connections between adjacent LABs
 Row and column interconnect
 Fixed length routing segments
 Span a number of LABs or entire device

 Embedded multipliers
 Useful for DSP
 High-performance multiply/add/accumulate operations
 Memory blocks
 High-speed transceivers
 Replace some LABs with dedicated functional
hardware blocks
 PLLs
 SDRAM controllers
 Hard Processor System

 FPGA programming information must be stored
somewhere to program device at power on
 Use external EEPROM, CPLD or CPU to program
 Two programming methods
 Active: FPGA controls programming sequence
automatically at power on
 Passive: Intelligent host (typically CPU) controls
programming
 Also programmable through JTAG connection

 High density to create many complex logic
functions
 High performance
 Low cost
 Integration of many functions
 Many available I/O standards and features
 Fast programming

 A true ASIC: no configuration at power-on required
 Create and test design with FPGA device
 Migrate design to pin–compatible, functionally
equivalent ASIC device

 Verilog Hardware Description Language
 VDHL - Very high speed integrated circuits
Hardware Description Language

 Schematic design development

 Core IP
 SDRAM Controller
 Ethernet PHY, CustomTransceiver PHY
 PCIe PHY
 SDi, Display Port
 Megafunctions
 PLL
 I/O
 Custom logic blocks

 CPU data processing optimization
 Pipelining, parallelism
 OpenCL

B
A
A
ALU
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load Store
LdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData

 Mem[100] += 42 * Mem[101]
 CPU instructions:
R0  Load Mem[100]
R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
Store R0  Mem[100]

A
A
A
A
A
R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
Store R0  Mem[100] A
Time

A
A
A
A
A
R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
Space

A
A
A
A
A
R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
1. Instructions are fixed. Remove
“Fetch”

A
A
A
A
A
R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
“Fetch”
2. Remove unused ALU ops

A
A
A
A
A
R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
“Fetch”
3. Remove unused Load / Store

R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
“Fetch”
4. Wire up registers properly! And
propagate state.

R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
“Fetch”
propagate state.
5. Remove dead data.

R2  Load #42
R2  Mul R1, R2
R0  Add R2, R0
“Fetch”
propagate state.
5. Remove dead data.
6. Reschedule!

Load Load
Store
42
FPGA datapath =Your algorithm, in silicon
Build exactly what you need:
Operations
Data widths
Memory size, configuration
Efficiency:
Throughput / Latency / Power

Accelerator
LocalMem
GlobalMem
LocalMemLocalMemLocalMem
AcceleratorAcceleratorAcceleratorProcessor
Accelerator
LocalMem
GlobalMem
Host Accelerator
LocalMem
GlobalMem
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRange(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
 Host + Accelerator Programming
Model
 Sequential Host program on
microprocessor
 Function offload onto a highly parallel
accelerator device

IPIP
EMIF
IP
Processor
Rest of the System ?
HLSvoid F(...) {
#pragma ...
for(int i ...) {
#pragma ...
for(int j ...) {
#pragma ...
}
}
}
RTL
OpenCL
kernel void F(...) {
for(int i ...) {
for(int j ...) {
}
}
}
?
Complete Platform
C-to-HW tools
Standard OpenCL
Users
Hardware
Designers
Target
FPGA
Only
FPGA
Expertise
Yes
Timing
Closure
Manual
Users
Software
Programmers
Target
Complete
Platforms
FPGA
Expertise
No
Timing
Closure
Automatic

 OpenCL kernels expresses parallelism explicitly
__kernel void
sum(__global const float *a,
__global const float *b,
__global float *answer)
{
int xid = get_global_id(0);
answer[xid] = a[xid] + b[xid];
}
for (int i=0; i < n; i++)
{
answer[i] = a[i] + b[i];
}
Host Code Kernel Code
setup_memory_buffers();
transfer_data_to_fpga();
size_t global_size = {N, 1, 1};
clEnqueueNDRangeKernel(
sum_kernel, .., &global_size, ..);
read_data_from_fpga();

 To achieve acceleration, we can pipeline each iteration of the
loop
 Analyze any dependencies between iterations
 Schedule these operations
 Launch the next iteration as soon as possible
float array[M];
for (int i=0; i < n*numSets; i++)
{
for (int j=0; j < M-1; j++)
array[j] = array[j+1];
array[M-1] = a[i];
for (int j=0; j < M; j++)
answer[i] += array[j] * coefs[j];
}
At this point, we can
launch the next
iteration

 No Loop Pipelining
i0
i1
i2
i0
i1
i2
i3
i4
Looks almost
like parallel
thread
execution!
 With Loop Pipelining

z-1
z-1
z-1
z-1
z-1
z-1
z-1
X X X X X X X X
C0 C1 C2 C3 C4 C5 C6 C7
x(n)
+
y(n)

What is FPGA?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to What is FPGA?

Similar to What is FPGA? (20)

More from GlobalLogic Ukraine

More from GlobalLogic Ukraine (20)

Recently uploaded

Recently uploaded (20)

What is FPGA?