2. Programmability: Where do FPGA’s fit?
2
Intel CPU
TI DSP
MultiCore
ManyCore
GPU
FPGA
ASSP
ASIC
Flexibility, Programming Abstraction
Performance, Area and Power Efficiency
CPU:
• Market-agnostic
• Accessible to many
programmers (C++)
• Flexible, portable
ASIC
• Market-specific
• Fewer programmers
• Rigid, less programmable
• Hard to build (physical)
FPGA:
• Somewhat Restricted Market
• Harder to Program (Verilog)
• More efficient than SW
• More expensive than ASIC
3. 3 / 61
FPGA End Markets
Entertainment Broadcast
Broadband
Audio/video
Video display
Studio
Satellite
Broadcasting
Wireless Networking Wireline
Cellular
Basestations
Wireless LAN
Switches
Routers
Optical
Metro
Access
Computer Storage
Office
Automation
Servers
Mainframe
RAID
SAN
Copiers
Printers
MFP
Instrumentation
Security/
Energy Mgmt. Auto
Medical
Test equipment
Manufacturing
Card readers
Control systems
ATM
Navigation
Entertainment
Military
Secure comm.
Radar
Guidance and control
Computer
and Storage
Communications
IndustrialDigital Consumer
4. FPGA User Programming Model
User writes Verilog (or VHDL, or schematic)
Quartus compiles the Verilog to a bitstream
Synthesis: Verilog -> Gates
Tech-Mapping: Gates -> Device-specific LUTs & FF
Clustering: LUTs+FF -> LAB clusters
Placement: LABs –> placed LABS with an (x,y) position
Routing: Abstract connections -> exact routing
STA: Timing evaluated vs. constraints
Assembly: Routing converted to bitstream
Programming: Bitstream downloaded onto FPGA
(More on this in the Software Flow Section)
4
5. 5
FPGA CAD
Map to LAB’s not standard cells
Routing is setting mux select line bits
// Begin: Write Control
always @ (posedge wrbusy_int)
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
always @ (negedge wrbusy_int)
begin
write0 <= 1'b0;
end
always @ (posedge write0_done)
begin
write1 <= 1'b1;
// Begin: Write Control
always @ (posedge wrbusy_int)
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
always @ (negedge wrbusy_int)
begin
write0 <= 1'b0;
end
always @ (posedge write0_done)
begin
write1 <= 1'b1;
// Begin: Write Control
always @ (posedge wrbusy_int)
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
always @ (negedge wrbusy_int)
begin
write0 <= 1'b0;
end
always @ (posedge write0_done)
begin
write1 <= 1'b1;
Quartus II Database
Device features and timing information
Merge
Programmer
Timing
Analysis
Placement
& Routing
Power
Assembler
Simulator
3-rd Party
or Altera
EDA
Synthesis
3-rd Party
or Altera
6. What is FPGA Fabric – Logic Array Block
6
Input Muxing Logic Cell
Optional DFF
Output Muxing
Bottom line: Quartus generates a configuration bitstream which sets the logic
functions, and routing steering to instantiate one hardware design into the device.
LAB: X 20
®
®
®
®
®
Hard-Blocks
Routing Fabric
9. Unique Challenges in STA of FPGA’s
Fixed device with programmable LUTS, Routing and various IP
I would like to break down the challenges into categories
Verification of the un-programmed device
Many possible modes due to programmability
Delay Calculation of non-CMOS structures like pass gate muxes
Verification of a user’s compiled design
CRPR analysis can be very expensive
Large clock latency and skew, tree used versus mesh
Long combinational paths with lots of re-convergent logic
Slow logic that is still much faster than software on a CPU
Incremental moves affect function not just delay of instances
CRAM configuration constant changes
Mode changes
9
10. Unique Challenges in STA of FPGA’s
Periphery and Core have different challenges
Programmable core logic implementing functions via look up tables
Peripheral IP blocks performing programmable but less flexible tasks
SerDes, DSP, RAM, Arm Core etc
Periphery blocks often implemented with ASIC style flows
Core is full custom with pass gates
Delay modelling and parasitic reduction are challenges
Both have challenges due to configurability
I cannot cover all the challenges and will focus on 3
LUT modelling
Mode explosion flat implementation with hierarchical modelling
Modelling pass gate based multiplexors
10
11. LUT Overview
For the purposes of this
tutorial, let’s assume
we have a 3-LUT, i.e. 3
inputs on the select
lines to select one of 8
bits driven by the
CRAM.
This 3-LUT can be
used to model any logic
function of 3 bits by
assigning appropriate
values to the CRAM.
We call the 8-bit value
b[7:0] the LUTMASK.
11
A B C
CRAM
Y
b0
b1
b2
b3
b4
b5
b6
b7
12. Timing Arcs Dependency on LUTMASK
The existence and delays of
the timing arcs from A=>Y,
B=>Y, and C=>Y are
dependent upon the
LUTMASK.
For example, if bits are all
0s, then Y = 0 and there are
no arcs from any of the
inputs to the output. This is
a degenerate case.
12
A B C
CRAM
Y
b0
b1
b2
b3
b4
b5
b6
b7
13. Timing Arcs Dependency on LUTMASK
The existence and delays
of the timing arcs from
A=>Y, B=>Y, and C=>Y are
dependent upon the
LUTMASK.
For example, if bits are
10001000 (as shown in the
diagram), there is no arc for
C=>Y. [This LUTMASK
implements the logic function
Y=A&B.]
Unateness is a function of
LUTMASK
This configuration should
have positive-unate arcs
Ignoring unateness will hurt
fmax, but is not necessarily
critical for early Quartus
development
13
A B C
CRAM
Y
0
0
0
1
0
0
0
1
14. Timing Arcs Dependency on LUTMASK
The existence and delays of
the timing arcs from A=>Y,
B=>Y, and C=>Y are
dependent upon the
LUTMASK.
Another example: if bits are
10101010 (as shown in the
diagram), there is no arc for
B=>Y or C=>Y. [This
LUTMASK implements the
logic function Y=A.]
14
A B C
CRAM
Y
0
1
0
1
0
1
0
1
15. Enumerating Timing Arc Dependencies
One method to identify all the
arcs as a function of the
LUTMASK is to enumerate all
256 LUTMASK possibilities
along with the arc
dependencies.
This becomes unfeasible with
a 6-LUT, where there are 64
bits driven by CRAM,
resulting in 2^64
enumerations.
Alternate method is noticing
pattern of dependencies.
15
A B C
CRAM
Y
0
1
0
1
0
1
0
1
16. Enumerating Timing Arc Dependencies
Positive unate arc for A=>Y
will exist if any of the first bit
of the first level muxes is a 0
and the second bit of the
same mux is a 1.
Formally, it may be written
as: (!b0 && b1) || (!b2 &&
b3) || (!b4 && b5) || (!b6 &&
b7)
Negative unate arc for A=>Y
will exist if any of the first bit
of the first level muxes is a 1
and the second bit of the
same mux is a 0.
Formally, it may be written
as: (b0 && !b1) || (b2 &&
!b3) || (b4 && !b5) || (b6 &&
!b7)
16
A B C
CRAM
Y
0
1
0
1
0
1
0
1
17. LUT timing is an instance of case analysis
In Asic style STA case analysis can be slow
Happens once and not revisited during
incremental timing
Symbolic simulation has acceptable runtime
In FPGA timing, especially incremental timing,
the evaluation has to be done on every netlist
modification that affects logic
17
18. Modes can explode for complex blocks
Imagine a large block with many modes
Mode dependent timing is used to gain accuracy in STA
This block is used by a parent block with many cell modes continuing up multiple levels
The number of possible modes can explode especially if automatic tools are used to
enumerate them like PrimeTime’s extract_model command
Design teams want to do physical design at the highest possible level
Timing Modelling which needs to avoid an explosion of modes want to build models at
a lower level
It is not uncommon to have a complex block like DSP with 10K modes
We have no problem building these models but they can be slow in STA even when the
STA has been tuned for handling of many more modes than in ASIC flows.
PrimeTime and other commercial tools simultaneously build the graph for and delay
calculate all modes. No commercial tool can load and link Altera’s full chip.
18
19. Two possible approaches
Goal is to build models one level lower in the
Verilog hierarchy and provide a netlist of models to
Quartus and PrimeTime
Perform Place and Route Hierarchically
More work for design teams
Less optimal results
Use ICC’s hierarchical Verilog + flat Spef to build a
timing model below the top level
19
20. Hierarchical Place and Route / Extraction
Perform Place and Route Hierarchically
Pros
Spef is divided naturally by hierarchical P&R and extraction
Manual floorplan of top level may improve QoR over automatic P&R
Run time of P&R for lower level blocks will be dramatically faster
allowing more time for manual inspection and improvement of results
Cons
Design engineer must manually floorplan the top level
Multiple runs to manage or P&R and extraction
Possible QoR degredation if floorplan is poorly done
20
21. Model extraction one level lower
Use ICC’s hierarchical Verilog + an extracted flat Spef to build a
timing model below the top level
Pros
No change to construction flow
Cons
Some loss of accuracy on boundary rc delay calculation
Rc tree of boundary nets turned into lumped R and lumped C
Order 5% of the final gate in the path to the output/input of the model.
Approach
Read hierarchical Verilog in PT + flat spef + sdc for top level
Write_parasitics –format spef –nets [get_nets –hier sub_instance_name/*] for sub
block
Write_parasitcs –format spef –nets [get_nets *] for top
Charactarize_context –environment –timing sub_instance_name
Avoid boundary nets in context with -no_boundary_annotations or no boundary nets in
spef
Post process spef to remove prepended sub_instance_name from all names in map
Restart pt_shell with current_design as sub_module
Load spef and environment context
Extract_model
21
22. 22
FIHM – Model Validation Flow
Model comparison between:
Flat model (golden)
Hierarchical model (consuming molecule timing liberty model)
Parasitics for hierarchical model validation generated by
hacking the flat SPEF file:
Rename standard cell’s leaf pins to molecule’s boundary pins.
Zero R&C if nets connected within a molecule block.
Parasitics only extracted from flat SPEF for the nets connected to
top level elements or output ports.
23. 23
Correlation Results
Testcase: mm_core_digital
Total timing paths = 1444.
60 timing paths are pessimistic > 20ps as compared to flat model (4% of distribution)
14 timing paths are optimistic > 20ps as compared to flat model (1% of distribution)
95% of total paths agreed within ±20ps
25. 25
N-MOS gate multi-stage Multiplexors
Multiplexors are pervasive in an FPGA
They are designed using NMOS pass
gates to save area
This causes a timing model challenge
The input pin capacitance changes with
each select line configuration
Think of the Mux as a set of switches
The output load is seen on the input
Usual use of Liberty assumes a fixed
input capacitance or fixed receiver model
Quartus compiler uses fast spice but we
want a model for PrimeTime as well
26. 26
N-MOS gate multi-stage Multiplexors
Select line enabled by CRAM
2 stage one hot mux
Input cap varies depending on path taken
and if other side loads’ select lines are on
or off
Each possible path through the multi-stage
mux requires its own pin cap
Arc specific receiver model
This is part of the CCS noise model
It would be nice if there were a more
natural way to support arc and mode
specific pin caps in Liberty
Other NMOS inputs
27. Incentive for EDA Companies to help
As each process generation becomes more complex the number
of unique chip starts decreases.
Already 12K to less than 3K per year
Each chip that is designed will be increasingly hyper-optimized.
Custom tricks that need to be modeled at the gate level
FPGA use is increasing as its ability to run 1GHZ designs at
reasonable power approaches
FPGA compilers will not be able to model every effect
ICC needs PrimeTime
Encounter needs ETS
Eventually FPGA compilers may need to output their
programmed CRAM bits as constants and do a Super-Signoff in
commercial STA tools
This could be a good possibility for Market growth
27