tau 2015 spyrou fpga timing

Challenges in the
Static Timing Analysis
of FPGA’s
Tom Spyrou
TAU 2015
3/2015

Programmability: Where do FPGA’s fit?
2
Intel CPU
TI DSP
MultiCore
ManyCore
GPU
FPGA
ASSP
ASIC
Flexibility, Programming Abstraction
Performance, Area and Power Efficiency
CPU:
• Market-agnostic
• Accessible to many
programmers (C++)
• Flexible, portable
ASIC
• Market-specific
• Fewer programmers
• Rigid, less programmable
• Hard to build (physical)
FPGA:
• Somewhat Restricted Market
• Harder to Program (Verilog)
• More efficient than SW
• More expensive than ASIC

3 / 61
FPGA End Markets
Entertainment Broadcast
Broadband
Audio/video
Video display
Studio
Satellite
Broadcasting
Wireless Networking Wireline
Cellular
Basestations
Wireless LAN
Switches
Routers
Optical
Metro
Access
Computer Storage
Office
Automation
Servers
Mainframe
RAID
SAN
Copiers
Printers
MFP
Instrumentation
Security/
Energy Mgmt. Auto
Medical
Test equipment
Manufacturing
Card readers
Control systems
ATM
Navigation
Entertainment
Military
Secure comm.
Radar
Guidance and control
Computer
and Storage
Communications
IndustrialDigital Consumer

FPGA User Programming Model
 User writes Verilog (or VHDL, or schematic)
 Quartus compiles the Verilog to a bitstream
 Synthesis: Verilog -> Gates
 Tech-Mapping: Gates -> Device-specific LUTs & FF
 Clustering: LUTs+FF -> LAB clusters
 Placement: LABs –> placed LABS with an (x,y) position
 Routing: Abstract connections -> exact routing
 STA: Timing evaluated vs. constraints
 Assembly: Routing converted to bitstream
 Programming: Bitstream downloaded onto FPGA
(More on this in the Software Flow Section)
4

5
FPGA CAD
Map to LAB’s not standard cells
Routing is setting mux select line bits
// Begin: Write Control
always @ (posedge wrbusy_int)
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
always @ (negedge wrbusy_int)
begin
write0 <= 1'b0;
end
always @ (posedge write0_done)
begin
write1 <= 1'b1;
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
begin
write0 <= 1'b0;
end
begin
write1 <= 1'b1;
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
begin
write0 <= 1'b0;
end
begin
write1 <= 1'b1;
Quartus II Database
Device features and timing information
Merge
Programmer
Timing
Analysis
Placement
& Routing
Power
Assembler
Simulator
3-rd Party
or Altera
EDA
Synthesis
3-rd Party
or Altera

What is FPGA Fabric – Logic Array Block
6
Input Muxing Logic Cell
Optional DFF
Output Muxing
Bottom line: Quartus generates a configuration bitstream which sets the logic
functions, and routing steering to instantiate one hardware design into the device.
LAB: X 20
®
®
®
®
®
Hard-Blocks
Routing Fabric

7
Secondary Signals (CE, SLOAD, …)

FPGA Interconnect Model
8
0xab0f
0
VDIM
HDIM
HDIM
LIM
LEIM
A
B
C
D
0x81
0xf0
0x14
0x44
0x24
CRAM Programming
LAB (4,6)
LAB (12,9)
V4
H3
H3
 Wires are point-to-point
 Individual bits, not groups or
word-wise
 Statically programmed by SW
to establish the necessary
connection
 No bus, protocol, etc. routing
(unless built on top)

Unique Challenges in STA of FPGA’s
 Fixed device with programmable LUTS, Routing and various IP
 I would like to break down the challenges into categories
 Verification of the un-programmed device
 Many possible modes due to programmability
 Delay Calculation of non-CMOS structures like pass gate muxes
 Verification of a user’s compiled design
 CRPR analysis can be very expensive
 Large clock latency and skew, tree used versus mesh
 Long combinational paths with lots of re-convergent logic
 Slow logic that is still much faster than software on a CPU
 Incremental moves affect function not just delay of instances
 CRAM configuration constant changes
 Mode changes
9

Unique Challenges in STA of FPGA’s
 Periphery and Core have different challenges
 Programmable core logic implementing functions via look up tables
 Peripheral IP blocks performing programmable but less flexible tasks
 SerDes, DSP, RAM, Arm Core etc
 Periphery blocks often implemented with ASIC style flows
 Core is full custom with pass gates
 Delay modelling and parasitic reduction are challenges
 Both have challenges due to configurability
 I cannot cover all the challenges and will focus on 3
 LUT modelling
 Mode explosion flat implementation with hierarchical modelling
 Modelling pass gate based multiplexors
10

LUT Overview
 For the purposes of this
tutorial, let’s assume
we have a 3-LUT, i.e. 3
inputs on the select
lines to select one of 8
bits driven by the
CRAM.
 This 3-LUT can be
used to model any logic
function of 3 bits by
assigning appropriate
values to the CRAM.
 We call the 8-bit value
b[7:0] the LUTMASK.
11
A B C
CRAM
Y
b0
b1
b2
b3
b4
b5
b6
b7

Timing Arcs Dependency on LUTMASK
 The existence and delays of
the timing arcs from A=>Y,
B=>Y, and C=>Y are
dependent upon the
LUTMASK.
 For example, if bits are all
0s, then Y = 0 and there are
no arcs from any of the
inputs to the output. This is
a degenerate case.
12
A B C
CRAM
Y
b0
b1
b2
b3
b4
b5
b6
b7

 The existence and delays
of the timing arcs from
A=>Y, B=>Y, and C=>Y are
dependent upon the
LUTMASK.
 For example, if bits are
10001000 (as shown in the
diagram), there is no arc for
C=>Y. [This LUTMASK
implements the logic function
Y=A&B.]
 Unateness is a function of
LUTMASK
 This configuration should
have positive-unate arcs
 Ignoring unateness will hurt
fmax, but is not necessarily
critical for early Quartus
development
13
A B C
CRAM
Y
0
0
0
1
0
0
0
1

 The existence and delays of
the timing arcs from A=>Y,
B=>Y, and C=>Y are
dependent upon the
LUTMASK.
 Another example: if bits are
10101010 (as shown in the
diagram), there is no arc for
B=>Y or C=>Y. [This
LUTMASK implements the
logic function Y=A.]
14
A B C
CRAM
Y
0
1
0
1
0
1
0
1

Enumerating Timing Arc Dependencies
 One method to identify all the
arcs as a function of the
LUTMASK is to enumerate all
256 LUTMASK possibilities
along with the arc
dependencies.
 This becomes unfeasible with
a 6-LUT, where there are 64
bits driven by CRAM,
resulting in 2^64
enumerations.
 Alternate method is noticing
pattern of dependencies.
15
A B C
CRAM
Y
0
1
0
1
0
1
0
1

Enumerating Timing Arc Dependencies
 Positive unate arc for A=>Y
will exist if any of the first bit
of the first level muxes is a 0
and the second bit of the
same mux is a 1.
 Formally, it may be written
as: (!b0 && b1) || (!b2 &&
b3) || (!b4 && b5) || (!b6 &&
b7)
 Negative unate arc for A=>Y
will exist if any of the first bit
of the first level muxes is a 1
and the second bit of the
same mux is a 0.
 Formally, it may be written
as: (b0 && !b1) || (b2 &&
!b3) || (b4 && !b5) || (b6 &&
!b7)
16
A B C
CRAM
Y
0
1
0
1
0
1
0
1

LUT timing is an instance of case analysis
 In Asic style STA case analysis can be slow
 Happens once and not revisited during
incremental timing
 Symbolic simulation has acceptable runtime
 In FPGA timing, especially incremental timing,
the evaluation has to be done on every netlist
modification that affects logic
17

Modes can explode for complex blocks
 Imagine a large block with many modes
 Mode dependent timing is used to gain accuracy in STA
 This block is used by a parent block with many cell modes continuing up multiple levels
 The number of possible modes can explode especially if automatic tools are used to
enumerate them like PrimeTime’s extract_model command
 Design teams want to do physical design at the highest possible level
 Timing Modelling which needs to avoid an explosion of modes want to build models at
a lower level
 It is not uncommon to have a complex block like DSP with 10K modes
 We have no problem building these models but they can be slow in STA even when the
STA has been tuned for handling of many more modes than in ASIC flows.
 PrimeTime and other commercial tools simultaneously build the graph for and delay
calculate all modes. No commercial tool can load and link Altera’s full chip.
18

Two possible approaches
 Goal is to build models one level lower in the
Verilog hierarchy and provide a netlist of models to
Quartus and PrimeTime
 Perform Place and Route Hierarchically
 More work for design teams
 Less optimal results
 Use ICC’s hierarchical Verilog + flat Spef to build a
timing model below the top level
19

Hierarchical Place and Route / Extraction
 Perform Place and Route Hierarchically
 Pros
 Spef is divided naturally by hierarchical P&R and extraction
 Manual floorplan of top level may improve QoR over automatic P&R
 Run time of P&R for lower level blocks will be dramatically faster
allowing more time for manual inspection and improvement of results
 Cons
 Design engineer must manually floorplan the top level
 Multiple runs to manage or P&R and extraction
 Possible QoR degredation if floorplan is poorly done
20

Model extraction one level lower
 Use ICC’s hierarchical Verilog + an extracted flat Spef to build a
timing model below the top level
 Pros
 No change to construction flow
 Cons
 Some loss of accuracy on boundary rc delay calculation
 Rc tree of boundary nets turned into lumped R and lumped C
 Order 5% of the final gate in the path to the output/input of the model.
 Approach
 Read hierarchical Verilog in PT + flat spef + sdc for top level
 Write_parasitics –format spef –nets [get_nets –hier sub_instance_name/*] for sub
block
 Write_parasitcs –format spef –nets [get_nets *] for top
 Charactarize_context –environment –timing sub_instance_name
 Avoid boundary nets in context with -no_boundary_annotations or no boundary nets in
spef
 Post process spef to remove prepended sub_instance_name from all names in map
 Restart pt_shell with current_design as sub_module
 Load spef and environment context
 Extract_model
21

22
FIHM – Model Validation Flow
 Model comparison between:
 Flat model (golden)
 Hierarchical model (consuming molecule timing liberty model)
 Parasitics for hierarchical model validation generated by
hacking the flat SPEF file:
 Rename standard cell’s leaf pins to molecule’s boundary pins.
 Zero R&C if nets connected within a molecule block.
 Parasitics only extracted from flat SPEF for the nets connected to
top level elements or output ports.

23
Correlation Results
 Testcase: mm_core_digital
 Total timing paths = 1444.
 60 timing paths are pessimistic > 20ps as compared to flat model (4% of distribution)
 14 timing paths are optimistic > 20ps as compared to flat model (1% of distribution)
 95% of total paths agreed within ±20ps

25
N-MOS gate multi-stage Multiplexors
 Multiplexors are pervasive in an FPGA
 They are designed using NMOS pass
gates to save area
 This causes a timing model challenge
 The input pin capacitance changes with
each select line configuration
 Think of the Mux as a set of switches
 The output load is seen on the input
 Usual use of Liberty assumes a fixed
input capacitance or fixed receiver model
 Quartus compiler uses fast spice but we
want a model for PrimeTime as well

26
N-MOS gate multi-stage Multiplexors
 Select line enabled by CRAM
 2 stage one hot mux
 Input cap varies depending on path taken
and if other side loads’ select lines are on
or off
 Each possible path through the multi-stage
mux requires its own pin cap
 Arc specific receiver model
 This is part of the CCS noise model
 It would be nice if there were a more
natural way to support arc and mode
specific pin caps in Liberty
Other NMOS inputs

Incentive for EDA Companies to help
 As each process generation becomes more complex the number
of unique chip starts decreases.
 Already 12K to less than 3K per year
 Each chip that is designed will be increasingly hyper-optimized.
 Custom tricks that need to be modeled at the gate level
 FPGA use is increasing as its ability to run 1GHZ designs at
reasonable power approaches
 FPGA compilers will not be able to model every effect
 ICC needs PrimeTime
 Encounter needs ETS
 Eventually FPGA compilers may need to output their
programmed CRAM bits as constants and do a Super-Signoff in
commercial STA tools
 This could be a good possibility for Market growth
27

tau 2015 spyrou fpga timing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to tau 2015 spyrou fpga timing

Similar to tau 2015 spyrou fpga timing (20)

tau 2015 spyrou fpga timing