Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad |
Pondicherry | Trivandrum | Salem | Erode | Tirunelveli
http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
13 Years of Experience
Automated Services
24/7 Help Desk Support
Experience & Expertise Developers
Advanced Technologies & Tools
Legitimate Member of all Journals
Having 1,50,000 Successive records in
all Languages
More than 12 Branches in Tamilnadu,
Kerala & Karnataka.
Ticketing & Appointment Systems.
Individual Care for every Student.
Around 250 Developers & 20
Researchers

227-230 Church Road, Anna Nagar, Madurai – 625020.
0452-4390702, 4392702, + 91-9944793398.
info@elysiumtechnologies.com, elysiumtechnologies@gmail.com
S.P.Towers, No.81 Valluvar Kottam High Road, Nungambakkam,
Chennai - 600034. 044-42072702, +91-9600354638,
chennai@elysiumtechnologies.com
15, III Floor, SI Towers, Melapudur main Road, Trichy – 620001.
0431-4002234, + 91-9790464324.
trichy@elysiumtechnologies.com
577/4, DB Road, RS Puram, Opp to KFC, Coimbatore – 641002
0422- 4377758, +91-9677751577.
coimbatore@elysiumtechnologies.com

Plot No: 4, C Colony, P&T Extension, Perumal puram, Tirunelveli-
627007. 0462-2532104, +919677733255,
tirunelveli@elysiumtechnologies.com
1st Floor, A.R.IT Park, Rasi Color Scan Building, Ramanathapuram
- 623501. 04567-223225,
+919677704922.ramnad@elysiumtechnologies.com
74, 2nd floor, K.V.K Complex,Upstairs Krishna Sweets, Mettur
Road, Opp. Bus stand, Erode-638 011. 0424-4030055, +91-
9677748477 erode@elysiumtechnologies.com
No: 88, First Floor, S.V.Patel Salai, Pondicherry – 605 001. 0413–
4200640 +91-9677704822
pondy@elysiumtechnologies.com
TNHB A-Block, D.no.10, Opp: Hotel Ganesh Near Busstand. Salem
– 636007, 0427-4042220, +91-9894444716.
salem@elysiumtechnologies.com

ETPL
VLSI-001
Pragmatic Integration of an SRAM Row Cache in Heterogeneous 3-D DRAM
Architecture Using TSV
Abstract: As scaling DRAM cells becomes more challenging and energy-efficient DRAM chips are in
high demand, the DRAM industry has started to undertake an alternative approach to address these
looming issues-that is, to vertically stack DRAM dies with through-silicon-vias (TSVs) using 3-D-IC
technology. Furthermore, this emerging integration technology also makes heterogeneous die stacking in
one DRAM package possible. Such a heterogeneous DRAM chip provides a unique, promising
opportunity for computer architects to contemplate a new memory hierarchy for future system design. In
this paper, we study how to design such a heterogeneous DRAM chip for improving both performance
and energy efficiency. In particular, we found that, if we want to design an SRAM row cache in a DRAM
chip, simple stacking alone cannot address the majority of traditional SRAM row cache design issues. In
this paper, to address these issues, we propose a novel floorplan and several architectural techniques that
fully exploit the benefits of 3-D stacking technology. Our multi-core simulation results with memory-
intensive applications suggest that, by tightly integrating a small row cache with its corresponding DRAM
array, we can improve performance by 30% while saving dynamic energy by 31%.
ETPL
VLSI-002
A Low-Complexity Turbo Decoder Architecture for Energy-Efficient Wireless Sensor
Networks
Abstract: Turbo codes have recently been considered for energy-constrained wireless communication
applications, since they facilitate a low transmission energy consumption. However, in order to reduce the
overall energy consumption, lookup table-log-BCJR (LUT-Log-BCJR) architectures having a low
processing energy consumption are required. In this paper, we decompose the LUT-Log-BCJR
architecture into its most fundamental add compare select (ACS) operations and perform them using a
novel low-complexity ACS unit. We demonstrate that our architecture employs an order of magnitude
fewer gates than the most recent LUT-Log-BCJR architectures, facilitating a 71% energy consumption
reduction. Compared to state-of-the-art maximum logarithmic Bahl-Cocke-Jelinek-Raviv
implementations, our approach facilitates a 10% reduction in the overall energy consumption at ranges
above 58 m.
ETPL
VLSI-003
Pipelined Radix- 2k
Feedforward FFT Architectures
Abstract: The appearance of radix-22 was a milestone in the design of pipelined FFT hardware
architectures. Later, radix-22 was extended to radix-2k . However, radix-2k was only proposed for single-
path delay feedback (SDF) architectures, but not for feedforward ones, also called multi-path delay
commutator (MDC). This paper presents the radix-2k feedforward (MDC) FFT architectures. In
feedforward architectures radix-2k can be used for any number of parallel samples which is a power of
two. Furthermore, both decimation in frequency (DIF) and decimation in time (DIT) decompositions can
be used. In addition to this, the designs can achieve very high throughputs, which makes them suitable for
the most demanding applications. Indeed, the proposed radix-2k feedforward architectures require fewer
hardware resources than parallel feedback ones, also called multi-path delay feedback (MDF), when
several samples in parallel must be processed. As a result, the proposed radix-2k feedforward
architectures not only offer an attractive solution for current applications, but also open up a new research
line on feedforward structures.

ETPL
VLSI-004
Algorithm and Architecture Design of Bandwidth-Oriented Motion Estimation for
Real-Time Mobile Video Applications
Abstract: This paper proposes a data bandwidth-oriented motion estimation design for resource-limited
mobile video applications using an integrated bandwidth rate distortion optimization framework. This
framework predicts and allocates the appropriate data bandwidth for motion estimation under a limited
bandwidth supply to fit a dynamically changing bandwidth supply. The simulation results show that our
proposed algorithm can achieve 66% and 41% memory bandwidth savings while maintaining an
equivalent rate-distortion performance and meeting real-time targets, when compared with conventional
approaches for low-motion and high-motion D1 (704 ×  576)-size video, respectively.
The final implementation costs 122 K gate counts with TSMC 0.13-μ m CMOS technology and consumes
74 mW of power for D1 resolution at 30 frames/s which is 40% of that achieved in previous designs.
ETPL
VLSI-005
STBC-OFDM Downlink Baseband Receiver for Mobile WMAN
Abstract: This paper proposes a space time block code-orthogonal frequency division multiplexing
downlink baseband receiver for mobile wireless metropolitan area network. The proposed baseband
receiver applied in the system with two transmit antennas and one receive antenna aims to provide high
performance in outdoor mobile environments. It provides a simple and robust synchronizer and an
accurate but hardware affordable channel estimator to overcome the challenge of multipath fading
channels. The coded bit error rate performance for 16 quadrature amplitude modulation can achieve less
than 10-6 under the vehicle speed of 120 km/hr. The proposed baseband receiver designed in 90-nm
CMOS technology can support up to 27.32 Mb/s uncoded data transmission under 10 MHz channel
bandwidth. It requires a core area of 2.41 × 2.41 mm2 and dissipates 68.48 mW at 78.4 MHz with 1 V
power supply.
ETPL
VLSI-006
Glitch-Free NAND-Based Digitally Controlled Delay-Lines
Abstract: The recently proposed NAND-based digitally controlled delay-lines (DCDL) present a glitching
problem which may limit their employ in many applications. This paper presents a glitch-free NAND-
based DCDL which overcame this limitation by opening the employ of NAND-based DCDLs in a wide
range of applications. The proposed NAND-based DCDL maintains the same resolution and minimum
delay of previously proposed NAND-based DCDL. The theoretical demonstration of the glitch-free
operation of proposed DCDL is also derived in the paper. Following this analysis, three driving circuits
for the delay control-bits are also proposed. Proposed DCDLs have been designed in a 90-nm CMOS
technology and compared, in this technology, to the state-of-the-art. Simulation results show that novel
circuits result in the lowest resolution, with a little worsening of the minimum delay with respect to the
previously proposed DCDL with the lowest delay. Simulations also confirm the correctness of developed
glitching model and sizing strategy. As example application, proposed DCDL is used to realize an All-
digital spread-spectrum clock generator (SSCG). The employ of proposed DCDL in this circuit allows to
reduce the peak-to-peak absolute output jitter of more than the 40% with respect to a SSCG using three-
state inverter based DCDLs.

ETPL
VLSI-007
A High-Efficiency, Wide Workload Range, Digital Off-Time Modulation (DOTM) DC-
DC Converter With Asynchronous Power Saving Technique
Abstract: Conventionally for wide workload range applications, to keep good stability and high
efficiency, a switching converter with multi-mode operation is necessary. With the advanced digital
signal processing, this work presents an asynchronous digital controller with dynamic power saving
technique to achieve high power efficiency. The regulation is based on the off-time modulation, in which
an adaptive resolution adjustment is proposed for the extension toward light-loaded range. The DC-DC
converter is fabricated in a 0.18- μm CMOS process. The input voltage is from 2.7 to 3.6 V and the
regulated output is 1.8 V. The switching frequency is from 44 kHz to 1.65 MHz and the maximum output
ripple is 20 mV with a 10-μF capacitor and a 2.2-μH inductor. The power efficiency is higher than 91%
for the workload range from 3 to 400 mA.
ETPL
VLSI-008
Formal Verification of Architectural Power Intent
Abstract: This paper presents a verification framework that attempts to bridge the disconnect between
high-level properties capturing the architectural power management strategy and the implementation of
the power management control logic using low-level per-domain control signals. The novelty of the
proposed framework is in demonstrating that the architectural power intent properties developed using
high-level artifacts can be automatically translated into properties over low-level control sequences
gleaned from UPF specifications of power domains, and that the resulting properties can be used to
formally verify the global on-chip power management logic. The proposed translation uses a considerable
amount of domain knowledge and is also not purely syntactic, because it requires formal extraction of
timing information for the low-level control sequences. We present a tool, called POWER-TRUCTOR
which enables the proposed framework, and several test cases of significant complexity to demonstrate
the feasibility of the proposed framework.
ETPL
VLSI-009
Statistical SRAM Read Access Yield Improvement Using Negative Capacitance
Circuits
Abstract: SRAM has become the dominant block in modern ICs and constitutes more than 50% of the die
area. The increase of process variations with continued CMOS technology scaling is considered one of
the major challenges for SRAM designers. This process variations increase causes the SRAM cells to
functionally fail and reduces the chip functional yield considering the static noise margin stability failures
(i.e., cell flips when accessed), write failures (i.e., cell is not written within the write window), and read
access failures (i.e., incorrect read operation). In this paper, novel negative capacitance circuits are
developed, for the first time, to statistically improve the SRAM read access yield under process variations
by reducing the bitlines parasitic capacitance. Post layout simulation results, referring to an industrial
hardware-calibrated TSMC 65-nm CMOS technology, show that the adoption of the negative capacitance
circuit to a 512 SRAM cells column is capable of improving the read access yield from 61.9% to 100%.
ETPL
VLSI-010
An Energy-Efficient L2 Cache Architecture Using Way Tag Information Under Write-
Through Policy
Abstract: Many high-performance microprocessors employ cache write-through policy for performance
improvement and at the same time achieving good tolerance to soft errors in on-chip caches. However,

write-through policy also incurs large energy overhead due to the increased accesses to caches at the
lower level (e.g., L2 caches) during write operations. In this paper, we propose a new cache architecture
referred to as way-tagged cache to improve the energy efficiency of write-through caches. By maintaining
the way tags of L2 cache in the L1 cache during read operations, the proposed technique enables L2 cache
to work in an equivalent direct-mapping manner during write hits, which account for the majority of L2
cache accesses. This leads to significant energy reduction without performance degradation. Simulation
results on the SPEC CPU2000 benchmarks demonstrate that the proposed technique achieves 65.4%
energy savings in L2 caches on average with only 0.02% area overhead and no performance degradation.
Similar results are also obtained under different L1 and L2 cache configurations. Furthermore, the idea of
way tagging can be applied to existing low-power cache design techniques to further improve energy
efficiency.
ETPL
VLSI-011
An Analytical Latency Model for Networks-on-Chip
Abstract: We propose an analytical model based on queueing theory for delay analysis in a wormhole-
switched network-on-chip (NoC). The proposed model takes as input an application communication
graph, a topology graph, a mapping vector, and a routing matrix, and estimates average packet latency
and router blocking time. It works for arbitrary network topology with deterministic routing under
arbitrary traffic patterns. This model can estimate per-flow average latency accurately and quickly, thus
enabling fast design space exploration of various design parameters in NoC designs. Experimental results
show that the proposed analytical model can predict the average packet latency more than four orders of
magnitude faster than an accurate simulation, while the computation error is less than 10% in non-
saturated networks for different system-on-chip platforms.
ETPL
VLSI-012
Built-In Generation of Functional Broadside Tests Using a Fixed Hardware Structure
Abstract: Functional broadside tests are two-pattern scan-based tests that avoid overtesting by ensuring
that a circuit traverses only reachable states during the functional clock cycles of a test. In addition, the
power dissipation during the fast functional clock cycles of functional broadside tests does not exceed that
possible during functional operation. On-chip test generation has the added advantage that it reduces test
data volume and facilitates at-speed test application. This paper shows that on-chip generation of
functional broadside tests can be done using a simple and fixed hardware structure, with a small number
of parameters that need to be tailored to a given circuit, and can achieve high transition fault coverage for
testable circuits. With the proposed on-chip test generation method, the circuit is used for generating
reachable states during test application. This alleviates the need to compute reachable states offline.
ETPL
VLSI-013
Checkpointing for Virtual Platforms and SystemC-TLM
Abstract: Integrating simulation models created using different simulation systems is a common problem
when constructing virtual platforms. Different companies and different departments can create models,
and virtual platforms for different purposes using different tools. There are also existing models that need
to be integrated into new tools, or the other way around. The simulators can be quite different in details,
even in the case of transaction-level models. We present work in integrating SystemC transaction-level
models into two typical full-system simulation environments, QEMU and Simics. We present issues in

reconciling the semantics of the different platforms, and our proposed solutions. In the Simics integration,
we additionally enable checkpointing in the models, based on the Simics checkpoint mechanism.
ETPL
VLSI-014
Design of a Practical Nanometer-Scale Redundant Via-Aware Standard Cell Library
for Improved Redundant Via1 Insertion Rate
Abstract: Despite the rapid advances in process technology, via failure is still problematic in nanometer-
scale semiconductor manufacturing. Adding redundant vias is a typical approach for improving yield and
reliability. Cell-based design methodologies are widely adopted in the industry for application-specific
integrated circuits. Standard cells are effective for increasing the insertion rate of redundant via1s in cell-
based designs. This study proposes an efficient library check and staggered pin arrangement approach that
compares redundant via1 insertion rate in different configurations such as double-via and rectangle-via.
To compare the variability in standard cell (SC) libraries, accurate characterization results are provided.
Moreover, the proposed SC library is easily implemented in all currently available routers. The
experimental results reveal that the proposed library improves total inserted redundant vias, total inserted
redundant via1s, and total run time by 20.2%, 51.9%, and 42.3%, respectively. In double-via pattern, the
proposed approach improves average via1 insertion rate by 14.6%. In rectangle-via pattern, the proposed
approach achieves a 100% via1 insertion rate.
ETPL
VLSI-015
Scaling Energy Per Operation via an Asynchronous Pipeline
Abstract: Statistical analysis of computations per unit energy in processors over the last 30 years is given
that illustrates a sharp reduction in the rate of energy efficiency improvements over the last several years
resulting in the formation of an asymptotic “wall” with our dataset; we use the measure of giga multiply
accumulates per Joule. We have developed an energy model which takes into account the realities of
scaling, specifically for asynchronous systems. Studies of an energy efficient asynchronous pipeline show
fabricated results of 17 Giga Operations per Joule in 0.6 μm at subthreshold when fully pipelined, and
simulations at a more modern 65 nm process show a further order of magnitude improvement on that.
ETPL
VLSI-016
A High Speed Low Power CAM With a Parity Bit and Power-Gated ML Sensing
Abstract: Content addressable memory (CAM) offers high-speed search function in a single clock cycle.
Due to its parallel match-line (ML) comparison, CAM is power-hungry. Thus, robust, high-speed and
low-power ML sense amplifiers are highly sought-after in CAM designs. In this paper, we introduce a
parity bit that leads to 39% sensing delay reduction at a cost of less than 1% area and power overhead.
Furthermore, we propose an effective gated-power technique to reduce the peak and average power
consumption and enhance the robustness of the design against process variations. A feedback loop is
employed to auto-turn off the power supply to the comparison elements and hence reduce the average
power consumption by 64%. The proposed design can work at a supply voltage down to 0.5 V.
ETPL
VLSI-017
Error Detection in Majority Logic Decoding of Euclidean Geometry Low Density
Parity Check (EG-LDPC) Codes
Abstract: In a recent paper, a method was proposed to accelerate the majority logic decoding of difference
set low density parity check codes. This is useful as majority logic decoding can be implemented serially

with simple hardware but requires a large decoding time. For memory applications, this increases the
memory access time. The method detects whether a word has errors in the first iterations of majority logic
decoding, and when there are no errors the decoding ends without completing the rest of the iterations.
Since most words in a memory will be error-free, the average decoding time is greatly reduced. In this
brief, we study the application of a similar technique to a class of Euclidean geometry low density parity
check (EG-LDPC) codes that are one step majority logic decodable. The results obtained show that the
method is also effective for EG-LDPC codes. Extensive simulation results are given to accurately
estimate the probability of error detection for different code sizes and numbers of errors.
ETPL
VLSI-018
Techniques for Compensating Memory Errors in JPEG2000
Abstract: This paper presents novel techniques to mitigate the effects of SRAM memory failures caused
by low voltage operation in JPEG2000 implementations. We investigate error control coding schemes,
specifically single error correction double error detection code based schemes, and propose an unequal
error protection scheme tailored for JPEG2000 that reduces memory overhead with minimal effect in
performance. Furthermore, we propose algorithm-specific techniques that exploit the characteristics of the
discrete wavelet transform coefficients to identify and remove SRAM errors. These techniques do not
require any additional memory, have low circuit overhead, and more importantly, reduce the memory
power consumption significantly with only a small reduction in image quality.
ETPL
VLSI-019
Spatial Distribution Measurement of Dynamic Voltage Drop Caused by Pulse and
Periodic Injection of Spot Noise
Abstract: This paper presents measured results of dynamic voltage drop caused by pulse and periodic
injection of spot noise. The test structure being fabricated by a 45 nm low-power process has 1024 delay
probes to measure spatial distributions in response to the spot-noise generation. The test structure is the
advanced version of our predecessor being fabricated by a 65-nm node, and can trace changes in the
spatial distributions with time after the noise injection. The measured results are compared with SPICE
simulations, in which package/socket LCR as well as power-line RC within the die is modeled. It is found
that the simple model agrees well with the measured results.
ETPL
VLSI-020
Low-Complexity Multiplier for GF(2^{m}) Based on All-One Polynomials
Abstract: This paper presents an area-time-efficient systolic structure for multiplication over GF(2m)
based on irreducible all-one polynomial (AOP). We have used a novel cut-set retiming to reduce the
duration of the critical-path to one XOR gate delay. It is further shown that the systolic structure can be
decomposed into two or more parallel systolic branches, where the pair of parallel systolic branches has
the same input operand, and they can share the same input operand registers. From the application-
specific integrated circuit and field-programmable gate array synthesis results we find that the proposed
design provides significantly less area-delay and power-delay complexities over the best of the existing
designs.
ETPL
VLSI-021
Design and Implementation of an On-Chip Permutation Network for Multiprocessor
System-On-Chip

Abstract: This paper presents the silicon-proven design of a novel on-chip network to support guaranteed
traffic permutation in multiprocessor system-on-chip applications. The proposed network employs a
pipelined circuit-switching approach combined with a dynamic path-setup scheme under a multistage
network topology. The dynamic path-setup scheme enables runtime path arrangement for arbitrary traffic
permutations. The circuit-switching approach offers a guarantee of permuted data and its compact
overhead enables the benefit of stacking multiple networks. A 0.13-μ m CMOS test-chip validates the
feasibility and efficiency of the proposed design. Experimental results show that the proposed on-chip
network achieves 1.9× to 8.2× reduction of silicon overhead compared to other design approaches.
ETPL
VLSI-022
An On-Chip Network Fabric Supporting Coarse-Grained Processor Array
Abstract: Coarse grained arrays (CGAs) with run-time reconfigurability play an important role in
accelerating reconfigurable computing applications. It is challenging to design on-chip communication
networks (OCNs) for such CGAs with dynamic run-time reconfigurability whilst satisfying the tight
budgets of power and area for an embedded system. This paper presents a silicon-proven design of a 64-
PE circuit-switched OCN fabric with a dynamic path-setup scheme capable of supporting an embedded
coarse-grained processor array. A proof-of-concept test chip fabricated in a 0.13 μm CMOS process
occupies a silicon area of 23 mm2 and consumes a peak power of 200 mW @ 128 MHz and 1.2 Vcc, at
room temperature. The OCN overhead consumes 9.4% of the area and 18% of the power of the total chip.
Experimental results and analysis show that the proposed OCN fabric with its dynamic path-setup is
suitable for use in an embedded CGA supporting fast run-time reconfigurability.
ETPL
VLSI-023
A Very Linear Low-Pass Filter with Automatic Frequency Tuning
Abstract: A Gm-C third-order Chebyshev low-pass filter with a novel switched capacitor frequency
tuning technique for a zero-IF Bluetooth receiver has been designed. The frequency tuning scheme is
simpler and has more relaxed specifications than conventional ones. Furthermore, a highly linear pseudo-
differential transconductor with a compact feedback loop able to operate with low supply voltage has
been used. This control loop holds the input transistors in triode region and provides high output
resistance, keeping high linearity in a wide range of transconductance. The filter bandwidth is 0.5 MHz
and the overall scheme consumes 1.1 mA from a 1.8-V supply. The measured third-order intermodulation
(IM3) distortion of the filter for a 1 Vpp two-tone signal centered at 300 kHz is -65 dB.
ETPL
VLSI-024
A High-Speed Low-Complexity Modified {rm Radix}-2^{5} FFT Processor for High
Rate WPAN Applications
Abstract: This paper presents a high-speed low-complexity modified radix-25 512-point fast Fourier
transform (FFT) processor using an eight data-path pipelined approach for high rate wireless personal
area network applications. A novel modified radix-25 FFT algorithm that reduces the hardware
complexity is proposed. This method can reduce the number of complex multiplications and the size of
the twiddle factor memory. It also uses a complex constant multiplier instead of a complex Booth
multiplier. The proposed FFT processor achieves a signal-to-quantization noise ratio of 35 dB at 12 bit
internal word length. The proposed processor has been designed and implemented using 90-nm CMOS
technology with a supply voltage of 1.2 V. The results demonstrate that the total gate count of the

proposed FFT processor is 290 K. Furthermore, the highest throughput rate is up to 2.5 GS/s at 310 MHz
while requiring much less hardware complexity.
ETPL
VLSI-025
Application Space Exploration of a Heterogeneous Run-Time Configurable Digital
Signal Processor
Abstract: This paper describes the application space exploration of a heterogeneous digital signal
processor with dynamic reconfiguration capabilities. The device is built around three reconfigurable
engines featuring different flavours and computation granularities that make it suitable for a wide range of
signal processing application domains such as video coding, image processing, telecommunications, and
cryptography. Performance of signal processing applications is evaluated from measurements performed
on a CMOS 90 nm prototype. In order to characterize the application space of the processor, performance
is compared with state-of-the-art devices, taking programmability, computational capabilities, and energy
efficiency as the main metrics. The device exploits performance and energy efficiency significantly more
than general purpose processors, while still maintaining a user-friendly programming approach that
mainly relies on software-oriented languages. The device is able to achieve 1.2 to 15 GOPS with an
energy efficiency from 2 to 50 GOPS/W when running the selected applications
ETPL
VLSI-026
A Unified Graphics and Vision Processor With a 0.89 mu W/fps Pose Estimation
Engine for Augmented Reality
Abstract: A unified vision and graphics processor with three layers is shown to provide a fast pipeline for
augmented reality. In the image-level layer, a 153.6 GOPS massively parallel processing unit with eight
SIMD processors, each containing 128 processing elements, performs highly data-parallel operations. In
the sub-image layer, a rasterizer and a pixel arranger respectively generate and reduce data-level
parallelism. In the descriptor-level layer, a pose estimation engine executes sequential programs. Our
processor can provide images for augmented reality at 100 fps, for a power consumption of 413 mW. This
is 39% faster than a comparable smartphone implementation. Our chip is fabricated in a 0.18 μm CMOS
process and contains 0.95 M gates.
ETPL
VLSI-027
CORDIC Designs for Fixed Angle of Rotation
Abstract: Rotation of vectors through fixed and known angles has wide applications in robotics, digital
signal processing, graphics, games, and animation. But, we do not find any optimized coordinate rotation
digital computer (CORDIC) design for vector-rotation through specific angles. Therefore, in this paper,
we present optimization schemes and CORDIC circuits for fixed and known rotations with different
levels of accuracy. For reducing the area- and time-complexities, we have proposed a hardwired pre-
shifting scheme in barrel-shifters of the proposed circuits. Two dedicated CORDIC cells are proposed for
the fixed-angle rotations. In one of those cells, micro-rotations and scaling are interleaved, and in the
other they are implemented in two separate stages. Pipelined schemes are suggested further for cascading
dedicated single-rotation units and bi-rotation CORDIC units for high-throughput and reduced latency
implementations. We have obtained the optimized set of micro-rotations for fixed and known angles. The
optimized scale-factors are also derived and dedicated shift-add circuits are designed to implement the
scaling. The fixed-point mean-squared-error of the proposed CORDIC circuit is analyzed statistically, and
strategies for reducing the error are given. We have synthesized the proposed CORDIC cells by Synopsys
Design Compiler using TSMC 90-nm library, and shown that the proposed designs offer higher

throughput, less latency and less area-delay product than the reference CORDIC design for fixed and
known angles of rotation. We find similar results of synthesis for different Xilinx field-programmable
gate-array platforms.
ETPL
VLSI-028
Application-Driven End-to-End Traffic Predictions for Low Power NoC Design
Abstract: As chip multiprocessors keep increasing the number of cores on the chip, the network-on-chip
(NoC) technology is becoming essential for interconnecting the cores. While NoCs result in noticeable
performance boost over conventional bus systems, they consume a non-negligible fraction of the system
power. One promising solution is to dynamically adjust the working frequencies/voltages of the switches
as well as the links between switches in the NoC to match the traffic flows. The question is when to adjust
and by how much. Most previous works take a passive approach by reacting to fluctuations in local traffic
flows. Unfortunately, this approach may be too slow and too conservative in adjusting the working
frequencies/voltages. Since applications often exhibit periodic behaviors, we propose a hardware
mechanism to proactively adjust the frequencies/voltages of switches and/or links in NoC by predicting
the application runtime traffic. The evaluations show that our design achieves 86% dynamic power
savings of the links in the on-chip network, and the resulting overheads from mispredictions are tolerable.
ETPL
VLSI-029
Thermal-Constrained Task Allocation for Interconnect Energy Reduction in 3-D
Homogeneous MPSoCs
Abstract: 3-D technology that stacks silicon dies with through silicon vias (TSVs) is a promising solution
to overcome the interconnect scaling problem in giga-scale integrated circuits (ICs). Thermal dissipation
is a major challenge for 3-D integration and prior thermal-balanced task scheduling methods for 3-D
multiprocessor system-on-chips (MPSoCs) typically balance power gradient across vertical stacks based
on the assumption of strong thermal correlation among processing cores within a stack. On the other
hand, 3-D MPSoCs typically employ network-on-chip (NoC) as the communication infrastructure which
consumes a large portion of the energy budget. As TSVs consume much less energy than horizontal links
in 3-D MPSoCs when transmitting the same amount data due to the reduced interconnect distance
between vertical adjacent cores, it motivates to allocate heavily communicating tasks within the same
vertical stack as much as possible, and thus traffic is restricted in the third dimension to reduce
interconnect energy. However, aggregating active tasks within the same stack probably exacerbates the
power density and result in hot spots. In this paper, we explore the tradeoff between thermal and
interconnect energy when allocating tasks in 3-D Homogeneous MPSoCs, and propose an efficient
heuristic. Experimental results show that the proposed technique can reduce interconnect energy by more
than 25% on average with almost the same peak temperature when compared with prior thermal-balanced
solutions.
ETPL
VLSI-030
A Wide-Range PLL Using Self-Healing Prescaler/VCO in 65-nm CMOS
Abstract: The variability and leakage current in nanoscale CMOS technology may degrade the circuit
performances significantly. To accommodate the above issues in a wide-range phase-locked loop (PLL), a
self-healing prescaler, a self-healing voltage-controlled oscillator (VCO), and a calibrated charge pump
(CP) are presented. This PLL is fabricated in a 65-nm CMOS technology and its active area is 0.0182
mm2 . For the self-healing VCO, its measured frequency range is from 60 to 1489 MHz. When this PLL

operates at 855 MHz, the measured rms and peak-to-peak jitters are 8.03 and 55.6 ps, respectively. The
measured reference spur is -52.89 dBc. This PLL consumes 4.3 mW from 1.2 V supply without buffers.
ETPL
VLSI-031
A Clock Control Strategy for Peak Power and RMS Current Reduction Using Path
Clustering
Abstract: Peak power reduction has been a critical challenge in the design of integrated circuits impacting
the chip's performance and reliability. The reduction of peak power also reduces the power density of
integrated circuits. Due to large IR-voltage drops in circuits, transistor switching slows down giving rise
to timing violations and logic failures. In this paper, we present a new clock control strategy for peak-
power reduction in VLSI circuits. In the proposed method, the simultaneous switching of combinational
paths is minimized by taking advantage of the delay slacks among the paths and clustering the paths with
similar slack values. Once the paths are identified based on the path delays and their slack values, the
clustering algorithm determines the ideal number of clusters for the given circuit and for each cluster the
maximum possible phase shift that can be applied to the clock. The paths are assigned to clusters in a load
balanced manner based on the slack values and each cluster will have a phase shift possible on its clock
depending on the slack. Thus, the proposed register-transfer level (RTL) method takes advantage of the
logic-path timing slack to re-schedule circuit activities at optimal intervals within the unaltered clock
period. When switching activities are redistributed more evenly across the clock period, the IC supply-
current consumption is also spread across a wider range of time within the clock period. This has the
beneficial effect of reducing peak-current draw in addition to reducing RMS power draw without having
to change the operating frequency and without utilizing additional power supply voltages as in dual or
multi VT approaches. The proposed method is implemented and tested through simulations using an
experimental setup with Synopsys Tools Suite and Cadence Tools on the ISCAS'85 benchmark circuits,
OpenCore circuits and LEON processor multiplier circuit. Experimental results indicate that peak power
can be reduced significantly to at- least 72% depending on the number of clusters and the phase-shifted
clock identified as suitable for the given circuit by the proposed algorithms. Although the proposed
method incurs some power overhead compared to the traditional clocking method, the overhead can be
made negligible compared to the peak-power reduction as seen in the experimental results presented.
ETPL
VLSI-032
A Fast-Locking All-Digital Deskew Buffer With Duty-Cycle Correction
Abstract: In this paper, a fast-locking all-digital deskew buffer with duty cycle correction is proposed and
implemented. A cyclic time-to-digital converter is introduced to decrease the locking time in conventional
register-controlled delay-locked loop to only two input clock cycles in coarse tuning. With the aid of the
three half delay lines technique, the mismatch between half delay lines causing the duty cycle distortion
can be alleviated by interpolation. A balanced edge combiner to achieve a precise 50% output clock is
also presented. A test chip is fabricated in 0.18-μm technology to demonstrate the feasibility of the
proposed architecture. The circuit can accept the input clock rates from 250 to 625 MHz with the duty
cycle variation within 30% and 70% to generate 50% output clocks. It preserves the capability of closed-
loop control with a small area and power consumption.
ETPL
VLSI-033
A Built-In Repair Analyzer With Optimal Repair Rate for Word-Oriented Memories

Abstract: This paper presents a built-in self repair analyzer with the optimal repair rate for memory arrays
with redundancy. The proposed method requires only a single test, even in the worst case. By performing
the must-repair analysis on the fly during the test, it selectively stores fault addresses, and the final
analysis to find a solution is performed on the stored fault addresses. To enumerate all possible solutions,
existing techniques use depth first search using a stack and a finite-state machine. Instead, we propose a
new algorithm and its combinational circuit implementation. Since our formulation for the circuit allows
us to use the parallel prefix algorithm, it can be configured in various ways to meet area and test time
requirements. The total area of our infrastructure is dominated by the number of content addressable
memory entries to store the fault addresses, and it only grows quadratically with respect to the number of
repair elements. The infrastructure is also extended to support various types of word-oriented memories.
ETPL
VLSI-034
System-Level Modeling and Analysis of Thermal Effects in Optical Networks-on-Chip
Abstract: The performance of multiprocessor systems, such as chip multiprocessors (CMPs), is
determined not only by individual processor performance, but also by how efficiently the processors
collaborate with one another. It is the communication architecture that determines the collaboration
efficiency on the hardware side. Optical networks-on-chip (ONoCs) are emerging communication
architectures that can potentially offer ultra-high communication bandwidth and low latency to
multiprocessor systems. Thermal sensitivity is an intrinsic characteristic of photonic devices used by
ONoCs as well as a potential issue. This paper systematically modeled and quantitatively analyzed the
thermal effects in ONoCs. We used an 8 × 8 mesh-based ONoC as a case study and evaluated the impacts
of thermal effects in the average power efficiency for real MPSoC applications. We revealed three
important factors regarding ONoC power efficiency under temperature variations, and proposed several
techniques to reduce the temperature sensitivity of ONoCs. These techniques include the optimal initial
setting of microresonator resonant wavelength, increasing the 3-dB bandwidth of optical switching
elements by parallel coupling multiple microresonators, and the use of passive-routing optical router Crux
to minimize the number of switching stages in mesh-based ONoCs. We gave a mathematical analysis of
periodically parallel coupling of multiple microresonators and show that the 3-dB bandwidth of optical
switching elements can be widened nearly linearly with the ring number. Evaluation results for different
real MPSoC applications show that, on the basis of thermal tuning, the optimal device setting improves
the average power efficiency by 54% to 1.2 pJ/bit when chip temperature reaches 85 °C. The findings in
this paper can help support the further development of this emerging technology.
ETPL
VLSI-035
A Study of Tapered 3-D TSVs for Power and Thermal Integrity
Abstract: 3-D integration presents a path to higher performance, greater density, increased functionality
and heterogeneous technology implementation. However, 3-D integration introduces many challenges for
power and thermal integrity due to large switching currents, longer power delivery paths, and increased
parasitics compared to 2-D integration. In this work, we provide an in-depth study of power and thermal
issues while incorporating the physical design characteristics unique to 3-D integration. We provide a
qualitative perspective of the power and thermal dissipation issues in 3-D and study the impact of
Through Silicon Vias (TSVs) size for their mitigation. We investigate and discuss the design implications
of power and thermal issues in the presence of decoupling capacitors, TSV/on-die/package parasitics,
various resonance effects and power gating. Our study is based on a ten-tier system utilizing existing 3-D

technology specifications. Based on detailed power distribution and heat dissipation models, we present a
comprehensive analysis of TSV tapering for alleviating power and thermal integrity issues in 3-D ICs.
ETPL
VLSI-036
Improved Trace Buffer Observation via Selective Data Capture Using 2-D Compaction
for Post-Silicon Debug
Abstract: This paper presents a novel technique for extending the capacity of trace buffers when capturing
debug data during post-silicon debug. It exploits the fact that is it not necessary to capture error-free data
in the trace buffer since that information can be obtained from simulation. A selective data capture
method is proposed in this paper that only captures debug data during clock cycles in which errors are
present. The proposed debug method requires only three debug sessions. The first session estimates a
rough error rate, the second session identifies a set of suspect clock cycles where errors may be present,
and the third session captures the suspect clock cycles in the trace buffer. The suspect clock cycles are
determined through a 2-D compaction technique using multiple-input signature register signatures and
cycling register signatures. Intersecting both signatures generates a small number of suspect clock cycles
for which the trace buffer needs to capture. The effective observation window of the trace buffer can be
expanded significantly, by up to orders of magnitude. Experimental results indicate very significant
increases in the effective observation window for a trace buffer can be obtained.
ETPL
VLSI-037
AC-Plus Scan Methodology for Small Delay Testing and Characterization
Abstract: Small delay defects escaping traditional delay testing could cause a device to malfunction in the
field and thus detecting these defects is often necessary. To address this issue, we propose three test
modes in a new methodology called AC-plus scan, in which versatile test clocks can be generated on the
chip by embedding an all-digital phase-locked loop (ADPLL) into the circuit under test (CUT). AC-plus
scan can be executed on an in-house wireless test platform called HOY system. The first test mode of our
AC-plus scan provides a more efficient way to measure the longest path delay associated with each test
pattern. Experimental result shows that our method could greatly reduce the test time by 81.8%. The
second test mode is designed for volume production test. It could effectively detect small delay defects
and provide fast characterization on those defective chips for further processing. This mode could be used
to help predict which chips are more likely to fall victim to operational failure in the field. The third test
mode is to extract the waveform of each flip-flop's output in a real chip. This is made possible by taking
advantage of the almost unlimited test memory our HOY test platform provides, so that we could easily
store a great volume of data and reconstruct the waveform for post-silicon debugging. We have
successfully fabricated a Viterbi decoder chip with such an AC-plus scan methodology inside to
demonstrate its capability.
ETPL
VLSI-038
A Variation Tolerant Current-Mode Signaling Scheme for On-Chip Interconnects
Abstract: Current-mode signaling (CMS) with dynamic overdriving is one of the most promising scheme
for high-speed low-power communication over long on-chip interconnects. However, they are sensitive to
parameter variations due to reduced voltage swings on the line. In this paper, we propose a variation
tolerant dynamic overdriving CMS scheme. The proposed CMS scheme and a competing CMS scheme
(CMS-Fb) are fabricated in 180-nm CMOS technology. Measurement results show that the proposed
scheme offers 34% reduction in energy/bit and 42% reduction in energy-delay-product over CMS-Fb

scheme for a 10 mm line operating at 0.64 Gbps of data rate. Simulations indicate that the proposed CMS
scheme consumes 0.297 pJ/bit for data transfer over the 10 mm line at 2.63 Gb/s. Measurements indicate
that the delay of CMS-Fb becomes 2.5 times its nominal value in the presence of intra-die variations
whereas the delay of the proposed scheme changes by only 5% for the same amount of intra-die
variations. Measurement and simulation results show that both the schemes are robust against inter-die
variations. Experiments and simulations also indicate that the proposed CMS scheme is more robust
against practical variations in supply and temperature as compared to CMS-Fb scheme.
ETPL
VLSI-039
Modeling and Analysis of Power Distribution Networks in 3-D ICs
Abstract: This paper addresses the modeling and analysis problems for power distribution networks
(PDNs) in 3-D ICs. An on-chip distributed model is proposed for 3-D power grids, in which the details of
metal layers are considered. The distributed model is demonstrated to be essential to identifying the
unique noise behavior of 3-D PDNs. A lumped model is proposed based on the distributed model. The
lumped model features the connection impedance between tiers and is proven to be useful for designers to
understand the global effects of 3-D PDNs. Based on the models, an analysis flow is designed for 3-D
PDNs in both frequency domain and time domain. With the analysis flow, the electrical characteristics of
3-D PDNs are studied systematically for the first time. The frequency-domain analysis identifies the
global and local resonance phenomena in 3-D PDNs that are distinct from those in 2-D PDNs. The
physical mechanisms behind the resonance phenomena are investigated. The time-domain analysis
predicts the worst-case supply noise based on distributed current constraints. The “Rogue Wave” concept
is introduced to explain the spatial and temporal relations of the worst-case on-chip noise responses in 3-
D PDNs.
ETPL
VLSI-040
A Low-Cost, Systematic Methodology for Soft Error Robustness of Logic Circuits
Abstract: Due to current technology scaling trends such as shrinking feature sizes and decreasing supply
voltages, circuit reliability is becoming more susceptible to radiation-induced transient faults (soft errors).
Soft errors, which have been a great concern in memories, are now a main factor in reliability degradation
of logic circuits as well. In this paper, we present a systematic and integrated methodology for circuit
robustness to soft errors. The proposed soft error rate (SER) reduction framework, based on redundancy
addition and removal (RAR), aims at eliminating those gates with large contribution to the overall SER.
Several metrics and constraints are introduced to guide the RAR-based approach toward SER reduction.
Furthermore, we integrate a resizing strategy into our framework, as post-RAR additive SER
optimization. The strategy can identify most critical gates to be upsized and thereby, minimize area and
power overheads while maintaining a high level of soft error robustness. Experimental results show that
the proposed RAR-based framework can achieve up to 70% reduction in output failure probability. On
average, about 23% SER reduction is obtained with less than 4% area overhead.
ETPL
VLSI-041
Low Complexity Out-of-Order Issue Logic Using Static Circuits
Abstract: In this paper a single-cycle issue queue circuit architecture that simplifies the wakeup and
selection logic is proposed. The micro-architecture and fully static CMOS circuits are presented for a 32-
entry queue that issues four instructions per cycle. The instruction-ready signals are divided into groups
and processed in parallel to issue the four oldest ready instructions. The complete issue queue and
prioritization logic requires 20 inversions, allowing simulated circuit operation at over 4 GHz in a foundry

45 nm SOI fabrication process.
ETPL
VLSI-042
Low Latency Systolic Montgomery Multiplier for Finite Field GF(2^{m}) Based on
Pentanomials
Abstract: In this paper, we present a low latency systolic Montgomery multiplier over GF(2m) based on
irreducible pentanomials. An efficient algorithm is presented to decompose the multiplication into a
number of independent units to facilitate parallel processing. Besides, a novel so-called “pre-computed
addition” technique is introduced to further reduce the latency. The proposed design involves significantly
less area-delay and power-delay complexities compared with the best of the existing designs. It has the
same or shorter critical-path and involves nearly one-fourth of the latency of the other in case of the
National Institute of Standards and Technology recommended irreducible pentanomials.
ETPL
VLSI-043
Power-Up Sequence Control for MTCMOS Designs
Abstract: Power gating is effective for reducing standby leakage power as multi-threshold CMOS
(MTCMOS) designs have become popular in the industry. However, a large inrush current and dynamic
IR drop may occur when a circuit domain is powered up with MTCMOS switches. This could in turn lead
to improper circuit operation. We propose a novel framework for generating a proper power-up sequence
of the switches to control the inrush current of a power-gated domain while minimizing the power-up
time and reducing the dynamic IR drop of the active domains. We also propose a configurable domino-
delay circuit for implementing the sequence. Experimental results based on state-of-the-art industrial
designs demonstrate the effectiveness of the proposed framework in limiting the inrush current,
minimizing the power-up time, and reducing the dynamic IR drop. Results further confirm the efficiency
of the framework in handling large-scale designs with more than 80 K power switches and 100 M
transistors.
ETPL
VLSI-044
Architecture and Design Flow for a Highly Efficient Structured ASIC
Abstract: As fabrication process technology continues to advance, mask set costs have become
prohibitively expensive. Structured application specific integrated circuits (sASICs) offer a middle ground
in price and performance between ASICs and field-programmable gate arrays (FPGAs) by sharing masks
across different designs. In this paper, two sASIC architectures are proposed, the first being based on
three-input lookup-tables, and the second on AOI22 gates. The sASICs are programmed using a standard-
cell compatible design flow. They are customized using a minimum of three masks, i.e., two metals and
one via. The area and delay of the sASIC are compared with ASICs and FPGAs. Results over a set of
benchmark circuits show that our AOI22-based sASIC had an average of 1.76x/1.41x increase in
area/delay compared to ASICs, a considerable improvement compared with the 26.56x/5.09x increase for
FPGAs. This is, to the best of our knowledge, the best performance reported in the literature for a
practical sASIC. A prototype using the sASIC was fabricated using a universal machine control 0.13-μm
mixed-mode/RF process. It was fully verified using scan and functional tests, and used in a demonstration
system.
ETPL
VLSI-045
Secure Dual-Core Cryptoprocessor for Pairings Over Barreto-Naehrig Curves on
FPGA Platform,
Abstract: This paper is devoted to the design and the physical security of a parallel dual-core flexible
cryptoprocessor for computing pairings over Barreto-Naehrig (BN) curves. The proposed design is
specifically optimized for field-programmable gate-array (FPGA) platforms. The design explores the in-
built features of an FPGA device for achieving an efficient cryptoprocessor for computing 128-bit secure

pairings. The work further pinpoints the vulnerability of those pairing computations against side-channel
attacks and demonstrates experimentally that power consumptions of such devices can be used to attack
these ciphers. Finally, we suggest a suitable countermeasure to overcome the respective weaknesses. The
proposed secure cryptoprocessor needs 1 730 000, 1 206 000, and 821 000 cycles for the computation of
Tate, ate, and optimal-ate pairings, respectively. The implementation results on a Virtex-6 FPGA device
shows that it consumes 23 k Slices and computes the respective pairings in 11.93, 8.32, and 5.66 ms.
ETPL
VLSI-046
In-Situ Method for TSV Delay Testing and Characterization Using Input Sensitivity
Analysis
Abstract: In this paper, we propose a method and the required architecture for characterizing the
propagation delays of the through Silicon vias (TSVs) in a 3-D IC. First of all, every two TSVs are paired
up to form an oscillation ring with some peripheral circuits. Their joint performance can thus be measured
roughly by the oscillation period of the ring. Next, we utilize a technique called sensitivity analysis to
further derive the propagation delay of each individual TSV participating in an oscillation ring-a distilling
process. In this process, we perturb the strength of the two TSV drivers, and then measure their effects in
terms of the change of the oscillation ring's period. By some following analysis, the propagation delay of
each TSV can be revealed. On top of scheme, we also present an architecture that can activate the
performance characterization process of each test unit - that consists of two TSVs - one at a time in a
proper sequence. The area overhead is only 18.97 equivalent two-input NAND gate per TSV, by which
one can gain the ability to profile the capacitances and the propagation delays of the TSVs on a 3-D IC.
ETPL
VLSI-047
Low-Resolution DAC-Driven Linearity Testing of Higher Resolution ADCs Using
Polynomial Fitting Measurements
Abstract: A low-cost linearity test methodology for high-resolution analog-to-digital converters (ADCs) is
presented in this paper. Linearity testing of ADCs requires high-precision digital-to-analog conversion
(DAC) capability, commonly 3-bit higher resolution than the ADC under test. Further, a large number of
ADC output data samples must be collected making conventional histogram testing impractical for high-
resolution ADCs with 18-24 bit precision. In the proposed test methodology, two low-precision and low-
cost DACs are used to generate a high-resolution ADC test stimulus. Significant reductions in test cost
and test time are achieved by using low-cost instrumentation and by making fewer measurements than
required for conventional histogram test. A least-squares-based polynomial fitting approach is used to
determine the transfer function of the ADC under test. The generated transfer function is used to compute
the non-linearity of the ADC accurately. No assumption is made regarding the linearity of the lower
precision signal generators (DACs) used in the testing procedure. Software simulations and hardware
experiments are performed to validate the proposed test methodology
ETPL
VLSI-048
Low-Cost Error Tolerance Scheme for 3-D CMOS Imagers
Abstract: This paper presents an error tolerance scheme for 3-D CMOS imagers that are constructed by
stacking a pixel array of imager sensors, an analog-to-digital converter (ADC) array, and an image signal
processor (ISP) array using microbumps (μbumps) and through silicon vias (TSVs). To deliver high-
quality images in the presence of single or multiple μbump, ADC, or TSV failures, we propose to
interleave the connections from pixels to ADCs and recover the corrupted data in the ISPs. Key design
parameters, such as the interleaving stride and the grouping ratio are determined by analyzing the
employed error correction algorithm. Architectural simulation results demonstrate that the error tolerance

scheme enhances the effective yield of an exemplar 3-D imager from 44% to 97%.
ETPL
VLSI-049
Computing Two-Pattern Test Cubes for Transition Path Delay Faults
Abstract: Considering full-scan circuits, incompletely-specified tests, or test cubes, are used for test data
compression. When considering path delay faults, certain specified input values in a test cube are needed
only for determining the lengths of the paths associated with detected faults. Path delay faults, and
therefore, small delay defects, would still be detected if such values are unspecified. The goal of this
paper is to explore the possibility of increasing the number of unspecified input values in a test set for
path delay faults by unspecifying such values in order to make the test set more amenable to test data
compression. Experimental results indicate that significant numbers of such values exist. The proposed
procedure unspecifies them gradually to obtain a series of test sets with increasing numbers of unspecified
values and decreasing path lengths. Experimental results also indicate that filling the unspecified values
randomly (as with some test data compression methods) recovers some or all of the path lengths
associated with detected path delay faults. The procedure uses a matching of the sets of detected faults for
the comparison of path lengths
ETPL
VLSI-050
Integrated Energy-Harvesting Photodiodes With Diffractive Storage Capacitance
Abstract: Integrating energy-harvesting photodiodes with logic and exploiting on-die interconnect
capacitance for energy storage can enable new, ultraminiaturized wireless systems. Unlike CMOS imager
pixels, the proposed photodiode designs utilize p-diffusion fingers and are implemented in a conventional
logic process. Also unlike specialized solar cell processes, the designs utilize the on-chip metal
interconnect to form a diffraction grating above the p-diffusion fingers which also provides capacitive
energy storage. To explore the tradeoffs between optical efficiency and energy storage for integrated
photodiodes, an array of photovoltaics with various diffractive storage capacitors was designed in a 90-
nm CMOS logic process. The diffractive effects can be exploited to increase the photodiodes' response to
off-axis illumination. Transient effects from interfacing the photodiodes with switched-capacitor DC-DC
converters were examined, with measurements indicating a 50% reduction in the output voltage ripple
due to the diffractive storage capacitance. A quantitative comparison between 90-nm and 0.35-μm CMOS
logic processes for energy-harvesting capabilities was carried out. Measurements show an increase in
power generation for the newer CMOS technology, however at the cost of reduced output voltage. One
potential application for the integrated photodiodes is harvesting energy for a subdermal biomedical
device.
ETPL
VLSI-051
Fast Fixed-Outline 3-D IC Floorplanning With TSV Co-Placement
Abstract: Through-silicon vias (TSVs) are used to connect inter-die signals in a 3-D IC. Unlike
conventional vias, TSVs occupy device area and are very large compared to logic gates. However, most
previous 3-D floorplanners only view TSVs as points. As a result, whitespace redistribution is necessary
for TSV insertion after the initial floorplan is computed, which leads to suboptimal layouts. In this paper,
we propose a very efficient 3-D floorplanner to simultaneously floorplan the functional modules and
place the TSVs and to optimize the total wirelength under fixed-outline constraint. Compared to the state-

of-the-art 3-D floorplanner with TSV planning, our design consistently produces better floorplans with
15% shorter wirelength and 31% fewer TSVs on average. Our algorithm is extremely fast and only takes
a few seconds to floorplan benchmarks with hundreds of modules compared to hours as required by the
previous state-of-the-art floorplanner.
ETPL
VLSI-052
Reactivation Noise Suppression With Sleep Signal Slew Rate Modulation in MTCMOS
Circuits
Abstract: Multi-threshold CMOS (MTCMOS) is commonly used for suppressing leakage currents in idle
integrated circuits. Power and ground distribution network noise produced during SLEEP to ACTIVE
mode transitions is an important reliability concern in MTCMOS circuits. Sleep signal slew rate
modulation techniques for suppressing mode-transition noise are explored in this paper. A triple-phase
sleep signal slew rate modulation (TPS) technique with a novel digital sleep signal generator is proposed.
Reactivation time, mode-transition energy consumption, leakage power consumption, and layout area of
different MTCMOS circuits are characterized under an equal-noise constraint. Influences of within-die
and die-to-die parameter variations on the reactivation noise, time, and energy consumption of sleep
signal slew rate modulated MTCMOS circuits are evaluated with a process imperfections aware
robustness metric. The proposed triple-phase sleep signal slew rate modulation technique enhances the
tolerance to process parameter fluctuations by up to 183.1× as compared to various alternative MTCMOS
noise suppression techniques in a UMC 80-nm CMOS technology.
ETPL
VLSI-053
Sub-mW LC Dual-Input Injection-Locked Oscillator for Autonomous WBSNs
Abstract: This paper presents a sub-mW, current-reused first-harmonic LC injection-locked oscillator
(ILO) using in-phase dual-input injection technique. It can be used as a power oscillator in the injection-
locked transmitter of wireless biomedical sensor nodes (WBSNs) integrated into a wireless body area
network. A prototype chip, implemented in a standard 0.13-μm CMOS process occupying 200 × 380 μm,
operates in the medical implantable communications service (MICS) band for medical implants.
Measurement results show that the proposed ILO features a wide locking range of 800 MHz (150-950
MHz) at an input power of 0 dBm. More importantly, it has a high input sensitivity of -30 dBm to lock
the 3-MHz bandwidth of the MICS band, while consuming only 660 μW at 1-V supply. This ultralow
power consumption enables autonomous WBSNs
ETPL
VLSI-054
Constant Delay Logic Style
Abstract: A constant delay (CD) logic style is proposed in this paper, targeting at full-custom high-speed
applications. The CD characteristic of this logic style regardless of the logic type makes it suitable in
implementing complicated logic expressions such as addition. CD logic exhibits a unique characteristic
where the output is pre-evaluated before the inputs from the preceding stage is ready. This feature offers
performance advantage over static and dynamic domino logic styles in a single-cycle multistage circuit
block. Several design considerations including timing window width adjustment and clock distribution
are discussed. Using 65-nm general-purpose CMOS technology, the proposed logic demonstrates an
average speedup of 94% and 56% over static and dynamic domino logic, respectively, in five different
logic gates. Simulation results of 8-bit ripple carry adders show that CD logic is 39% and 23% faster than
the static and dynamic-based adders, respectively. CD logic also demonstrates 39% speedup and 64%

(22%) energy-delay product (EDP) reduction from static logic at 100% (10%) data activity in 32-bit carry
lookahead adders. For 8-bit Wallace tree multiplier, CD logic achieves a similar speedup with at least
50% EDP reduction across all data activities.
ETPL
VLSI-055
A Compact Clock Generator for Heterogeneous GALS MPSoCs in 65-nm CMOS
Technology
Abstract: This paper presents an all-digital phase-locked loop (ADPLL) clock generator for globally
asynchronous locally synchronous (GALS) multiprocessor systems-on-chip (MPSoCs). With its low
power consumption of 2.7 mW and ultra small chip area of 0.0078 mm2 it can be instantiated per core for
fine-grained power management like DVFS. It is based on an ADPLL providing a multiphase clock
signal from which core frequencies from 83 to 666 MHz with 50% duty cycle are generated by phase
rotation and frequency division. The clock meets the specification for DDR2/DDR3 memory interfaces.
Additionally, it provides a dedicated high-speed clock up to 4 GHz for serial network-on-chip data links.
Core frequencies can be changed arbitrarily within one clock cycle for fast dynamic frequency scaling
applications. The performance including statistical analysis of mismatch has been verified by a prototype
in 65-nm CMOS technology.
ETPL
VLSI-056
A Colpitts CMOS Quadrature VCO Using Direct Connection of Substrates for
Coupling
Abstract: A new low-phase noise low-power quadrature voltage-controlled oscillator (QVCO) using
differential Colpitts oscillator is presented. The proposed QVCO is composed of two identical current-
switching differential Colpitts VCOs in which the first core VCO is coupled to the second in an in-phase
manner, and the second core VCO is coupled to the first in an anti-phase manner. To couple the two core
VCOs, the substrates of the cross-connected transistors as well as the substrates of MOS varactors are
used; alleviating the need for any extra elements for coupling, which could add noise and increase power
dissipation. A linear (sinusoidal) analysis is presented that confirms that the proposed circuit generates
quadrature waveforms. The proposed coupling technique can be generalized to N differential Colpitts
VCOs for multiphase signals generation
ETPL
VLSI-057
A Self-Calibrated DLL-Based Clock Generator for an Energy-Aware EISC Processor
Abstract: This paper describes a low-jitter delay-locked loop (DLL)-based clock generator for dynamic
frequency scaling in the extendable instruction set computing (EISC) processor. The DLL-based clock
generator provides the system clock with frequencies of 0.5× to 8× of the reference clock, according to
the workload of the EISC processor. The proposed analog self-calibration method and a phase detector
with an auxiliary charge pump can effectively reduce the delay mismatch between delay cells in the
voltage-controlled delay line and the static phase offset due to the current mismatch in the charge pump,
respectively. The self-calibrated output waveform exhibits 9.7 ps of RMS jitter and 73.7 ps of peak-to-
peak jitter at 120 MHz. The prototype clock generator implemented in a 0.18-μm CMOS process
occupies an active area of 0.27 mm2 and consumes 15.56 mA
ETPL
VLSI-058
Clamping Virtual Supply Voltage of Power-Gated Circuits for Active Leakage
Reduction and Gate-Oxide Reliability

Abstract: In an integrated circuit (IC) adopting a power-gating (PG) technique, the virtual supply voltage
(VVDD) is susceptible to: 1) negative-bias temperature instability (NBTI) degradation that weakens the
PG device over time and 2) temporal temperature variation that affects active leakage current (thus total
current) of the IC. The PG device is sized to guarantee a minimum VVDD level over the chip lifetime.
Thus, the NBTI degradation and the worst-case total current at high-temperature must be considered for
sizing the PG device. This leads to higher VVDD (thus active leakage power) than necessary in early chip
lifetime and/or at low temperature, negatively impacting the gate-oxide reliability of transistors. To
reduce active leakage power increase and improve the gate-oxide reliability due to these effects, we
propose two techniques that adjust the strength of a PG device based on its usage and IC's temperature at
runtime. We demonstrate the efficacy of these techniques with an experimental setup using a 32-nm
technology model in the presence of within-die spatial process and temperature variations. On an average
of 100 die samples, they can reduce dynamic and active leakage power by up to 3.7% and 10% in early
chip lifetime. Finally, these techniques also reduce the oxide failure rate by up to 5% across process
corners over a period of 7 years.
ETPL
VLSI-059
10-bit 30-MS/s SAR ADC Using a Switchback Switching Method
Abstract: This brief presents a 10-bit 30-MS/s successive-approximation-register analog-to-digital
converter (ADC) that uses a power efficient switchback switching method. With respect to the monotonic
switching method, the input common-mode voltage variation reduces which improves the dynamic offset
and the parasitic capacitance variation of the comparator. The proposed switchback switching method
does not consume any power at the first digital-to-analog converter switching, which can reduce the
power consumption and design effort of the reference buffer. The prototype was fabricated in a 90-nm
1P9M CMOS technology. At 1-V supply and 30 MS/s, the ADC achieves an sequenced neighbor double
reservation of 56.89 dB and consumes 0.98 mW, resulting in a figure-of-merit (FOM) of 57
fJ/conversion-step. The ADC core occupies an active area of only 190 × 525 μm2.
ETPL
VLSI-060
Spur-Reduction Frequency Synthesizer Exploiting Randomly Selected PFD
Abstract: This brief presents a low-spur phase-locked loop (PLL) system for wireless applications. The
low-spur frequency synthesizer randomizes the periodic ripples on the control voltage of the voltage-
controlled oscillator to reduce the reference spur at the output of the PLL. A novel random clock
generator is presented to perform the random selection of the phase frequency detector control for the
charge pump in locked state. The proposed frequency synthesizer was fabricated in a TSMC 0.18-μm
CMOS process. The proposed PLL achieved phase noise of -93 dBc/Hz with a 600-kHz offset frequency
and reference spurs below -72 dBc.
ETPL
VLSI-061
Gain-Enhanced Monolithic Charge Pump With Simultaneous Dynamic Gate and
Substrate Control
Abstract: This brief presents a gain-enhanced complimentary metal-oxide-semiconductor (CMOS) charge
pump (CP) circuit via dynamically controlling the gate and substrate terminals of each pMOS pass
transistor. The proposed control strategy enables the CP circuit free of the threshold-voltage drops, the
body effect, and the floating substrate terminals of pass devices. The on-resistance of each pass device is

also reduced to improve the gain and the power efficiency of the CP circuit. Implemented in a 0.35-μm
single n-well CMOS process, the proposed four-stage monolithic CP circuit can operate with a supply
voltage down to 0.9 V and deliver a maximum output current of about 100 μA. The proposed CP circuit
also achieves a high voltage gain of 4 with two complementary-phase nonoverlapping clock signals.
ETPL
VLSI-062
Embedding Repeaters in Silicon IPs for Cross-IP Interconnections
Abstract: During systems-on-a-chip (SoC) integration, silicon intellectual properties (IPs) are generally
regarded as blockages to long interconnections that connect different IPs. With this constraint,
conventional designs are forced to place those repeaters that drive long interconnections outside the IP.
These designs either lead to a longer interconnection distance requiring more repeaters or result in a
longer signal delay, since the interconnection wire is not appropriately segmented by the repeaters. To
solve these problems, we designed the IPs such that designers can embed the repeaters in the IP for the
SoC integration. In other words, it allows the cross-IP interconnections to be routed over the IP using
repeaters inserted in the IP. The design concept, physical implementation, and application examples of the
embedded repeaters are described in this brief
ETPL
VLSI-063
RATS: Restoration-Aware Trace Signal Selection for Post-Silicon Validation
Abstract: Post-silicon validation is one of the most important and expensive tasks in modern integrated
circuit design methodology. The primary problem governing post-silicon validation is the limited
observability due to storage of a small number of signals in a trace buffer. The signals to be traced should
be carefully selected in order to maximize restoration of the remaining signals. Existing approaches have
two major drawbacks. They depend on partial restorability computations that are not effective in restoring
maximum signal states. They also require long signal selection time due to inefficient computation as well
as operating on gate-level netlist. We have proposed a signal selection approach based on total
restorability at gate-level, which is computationally more efficient (10 times faster) and can restore up to
three times more signals compared to existing methods. We have also developed a register transfer level
signal selection approach, which reduces both memory requirements and signal selection time by several
orders-of-magnitude.
ETPL
VLSI-064
Test Patterns of Multiple SIC Vectors: Theory and Application in BIST Schemes
Abstract: This paper proposes a novel test pattern generator (TPG) for built-in self-test. Our method
generates multiple single-input change (MSIC) vectors in a pattern, i.e., each vector applied to a scan
chain is an SIC vector. A reconfigurable Johnson counter and a scalable SIC counter are developed to
generate a class of minimum transition sequences. The proposed TPG is flexible to both the test-per-clock
and the test-per-scan schemes. A theory is also developed to represent and analyze the sequences and to
extract a class of MSIC sequences. Analysis results show that the produced MSIC sequences have the
favorable features of uniform distribution and low input transition density. The performances of the
designed TPGs and the circuits under test with 45 nm are evaluated. Simulation results with ISCAS
benchmarks demonstrate that MSIC can save test power and impose no more than 7.5% overhead for a
scan design. It also achieves the target fault coverage without increasing the test length.

ETPL
VLSI-065
Effective and Efficient Approach for Power Reduction by Using Multi-Bit Flip-Flops
Abstract: Power has become a burning issue in modern VLSI design. In modern integrated circuits, the
power consumed by clocking gradually takes a dominant part. Given a design, we can reduce its power
consumption by replacing some flip-flops with fewer multi-bit flip-flops. However, this procedure may
affect the performance of the original circuit. Hence, the flip-flop replacement without timing and
placement capacity constraints violation becomes a quite complex problem. To deal with the difficulty
efficiently, we have proposed several techniques. First, we perform a co-ordinate transformation to
identify those flip-flops that can be merged and their legal regions. Besides, we show how to build a
combination table to enumerate possible combinations of flip-flops provided by a library. Finally, we use
a hierarchical way to merge flip-flops. Besides power reduction, the objective of minimizing the total
wirelength is also considered. The time complexity of our algorithm is $Theta({rm n}^{1.12})$ less than
the empirical complexity of $Theta({rm n}^{2})$. According to the experimental results, our algorithm
significantly reduces clock power by 20–30% and the running time is very short. In the largest test case,
which contains 1 700 000 flip-flops, our algorithm only takes about 5 min to replace flip-flops and the
power reduction can achieve 21%.
ETPL
VLSI-066
Reconfigurable Accelerator for the Word-Matching Stage of BLASTN
Abstract: BLAST is one of the most popular sequence analysis tools used by molecular biologists. It is
designed to efficiently find similar regions between two sequences that have biological significance.
However, because the size of genomic databases is growing rapidly, the computation time of BLAST,
when performing a complete genomic database search, is continuously increasing. Thus, there is a clear
need to accelerate this process. In this paper, we present a new approach for genomic sequence database
scanning utilizing reconfigurable field programmable gate array (FPGA)-based hardware. In order to
derive an efficient structure for BLASTN, we propose a reconfigurable architecture to accelerate the
computation of the word-matching stage. The experimental results show that the FPGA implementation
achieves a speedup around one order of magnitude compared to the NCBI BLASTN software running on
a general purpose computer.
ETPL
VLSI-067
Architecturally Homogeneous Power-Performance Heterogeneous Multicore Systems
Abstract: Dynamic voltage and frequency scaling (DVFS), a widely adopted technique to ensure safe
thermal characteristics while delivering superior energy efficiency, is rapidly becoming inefficient with
technology scaling due to two critical factors: 1) inability to scale the supply voltage due to reliability
concerns and 2) dynamic adaptations through DVFS cannot alter underlying power hungry circuit
characteristics, designed for the nominal frequency. In this paper, we show that DVFS scaled circuits
substantially lag in energy efficiency, by 22%–86%, compared to ground up designs for target frequency
levels. We propose architecturally homogeneous power-performance heterogeneous multicore systems, a
fundamentally alternate means to design energy efficient multicore systems. Using a system level
computer-aided design (CAD) approach, we seamlessly integrate architecturally identical cores, designed
for different voltage-frequency domains. We use a combination of standard cell library based CAD flow
and full system architectural simulation to demonstrate 11%–22% improvement in energy efficiency
using our design paradigm.

ETPL
VLSI-068
Active Filter-Based Hybrid On-Chip DC–DC Converter for Point-of-Load Voltage
Regulation
Abstract: An active filter-based on-chip DC–DC voltage converter for application to distributed on-chip
power supplies in multivoltage systems is described in this paper. No inductor or output capacitor is
required in the proposed converter. The area of the voltage converter is therefore significantly less than
that of a conventional low-dropout (LDO) regulator. Hence, the proposed circuit is appropriate for point-
of-load voltage regulation for noise sensitive portions of an integrated circuit. The performance of the
circuit has been verified with Cadence Spectre simulations and fabricated with a commercial 110 nm
complimentary metal oxide semiconductor (CMOS) technology. The area of the voltage regulator is
0.015 ${rm mm}^{2}$ and delivers up to 80 mA of output current. The transient response with no output
capacitor ranges from 72 to 192 ns. The parameter sensitivity of the active filter is also described. The
advantages and disadvantages of the active filter-based, conventional switching, linear, and switched
capacitor voltage converters are compared. The proposed circuit is an alternative to classical LDO voltage
regulators, providing a means for distributing multiple local power supplies across an integrated circuit
while maintaining high current efficiency and fast response time within a small area.
ETPL
VLSI-069
CusNoC: Fast Full-Chip Custom NoC Generation
Abstract: We propose a full-chip synthesis methodology to construct custom network-on-chips
(CusNoCs) for NoC-based systems. The proposed scheme generates irregular network topologies for
application-specific designs with known communication demands. In this method, processors and the
communication architecture can be synthesized simultaneously in the floorplanning process, and thus it is
called CusNoC. CusNoC synthesizes CusNoC in two steps. The target network topology is first generated
based on communication analysis. Processing elements are partitioned into groups such that the utility of
routers will be maximized if a router is assigned to each group. In this way, the number of routers passed
by a packet, or hops, is minimized, and so is the power consumption in the network. The final network
topology is formed by properly connecting these groups. A wirelength-aware floor planning is then
carried out to optimize circuit size as well as wirelength. Experimental results show that CusNoC
produces custom NoCs with better performance than previous methods while the computation time is
significantly shorter. This method is also more scalable, which makes it ideal for complicated systems.
ETPL
VLSI-070
Cooperating Virtual Memory and Write Buffer Management for Flash-Based Storage
Systems
Abstract: Flash memory is becoming the preferred choice of secondary storage in mobile devices and
embedded systems. The performance of Flash memory is dictated by asymmetric speeds of read and
write, limited number of erase times, and the absence of in-place updates. To improve the performance of
Flash-based storage systems, the write buffer has been provided in Flash memories recently. At the same
time, new virtual memory management strategies have been proposed in recent studies that consider the
characteristics of Flash memory. Currently, approaches on these two memory layers are considered
separately, which fail to explore the full potential of these two layers. In this paper, we propose
cooperative management schemes for virtual memory and write buffer to maximize the performance of
Flash-memory-based systems. Management on virtual memory is designed to exploit write buffer status
via reordering of the write sequences. The proposed write buffer management scheme works seamlessly
with the proposed virtual memory management scheme. Experimental results show that significant

improvement in I/O performance and reduction of the number of erase and write operations can be
achieved compared to the state-of-art approaches.
ETPL
VLSI-071
MDC FFT/IFFT Processor With Variable Length for MIMO-OFDM Systems
Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory
scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-
orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the
MDC architecture, we propose to use radix-$N_{s}$ butterflies at each stage, where $N_{s}$ is the
number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100%
utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism
of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing,
which again results in a full utilization rate in memory usage. Since the memory requirements usually
dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can
effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed
scheme in practical applications, we let $N_{s}=4$ and implement a 4-stream FFT/IFFT processor with
variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be
used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was
implemented with an UMC 90-nm CMOS technology with a core area of 3.1 ${rm mm}^{2}$. The
power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT,
respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the
implemented processor and compare it with other processors. The results show advantages of the
proposed scheme in terms of area and power consumption.
ETPL
VLSI-072
Current-Reused 2.4-GHz Direct-Modulation Transmitter With On-Chip Automatic
Tuning
Abstract: This paper presents the design, analysis, and experimental verification of a self-calibrating
current-reused 2.4-GHz direct-modulation transmitter for short-range wireless applications. The key
contributions are the design/analysis of a stacked power amplifier (PA)/voltage-controlled oscillator
(VCO) architecture, the nonlinear frequency-dependent analysis of a Gilbert-cell-based root-mean-square
detector, and an on-chip $LC$-tank calibration circuit that needs no analog-to-digital convertor
(ADC)/digital signal processor. The stacked architecture reduces the number of required regulators,
utilizes supply headroom effectively, and allows for an “ADC-less” calibration loop that can dynamically
tune the PA center frequency by sensing the transmitted signal. The very nature of direct-modulation
architecture obviates additional high-purity signal generators, reducing complexity and allowing online
calibration. The system was implemented in TSMC 0.18 $mu{rm m}$ CMOS, occupies 0.7 ${rm
mm}^{2}~({rm TX})+0.1~{rm mm}^{2}$ (self-tuning), and was measured in a QFN48 package on an
FR4 PCB. Automatically correcting PA/VCO tank misalignment in this case yielded ${>}{rm 4}~{rm
dB}$ increase in output power. With the automatic tuning active, the transmitter delivers a measured
output power ${>}{rm 0}~{rm dBm}$ to a 100-$Omega$ differential load, and the system consumes
22.9 mA from a 1.8-V core-circuit supply.

ETPL
VLSI-073
Reconfigurable Adaptive Singular Value Decomposition Engine Design for High-
Throughput MIMO-OFDM Systems
Abstract: Singular value decomposition (SVD) is an optimal method to obtain spatial multiplexing gain in
multi-input multi-output (MIMO) channels. However, the high cost of implementation and high
decomposing latency of the SVD restricts its usage in current wireless communication applications. In
this paper, we present a complete adaptive SVD algorithm and a reconfigurable architecture for high-
throughput MIMO-orthogonal frequency division multiplexing systems. There are several proposed
architectural design techniques: reconfigurable scheme, division-free adaptive step size scheme, early
termination scheme, and data interleaving scheme. The reconfigurable scheme can support all antenna
configurations in a MIMO system. The division-free adaptive step size and early termination schemes are
used to effectively reduce the decomposing latency and improve hardware utilization. The data
interleaving scheme helps to deal with several channel matrices concurrently. Besides, we propose an
orthogonal reconstruction scheme to obtain more accurate SVD outputs, and then the system performance
will be greatly enhanced. We apply our SVD design to the IEEE 802.11 n applications. This design is
implemented and fabricated in UMC 90 nm 1P9M CMOS technology. The maximum operating
frequency is measured to be at 101.2 MHz, and the corresponding power dissipation is at 125 mW. The
core size is 2.17 ${rm mm}^{2}$ and the die size occupies 4.93 ${rm mm}^{2}$. The chip result shows
that the average latency is only 0.33% of the wireless local area network coherence time. Hence, the
proposed reconfigurable adaptive SVD engine design is very suitable for high-throughput wireless
communication applications.
ETPL
VLSI-074
The LUT-SR Family of Uniform Random Number Generators for FPGA Architectures
Abstract: Field-programmable gate array (FPGA) optimized random number generators (RNGs) are more
resource-efficient than software-optimized RNGs because they can take advantage of bitwise operations
and FPGA-specific features. However, it is difficult to concisely describe FPGA-optimized RNGs, so
they are not commonly used in real-world designs. This paper describes a type of FPGA RNG called a
LUT-SR RNG, which takes advantage of bitwise xor operations and the ability to turn lookup tables
(LUTs) into shift registers of varying lengths. This provides a good resource–quality balance compared to
previous FPGA-optimized generators, between the previous high-resource high-period LUT-FIFO RNGs
and low-resource low-quality LUT-OPT RNGs, with quality comparable to the best software generators.
The LUT-SR generators can also be expressed using a simple C++ algorithm contained within this paper,
allowing 60 fully-specified LUT-SR RNGs with different characteristics to be embedded in this paper,
backed up by an online set of very high speed integrated circuit hardware description language (VHDL)
generators and test benches.
ETPL
VLSI-075
Exploring the Use of Emerging Nonvolatile Memory Technologies in Future FPGAs,
Abstract: As new nonvolatile memory technologies become increasingly mature, there has been a
growing interest on investigating their use in future field-programmable gate arrays (FPGAs). Similar to
existing FPGAs with embedded Flash memory, future FPGAs can embed these new nonvolatile
memories to persistently store configuration data. By comparing with prior work, we first propose the
more appropriate design style for new nonvolatile configuration data storage memory. Moreover, this
brief studies a dynamic random-access memory (DRAM)-based FPGA design strategy enabled by high-

density embedded nonvolatile memory. Existing FPGAs do not use on-chip DRAM cells for
configuration data storage mainly because DRAM self-refresh involves destructive DRAM read. This
problem can be solved, if we use embedded nonvolatile memory as primary FPGA configuration data
storage and externally refresh on-chip DRAM cells. Analysis and simulations have been carried out to
demonstrate the potential advantages of such a design strategy.
ETPL
VLSI-076
Broadside and Skewed-Load Tests Under Primary Input Constraints
Abstract: Tester limitations may impose certain constraints on the primary input vectors applicable as part
of a two-pattern test for delay faults. Under these constraints, the primary input vectors may be held
constant, or the second primary input vector of a test may be obtained by a single shift of a scan chain
relative to the first. The goal of this brief is to study the differences in achievable transition fault coverage
between various primary input constraints that are similar to the commonly used ones of holding or
shifting primary input vectors. This brief also studies the possibility of combining the constraints in order
to increase the transition fault coverage. The combination requires a fixed and circuit-independent
hardware structure similar to the case where shifting of primary input vectors is used. This study is done
using test sets that consist of both broadside and skewed-load tests in order to maximize the transition
fault coverage.
ETPL
VLSI-078
Supply Noise Suppression by Triple-Well Structure
Abstract: This brief discusses the impact of twin- and triple-well structures on power supply noise, and a
substrate model for simulating the power supply noise. We observed $V_{rm ss}$ noise reduction by the
resistive network of the p-substrate and $V_{rm dd}$ noise reduction by the junction capacitance of a
triple-well structure on a 90-nm test chip. Measurement results also showed that the total noise reduction
of a triple-well structure is superior to that of a twin-well structure. The measurement results correlate
well with the results obtained from the power supply noise simulation using a hierarchical resistive mesh
model. Our simulation-based verification indicates that in common CMOS design, a triple-well structure
can reduce the power supply drop by 10%–40% or the decoupling capacitance area by 5%–10%. We also
verified that supply drop sensitivity to variation of the well junction capacitance is sufficiently small and
that supply noise reduction using a triple-well structure is robust to process variation.
ETPL
VLSI-079
Software-Based Self Test Methodology for On-Line Testing of L1 Caches in
Multithreaded Multicore Architectures
Abstract: The flexibility that allows the application of different March tests is a critical requirement for
on-line testing of memory arrays. In a previous study, we have introduced a low-cost software-based self
test (SBST) program development methodology for on-line periodic testing of L1 caches that utilizes
direct cache access (DCA) instructions and exploits the native monitoring hardware available in modern
architectures. In this brief, we discuss a multithreaded optimization of this SBST methodology that
exploits the thread level parallelism of multithreaded multicore architectures in order to speed up March
test execution by elaborating the low level multiple sub-bank cache organization. The effectiveness of the
methodology and its multithreaded optimization is demonstrated on the L1 caches of OpenSPARC T1
processor. Our results showed a speedup of more than 1.7 when the multithreaded optimization is applied
and an acceptable performance overhead (less than 11%), even in intensive periodic test scenarios.

Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Final Year IEEE Project 2013-2014 - VLSI Project Title and Abstract