SlideShare une entreprise Scribd logo
1  sur  5
Télécharger pour lire hors ligne
An Efficient VFI-Based NoC Architecture Using
      Johnson-Encoded Reconfigurable FIFOs
                   Amir-Mohammad Rahmani1,2, Pasi Liljeberg2, Juha Plosila2, and Hannu Tenhunen1,2
                                  1
                                  Turku Centre for Computer Science (TUCS), Turku, Finland
               2
                   Computer Systems Lab., Department of Information Technology, University of Turku, Finland
                           Email: {amir.rahmani, pasi.liljeberg, juha.plosila, hannu.tenhunen}@utu.fi

 Abstract— In this paper, a Johnson-encoded Reconfigurable           In this paper, we propose a novel Reconfigurable
 Synchronous/Bi-Synchronous (RSBS) FIFO is proposed              Synchronous/Bi-Synchronous (RSBS) FIFO based on
 which can adapt its operation to either synchronous or bi-      Johnson-encoded pointers to mitigate the latency and power
 synchronous mode. The proposed FIFO which can be used to        consumption overhead. In addition, since the NoC
 interface modules in Voltage/Frequency Islands (VFI) based      partitioned into VFIs requires a method to embed and
 Networks-on-chip, is capable of alleviating the excessive       exploit these FIFOs, we have developed a controller for
 energy consumption and high performance overhead of the         switch input channels which can adaptively decide the
 conventional bi-synchronous FIFOs. The FIFO is scalable         operating mode of the FIFOs.
 and synthesizable in synchronous standard cells. In addition,       The remainder of this paper is organized as follows.
 a technique for mesochronous adaptation of the proposed         Section II describes the related work. Section III elaborates
 FIFO is presented. Our extensive experiments show
                                                                 the demands for enhancement of the existing architectures,
 significant power and performance improvements compared
                                                                 while the architecture of the reconfigurable FIFO is
 to non-reconfigurable architectures.
                                                                 presented in detail in Section IV. Section V shows the
                                                                 experimental results and analyzes the impact of our
                       I.   INTRODUCTION                         technique on a video conference encoding part as a case
     Interconnect links will impose a number of limits to        study. Finally, Section VI draws conclusions.
 complexity, reliability, and throughput in nanoscale system
 design. Network-on-chip (NoC) has been proposed to                                II.   RELATED WORK
 mitigate the ever increasing communication complexity of
 modern many-core system-on-chip (SoC) designs [1][2]. In            There have been many efforts to design low latency
 addition, achieving power efficiency has become an              asynchronous communication mechanisms between
 increasingly difficult challenge, especially in the presence    synchronous blocks. Some of them include two flip-flop
 of increasing die sizes, high clock frequencies and             synchronizers or an asynchronous FIFO using Gray code
 variability driven design issues. Globally Asynchronous         [9][10], Johnson code [11], or a ring counter [12][13] for
 Locally Synchronous [3][4] (GALS) -based NoCs                   read and write pointers, while others consider stoppable
 implemented using a multiple Voltage Frequency Island           clocks [3]. Our proposed FIFO is based on Johnson
 (VFI) design style have become an attractive alternative to     encoding and it has two substantial advantages over the
 traditional designs [5]. In fact, VFI-based approaches could    proposed FIFO in [11]. Firstly, it devotes reconfigurability
 be used for minimizing the system power dissipation under       to bi-synchronous FIFOs to prevent their associated power
 performance constraints.                                        and latency overhead in such cases that their
     Assignment of frequencies and voltages to VFIs can be       synchronization parts are not needed. Secondly, in addition
 done by using either offline or online methods [6]. Offline     to register based implementation using one-hot addressing,
                                                                 it supports standard memory based implementation
 methods can be used when the behavior of an application is
                                                                 addressed by normal binary code.
 very predictable for various input conditions and the worst-
                                                                     There have been several design efforts to combine the
 case behavior is not very different from the average-case
                                                                 benefits of the GALS-based NoC interconnect mechanism
 behavior [7]. However, such an approach is not well-suited
 for applications that show large variations in their behavior   with VFI-based design style [14][15]. For instance the
 for different input conditions. For such systems, online        authors of [16], present design methodologies for
 methods are more suitable [7][8]. Dynamic Voltage and           partitioning an NoC architecture into multiple VFIs and
 Frequency Scaling (DVFS) schemes can be used to adapt           assigning frequency, supply voltage, and threshold
 the system to meet the performance requirements of a            voltage levels to each VFI according to given
 dynamically changing workload while minimizing power            performance constraints at design time. On the other hand,
 consumption.                                                    there have been many works that propose hardware-based
     In order to benefit from the VFI-based scheme,              approaches to dynamically change the frequencies and
 communication between islands should be carried out by          potentially voltages of a VFI system driven by a dynamic
 using mixed-timing (bi-synchronous) FIFOs [9] which             workload [6][8]. However, due to the high latency and
 adapt clock frequency discrepancy; however, due to the          power overhead, there is a substantial limitation to freely
 overhead in implementing these FIFOs in terms of latency,       exploit the bi-synchronous FIFOs. To the best of our
 area, and power consumption, the associated design              knowledge, there is only one study targeting to propose
 complexity increases.                                           reconfigurable FIFOs in which FIFOs can work in two
                                                                 distinct modes and accordingly devote much more


978-1-4244-8971-8/10$26.00 c 2010 IEEE
flexibility to the both dynamic and static voltage/frequency    and Mode Selector (which is added to the basic architecture
assigning techniques. In [17], for the best case in which the   of input channel module to support RSBS FIFO). The
authors presents a Gray-encoded reconfigurable FIFO, there      RSBS IB block is a dual mode FIFO buffer, while the IC
are still performance and power overheads due to existence      block of each input channel performs the routing function,
of pointer counters and complexity of fullness/emptiness        its IRS block receives x_rd and x_gnt signals and triggers
checking logic. Moreover, the gray-encoded design style         the rd signal of the RSBS IB block, and the IFC block
can only support FIFO capacities that are powers of two.        implements the logic that performs the translation between
The presented FIFO in this paper, which is an improved          the handshake and the FIFO flow control protocol. Each
extension of those reconfigurable FIFOs, is capable of          channel includes n bits for data and two bits for packet
circumventing the aforementioned issues. As we will             framing: begin-of-packet ((n+2)th bit), and end-of-packet
describe later, these restrictions could be mitigated           ((n+1)th bit). IFC, IC, and IRS modules are described in
considerably by utilizing the proposed Johnson-encoded          detail in [18].
reconfigurable FIFOs.

        III.   MOTIVATION AND CONTRIBUTION
    A VFI can consist of a single PE or, depending on the
physical or design considerations, may contain a group of
PEs. Each VFI is assumed to have a voltage level above a
certain value Vmin and, since the architecture is globally
asynchronous, locally synchronous [3], each module or
core is assumed to be clocked by local ring oscillator or a
central clock generator controlled by a variable intra-island         Figure 1. Input channel module architecture with support of
                                                                                   Reconfigurable Syn/Bi-Syn FIFOs
supply voltage [5]. In such systems, the assignment of
voltage/frequency to each island can be classified into             The static voltage/frequency assigning problem for a given
Static and Dynamic Voltage/Frequency Assigning                  component graph G(V,E) which is characterized by the set of
techniques.                                                     nodes represented as V = {1, 2, … , n} and edges represented
    In Dynamic Voltage Frequency Assigning (DVFA)               as E = {(i,j) | i precedes j} can be stated as [8]:
techniques, usually each individual processing element is a         Given a component graph G(V,E) comprised of a set of
locally synchronous module operating with its own clock         processes mapped on a set of processing elements (PEs), find
and either being a single VFI or forming a VFI with another     the optimal voltage and clock frequency to be assigned
synchronous       module.      This     enables     dynamic     statically to each PE such that the energy per operation is
voltage/frequency scaling in each synchronous module            minimized and rate and/or latency constraints are satisfied.
using a DC-DC voltage regulator and a central or local
variable delay ring oscillator maintaining the clock of the         After voltage and frequency assigning for each island at
VFI. There are usually a discrete set of frequency and          design time, then the VFIs are formed and for inter-island and
voltage levels (usually 2 to 6 levels) assigned by some         intra-island communications, bi-synchronous and synchronous
methods such as forecasting. For the sake of adaptivity, all    FIFOs are respectively employed. According to the recent
                                                                work in this field [7][8], a VFI-based NoC is generally
cores should benefit from bi-synchronous FIFOs. Let us
                                                                partitioned into 2 to 4 islands because if a larger number is
consider a frequent case in which adjacent cores in a NoC       selected, the overhead of the bi-synchronous FIFOs will
work at a same frequency level (e.g., they belong to same       diminish the power savings gained by VFI architecture.
VFI in current timing window). In this situation, despite           Designs using SVFA techniques are advantageous in the
both read and write clock signals of their FIFOs have equal     case of a system where an oracle has pre-existing knowledge
frequencies, they are still synchronized by passing through     of the number of run time cycles used in each PE for
synchronizer blocks for asserting full and empty signals.       processing each sample of the application under consideration.
However, these FIFOs can be informed about their equal          Moreover, since all of the SVFA stages are done at design time
read and write clock frequencies. A reconfigurable FIFO         and for a specific application, this system is not practical for
being capable of operating in both synchronous and bi-          other applications. For instance, such a design as
synchronous modes can cope with this by bypassing and           MultiProcessor System-on-Chip (MPSoC) which is typically
switching off the unused components (e.g., synchronizers,       designed to be mapped and run multi-purpose applications
code converters) and result to considerable improvement in      cannot benefit from the SVFA techniques, while it can be
terms of latency, throughput, and power consumption.            logical to exploit these techniques by using the proposed RSBS
    In order to highlight the importance of reconfigurable      FIFOs and configuring the FIFOs for respective application
FIFOs, we have embedded a simple hardware called Mode           after each mapping process.
Selector in the input channel of a RASoC-based NoC                  In the next section, we present the architecture of the
switch [18]. As can be seen from Figure 1, this module is       proposed RSBS FIFO which is based on Johnson-Encoding. It
responsible for recognizing the equality of write (provided     can be seen how this simple technique can astonishingly
                                                                optimize the overall NoC power consumption as well as
by output channel of the adjacent switch) and read clock
                                                                latency and throughput.
frequencies and directing the buffer to operate in
synchronous or bi-synchronous mode.
    The input channel module shown in Figure 1 consists of           IV.  JOHNSON-ENCODED RECONFIGURABLE
five different units: IFC (Input Flow Controller), RSBS IB             SYNCHRONOUS/BI-SYNCHRONOUS FIFO
(Reconfigurable       Synchronous/Bi-Synchronous        Input      In this section, we present a reconfigurable FIFO design
Buffer), IC (Input Controller), IRS (Input Read Switch),        approach based on Johnson encoding and discuss its
benefits over Gray-encoded FIFOs. It should be noted that      this issue, we uses Johnson encoding for read and write
this design style is scalable and synthesizable in             pointers.
synchronous standard cells. The proposed RSBS FIFO is a            Johnson encoding is another code with a Hamming
bi-synchronous FIFO [10] able to interface two                 distance of 1 between consecutive elements which allows a
synchronous systems with independent clock frequencies.        safe synchronization of the pointers. To implement the
For the sake of metastability [19] avoidance and               sequence, bits are chained in series as in a shift register, and
synchronization of pointers between two independent clock      the loop is closed using an inverter, so that the least
domains, it benefits from two synchronizers used for write     significant bit is implemented as the negation of the most
and read pointers.                                             significant bit. To differentiate the FIFO fullness and
    As shown in Figure 2, similar to most bi-synchronous       emptiness, a parity bit to the binary pointers is added for
FIFOs, five typical modules compose the RSBS FIFO              virtually doubling the addressing range of the pointers [20].
architecture: FIFO Memory block, sync_r2w, sync_w2r,           This parity method is extensible both to Gray and Johnson
FIFO rptr & empty, and FIFO wptr & full. The FIFO              encodings, but for Johnson encoding, because of the
Memory block is a buffer accessed by both the write and        twisted-ring sequence, it is simpler. When Johnson
read clock domains. This buffer is most likely an              encoding is used, the buffer is empty if write_pointer =
instantiated, synchronous dual-port RAM but other memory       read_pointer, and it is full if write_pointer = NOT
styles can also be adapted to function as the FIFO buffer.     read_pointer.
The sync_r2w (sync_w2r) module is a synchronizer used to           The architecture of the FIFO wptr & full (FIFO rptr &
synchronize the read (write) pointer into the write(read)-     empty) block is shown in Figure 3. This module consists of
clock domain in the bi-synchronous mode. The FIFO rptr         a Johnson-encoded register to generate the n-bit pointer to
& empty block is completely synchronous to the read-clock      be synchronized into the opposite clock domain. In
domain and contains the FIFO read pointer and empty-flag       addition, it exploits a Johnson to binary converter and
logic. Similarly, the FIFO wptr & full block is completely     another register (Binary register) used to address the FIFO
synchronous to the write-clock domain and contains the         memory directly without the need to translate memory
FIFO write pointer and full-flag logic.                        addresses and also one Full (Empty) Detector block to
    In the proposed design style, to provide                   check fullness (emptiness) of the FIFO.
reconfigurability, we have exploited two multiplexers and
two flip-flops to bypass unused components in the
synchronous mode. For this purpose, we added the Syn/Bi-
Syn_Mode signal indicating the operation mode of the FIFO
(synchronous or bi-synchronous). Before describing the
main function of the RSBS FIFO, let us first focus on the
properties of Johnson encoding [11] for the FIFO read and
write pointers and the internal structure of the FIFO wptr &
full (FIFO rptr & empty) block.




                                                                          Figure 3. FIFO wptr & full block diagram

                                                                   The main target of the proposed RSBS FIFO is to
                                                               bypass and switch off the unused components and have a
                                                               simple synchronous FIFO (without the other components
                                                               synchronizers) in the synchronous mode. In the proposed
                                                               design style, once the FIFO receives the command to
                                                               operate in the synchronous mode via Syn/Bi-Syn Mode
                                                               signal, the blocks in Figure 2 highlighted with gray circles
                                                               are removed from the FIFO path and switched off. Since in
                                                               the synchronous mode it is not necessary to synchronize the
                                                               pointers into opposite clock domains, we bypass the
                                                               synchronizers to produce the empty and full flags using
                                                               unsynchronized Johnson-encoded pointers. As the mode
     Figure 2. Reconfigurable Syn/Bi-Syn FIFO architecture
                                                               changes to bi-synchronous, the disabled blocks will be
    As discussed earlier, Gray code presents some              again turned on.
limitations in terms of the implementation complexity. The         As a result of bypassing synchronizers in both full and
first reason is that Gray code allows encoding only “power     empty detection stages, the FIFO has considerable latency,
of two” ranges, while the FIFO size may be optimal at a        throughput, and power improvement when it operates in the
value that is not a power of two. The second limitation is     synchronous mode; hence it can be an applicable FIFO
that contrarily to binary encoding with the full-adder         architecture to be utilized in DVFA techniques. It should
standard-cell, there is no elementary logical operator to      be emphasized that in such NoC systems which benefit
perform an addition in Gray encoding. Hence, the               from SVFA techniques, the position of islands are constant.
increment of the pointers needs to be hardwired at the cost    These NoC systems are not appropriate to be mapped for
of more area and lower performance. In order to cope with      various applications at different times. Therefore, it is
desirable to have such synchronous FIFOs for intra-island          system as a case study and compared it to a similar system
communication which do not have extra latency and power            using conventional bi-synchronous FIFOs.
consumption overhead. As a result, if the area overheads of            In the case of latency analysis, as the sender and the
the inactive components are acceptable for the system,             receiver have different clock signals, the latency of the
exploiting the proposed RSBS FIFO in SVFA-based                    FIFO depends on the relation between these two signals.
systems is quite efficient.                                        The latency can be decomposed in two parts: the state
    In some cases, it is not possible to exploit a central         machine latency and the synchronization latency. As the
clock generator for NoC-based systems (specifically for            state-machine is designed using a Moore automaton, its
DVFA-based ones). In these situations, each node has its           latency is one clock cycle. In the bi-synchronous mode, s
own clock generator (phase-locked loop) and the FIFO               registers compose the synchronizers and the latency is ΔT
architecture should be adapted to interface mesochronous           plus one clock cycle, where ΔT is the difference, in time,
clock domains where the sender and the receiver have the           between the rising edges of sender and receiver clocks. As
same clock frequency but different phases. To this end, the        this difference is between zero and one Clk_read clock
RMBS (Reconfigurable Mesochronous/Bi-Synchronous)                  cycle, the latency of the RSBS FIFO is between s and s+1
FIFO should be utilized instead of the RSBS one. The               Clk_read clock cycles in the bi-synchronous mode.
phase difference can be constant or slowly varying.                Obviously, in the synchronous mode, there is no difference
According to [19], metastability can be avoided when the           between Clk_read and Clk_write and also there is not any
rising edges of the clock signals are predictable, and the         synchronizer, and hence data can be fetched by the receiver
two registers in the synchronizer can be reduced to a single       on the next rising/falling edge of Clk_read.
register.                                                              In order to evaluate the throughput, the RSBS FIFOs
                                                                   for each operation mode should be analyzed as function of
                                                                   the FIFO depth. For this FIFO in the bi-synchronous and
                                                                   mesochronous modes, as the synchronizers add latency, the
                                                                   performance of flow control of the FIFO is penalized. In
                                                                   the case of a deep FIFO, those latencies do not decrease the
                                                                   FIFO throughput since the buffered data compensate the
                                                                   latency of the flow control. As the FIFO operates in the
                                                                   synchronous mode, the minimum FIFO depth required to
                                                                   provide maximum throughput decreases because there is
                                                                   no need for synchronizers. Table 1 shows the minimum
                                                                   FIFO depth for 50% and 100% throughput as a function of
                                                                   the clocking mode. Note that for the bi-synchronous mode
                                                                   analysis, the write and read clock frequencies are equal,
                                                                   otherwise it is not possible to obtain 100% throughput.
                                                                    Table 1. Minimum FIFO depth in function of the clock relation
Figure 4. Synchronization part of the Reconfigurable Meso/Bi-Syn                     and required throughput
                        FIFO architecture                                          Minimum depth for Minimum depth for
                                                                        Mode
                                                                                    50% throughput          100% throughput
    As an example, Figure 4 shows the proposed design               Bi-syn. Mode            3                         6
which we have modified for correct emptiness and fullness           Meso. Mode              2                         4
detection in the mesochronous mode. In this case, two               Syn. Mode               1                         2
registers are added and clocked using a delayed version of
the read/write clock in the mesochronous mode. This delay              The area of the FIFOs was computed once synthesized
must be chosen to exchange the data without metastable             on CMOS 90nm GPLVT STMicroelectronics standard
situations. The delay can be a programmable delay, or any          cells using Synopsys Design Compiler. Different FIFO
other metastability-free solution, as for example the              depths are used to illustrate the scalability of the
Chakraborty-Greenstreet [21] architecture which allows the         architecture. Table 2 shows the area of the 16 and 32-bit
FIFO to work also on plesiochronous (small difference of           Gray-encoded RSBS, Gray-encoded RMBS, Johnson-
frequency) clocks. Likewise, if the write and read clocks are      encoded RSBS, and Johnson-encoded RMBS FIFOs as a
out of phase by 90°, 180º, or 270°, no programmable delay          function of the FIFO depth.
is needed because, by-construction, the communication is
free of metastability. Although it is not as efficient as the        Table 2. Area and overhead comparison between the Johnson-
RSBS FIFO, it still improves the FIFO throughput, and in            encoded and Gray-encoded design styles and the baseline design
each mode, if the synchronizers used for the other mode is                                       4×16    4×32      8×16     8×32
                                                                              Style
turned off, the unnecessary power consumption is                                                 µm2     µm2       µm2       µm2
prevented.                                                          Gray-encoded RSBS
                                                                                                 3434    5888      6354 11415
                                                                    FIFO [17]
                                                                    Gray-encoded RMBS
            V.    ANALYSIS AND CASE STUDY                           FIFO [17]
                                                                                                 3530    5983      6509 11570
   We have simulated the reconfigurable FIFO to                     Johnson-encoded RSBS
                                                                                                 3331    5795      6267 11332
characterize its latency, throughput, area, and power               FIFO
consumption. Note that, to observe the power efficiency of          Johnson-encoded RMBS
                                                                                                 3420    5877      6416 11463
the FIFO, we have employed it in a NoC-based MPEG                   FIFO
We apply the proposed RSBS-FIFO-based switch to the               [2]    L. Benini, and G. D. Micheli, “Networks on chips: a new SOC
                                                                             paradigm,” IEEE computer, Vol. 35, No. 1, 2002, pp. 70–78.
NoC-based MPEG-4 decoder described in [22] and
                                                                      [3]    D. M. Chapiro, “Globally asynchronous locally synchronous
compare it with a similar system which does not benefit                      systems,” Ph.D. dissertation, Dept. Comput. Sci., Stanford
from the reconfigurable FIFOs. The MPEG-4 decoder                            University, Stanford, CA, 1984.
system is modeled and mapped on a 5×3 NoC. In the                     [4]    J. Muttersbach et al., “Practical design of globally asynchronous
system, each node has a 5×5 crossbar switch. Since MPEG                      locally synchronous systems,” in Proc. of Int. Symp. on Advanced
videos show a lot of variability in processing time                          Research in Asynchronous Circuits and Systems, 2000, pp. 52–59.
depending on the type of frame being processed, we                    [5]    D.E. Lackey et al., “Managing power and performance for system-
                                                                             on-chip designs using volatge islands,” in Proc. of IEEE/ACM Int.
perform prediction-based dynamic voltage/frequency                           Conf. on Computer Aided Design, 2002, pp. 195-202.
assigning on each node based on the DVFA algorithm                    [6]    P. Choudhary and D. Marculescu, “Power Management of
proposed in [8]. The prediction decision is taken at the start               Voltage/Frequency Island-Based Systems Using Hardware-Based
of processing of a new macroblock at each node, and for                      Methods,” IEEE Transactions on VLSI Systems, Vol. 17, No. 3,
each input channel we add a synchronous/bi-synchronous                       2009, pp. 427-438.
mode selector unit. The simulation is performed for three             [7]    U. Y. Ogras, R. Marculescu, D. Marculescu, and E. G. Jung,
                                                                             “Design and Management of Voltage-Frequency Island Partitioned
different frequency sets having 2, 4, and 6 frequency                        Networks-on-Chip,” IEEE Transactions on VLSI Systems, Vol. 17,
levels. We assume that the switches are clocked by a                         No. 3, 2009, pp. 330-341.
central clock generator block; therefore they do not need             [8]    K. Niyogi and D. Marculescu, “Speed and voltage selection for gals
the mesochronous adaptation.                                                 systems based on voltage/frequency islands,” in Proc. of ACM/IEEE
                                                                             Asian-South Pacific Design Automation Conf., 2005, pp. 292–297.
    Figure 5 shows the average power saving percentage of
                                                                      [9]    T. Chelcea and S. M. Nowick, “Robust interfaces for mixed-timing
the NoC switches achieved by exploiting the Johnson-                         systems,” IEEE Transactions on VLSI Systems, Vol. 12, No. 8,
encoded RSBS FIFOs instead of the conventional bi-                           2004, pp. 857–873.
synchronous FIFOs. The comparison is made between the                 [10]   C. Cummings and P. Alfke, “Simulation and synthesis techniques
RSBS FIFO and its baseline counterpart [10] for three                        for asynchronous FIFO design with asynchronous pointer
different frequency sets and two different data widths. As                   comparison,” in SNUG-2002, San Jose, CA, 2002.
the results show, we get around 5.2-17% savings over the              [11]   Y. Thonnart et al., “Design and Implementation of a GALS Adapter
                                                                             for ANoC based Architectures,” in Proc. of Int. Symp. on Advanced
baseline architecture.                                                       Research in Asynchronous Circuits and Systems, 2009, pp. 13-22.
                                                                      [12]   T. Ono, and M. Greenstreet, “A Modular Synchronizing FIFO for
                                                                             NoCs,” in Proc. of Int. Symp. on Networks-on-Chip, 2009, pp. 224-
                                                                             233.
                                                                      [13]   I. Panades and A. Greiner, “Bi-synchronous FIFO for synchronous
                                                                             circuit communication well suited for network-on-chip in GALS
                                                                             architectures,” in Proc. of Int. Symp. on Networks-on-Chip, 2007,
                                                                             pp. 83–94.
                                                                      [14]   J. Quartana, S. Renane, A. Baixas, L. Fesquet, and M. Renaudin,
                                                                             “GALS systems prototyping using multiclock FPGAs and
                                                                             asynchronous network-on-chips,” in Proc. of Int. Conf. on Field
                                                                             Programmable Logic and Applications, 2005, pp. 299–304.
                                                                      [15]   G. Campobello, M. Castano, C. Ciofi, and D. Mangano, “GALS
                                                                             networks on chip: A new solution for asynchronous delay-
  Figure 5. Average power savings for the NoC switches used in               insensitive links,” in Proc. of Design, Automation and Test in
                       MPEG Encoder                                          Europe Conf., 2006, pp. 1–6.
                                                                      [16]   C.-L. Chou et al., “Energy- and Performance-Aware Incremental
                                                                             Mapping for Networks on Chip With Multiple Voltage Levels,”
              I.   SUMMARY AND CONCLUSION                                    IEEE Transactions on CAD, Vol. 27, No. 10, 2008, pp. 1866-1879.
   In this paper, a Johnson-Encoded Reconfigurable                    [17]   A. -M. Rahmani et al., “Power and Performance Optimization of
Synchronous/Bi-Synchronous FIFO was proposed which                           Voltage/Frequency      Island-Based     Networks-on-Chip      Using
                                                                             Reconfigurable Synchronous/Bi-Synchronous FIFOs,” in Proc. of
can operate in either synchronous or bi-synchronous mode.                    ACM International Conference on Computing Frontiers, 2010, pp.
The FIFO addresses the synchronization power and latency                     267-276.
overhead in the case that adjacent switches in the NoC                [18]   C. A. Zeferino, M. E. Kreutz, and A. A. Susin, “RASoC: A router
system operate in the same clock frequency but suffer from                   soft-core for networks-on-chip,” in Proc. of Design, Automation and
unnecessary     synchronizations.     A      technique   for                 Test in Europe Conf., 2004, pp. 198–203.
mesochronous adaptation of the FIFOs has been suggested.              [19]   F. Mu and C. Svensson, “Self-tested self-synchronization circuit for
Our results revealed that compared to a non-reconfigurable                   mesochronous clocking,” in IEEE Transactions on Circuits and
                                                                             Systems-II, Vol. 48, No. 2, 2001, pp. 129-140.
system architecture, the Johnson-Encoded RSBS FIFO can
                                                                      [20]   R. Apperson et al., “A Scalable Dual-Clock FIFO for Data
help to achieve considerable savings in average power                        Transfers Between Arbitrary and Haltable Clock Domains”, in IEEE
consumption of NoC switches and to improve the total                         Transactions on VLSI Systems, Vol. 15, No. 10, 2007, pp 1125-
average packet latency significantly in the case of a MPEG-                  1134.
4 encoder application.                                                [21]   A. Chakraborty and M. R. Greenstreet, “Efficient self-timed
                                                                             interfaces for crossing clock domains,” in Proc. of 9th IEEE Int.
                         REFERENCES                                          Symp. on Asynchronous Circuits and Systems, 2003, pp. 78-88.
[1]   A. Jantsch and H. Tenhunen. Networks on Chip. Kluwer Academic   [22]   E. B. Van der Tol and E.G.T. Jaspers, “Mapping of MPEG-4
      Publishers, 2003.                                                      Decoding on a Flexible Architecture Platform,” SPIE 2002, pp. 1-
                                                                             13.

Contenu connexe

Tendances

PERFORMANCE OF DIFFERENT CMOS LOGIC STYLES FOR LOW POWER AND HIGH SPEED
PERFORMANCE OF DIFFERENT CMOS LOGIC STYLES FOR LOW POWER AND HIGH SPEED PERFORMANCE OF DIFFERENT CMOS LOGIC STYLES FOR LOW POWER AND HIGH SPEED
PERFORMANCE OF DIFFERENT CMOS LOGIC STYLES FOR LOW POWER AND HIGH SPEED VLSICS Design
 
IRJET- PAPR Reduction in UF-OFDM and F-OFDM 5G Systems using ZCT Precoding Te...
IRJET- PAPR Reduction in UF-OFDM and F-OFDM 5G Systems using ZCT Precoding Te...IRJET- PAPR Reduction in UF-OFDM and F-OFDM 5G Systems using ZCT Precoding Te...
IRJET- PAPR Reduction in UF-OFDM and F-OFDM 5G Systems using ZCT Precoding Te...IRJET Journal
 
Satellite telecommand modem
Satellite telecommand modemSatellite telecommand modem
Satellite telecommand modemeSAT Journals
 
HIGH SPEED MULTIPLE VALUED LOGIC FULL ADDER USING CARBON NANO TUBE FIELD EFFE...
HIGH SPEED MULTIPLE VALUED LOGIC FULL ADDER USING CARBON NANO TUBE FIELD EFFE...HIGH SPEED MULTIPLE VALUED LOGIC FULL ADDER USING CARBON NANO TUBE FIELD EFFE...
HIGH SPEED MULTIPLE VALUED LOGIC FULL ADDER USING CARBON NANO TUBE FIELD EFFE...VLSICS Design
 
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...ijngnjournal
 
Uplink control channel design for 3 gpp lte
Uplink control channel design for 3 gpp lteUplink control channel design for 3 gpp lte
Uplink control channel design for 3 gpp ltemerriam2008
 

Tendances (12)

Ch33509513
Ch33509513Ch33509513
Ch33509513
 
PERFORMANCE OF DIFFERENT CMOS LOGIC STYLES FOR LOW POWER AND HIGH SPEED
PERFORMANCE OF DIFFERENT CMOS LOGIC STYLES FOR LOW POWER AND HIGH SPEED PERFORMANCE OF DIFFERENT CMOS LOGIC STYLES FOR LOW POWER AND HIGH SPEED
PERFORMANCE OF DIFFERENT CMOS LOGIC STYLES FOR LOW POWER AND HIGH SPEED
 
IRJET- PAPR Reduction in UF-OFDM and F-OFDM 5G Systems using ZCT Precoding Te...
IRJET- PAPR Reduction in UF-OFDM and F-OFDM 5G Systems using ZCT Precoding Te...IRJET- PAPR Reduction in UF-OFDM and F-OFDM 5G Systems using ZCT Precoding Te...
IRJET- PAPR Reduction in UF-OFDM and F-OFDM 5G Systems using ZCT Precoding Te...
 
Satellite telecommand modem
Satellite telecommand modemSatellite telecommand modem
Satellite telecommand modem
 
HIGH SPEED MULTIPLE VALUED LOGIC FULL ADDER USING CARBON NANO TUBE FIELD EFFE...
HIGH SPEED MULTIPLE VALUED LOGIC FULL ADDER USING CARBON NANO TUBE FIELD EFFE...HIGH SPEED MULTIPLE VALUED LOGIC FULL ADDER USING CARBON NANO TUBE FIELD EFFE...
HIGH SPEED MULTIPLE VALUED LOGIC FULL ADDER USING CARBON NANO TUBE FIELD EFFE...
 
Bm34399403
Bm34399403Bm34399403
Bm34399403
 
An503
An503An503
An503
 
Answers to questions
Answers to questionsAnswers to questions
Answers to questions
 
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
Implementation of Pipelined Architecture for Physical Downlink Channels of 3G...
 
24
2424
24
 
20120130406009 2
20120130406009 220120130406009 2
20120130406009 2
 
Uplink control channel design for 3 gpp lte
Uplink control channel design for 3 gpp lteUplink control channel design for 3 gpp lte
Uplink control channel design for 3 gpp lte
 

En vedette (8)

44
4444
44
 
30
3030
30
 
27
2727
27
 
32
3232
32
 
68
6868
68
 
49
4949
49
 
75
7575
75
 
73
7373
73
 

Similaire à 26

Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GEricsson
 
Network Function Modeling and Performance Estimation
Network Function Modeling and Performance EstimationNetwork Function Modeling and Performance Estimation
Network Function Modeling and Performance EstimationIJECEIAES
 
Ieee 2015 project list_vlsi
Ieee 2015 project list_vlsiIeee 2015 project list_vlsi
Ieee 2015 project list_vlsiigeeks1234
 
Me,be ieee 2015 project list_vlsi
Me,be ieee 2015 project list_vlsiMe,be ieee 2015 project list_vlsi
Me,be ieee 2015 project list_vlsiigeeks1234
 
Ieee 2015 project list_vlsi
Ieee 2015 project list_vlsiIeee 2015 project list_vlsi
Ieee 2015 project list_vlsiigeeks1234
 
Lte planning-principles-part-ii
Lte planning-principles-part-iiLte planning-principles-part-ii
Lte planning-principles-part-iiMohsen Karami
 
Simulation of IEEE 802.16e Physical Layer
Simulation of IEEE 802.16e Physical LayerSimulation of IEEE 802.16e Physical Layer
Simulation of IEEE 802.16e Physical LayerIOSR Journals
 
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVERPERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVERIJCNCJournal
 
10.1109@jiot.2020.29954
10.1109@jiot.2020.2995410.1109@jiot.2020.29954
10.1109@jiot.2020.29954MarcoToledoO
 
Energy Efficient and Process Tolerant Full Adder in Technologies beyond CMOS
Energy Efficient and Process Tolerant Full Adder in Technologies beyond CMOSEnergy Efficient and Process Tolerant Full Adder in Technologies beyond CMOS
Energy Efficient and Process Tolerant Full Adder in Technologies beyond CMOSIDES Editor
 
Towards achieving-high-performance-in-5g-mobile-packet-cores-user-plane-function
Towards achieving-high-performance-in-5g-mobile-packet-cores-user-plane-functionTowards achieving-high-performance-in-5g-mobile-packet-cores-user-plane-function
Towards achieving-high-performance-in-5g-mobile-packet-cores-user-plane-functionEiko Seidel
 
A REVIEW OF MEMORY UTILISATION AND MANAGEMENT RELATED ISSUES IN WIRELESS SENS...
A REVIEW OF MEMORY UTILISATION AND MANAGEMENT RELATED ISSUES IN WIRELESS SENS...A REVIEW OF MEMORY UTILISATION AND MANAGEMENT RELATED ISSUES IN WIRELESS SENS...
A REVIEW OF MEMORY UTILISATION AND MANAGEMENT RELATED ISSUES IN WIRELESS SENS...IAEME Publication
 
Scalable architecture for VOD service
Scalable architecture for VOD serviceScalable architecture for VOD service
Scalable architecture for VOD serviceNadeem Najeeb
 
On an LAS-integrated soft PLC system based on WorldFIP fieldbus
On an LAS-integrated soft PLC system based on WorldFIP fieldbusOn an LAS-integrated soft PLC system based on WorldFIP fieldbus
On an LAS-integrated soft PLC system based on WorldFIP fieldbusISA Interchange
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 

Similaire à 26 (20)

Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5G
 
Network Function Modeling and Performance Estimation
Network Function Modeling and Performance EstimationNetwork Function Modeling and Performance Estimation
Network Function Modeling and Performance Estimation
 
Ieee 2015 project list_vlsi
Ieee 2015 project list_vlsiIeee 2015 project list_vlsi
Ieee 2015 project list_vlsi
 
Me,be ieee 2015 project list_vlsi
Me,be ieee 2015 project list_vlsiMe,be ieee 2015 project list_vlsi
Me,be ieee 2015 project list_vlsi
 
Ieee 2015 project list_vlsi
Ieee 2015 project list_vlsiIeee 2015 project list_vlsi
Ieee 2015 project list_vlsi
 
Lte planning-principles-part-ii
Lte planning-principles-part-iiLte planning-principles-part-ii
Lte planning-principles-part-ii
 
PPU_PNSS-1_ICS-2014
PPU_PNSS-1_ICS-2014PPU_PNSS-1_ICS-2014
PPU_PNSS-1_ICS-2014
 
Simulation of IEEE 802.16e Physical Layer
Simulation of IEEE 802.16e Physical LayerSimulation of IEEE 802.16e Physical Layer
Simulation of IEEE 802.16e Physical Layer
 
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVERPERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
PERFORMANCE EVALUATION OF MOBILE WIMAX IEEE 802.16E FOR HARD HANDOVER
 
Intermediate Fabrics
Intermediate FabricsIntermediate Fabrics
Intermediate Fabrics
 
10.1109@jiot.2020.29954
10.1109@jiot.2020.2995410.1109@jiot.2020.29954
10.1109@jiot.2020.29954
 
50120140505008
5012014050500850120140505008
50120140505008
 
Energy Efficient and Process Tolerant Full Adder in Technologies beyond CMOS
Energy Efficient and Process Tolerant Full Adder in Technologies beyond CMOSEnergy Efficient and Process Tolerant Full Adder in Technologies beyond CMOS
Energy Efficient and Process Tolerant Full Adder in Technologies beyond CMOS
 
D031201021027
D031201021027D031201021027
D031201021027
 
Towards achieving-high-performance-in-5g-mobile-packet-cores-user-plane-function
Towards achieving-high-performance-in-5g-mobile-packet-cores-user-plane-functionTowards achieving-high-performance-in-5g-mobile-packet-cores-user-plane-function
Towards achieving-high-performance-in-5g-mobile-packet-cores-user-plane-function
 
A REVIEW OF MEMORY UTILISATION AND MANAGEMENT RELATED ISSUES IN WIRELESS SENS...
A REVIEW OF MEMORY UTILISATION AND MANAGEMENT RELATED ISSUES IN WIRELESS SENS...A REVIEW OF MEMORY UTILISATION AND MANAGEMENT RELATED ISSUES IN WIRELESS SENS...
A REVIEW OF MEMORY UTILISATION AND MANAGEMENT RELATED ISSUES IN WIRELESS SENS...
 
Scalable architecture for VOD service
Scalable architecture for VOD serviceScalable architecture for VOD service
Scalable architecture for VOD service
 
On an LAS-integrated soft PLC system based on WorldFIP fieldbus
On an LAS-integrated soft PLC system based on WorldFIP fieldbusOn an LAS-integrated soft PLC system based on WorldFIP fieldbus
On an LAS-integrated soft PLC system based on WorldFIP fieldbus
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Ew31992997
Ew31992997Ew31992997
Ew31992997
 

Plus de srimoorthi (20)

94
9494
94
 
87
8787
87
 
84
8484
84
 
83
8383
83
 
82
8282
82
 
72
7272
72
 
70
7070
70
 
69
6969
69
 
63
6363
63
 
62
6262
62
 
61
6161
61
 
60
6060
60
 
59
5959
59
 
57
5757
57
 
56
5656
56
 
50
5050
50
 
55
5555
55
 
52
5252
52
 
53
5353
53
 
51
5151
51
 

26

  • 1. An Efficient VFI-Based NoC Architecture Using Johnson-Encoded Reconfigurable FIFOs Amir-Mohammad Rahmani1,2, Pasi Liljeberg2, Juha Plosila2, and Hannu Tenhunen1,2 1 Turku Centre for Computer Science (TUCS), Turku, Finland 2 Computer Systems Lab., Department of Information Technology, University of Turku, Finland Email: {amir.rahmani, pasi.liljeberg, juha.plosila, hannu.tenhunen}@utu.fi Abstract— In this paper, a Johnson-encoded Reconfigurable In this paper, we propose a novel Reconfigurable Synchronous/Bi-Synchronous (RSBS) FIFO is proposed Synchronous/Bi-Synchronous (RSBS) FIFO based on which can adapt its operation to either synchronous or bi- Johnson-encoded pointers to mitigate the latency and power synchronous mode. The proposed FIFO which can be used to consumption overhead. In addition, since the NoC interface modules in Voltage/Frequency Islands (VFI) based partitioned into VFIs requires a method to embed and Networks-on-chip, is capable of alleviating the excessive exploit these FIFOs, we have developed a controller for energy consumption and high performance overhead of the switch input channels which can adaptively decide the conventional bi-synchronous FIFOs. The FIFO is scalable operating mode of the FIFOs. and synthesizable in synchronous standard cells. In addition, The remainder of this paper is organized as follows. a technique for mesochronous adaptation of the proposed Section II describes the related work. Section III elaborates FIFO is presented. Our extensive experiments show the demands for enhancement of the existing architectures, significant power and performance improvements compared while the architecture of the reconfigurable FIFO is to non-reconfigurable architectures. presented in detail in Section IV. Section V shows the experimental results and analyzes the impact of our I. INTRODUCTION technique on a video conference encoding part as a case Interconnect links will impose a number of limits to study. Finally, Section VI draws conclusions. complexity, reliability, and throughput in nanoscale system design. Network-on-chip (NoC) has been proposed to II. RELATED WORK mitigate the ever increasing communication complexity of modern many-core system-on-chip (SoC) designs [1][2]. In There have been many efforts to design low latency addition, achieving power efficiency has become an asynchronous communication mechanisms between increasingly difficult challenge, especially in the presence synchronous blocks. Some of them include two flip-flop of increasing die sizes, high clock frequencies and synchronizers or an asynchronous FIFO using Gray code variability driven design issues. Globally Asynchronous [9][10], Johnson code [11], or a ring counter [12][13] for Locally Synchronous [3][4] (GALS) -based NoCs read and write pointers, while others consider stoppable implemented using a multiple Voltage Frequency Island clocks [3]. Our proposed FIFO is based on Johnson (VFI) design style have become an attractive alternative to encoding and it has two substantial advantages over the traditional designs [5]. In fact, VFI-based approaches could proposed FIFO in [11]. Firstly, it devotes reconfigurability be used for minimizing the system power dissipation under to bi-synchronous FIFOs to prevent their associated power performance constraints. and latency overhead in such cases that their Assignment of frequencies and voltages to VFIs can be synchronization parts are not needed. Secondly, in addition done by using either offline or online methods [6]. Offline to register based implementation using one-hot addressing, it supports standard memory based implementation methods can be used when the behavior of an application is addressed by normal binary code. very predictable for various input conditions and the worst- There have been several design efforts to combine the case behavior is not very different from the average-case benefits of the GALS-based NoC interconnect mechanism behavior [7]. However, such an approach is not well-suited for applications that show large variations in their behavior with VFI-based design style [14][15]. For instance the for different input conditions. For such systems, online authors of [16], present design methodologies for methods are more suitable [7][8]. Dynamic Voltage and partitioning an NoC architecture into multiple VFIs and Frequency Scaling (DVFS) schemes can be used to adapt assigning frequency, supply voltage, and threshold the system to meet the performance requirements of a voltage levels to each VFI according to given dynamically changing workload while minimizing power performance constraints at design time. On the other hand, consumption. there have been many works that propose hardware-based In order to benefit from the VFI-based scheme, approaches to dynamically change the frequencies and communication between islands should be carried out by potentially voltages of a VFI system driven by a dynamic using mixed-timing (bi-synchronous) FIFOs [9] which workload [6][8]. However, due to the high latency and adapt clock frequency discrepancy; however, due to the power overhead, there is a substantial limitation to freely overhead in implementing these FIFOs in terms of latency, exploit the bi-synchronous FIFOs. To the best of our area, and power consumption, the associated design knowledge, there is only one study targeting to propose complexity increases. reconfigurable FIFOs in which FIFOs can work in two distinct modes and accordingly devote much more 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  • 2. flexibility to the both dynamic and static voltage/frequency and Mode Selector (which is added to the basic architecture assigning techniques. In [17], for the best case in which the of input channel module to support RSBS FIFO). The authors presents a Gray-encoded reconfigurable FIFO, there RSBS IB block is a dual mode FIFO buffer, while the IC are still performance and power overheads due to existence block of each input channel performs the routing function, of pointer counters and complexity of fullness/emptiness its IRS block receives x_rd and x_gnt signals and triggers checking logic. Moreover, the gray-encoded design style the rd signal of the RSBS IB block, and the IFC block can only support FIFO capacities that are powers of two. implements the logic that performs the translation between The presented FIFO in this paper, which is an improved the handshake and the FIFO flow control protocol. Each extension of those reconfigurable FIFOs, is capable of channel includes n bits for data and two bits for packet circumventing the aforementioned issues. As we will framing: begin-of-packet ((n+2)th bit), and end-of-packet describe later, these restrictions could be mitigated ((n+1)th bit). IFC, IC, and IRS modules are described in considerably by utilizing the proposed Johnson-encoded detail in [18]. reconfigurable FIFOs. III. MOTIVATION AND CONTRIBUTION A VFI can consist of a single PE or, depending on the physical or design considerations, may contain a group of PEs. Each VFI is assumed to have a voltage level above a certain value Vmin and, since the architecture is globally asynchronous, locally synchronous [3], each module or core is assumed to be clocked by local ring oscillator or a central clock generator controlled by a variable intra-island Figure 1. Input channel module architecture with support of Reconfigurable Syn/Bi-Syn FIFOs supply voltage [5]. In such systems, the assignment of voltage/frequency to each island can be classified into The static voltage/frequency assigning problem for a given Static and Dynamic Voltage/Frequency Assigning component graph G(V,E) which is characterized by the set of techniques. nodes represented as V = {1, 2, … , n} and edges represented In Dynamic Voltage Frequency Assigning (DVFA) as E = {(i,j) | i precedes j} can be stated as [8]: techniques, usually each individual processing element is a Given a component graph G(V,E) comprised of a set of locally synchronous module operating with its own clock processes mapped on a set of processing elements (PEs), find and either being a single VFI or forming a VFI with another the optimal voltage and clock frequency to be assigned synchronous module. This enables dynamic statically to each PE such that the energy per operation is voltage/frequency scaling in each synchronous module minimized and rate and/or latency constraints are satisfied. using a DC-DC voltage regulator and a central or local variable delay ring oscillator maintaining the clock of the After voltage and frequency assigning for each island at VFI. There are usually a discrete set of frequency and design time, then the VFIs are formed and for inter-island and voltage levels (usually 2 to 6 levels) assigned by some intra-island communications, bi-synchronous and synchronous methods such as forecasting. For the sake of adaptivity, all FIFOs are respectively employed. According to the recent work in this field [7][8], a VFI-based NoC is generally cores should benefit from bi-synchronous FIFOs. Let us partitioned into 2 to 4 islands because if a larger number is consider a frequent case in which adjacent cores in a NoC selected, the overhead of the bi-synchronous FIFOs will work at a same frequency level (e.g., they belong to same diminish the power savings gained by VFI architecture. VFI in current timing window). In this situation, despite Designs using SVFA techniques are advantageous in the both read and write clock signals of their FIFOs have equal case of a system where an oracle has pre-existing knowledge frequencies, they are still synchronized by passing through of the number of run time cycles used in each PE for synchronizer blocks for asserting full and empty signals. processing each sample of the application under consideration. However, these FIFOs can be informed about their equal Moreover, since all of the SVFA stages are done at design time read and write clock frequencies. A reconfigurable FIFO and for a specific application, this system is not practical for being capable of operating in both synchronous and bi- other applications. For instance, such a design as synchronous modes can cope with this by bypassing and MultiProcessor System-on-Chip (MPSoC) which is typically switching off the unused components (e.g., synchronizers, designed to be mapped and run multi-purpose applications code converters) and result to considerable improvement in cannot benefit from the SVFA techniques, while it can be terms of latency, throughput, and power consumption. logical to exploit these techniques by using the proposed RSBS In order to highlight the importance of reconfigurable FIFOs and configuring the FIFOs for respective application FIFOs, we have embedded a simple hardware called Mode after each mapping process. Selector in the input channel of a RASoC-based NoC In the next section, we present the architecture of the switch [18]. As can be seen from Figure 1, this module is proposed RSBS FIFO which is based on Johnson-Encoding. It responsible for recognizing the equality of write (provided can be seen how this simple technique can astonishingly optimize the overall NoC power consumption as well as by output channel of the adjacent switch) and read clock latency and throughput. frequencies and directing the buffer to operate in synchronous or bi-synchronous mode. The input channel module shown in Figure 1 consists of IV. JOHNSON-ENCODED RECONFIGURABLE five different units: IFC (Input Flow Controller), RSBS IB SYNCHRONOUS/BI-SYNCHRONOUS FIFO (Reconfigurable Synchronous/Bi-Synchronous Input In this section, we present a reconfigurable FIFO design Buffer), IC (Input Controller), IRS (Input Read Switch), approach based on Johnson encoding and discuss its
  • 3. benefits over Gray-encoded FIFOs. It should be noted that this issue, we uses Johnson encoding for read and write this design style is scalable and synthesizable in pointers. synchronous standard cells. The proposed RSBS FIFO is a Johnson encoding is another code with a Hamming bi-synchronous FIFO [10] able to interface two distance of 1 between consecutive elements which allows a synchronous systems with independent clock frequencies. safe synchronization of the pointers. To implement the For the sake of metastability [19] avoidance and sequence, bits are chained in series as in a shift register, and synchronization of pointers between two independent clock the loop is closed using an inverter, so that the least domains, it benefits from two synchronizers used for write significant bit is implemented as the negation of the most and read pointers. significant bit. To differentiate the FIFO fullness and As shown in Figure 2, similar to most bi-synchronous emptiness, a parity bit to the binary pointers is added for FIFOs, five typical modules compose the RSBS FIFO virtually doubling the addressing range of the pointers [20]. architecture: FIFO Memory block, sync_r2w, sync_w2r, This parity method is extensible both to Gray and Johnson FIFO rptr & empty, and FIFO wptr & full. The FIFO encodings, but for Johnson encoding, because of the Memory block is a buffer accessed by both the write and twisted-ring sequence, it is simpler. When Johnson read clock domains. This buffer is most likely an encoding is used, the buffer is empty if write_pointer = instantiated, synchronous dual-port RAM but other memory read_pointer, and it is full if write_pointer = NOT styles can also be adapted to function as the FIFO buffer. read_pointer. The sync_r2w (sync_w2r) module is a synchronizer used to The architecture of the FIFO wptr & full (FIFO rptr & synchronize the read (write) pointer into the write(read)- empty) block is shown in Figure 3. This module consists of clock domain in the bi-synchronous mode. The FIFO rptr a Johnson-encoded register to generate the n-bit pointer to & empty block is completely synchronous to the read-clock be synchronized into the opposite clock domain. In domain and contains the FIFO read pointer and empty-flag addition, it exploits a Johnson to binary converter and logic. Similarly, the FIFO wptr & full block is completely another register (Binary register) used to address the FIFO synchronous to the write-clock domain and contains the memory directly without the need to translate memory FIFO write pointer and full-flag logic. addresses and also one Full (Empty) Detector block to In the proposed design style, to provide check fullness (emptiness) of the FIFO. reconfigurability, we have exploited two multiplexers and two flip-flops to bypass unused components in the synchronous mode. For this purpose, we added the Syn/Bi- Syn_Mode signal indicating the operation mode of the FIFO (synchronous or bi-synchronous). Before describing the main function of the RSBS FIFO, let us first focus on the properties of Johnson encoding [11] for the FIFO read and write pointers and the internal structure of the FIFO wptr & full (FIFO rptr & empty) block. Figure 3. FIFO wptr & full block diagram The main target of the proposed RSBS FIFO is to bypass and switch off the unused components and have a simple synchronous FIFO (without the other components synchronizers) in the synchronous mode. In the proposed design style, once the FIFO receives the command to operate in the synchronous mode via Syn/Bi-Syn Mode signal, the blocks in Figure 2 highlighted with gray circles are removed from the FIFO path and switched off. Since in the synchronous mode it is not necessary to synchronize the pointers into opposite clock domains, we bypass the synchronizers to produce the empty and full flags using unsynchronized Johnson-encoded pointers. As the mode Figure 2. Reconfigurable Syn/Bi-Syn FIFO architecture changes to bi-synchronous, the disabled blocks will be As discussed earlier, Gray code presents some again turned on. limitations in terms of the implementation complexity. The As a result of bypassing synchronizers in both full and first reason is that Gray code allows encoding only “power empty detection stages, the FIFO has considerable latency, of two” ranges, while the FIFO size may be optimal at a throughput, and power improvement when it operates in the value that is not a power of two. The second limitation is synchronous mode; hence it can be an applicable FIFO that contrarily to binary encoding with the full-adder architecture to be utilized in DVFA techniques. It should standard-cell, there is no elementary logical operator to be emphasized that in such NoC systems which benefit perform an addition in Gray encoding. Hence, the from SVFA techniques, the position of islands are constant. increment of the pointers needs to be hardwired at the cost These NoC systems are not appropriate to be mapped for of more area and lower performance. In order to cope with various applications at different times. Therefore, it is
  • 4. desirable to have such synchronous FIFOs for intra-island system as a case study and compared it to a similar system communication which do not have extra latency and power using conventional bi-synchronous FIFOs. consumption overhead. As a result, if the area overheads of In the case of latency analysis, as the sender and the the inactive components are acceptable for the system, receiver have different clock signals, the latency of the exploiting the proposed RSBS FIFO in SVFA-based FIFO depends on the relation between these two signals. systems is quite efficient. The latency can be decomposed in two parts: the state In some cases, it is not possible to exploit a central machine latency and the synchronization latency. As the clock generator for NoC-based systems (specifically for state-machine is designed using a Moore automaton, its DVFA-based ones). In these situations, each node has its latency is one clock cycle. In the bi-synchronous mode, s own clock generator (phase-locked loop) and the FIFO registers compose the synchronizers and the latency is ΔT architecture should be adapted to interface mesochronous plus one clock cycle, where ΔT is the difference, in time, clock domains where the sender and the receiver have the between the rising edges of sender and receiver clocks. As same clock frequency but different phases. To this end, the this difference is between zero and one Clk_read clock RMBS (Reconfigurable Mesochronous/Bi-Synchronous) cycle, the latency of the RSBS FIFO is between s and s+1 FIFO should be utilized instead of the RSBS one. The Clk_read clock cycles in the bi-synchronous mode. phase difference can be constant or slowly varying. Obviously, in the synchronous mode, there is no difference According to [19], metastability can be avoided when the between Clk_read and Clk_write and also there is not any rising edges of the clock signals are predictable, and the synchronizer, and hence data can be fetched by the receiver two registers in the synchronizer can be reduced to a single on the next rising/falling edge of Clk_read. register. In order to evaluate the throughput, the RSBS FIFOs for each operation mode should be analyzed as function of the FIFO depth. For this FIFO in the bi-synchronous and mesochronous modes, as the synchronizers add latency, the performance of flow control of the FIFO is penalized. In the case of a deep FIFO, those latencies do not decrease the FIFO throughput since the buffered data compensate the latency of the flow control. As the FIFO operates in the synchronous mode, the minimum FIFO depth required to provide maximum throughput decreases because there is no need for synchronizers. Table 1 shows the minimum FIFO depth for 50% and 100% throughput as a function of the clocking mode. Note that for the bi-synchronous mode analysis, the write and read clock frequencies are equal, otherwise it is not possible to obtain 100% throughput. Table 1. Minimum FIFO depth in function of the clock relation Figure 4. Synchronization part of the Reconfigurable Meso/Bi-Syn and required throughput FIFO architecture Minimum depth for Minimum depth for Mode 50% throughput 100% throughput As an example, Figure 4 shows the proposed design Bi-syn. Mode 3 6 which we have modified for correct emptiness and fullness Meso. Mode 2 4 detection in the mesochronous mode. In this case, two Syn. Mode 1 2 registers are added and clocked using a delayed version of the read/write clock in the mesochronous mode. This delay The area of the FIFOs was computed once synthesized must be chosen to exchange the data without metastable on CMOS 90nm GPLVT STMicroelectronics standard situations. The delay can be a programmable delay, or any cells using Synopsys Design Compiler. Different FIFO other metastability-free solution, as for example the depths are used to illustrate the scalability of the Chakraborty-Greenstreet [21] architecture which allows the architecture. Table 2 shows the area of the 16 and 32-bit FIFO to work also on plesiochronous (small difference of Gray-encoded RSBS, Gray-encoded RMBS, Johnson- frequency) clocks. Likewise, if the write and read clocks are encoded RSBS, and Johnson-encoded RMBS FIFOs as a out of phase by 90°, 180º, or 270°, no programmable delay function of the FIFO depth. is needed because, by-construction, the communication is free of metastability. Although it is not as efficient as the Table 2. Area and overhead comparison between the Johnson- RSBS FIFO, it still improves the FIFO throughput, and in encoded and Gray-encoded design styles and the baseline design each mode, if the synchronizers used for the other mode is 4×16 4×32 8×16 8×32 Style turned off, the unnecessary power consumption is µm2 µm2 µm2 µm2 prevented. Gray-encoded RSBS 3434 5888 6354 11415 FIFO [17] Gray-encoded RMBS V. ANALYSIS AND CASE STUDY FIFO [17] 3530 5983 6509 11570 We have simulated the reconfigurable FIFO to Johnson-encoded RSBS 3331 5795 6267 11332 characterize its latency, throughput, area, and power FIFO consumption. Note that, to observe the power efficiency of Johnson-encoded RMBS 3420 5877 6416 11463 the FIFO, we have employed it in a NoC-based MPEG FIFO
  • 5. We apply the proposed RSBS-FIFO-based switch to the [2] L. Benini, and G. D. Micheli, “Networks on chips: a new SOC paradigm,” IEEE computer, Vol. 35, No. 1, 2002, pp. 70–78. NoC-based MPEG-4 decoder described in [22] and [3] D. M. Chapiro, “Globally asynchronous locally synchronous compare it with a similar system which does not benefit systems,” Ph.D. dissertation, Dept. Comput. Sci., Stanford from the reconfigurable FIFOs. The MPEG-4 decoder University, Stanford, CA, 1984. system is modeled and mapped on a 5×3 NoC. In the [4] J. Muttersbach et al., “Practical design of globally asynchronous system, each node has a 5×5 crossbar switch. Since MPEG locally synchronous systems,” in Proc. of Int. Symp. on Advanced videos show a lot of variability in processing time Research in Asynchronous Circuits and Systems, 2000, pp. 52–59. depending on the type of frame being processed, we [5] D.E. Lackey et al., “Managing power and performance for system- on-chip designs using volatge islands,” in Proc. of IEEE/ACM Int. perform prediction-based dynamic voltage/frequency Conf. on Computer Aided Design, 2002, pp. 195-202. assigning on each node based on the DVFA algorithm [6] P. Choudhary and D. Marculescu, “Power Management of proposed in [8]. The prediction decision is taken at the start Voltage/Frequency Island-Based Systems Using Hardware-Based of processing of a new macroblock at each node, and for Methods,” IEEE Transactions on VLSI Systems, Vol. 17, No. 3, each input channel we add a synchronous/bi-synchronous 2009, pp. 427-438. mode selector unit. The simulation is performed for three [7] U. Y. Ogras, R. Marculescu, D. Marculescu, and E. G. Jung, “Design and Management of Voltage-Frequency Island Partitioned different frequency sets having 2, 4, and 6 frequency Networks-on-Chip,” IEEE Transactions on VLSI Systems, Vol. 17, levels. We assume that the switches are clocked by a No. 3, 2009, pp. 330-341. central clock generator block; therefore they do not need [8] K. Niyogi and D. Marculescu, “Speed and voltage selection for gals the mesochronous adaptation. systems based on voltage/frequency islands,” in Proc. of ACM/IEEE Asian-South Pacific Design Automation Conf., 2005, pp. 292–297. Figure 5 shows the average power saving percentage of [9] T. Chelcea and S. M. Nowick, “Robust interfaces for mixed-timing the NoC switches achieved by exploiting the Johnson- systems,” IEEE Transactions on VLSI Systems, Vol. 12, No. 8, encoded RSBS FIFOs instead of the conventional bi- 2004, pp. 857–873. synchronous FIFOs. The comparison is made between the [10] C. Cummings and P. Alfke, “Simulation and synthesis techniques RSBS FIFO and its baseline counterpart [10] for three for asynchronous FIFO design with asynchronous pointer different frequency sets and two different data widths. As comparison,” in SNUG-2002, San Jose, CA, 2002. the results show, we get around 5.2-17% savings over the [11] Y. Thonnart et al., “Design and Implementation of a GALS Adapter for ANoC based Architectures,” in Proc. of Int. Symp. on Advanced baseline architecture. Research in Asynchronous Circuits and Systems, 2009, pp. 13-22. [12] T. Ono, and M. Greenstreet, “A Modular Synchronizing FIFO for NoCs,” in Proc. of Int. Symp. on Networks-on-Chip, 2009, pp. 224- 233. [13] I. Panades and A. Greiner, “Bi-synchronous FIFO for synchronous circuit communication well suited for network-on-chip in GALS architectures,” in Proc. of Int. Symp. on Networks-on-Chip, 2007, pp. 83–94. [14] J. Quartana, S. Renane, A. Baixas, L. Fesquet, and M. Renaudin, “GALS systems prototyping using multiclock FPGAs and asynchronous network-on-chips,” in Proc. of Int. Conf. on Field Programmable Logic and Applications, 2005, pp. 299–304. [15] G. Campobello, M. Castano, C. Ciofi, and D. Mangano, “GALS networks on chip: A new solution for asynchronous delay- Figure 5. Average power savings for the NoC switches used in insensitive links,” in Proc. of Design, Automation and Test in MPEG Encoder Europe Conf., 2006, pp. 1–6. [16] C.-L. Chou et al., “Energy- and Performance-Aware Incremental Mapping for Networks on Chip With Multiple Voltage Levels,” I. SUMMARY AND CONCLUSION IEEE Transactions on CAD, Vol. 27, No. 10, 2008, pp. 1866-1879. In this paper, a Johnson-Encoded Reconfigurable [17] A. -M. Rahmani et al., “Power and Performance Optimization of Synchronous/Bi-Synchronous FIFO was proposed which Voltage/Frequency Island-Based Networks-on-Chip Using Reconfigurable Synchronous/Bi-Synchronous FIFOs,” in Proc. of can operate in either synchronous or bi-synchronous mode. ACM International Conference on Computing Frontiers, 2010, pp. The FIFO addresses the synchronization power and latency 267-276. overhead in the case that adjacent switches in the NoC [18] C. A. Zeferino, M. E. Kreutz, and A. A. Susin, “RASoC: A router system operate in the same clock frequency but suffer from soft-core for networks-on-chip,” in Proc. of Design, Automation and unnecessary synchronizations. A technique for Test in Europe Conf., 2004, pp. 198–203. mesochronous adaptation of the FIFOs has been suggested. [19] F. Mu and C. Svensson, “Self-tested self-synchronization circuit for Our results revealed that compared to a non-reconfigurable mesochronous clocking,” in IEEE Transactions on Circuits and Systems-II, Vol. 48, No. 2, 2001, pp. 129-140. system architecture, the Johnson-Encoded RSBS FIFO can [20] R. Apperson et al., “A Scalable Dual-Clock FIFO for Data help to achieve considerable savings in average power Transfers Between Arbitrary and Haltable Clock Domains”, in IEEE consumption of NoC switches and to improve the total Transactions on VLSI Systems, Vol. 15, No. 10, 2007, pp 1125- average packet latency significantly in the case of a MPEG- 1134. 4 encoder application. [21] A. Chakraborty and M. R. Greenstreet, “Efficient self-timed interfaces for crossing clock domains,” in Proc. of 9th IEEE Int. REFERENCES Symp. on Asynchronous Circuits and Systems, 2003, pp. 78-88. [1] A. Jantsch and H. Tenhunen. Networks on Chip. Kluwer Academic [22] E. B. Van der Tol and E.G.T. Jaspers, “Mapping of MPEG-4 Publishers, 2003. Decoding on a Flexible Architecture Platform,” SPIE 2002, pp. 1- 13.