SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Ultra-Low power and High Speed Design and Implementation of
              AES and SHA1 Hardware cores in 65 Nanometer CMOS
                                  Technology
                                                               Feng Ge, Pranjal Jain and Ken Choi
                                                     Department of Electrical and Computer Engineering
                                                              Illinois Institute of Technology
                                                        Email: {fge2, pjain13, and kchoi12}@iit.edu

   Abstract— This paper describes a design and                                                     being packed into a chip. This leads to the steady growth
implementation of low-power and high-speed security                                                of the operating frequency and processing capacity per
hardware cores for the Advanced Encryption Standard                                                chip, resulting in increased power dissipation. Now-a-
(AES) and the Secure Hash Algorithm (SHA1). We propose                                             days, power-aware design techniques at the early stage of
three Register Transfer Level (RTL) circuit techniques,                                            the design abstraction hierarchy such as register transfer
namely, Application Specific Register Reduction (ASRR),                                            level (RTL) are getting more attention.
Locally Explicit Clock Enabling (LECE), and Bus Specific                                              In this paper, we have implemented ultra-low power
Clock (BSC). LECE and BSC can be used directly to any
ASIC design flow and can be applied for any technology
                                                                                                   AES and SHA1 hardware cores with emphasize on power
nodes. With 65 nanometer industry technology, our                                                  reduction techniques at RTL so that system designers can
proposed schemes demonstrated at RTL and gate level that                                           easily implement very low-power and high-performance
for AES, 44.57% total power reduction (dynamic and cell                                            security systems toward fabrication in CMOS by using
leakage power), 10.43% area reduction, and 5.78 Gbps                                               our soft-IP at RTL at an early stage of ASIC design flow.
throughput with 452 MHz circuit speed are achieved and
for SHA1, 63.26% total power reduction, 12.72% area
reduction with 1.28 GHz circuit speed are achieved.
                                                                                                                               II. BACKGROUND

                                                                                                   A. AES Algorithm
                           I. INTRODUCTION
                                                                                                      AES [1, 2] is a symmetric cipher that processes data in
   As the demand for secure communications increases,                                              128-bit blocks. It supports key sizes of 128, 192 and 256
high-throughput, low-power en/decryption on both wired
                                                                                                   bits and consists of 10, 12 and 14 iterations. Each round
and wireless networks is growing more necessary. Today,
providing security is one of the major concerns, especially                                        mixes the data with a roundkey, which is generated from
for wireless systems design such as wireless sensor                                                the encryption key. We are considering only 128 bits and
networks and RFID.                                                                                 10 iterations.
   In general, such applications require development of                                               The Cipher maintains an internal, 4x4 matrix of bytes,
scalable, ultra-low power and low cost architecture [3,                                            called state, on which operations are performed. Initially,
11]. Security systems form backbone of such sensor                                                 state is filled with the input data block and XORed with
network and require protection from threats such as data                                           the encryption key. Regular rounds consist of operations
integrity, eavesdropping and impersonation. So, the main                                           called     Subbytes,     Shiftrows,    MixColumns,     and
aim is to implement a low power, high throughput and                                               AddRoundkey as shown in Figure 1 (a). The last round
low area cryptography algorithms like Advanced                                                     bypasses MixColumns.
Encryption Standard (AES) [1, 2] and Secure Hash                                                      (1) SubBytes: The SubBytes transformation is a
Algorithm (SHA1) [10] effectively.
                                                                                                   nonlinear substitution operation that works on bytes. Each
   Conventionally, research findings mainly focused on
                                                                                                   byte of the input state is replaced using the same
developing pipelined and loop-unrolled AES designs [5,
                                                                                                   substitution function (called S-Box). The S-Box is defined
6]. There has been research done in implementing AES S-
                                                                                                   as the multiplicative inverse in the Galois Field GF (28)
Box full-custom design, AES ASIC designs with varying
data paths [7]. Roundkey are generated on fly either by                                            with the irreducible polynomial m(x) = x8 + x4 + x3 + x +
sharing S-Box with main data path [7] or by dedicating S-                                          1 followed by an affine transformation. The InvSubBytes
box for the Key expansion. Architectures are exploited in                                          transformation, which is needed for decryption, is the
feedback modes of operation in SHA1. Thus, we observe                                              inverse of the affine transformation followed by the same
that above references mainly focus on area efficient                                               inversion as in the SubBytes transformation as shown in
implementation or increasing throughput using                                                      Figure 1 (b).
architecture reconfigurations [4, 8, and 9].                                                         (2) ShiftRows: The ShiftRows transformation rotates
   Traditionally, power dissipation of VLSI chips was                                              each row of the input state to the left, whereby the offset
neglected. The device density and operating frequency                                              of the rotation corresponds to the row number. The
were low enough to form a constraining factor in the                                               InvShiftRows of this transformation is computed by
chips. As the scale of integration improves, more                                                  performing the corresponding rotations to the right as
transistors, faster and smaller than their predecessors, are                                       shown in Figure 1 (b).

 978-1-4244-3355-1/09/$25.00©2009 IEEE                                                       405




   Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
Plaintext                                            Ciphertext
                                                                                                       Word E: C3 D2 El FO
 Round                                               Round
 Key 0      ADDROUNDKEY Initial Round                Key Nr       ADDROUNDKEY Initial Round


                SubBytes                                            InvSubBytes


               ShiftRows                                            InvShiftRows

                                      Nr 1       Round Key                               Nr 1
              MixColumns                          Nr 1….1          InvMixColumns

  Round
   Key      ADDROUNDKEY                      InvMixColumns        ADDROUNDKEY
 1….Nr 1

                SubBytes                                            InvSubBytes


               ShiftRows           Final                            InvShiftRows         Final
                                  Round                                                 Round

 Round                                               Round
 Key Nr     ADDROUNDKEY                              Key 0        ADDROUNDKEY



                Ciphertext                                              Plaintext

            Figure 1(a): Encryption                           Figure 1(b): Decryption
                                                                                                              Figure 2. Secure Hash Algorithm (SHA1) Algorithm
                                                                                                         (4) Process message in 16-word blocks: The heart of the
                                                                                                       algorithm is a module that consists of four rounds of
                             Figure 1: AES Algorithm
                                                                                                       processing 20 steps each. The four rounds have a similar
                                                                                                       structure, but each uses a different primitive logical
  (3) MixColumns: The MixColumns transformation
                                                                                                       function. These logical functions are defined as follows:
maps each column of the input state to a new column in
                                                                                                       These rounds take as input the current 512-bits block and
the output state. Each input column is considered as a
                         8                                                                             the 160-bits buffer value (A, B, C, D, E), and then update
polynomial over GF (2 ) and multiplied with the constant
                                                                                                       these buffers.
polynomial a(x) = {03} x3 + {01} x2 + {01} x + {02}
                                                                                                                           ⎧( B ∧ C ) ∨ ( B ∧ D )
           4
modulo x - 1. The coefficients of a(x) are also elements                                                                                                        0 ≤ t ≤ 19
                                                                                                                           ⎪B ⊕ C ⊕ D
          8
of GF (2 ) and are represented by hexadecimal values in
this equation. The InvMixColumns transformation is the                                                                     ⎪                                   20 ≤ t ≤ 39
                                        -1
                                                                                                           f ( B, C , D) = ⎨
multiplication of each column with a (x) = {0B} x3 +
                                         4
{0D} x2 + {09} x + {0E} modulo x – 1 as shown in
                                                                                                                           ⎪( B ∧ C ) ∨ ( B ∧ D ) ∨ (C ∧ D )   40 ≤ t ≤ 59

Figure 1 (b).                                                                                                              ⎪B ⊕ C ⊕ D
                                                                                                                           ⎩                                   60 ≤ t ≤ 79
  (4) AddRoundKey: The AddRoundKey transformation                                                      Each round also makes use of an additive constant KT. In
is self-inverting. It maps a 128-bit input state to a 128-bit                                          hex the values are shown below.
output state by XORing the input state with a 128-bit
round key. Please refer Figure 1.                                                                                      ⎧5 A827999                                0 ≤ t ≤ 19

                                                                                                                       ⎪6 ED 9 EBA1
                                                                                                                       ⎪                                        20 ≤ t ≤ 39
B. SHA1 Algorithm:                                                                                              KT   = ⎨
   The algorithm takes as input a message with a                                                                       ⎪8 F 1BBCDC                              40 ≤ t ≤ 59
maximum length of less than 264 bits and produces as                                                                   ⎪C 862C1D 6
                                                                                                                       ⎩                                        60 ≤ t ≤ 79
output a 160-bits message digest as shown in Figure 2.
The input is processed in 512 bits blocks. The algorithm
processing includes the following steps:                                                               III. PROPOSED APPROACHES AND IMPLEMENTATIONS
   (1) Padding: The purpose of message padding is to                                                      We have implemented both AES and SHA1 at RTL by
make the total length of a padded message congruent to                                                 using the following three techniques for low power and
448 modulo 512(length = 448 mod 512). The number of                                                    synthesized them. The performance is demonstrated in
padding bits should be between 1 and 512. Padding                                                      terms of power, area, speed, and throughput at RTL and
consists of single 1-bit followed by the necessary number                                              also gate level:
of 0-bits.                                                                                                  A) Application Specific Register Reduction (ASRR)
  (2) Appending Length: A 64-bits binary representation                                                     B) Locally Explicit Clock Enabling (LECE)
of the original length of the message is appended to the                                                    C) Bus Specific Clock (BSC)
end of the message.
   (3) Initialize the SHA-1 buffer: The 160-bits buffer is                                             A. Application Specific Register Reduction (ASRR):
represented by five four-word buffers (A, B, C, D, E)                                                     Figure 3 illustrates our implementation for the
used to store the middle or final results of the message                                               decryption part of AES core. The AES takes a 128-bit
digest for SHA-I functions. They are initialized to the                                                data block as input and performs several different
following values in hexadecimal. Low-order bytes are put                                               transformations on this block. AES encryptions and
first.                                                                                                 decryptions are based on four different transformations
Word A: 67 45 23 01;                                                                                   that are performed repeatedly in a
Word B: EF CD AB 89;
Word C: 98 BA DC EF;
Word D: 10 32 54 16;
                                                                                                 406




   Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
paper, we proposed a novel way to reduce the number of
                                                                                                   registers tremendously by generating all sub-keys in
                                                                                                   encryption Key Expansion Module and storing them into
                                                                                                   registers or RAMs before decryption begins.




                                                                                                                      Figure 5. Original Key Reverse Buffer

     Figure 3. Application Specific Register Reduction (ASRR)                                        In our proposed architecture, we share maximum
                                                                                                   similarity with encryption circuit and the registers can be
certain sequence as shown in Figure 1. Each of these                                               reduced as shown in Figure 6. Sub-key Ki is generated
transformations, which are described in the section I,                                             and stored into Regi at i-th clock cycle, where i equals to 1
maps a 128-bit input state to a 128-bit output state.                                              to 11. Notice that these 11 registers are only used once in
   For an AES-128 encryption, the 128-bit cipher key                                               decryption, therefore, we can reduce their number to 6.
                                                                                                                                                th
needs to be expanded to eleven 128-bit round keys. The                                             Sub-keys are stored into registers from 5 clock cycle.
principle idea of this key expansion is that the first round                                       Sub-keys K0 to K4 are generated and stored into registers
key, Roundkey (k0) corresponds to the cipher key. All                                              after decryption begins. The multiplexers before registers
subsequent round keys are derived from their respective                                            are controlled by decryption begin signal “de”.
predecessor using a function f. So, Roundkey (ki) = f
(Roundkey (ki) – 1) for all 0 < i < 11. For an AES-128
decryption, the same round keys are used in reversed
order. Using the inverse of the key expansion function, f -
1
  , the round keys can be derived recursively from
RoundKey (k10) and are stored in Key Reverse Buffer,
using just 6 registers instead of 10.
   In AddRoundKey step, a new sub-key is generated
according to the previous sub-key. The Key Generation
Schedule is shown in Figure 4. According to round
numbers, there are 10, 12, 14 sub-keys involved in
encryption. We have implemented 10 sub-keys
generation.

                                                                                                                   Figure 6. ASRR for the Key Reverse Buffer

                                                                                                     Timing Sequence of the Registers is as shown Figure 7.
                                                                                                   At the fifth clock, we store the key K5 to R0 and at the
                                                                                                   next clock, the key K6 to R1 until we store the key K10 to
                                                                                                   R5. Now, decryption starts and we use the key K10
                                                                                                   previously stored in R5 at the first clock cycle of the
                                                                                                   decryption. At the same time, the key K1 is generated and
                                                                                                   stored in R5. In the next cycle, the key K9 previously
                     Figure 4. Key Generation Block                                                stored in R4 at the second clock cycle of the decryption.
  The decryption process is the reverse of encryption.                                             At the same time, the key K2 is generated and stored in
Sub-keys are used in a reverse order. Conventional way to                                          R4. We repeat the operation until the key K6 previously
implement this is to generate the last key with encryption                                         stored in R1 at the fourth clock cycle of the decryption
Key Expansion Module, and then use a reverse Key                                                   and the key K4 is generated and stored in R1. By using
Expansion Module to generate each sub-key in reverse                                               this mechanism, we can save 5-128bits registers which is
                                                                                                   called in this paper ASRR (Application Specific Register
order as shown in Figure 5. However, this method
                                                                                                   Reduction) scheme.
requires large extra circuit and a large S-box. In this
                                                                                            407




  Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
C. Bus Specific Clock (BSC):
                                                                                                    Schematic and timing diagram in Figure 9 shows a
                                                                                                  register where the data input is active during one phase of
                                                                                                  operation only, and does not change for a long period of
                                                                                                  time. The main goal of this technique is to find buses in
                                                                                                  the design that have low switching activity first and then
  Figure 7. Timing and Waveform View for the Proposed ASRR
                           Scheme
                                                                                                  if we can create a clock enable signal by detecting
                                                                                                  changes on the bus, we can save power.
B. Locally Explicit Clock Enabling (LECE):
   In general, a RTL code which has the output dependent
on some initial condition, after synthesis results into a
flip-flop with a MUX in feedback. We have removed the
MUX in feedback loop by implementing a gated clock.
Main difference between LECE and traditional clock
gating is in two folds; i) Traditional clock gating during
synthesis inserts clock gating cells globally based on
maximum fanout number and maximum bus width, so it
is far from the optimal solution and ii) LECE investigates
judiciously the clock signal and the enable signal, and
then find which registers should be clock gated for the                                                                  Figure 9. Data Bus Specific Clock
optimal solution in terms of total power, dynamic and
leakage power. We have implemented this technique in                                                In the security algorithm AES, there is a potential
mainly Key Expansion Unit and Key Reverse Buffer block                                            candidate residing inside Key Expansion Unit. For
of the decryption module of AES.                                                                  generating sub-keys in Roundkey[i], we XOR the
   Control block of AES core performs several functions,                                          previous key generated in Roundkey[i-1] with Rcon[i]
from it one of its important function is to keep track of                                         and subword.
number of rounds and sub-keys generated using key
expansion unit. We have considered 128-bit key and
hence have to keep count of 10. Consider figure 8 (a), in
which we get ‘kcnt’ output on a rising edge of ‘clk’, but
only when the signal ‘kld’ or ‘kb_ld’ is high. Now if the
enable signal is low for a significant amount of circuit
operation and if ‘D = 10’ and ‘Kcnt’ are multi-bit buses
which they are, then a substantial amount of power
dissipated by the clock driver is wasted. We have
implemented a technique, which will gate the clock and
thus reduce the power dissipation by significant
percentage.                                                                                                             Figure 10. RCON Implementation
   As shown in Figure 8 (b), we replace the clock input to
flip-flop with an AND gate whose inputs are the clock                                               Here Rcon[i] consists of 32-bit bus having output
and the ‘EN = kld | kb_ld’ signal. We have used a latch so                                        ‘out[31:0]’ values 0X01, 0X02, 0X04, 0X08, 0X10,
that when the clock is high, no activity on the enable will                                       0X20, 0X40, 0X80, 0X1b, 0X36 for 10 rounds
be transferred to the clock input. We implemented our                                             respectively. Thus, we can observe that out[23:0] has 24-
technique at RTL so that we obtain a new module as                                                bit LSB bus infrequently used. In the Figure 11 (a), we
shown in figure 8 (b).                                                                            can see that out[31:0] (data) is active for a very small
  Clk                                                 Clk                                         amount of time, while we are applying clock
                                                                                                  continuously. Thus, this results a lot of power dissipation
                                                               E
                                                                                                  in clock driver as well as circuitry inside of the register.
                                                 EN = kld or
                                                   kb ld
                                                                                                  We can avoid this bottleneck by constructing an enable
                                                               D
                                                                                                  signal by detecting changes on the bus. Please refer figure
                                kcnt
                                                                                                  11 (b). We XOR the next state of each bit with the
                                                       10 or Kcnt-1                kcnt
                                                                                                  previous one to check whether they are same, and then N-
                                                                                                  bit OR is used to determine if any bits changed. Now if
                 Kcnt - 1                                                                         there are no bits changed then there is no point in
              Figure 8 (a)                                     Figure 8 (b)
                                                                                                  enabling the clock. The latch is used to avoid any glitches
                                                                                                  at AND output, otherwise there would be an accidental
                                                                                                  clock signal applied to the register making it ON, which is
   Figure 8. Implementation of Locally Explicit Clock Enabling
                               (LECE)
                                                                                                  undesirable.
   (a) 1-bit of initial control block (b) After implementing LECE

                                                                                               408




     Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
Clk                                                                                                                      Power Dissipation

                                                                                                                AES RTL                       Static Power Dissipation            Dynamic Power Dissipation         Total
                                                                                                                                         Internal     Clock         Total      Internal     Clock     Total
  Clk                                                                                                                                    Leakage     Leakage       Leakage     Dynamic     Dynamic   Dynamic
                                                                 E
                                                  dd[23:0]                                                       Original                  5.72uW     74.4nW       5.79uW       14.9mW     2.99mW     17.8mW       17.9mW
                                                                 D                                       Power           Register          5.41uW     59.5nW       5.47uW       13.5mW     2.38mW     15.8mW       15.8mW
                                                                        en                              Reduction       Reduction
D[31 0]                       Out[31 0]                                                                Techniques
                                                                                                                      Explicit Clock       5.42uW     70.5nW       5.49uW       8.91mW      999uW     9.91mW       9.92mW
                                                                                                                        Enable
                                                                           Out[23:0]
                                                                                                                       Bus Specific        5.79uW     73.8nW       5.86uW       14.7mW     2.93mW     17.7mW       17.7mW
                                                                          D[23:0]                                         Clock
                                                                                                         Combining above all three         5.31uW     56.2nW       5.37uW       9.12mW      916uW      10mW         10mW
                                                  D[31 0]                                               power reduction techniques
                                                                                                              LECE & BSC                   5.49uW     69.9nW       5.56uW       8.77mW      947uW     9.72mW       9.73mW
                                                                                       Out[31 0]



 Clk
                         Wasted
                                                   Clk
                                                                                                       Table 2. Gate-level Power Dissipation Comparisons for AES (after
                                                                                                                           synthesis with 65 nm tech.)

       D[30] D[29] D[28]              D[0]       D[31] D[30] D[29] D[28]                 D[0]



        Out    Out    Out             Out         Out     Out    Out    Out              Out
        [30]   [29]   [28]            [0]         [31]    [30]   [29]   [28]             [0]




                 Figure 11 (a)                                   Figure 11 (b)

      Figure 11. Implementation of Bus Specific Clock (BSC)
   (a) 32-bit of initial RCON block (b) After implementing BSC

                                                                                                      Table 3. Gate-level Area Comparisons for AES (after synthesis with
                             IV. SIMULATION RESULTS                                                                              65 nm tech.)
                                                                                                                                                                                             Area
  We designed and implemented the AES and the SHA1                                                               AES GATE
                                                                                                                                                    Combinational                     Sequential                        Total
                                                                                                                                                                                                               (Min Inverter Area: 1.08)
core in Verilog at the RTL and synthesized it to the gate                                                           Original                        58352.609375                    23870.750000                    82222.562500
level using a 65 nm, 1.0 Volt, standard-cell CMOS                                                         Traditional Clock Gating                  58340.003906                    18760.798828                    77100.484375

technology. We used PowerTheater for power analysis,                                                     Power
                                                                                                        Reduction
                                                                                                                              Register
                                                                                                                             Reduction
                                                                                                                                                    57282.164062                    19723.982422                    77005.445312

NC-Verilog for RTL simulation, Design Compiler for                                                     Techniques
                                                                                                                            Explic t Clock          58326.691406                    18769.437500                    77095.804688
                                                                                                                                                                                                                                           10.43%
                                                                                                                              Enable
synthesis, and Power Compiler for traditional clock-                                                                        Bus Spec fic            58588.035156                    23880.830078                    82468.078125
gating implementation. We have included the results from                                                                       Clock
                                                                                                         Combining above all three                  57556.742188                    16087.537109                    73643.757812
power, area and speed at RTL and also gate level. The                                                   power reduction techniques

following tables compare our results with the previous                                                          LECE & BSC                          58562.148438                   18779.515625                     77341.320312


compact ASIC designs for AES and SHA1.
                                                                                                        Table 4. Gate-level Delay and Throughput Comparisons for AES
A. Comparison Results for AES                                                                                           (after synthesis with 65 nm tech.)
After doing initial power analysis at RTL, we applied                                                                                                  Critical Path
                                                                                                                                                                                     Delay and Throughput
                                                                                                                                                                                   Frequency (with 10%              Throughput
                                                                                                                 AES-GATE
three power reduction techniques to AES core at RTL and                                                                                                     (ns)                      slack margin)
                                                                                                                                                                                          (MHz)
                                                                                                                                                                                                                     (Gb/sec)

results are tabulated in Table 1-4. We can observe that                                                             Original                                1.99                             452                        5.78

with 65 nanometer industry technology, our proposed                                                       Traditional Clock Gating                          1.99                             452                        5.78
                                                                                                          Power                 Register                    2.03                             443                        5.67
schemes demonstrated 45.6% total power reduction                                                         Reduction             Reduction
                                                                                                        Techniques
(dynamic and cell leakage power) at RTL and 44.57%                                                                           Explicit Clock
                                                                                                                               Enable
                                                                                                                                                            1.99                             452                        5.78

total power reduction, 10.43% area reduction, and 5.78                                                                       Bus Specific                   1.99                             452                        5.78
                                                                                                                                Clock
Gbps throughput with 452 MHz circuit speed at gate                                                       Combining above all three                          1.99                             452                        5.78
level. Table 1 shows the power reduction results at RTL,                                                power reduction techniques
                                                                                                                LECE & BSC                                  1.99                             452                        5.78
Table 2 shows the power reduction results at gate level,
Table 3 shows the area reduction, and Table 4 shows max
circuit speed and throughput of AES implementation,                                                  B. Comparison Results for SHA1
comparing with conventional design method and                                                        We applied three power reduction techniques to SHA1
traditional clock-gating design.                                                                     core at RTL and results are tabulated in Table 5-8. We
                                                                                                     can observe that with 65 nanometer industry technology,
         Table 1. RTL Power Dissipation Comparisons for AES                                          our proposed schemes demonstrated 65.33% total power
                                                                                                     reduction (dynamic and cell leakage power) at RTL and
                                                                                                     63.26% total power reduction, 12.72% area reduction
                                                                                                     without compromising the speed, 1.28 GHz at gate level.
                                                                                                     Table 5 shows the power reduction results at RTL, Table
                                                                                                     6 shows the power reduction results at gate level, Table 7
                                                                                                     shows the area reduction, and Table 8 shows max circuit
                                                                                                     speed of SHA1 implementation, comparing with

                                                                                                    409




          Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
conventional design method and traditional clock-gating                                                                                          reduction, 12.72% area reduction in 1.28 GHz circuit
design.                                                                                                                                          speed for SHA1.

         Table 5. RTL Power Dissipation Comparisons for SHA1                                                                                                      VI. ACKNOWLEDGMENT
                                                                                Power Dissipation
                                               Static Power Dissipation                  Dynamic Power Dissipation                 Total           The authors gratefully acknowledge the contribution of
         SHA1 RTL                         Internal     Clock         Total           Internal      Clock          Total
                                          Leakage     Leakage       Leakage          Dynamic      Dynamic        Dynamic                         reviewers' comments.
              Original                    2.34uW       66.7nW          2.41uW        4.15mW        1.09mW        5.24mW       5.25mW
   Power
  Reduction
                   Exp ic t Clock
                     Enable
                                          2.15uW       64.9nW          2.21uW        1.79mW        333uW         2.12mW       2.12mW
                                                                                                                                                                    VII. REFERENCES
 Techniques
                    Bus Specific          2.33uW       66.4nW           2.4uW        3.87mW        1.02mW         4.9mW        4.9mW
                       Clock                                                                                                                     [1] National Institute of Standards and Technology (U.S.),
 Combining above two power
   reduction techniques
                                          2.14uW       64.8nW          2.21uW        1.56mW        266uW         1.82mW       1.82mW
                                                                                                                                                    Advanced Encryption Standard.
                                                                                                                                                 [2] J. Dijmen and V. Rijmen. AES Proposal: Rijndael. NIST
Table 6. Gate-level Power Dissipation Comparisons for SHA1 (after                                                                                   AES Proposal, June 1998.
                     synthesis with 65 nm tech.)                                                                                                 [3] MooSeop Kim, Juhan Kim, Yongje Choi, “Low Power
                                                                            Power Dissipation                                                       Circuit Architecture of AES Crypto Module for Wireless
        SHA1-GATE
                                             Static Power Dissipation
                                        Internal      Clock       Total
                                                                                    Dynamic Power Dissipation
                                                                                 Internal      Clock       Total
                                                                                                                           Total
                                                                                                                                                    Sensor Network” In Proc. of world academy of science,
                                        Leakage      Leakage     Leakage         Dynamic      Dynamic     Dynamic
                                                                                                                                                    engineering and technology volume 8 october 2005 issn
           Original
         Trad tional
                                         1.99uW
                                         1.97uW
                                                     65.8nW
                                                     52.4nW
                                                                 2.05uW
                                                                 2.02uW
                                                                                  5.01mW
                                                                                  2.17mW
                                                                                              1.36mW
                                                                                                501uW
                                                                                                           6.37mW
                                                                                                           2.67mW
                                                                                                                          6.37mW
                                                                                                                          2.68mW
                                                                                                                                                    1307-6884
   Power
        Clock Gating
                   Explic t Clock        1.96uW      63.9nW      2.02uW           2.16mW        520uW      2.68mW         2.68mW
                                                                                                                                                 [4] Alireza Hodjat, David D. Hwang, Bocheng Lai, Kris Tiri,
  Reduction
 Techniques
                      Enable
                   Bus Spec fic          2.08uW      64.8nW      2.15uW           4.75mW      1.28mW       6.03mW         6.03mW
                                                                                                                                                    Ingrid     Verbauwhede, “A 3.84 Gbits/s AES Crypto
                      Clock
   Combining above all three             2.02uW      63.1nW      2.08uW           1.9mW         441uW      2.34mW         2.34mW
                                                                                                                                                    Coprocessor with Modes of Operation in a 0.18-μm CMOS
  power reduction techniques
                                                                                                                                                    Technology” GLSVLSI’05 April          17–19, 2005, Chicago,
                                                                                                                                                    Illinois, USA.
  Table 7. Gate-level Area Comparisons for SHA1 (after synthesis                                                                                 [5] T. Good and M. Benaissa. AES on FPGA from the fastest to
                         with 65 nm tech.)                                                                                                          the smallest. In Proc. 7th Int. Workshop on
        SHA1 GATE
                                              Combinational
                                                                                  Area
                                                                                Sequential                       Total
                                                                                                                                                    CryptographicHardware and Embedded Systems (CHES
                                                                                                        (Min Inverter Area:1.08)                    2005), pages 427–440, Edinburgh, UK,       Aug. 29–Sept. 1,
           Original
   Traditional Clock Gating
                                               6671.159668
                                               6381.723145
                                                                           21681.474609
                                                                           17775.910156
                                                                                                               28352.880859
                                                                                                               24157.800781
                                                                                                                                                    2005.
   Power          Exp icit Clock               6350.041504                 17784.912109                        24135.121094                      [6] P. Hämäläinen, M. Hännikäinen, and T. Hämäläinen.
  Reduction         Enable
 Techniques
                   Bus Specific                7391.526855                 21691.556641                        29083.320312
                                                                                                                                                    Efficient hardware implementation of security processing for
                      Clock
                                                                                                                                                    IEEE 802.15.4 wireless networks. In Proc. 48th IEEE Int.
  Combining above all three                    6934.683105                 17809.392578                        24744.240234
 power reduction techniques                                                                                                                         Midwest Symp. On Circuits          and Systems (MWSCAS
                                                                                                                                                    2005), pages 484–487, Cincinnati, OH, USA. Aug. 7–10,
  Table 8. Gate-level Delay and Throughput Comparisons for AES                                                                                      2005.
                  (after synthesis with 65 nm tech.)                                                                                             [7] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. “A
                                                                                   Delay                                                            compact Rijndael hardware architecture with S-box
                         SHA1 GATE                        Critical Path (ns)                        Frequency
                                                                                             (with 10% slack margin)                                optimization” In Proc. 7th Int. Conf. on Theory and
                                                                                                      (GHz)
                          Original                               0.7                                    1.28
                                                                                                                                                    Application of Cryptology and Inf. Secur.,      Advances in
                 Traditional Clock Gating                        0.7                                    1.28                                        Cryptology (ASIACRYPT2001), pages 239–254, Gold
             Power                  Explicit Clock               0.7                                    1.28                                        Coast, Australia, Dec.9–13, 2001.
            Reduction
           Techniques
                                      Enable
                                    Bus Specific                 0.7                                    1.28
                                                                                                                                                 [8] C. Su, T. Lin, C. Huang, and C. Wu, “A High-Throughput
                                       Clock
          Combining above all three power                        0.7                                    1.28
                                                                                                                                                    Low-cost AES processor,” IEEE Communication Magazine,
              reduction techniques                                                                                                                  Vol. 41, Issue 12, pp. 86-91, December 2003.
                                                                                                                                                 [9] S. Morioka, A. Satoh, “A 10-Gbps Full- AES Design with
                                               V. CONCLUSION                                                                                        aTwisted BDD S-Box Architecture”, IEEE Transaction on
                                                                                                                                                    VLSI, Vol.12, No. 7, July 2004.
In this paper we presented the design and implementation                                                                                         [10] FIPS 180-1, Secure hash standard, NIST, US Department of
of a compact AES and SHA1 ASIC core suitable for                                                                                                    Commerce, Washington D. C., April I995
wireless sensor networks and RFID. Compared to                                                                                                   [11] G. Asada, M. Dong, T. S. Lin, F. Newberg, G. Pottie, W. J.
previous designs, we achieved significantly lower power                                                                                             Kaiser, “Wireless Integrated Network Sensors: Low Power
and lower area in both AES and SHA1 case by using                                                                                                   Systems on a Chip”, Solid-State Circuits Conference, 1998.
proposed novel design techniques. We implemented the                                                                                                ESSCIRC '98. Proceedings of the 24th European.
proposed ASRR (application specific register reduction),
LECE (locally explicit clock enabling), and BSC (bus
specific clock) at RTL and evaluated at gate level in ASIC
flow. Generated RTL soft-Intellectual Property by using
those techniques in this paper can be used directly to any
ASIC design flow and can be applied for any technology
nodes. With 65 nanometer industry technology, our
proposed schemes demonstrated 44.57% power reduction,
10.43% area reduction, and 5.78 Gbps throughput with
452 MHz circuit speed for AES, and 63.26% power

                                                                                                                                           410




   Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.

Contenu connexe

Tendances

HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSHIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSijngnjournal
 
Silicon to software share
Silicon to software shareSilicon to software share
Silicon to software shareNarendra Patel
 
Aruna Ravi - M.S Thesis
Aruna Ravi - M.S ThesisAruna Ravi - M.S Thesis
Aruna Ravi - M.S ThesisArunaRavi
 
Multi Supply Digital Layout
Multi Supply Digital LayoutMulti Supply Digital Layout
Multi Supply Digital LayoutRégis SANTONJA
 
SmartCore System for Dependable Many-core Processor with Multifunction Router...
SmartCore System for Dependable Many-core Processor with Multifunction Router...SmartCore System for Dependable Many-core Processor with Multifunction Router...
SmartCore System for Dependable Many-core Processor with Multifunction Router...Shinya Takamaeda-Y
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARMEdge AI and Vision Alliance
 
Asic backend design
Asic backend designAsic backend design
Asic backend designkbipeen
 
Fpga implementation of encryption and decryption algorithm based on aes
Fpga implementation of encryption and decryption algorithm based on aesFpga implementation of encryption and decryption algorithm based on aes
Fpga implementation of encryption and decryption algorithm based on aeseSAT Publishing House
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
An fpga based efficient fruit recognition system using minimum
An fpga based efficient fruit recognition system using minimumAn fpga based efficient fruit recognition system using minimum
An fpga based efficient fruit recognition system using minimumAlexander Decker
 
Prototyping a Wireless Sensor Node using FPGA for Mines Safety Application
Prototyping a Wireless Sensor Node using FPGA for Mines Safety ApplicationPrototyping a Wireless Sensor Node using FPGA for Mines Safety Application
Prototyping a Wireless Sensor Node using FPGA for Mines Safety ApplicationIDES Editor
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Gaurav Raina
 
Michael john sebastian smith application-specific integrated circuits-addison...
Michael john sebastian smith application-specific integrated circuits-addison...Michael john sebastian smith application-specific integrated circuits-addison...
Michael john sebastian smith application-specific integrated circuits-addison...Đình Khanh Nguyễn
 

Tendances (20)

A04660105
A04660105A04660105
A04660105
 
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKSHIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
HIGH PERFORMANCE ETHERNET PACKET PROCESSOR CORE FOR NEXT GENERATION NETWORKS
 
DSP Processors versus ASICs
DSP Processors versus ASICsDSP Processors versus ASICs
DSP Processors versus ASICs
 
Silicon to software share
Silicon to software shareSilicon to software share
Silicon to software share
 
Asic
AsicAsic
Asic
 
Aruna Ravi - M.S Thesis
Aruna Ravi - M.S ThesisAruna Ravi - M.S Thesis
Aruna Ravi - M.S Thesis
 
Gv2512441247
Gv2512441247Gv2512441247
Gv2512441247
 
Dr.s.shiyamala fpga ppt
Dr.s.shiyamala  fpga pptDr.s.shiyamala  fpga ppt
Dr.s.shiyamala fpga ppt
 
Multi Supply Digital Layout
Multi Supply Digital LayoutMulti Supply Digital Layout
Multi Supply Digital Layout
 
SmartCore System for Dependable Many-core Processor with Multifunction Router...
SmartCore System for Dependable Many-core Processor with Multifunction Router...SmartCore System for Dependable Many-core Processor with Multifunction Router...
SmartCore System for Dependable Many-core Processor with Multifunction Router...
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
 
Floor plan & Power Plan
Floor plan & Power Plan Floor plan & Power Plan
Floor plan & Power Plan
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
 
Asic backend design
Asic backend designAsic backend design
Asic backend design
 
Fpga implementation of encryption and decryption algorithm based on aes
Fpga implementation of encryption and decryption algorithm based on aesFpga implementation of encryption and decryption algorithm based on aes
Fpga implementation of encryption and decryption algorithm based on aes
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
An fpga based efficient fruit recognition system using minimum
An fpga based efficient fruit recognition system using minimumAn fpga based efficient fruit recognition system using minimum
An fpga based efficient fruit recognition system using minimum
 
Prototyping a Wireless Sensor Node using FPGA for Mines Safety Application
Prototyping a Wireless Sensor Node using FPGA for Mines Safety ApplicationPrototyping a Wireless Sensor Node using FPGA for Mines Safety Application
Prototyping a Wireless Sensor Node using FPGA for Mines Safety Application
 
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
 
Michael john sebastian smith application-specific integrated circuits-addison...
Michael john sebastian smith application-specific integrated circuits-addison...Michael john sebastian smith application-specific integrated circuits-addison...
Michael john sebastian smith application-specific integrated circuits-addison...
 

Similaire à Publication

VLSI Architecture for Nano Wire Based Advanced Encryption Standard (AES) with...
VLSI Architecture for Nano Wire Based Advanced Encryption Standard (AES) with...VLSI Architecture for Nano Wire Based Advanced Encryption Standard (AES) with...
VLSI Architecture for Nano Wire Based Advanced Encryption Standard (AES) with...VLSICS Design
 
VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH...
VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH...VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH...
VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH...VLSICS Design
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
An Efficient FPGA Implementation of the Advanced Encryption Standard AlgorithmAn Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithmijsrd.com
 
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASICDesign and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASICpaperpublications3
 
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,paperpublications3
 
Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...
Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...
Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...IJMTST Journal
 
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...IJNSA Journal
 
Arm recognition encryption by using aes algorithm
Arm recognition    encryption by using aes algorithmArm recognition    encryption by using aes algorithm
Arm recognition encryption by using aes algorithmeSAT Journals
 
Trends and challenges in IP based SOC design
Trends and challenges in IP based SOC designTrends and challenges in IP based SOC design
Trends and challenges in IP based SOC designAishwaryaRavishankar8
 
High throughput FPGA Implementation of Advanced Encryption Standard Algorithm
High throughput FPGA Implementation of Advanced Encryption Standard AlgorithmHigh throughput FPGA Implementation of Advanced Encryption Standard Algorithm
High throughput FPGA Implementation of Advanced Encryption Standard AlgorithmTELKOMNIKA JOURNAL
 
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...mjaganm
 
Implementation and Design of AES S-Box on FPGA
Implementation and Design of AES S-Box on FPGAImplementation and Design of AES S-Box on FPGA
Implementation and Design of AES S-Box on FPGAIJRES Journal
 
ETHERNET PACKET PROCESSOR FOR SOC APPLICATION
ETHERNET PACKET PROCESSOR FOR SOC APPLICATIONETHERNET PACKET PROCESSOR FOR SOC APPLICATION
ETHERNET PACKET PROCESSOR FOR SOC APPLICATIONcscpconf
 
Reconfigurable network on chip
Reconfigurable network on chipReconfigurable network on chip
Reconfigurable network on chipAngelinaRoyappa1
 
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...IJNSA Journal
 

Similaire à Publication (20)

VLSI Architecture for Nano Wire Based Advanced Encryption Standard (AES) with...
VLSI Architecture for Nano Wire Based Advanced Encryption Standard (AES) with...VLSI Architecture for Nano Wire Based Advanced Encryption Standard (AES) with...
VLSI Architecture for Nano Wire Based Advanced Encryption Standard (AES) with...
 
VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH...
VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH...VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH...
VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH...
 
Aes
AesAes
Aes
 
A03530107
A03530107A03530107
A03530107
 
G04701051058
G04701051058G04701051058
G04701051058
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
An Efficient FPGA Implementation of the Advanced Encryption Standard AlgorithmAn Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
An Efficient FPGA Implementation of the Advanced Encryption Standard Algorithm
 
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASICDesign and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC
 
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,
 
Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...
Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...
Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...
 
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
 
Arm recognition encryption by using aes algorithm
Arm recognition    encryption by using aes algorithmArm recognition    encryption by using aes algorithm
Arm recognition encryption by using aes algorithm
 
Trends and challenges in IP based SOC design
Trends and challenges in IP based SOC designTrends and challenges in IP based SOC design
Trends and challenges in IP based SOC design
 
Aes
AesAes
Aes
 
High throughput FPGA Implementation of Advanced Encryption Standard Algorithm
High throughput FPGA Implementation of Advanced Encryption Standard AlgorithmHigh throughput FPGA Implementation of Advanced Encryption Standard Algorithm
High throughput FPGA Implementation of Advanced Encryption Standard Algorithm
 
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
Hardware Software Partitioning Of Advanced Encryption Standard To Counter Dif...
 
Implementation and Design of AES S-Box on FPGA
Implementation and Design of AES S-Box on FPGAImplementation and Design of AES S-Box on FPGA
Implementation and Design of AES S-Box on FPGA
 
ETHERNET PACKET PROCESSOR FOR SOC APPLICATION
ETHERNET PACKET PROCESSOR FOR SOC APPLICATIONETHERNET PACKET PROCESSOR FOR SOC APPLICATION
ETHERNET PACKET PROCESSOR FOR SOC APPLICATION
 
Reconfigurable network on chip
Reconfigurable network on chipReconfigurable network on chip
Reconfigurable network on chip
 
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
MICRO ROTOR ENHANCED BLOCK CIPHER DESIGNED FOR EIGHT BITS MICRO-CONTROLLERS (...
 

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Publication

  • 1. Ultra-Low power and High Speed Design and Implementation of AES and SHA1 Hardware cores in 65 Nanometer CMOS Technology Feng Ge, Pranjal Jain and Ken Choi Department of Electrical and Computer Engineering Illinois Institute of Technology Email: {fge2, pjain13, and kchoi12}@iit.edu Abstract— This paper describes a design and being packed into a chip. This leads to the steady growth implementation of low-power and high-speed security of the operating frequency and processing capacity per hardware cores for the Advanced Encryption Standard chip, resulting in increased power dissipation. Now-a- (AES) and the Secure Hash Algorithm (SHA1). We propose days, power-aware design techniques at the early stage of three Register Transfer Level (RTL) circuit techniques, the design abstraction hierarchy such as register transfer namely, Application Specific Register Reduction (ASRR), level (RTL) are getting more attention. Locally Explicit Clock Enabling (LECE), and Bus Specific In this paper, we have implemented ultra-low power Clock (BSC). LECE and BSC can be used directly to any ASIC design flow and can be applied for any technology AES and SHA1 hardware cores with emphasize on power nodes. With 65 nanometer industry technology, our reduction techniques at RTL so that system designers can proposed schemes demonstrated at RTL and gate level that easily implement very low-power and high-performance for AES, 44.57% total power reduction (dynamic and cell security systems toward fabrication in CMOS by using leakage power), 10.43% area reduction, and 5.78 Gbps our soft-IP at RTL at an early stage of ASIC design flow. throughput with 452 MHz circuit speed are achieved and for SHA1, 63.26% total power reduction, 12.72% area reduction with 1.28 GHz circuit speed are achieved. II. BACKGROUND A. AES Algorithm I. INTRODUCTION AES [1, 2] is a symmetric cipher that processes data in As the demand for secure communications increases, 128-bit blocks. It supports key sizes of 128, 192 and 256 high-throughput, low-power en/decryption on both wired bits and consists of 10, 12 and 14 iterations. Each round and wireless networks is growing more necessary. Today, providing security is one of the major concerns, especially mixes the data with a roundkey, which is generated from for wireless systems design such as wireless sensor the encryption key. We are considering only 128 bits and networks and RFID. 10 iterations. In general, such applications require development of The Cipher maintains an internal, 4x4 matrix of bytes, scalable, ultra-low power and low cost architecture [3, called state, on which operations are performed. Initially, 11]. Security systems form backbone of such sensor state is filled with the input data block and XORed with network and require protection from threats such as data the encryption key. Regular rounds consist of operations integrity, eavesdropping and impersonation. So, the main called Subbytes, Shiftrows, MixColumns, and aim is to implement a low power, high throughput and AddRoundkey as shown in Figure 1 (a). The last round low area cryptography algorithms like Advanced bypasses MixColumns. Encryption Standard (AES) [1, 2] and Secure Hash (1) SubBytes: The SubBytes transformation is a Algorithm (SHA1) [10] effectively. nonlinear substitution operation that works on bytes. Each Conventionally, research findings mainly focused on byte of the input state is replaced using the same developing pipelined and loop-unrolled AES designs [5, substitution function (called S-Box). The S-Box is defined 6]. There has been research done in implementing AES S- as the multiplicative inverse in the Galois Field GF (28) Box full-custom design, AES ASIC designs with varying data paths [7]. Roundkey are generated on fly either by with the irreducible polynomial m(x) = x8 + x4 + x3 + x + sharing S-Box with main data path [7] or by dedicating S- 1 followed by an affine transformation. The InvSubBytes box for the Key expansion. Architectures are exploited in transformation, which is needed for decryption, is the feedback modes of operation in SHA1. Thus, we observe inverse of the affine transformation followed by the same that above references mainly focus on area efficient inversion as in the SubBytes transformation as shown in implementation or increasing throughput using Figure 1 (b). architecture reconfigurations [4, 8, and 9]. (2) ShiftRows: The ShiftRows transformation rotates Traditionally, power dissipation of VLSI chips was each row of the input state to the left, whereby the offset neglected. The device density and operating frequency of the rotation corresponds to the row number. The were low enough to form a constraining factor in the InvShiftRows of this transformation is computed by chips. As the scale of integration improves, more performing the corresponding rotations to the right as transistors, faster and smaller than their predecessors, are shown in Figure 1 (b). 978-1-4244-3355-1/09/$25.00©2009 IEEE 405 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  • 2. Plaintext Ciphertext Word E: C3 D2 El FO Round Round Key 0 ADDROUNDKEY Initial Round Key Nr ADDROUNDKEY Initial Round SubBytes InvSubBytes ShiftRows InvShiftRows Nr 1 Round Key Nr 1 MixColumns Nr 1….1 InvMixColumns Round Key ADDROUNDKEY InvMixColumns ADDROUNDKEY 1….Nr 1 SubBytes InvSubBytes ShiftRows Final InvShiftRows Final Round Round Round Round Key Nr ADDROUNDKEY Key 0 ADDROUNDKEY Ciphertext Plaintext Figure 1(a): Encryption Figure 1(b): Decryption Figure 2. Secure Hash Algorithm (SHA1) Algorithm (4) Process message in 16-word blocks: The heart of the algorithm is a module that consists of four rounds of Figure 1: AES Algorithm processing 20 steps each. The four rounds have a similar structure, but each uses a different primitive logical (3) MixColumns: The MixColumns transformation function. These logical functions are defined as follows: maps each column of the input state to a new column in These rounds take as input the current 512-bits block and the output state. Each input column is considered as a 8 the 160-bits buffer value (A, B, C, D, E), and then update polynomial over GF (2 ) and multiplied with the constant these buffers. polynomial a(x) = {03} x3 + {01} x2 + {01} x + {02} ⎧( B ∧ C ) ∨ ( B ∧ D ) 4 modulo x - 1. The coefficients of a(x) are also elements 0 ≤ t ≤ 19 ⎪B ⊕ C ⊕ D 8 of GF (2 ) and are represented by hexadecimal values in this equation. The InvMixColumns transformation is the ⎪ 20 ≤ t ≤ 39 -1 f ( B, C , D) = ⎨ multiplication of each column with a (x) = {0B} x3 + 4 {0D} x2 + {09} x + {0E} modulo x – 1 as shown in ⎪( B ∧ C ) ∨ ( B ∧ D ) ∨ (C ∧ D ) 40 ≤ t ≤ 59 Figure 1 (b). ⎪B ⊕ C ⊕ D ⎩ 60 ≤ t ≤ 79 (4) AddRoundKey: The AddRoundKey transformation Each round also makes use of an additive constant KT. In is self-inverting. It maps a 128-bit input state to a 128-bit hex the values are shown below. output state by XORing the input state with a 128-bit round key. Please refer Figure 1. ⎧5 A827999 0 ≤ t ≤ 19 ⎪6 ED 9 EBA1 ⎪ 20 ≤ t ≤ 39 B. SHA1 Algorithm: KT = ⎨ The algorithm takes as input a message with a ⎪8 F 1BBCDC 40 ≤ t ≤ 59 maximum length of less than 264 bits and produces as ⎪C 862C1D 6 ⎩ 60 ≤ t ≤ 79 output a 160-bits message digest as shown in Figure 2. The input is processed in 512 bits blocks. The algorithm processing includes the following steps: III. PROPOSED APPROACHES AND IMPLEMENTATIONS (1) Padding: The purpose of message padding is to We have implemented both AES and SHA1 at RTL by make the total length of a padded message congruent to using the following three techniques for low power and 448 modulo 512(length = 448 mod 512). The number of synthesized them. The performance is demonstrated in padding bits should be between 1 and 512. Padding terms of power, area, speed, and throughput at RTL and consists of single 1-bit followed by the necessary number also gate level: of 0-bits. A) Application Specific Register Reduction (ASRR) (2) Appending Length: A 64-bits binary representation B) Locally Explicit Clock Enabling (LECE) of the original length of the message is appended to the C) Bus Specific Clock (BSC) end of the message. (3) Initialize the SHA-1 buffer: The 160-bits buffer is A. Application Specific Register Reduction (ASRR): represented by five four-word buffers (A, B, C, D, E) Figure 3 illustrates our implementation for the used to store the middle or final results of the message decryption part of AES core. The AES takes a 128-bit digest for SHA-I functions. They are initialized to the data block as input and performs several different following values in hexadecimal. Low-order bytes are put transformations on this block. AES encryptions and first. decryptions are based on four different transformations Word A: 67 45 23 01; that are performed repeatedly in a Word B: EF CD AB 89; Word C: 98 BA DC EF; Word D: 10 32 54 16; 406 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  • 3. paper, we proposed a novel way to reduce the number of registers tremendously by generating all sub-keys in encryption Key Expansion Module and storing them into registers or RAMs before decryption begins. Figure 5. Original Key Reverse Buffer Figure 3. Application Specific Register Reduction (ASRR) In our proposed architecture, we share maximum similarity with encryption circuit and the registers can be certain sequence as shown in Figure 1. Each of these reduced as shown in Figure 6. Sub-key Ki is generated transformations, which are described in the section I, and stored into Regi at i-th clock cycle, where i equals to 1 maps a 128-bit input state to a 128-bit output state. to 11. Notice that these 11 registers are only used once in For an AES-128 encryption, the 128-bit cipher key decryption, therefore, we can reduce their number to 6. th needs to be expanded to eleven 128-bit round keys. The Sub-keys are stored into registers from 5 clock cycle. principle idea of this key expansion is that the first round Sub-keys K0 to K4 are generated and stored into registers key, Roundkey (k0) corresponds to the cipher key. All after decryption begins. The multiplexers before registers subsequent round keys are derived from their respective are controlled by decryption begin signal “de”. predecessor using a function f. So, Roundkey (ki) = f (Roundkey (ki) – 1) for all 0 < i < 11. For an AES-128 decryption, the same round keys are used in reversed order. Using the inverse of the key expansion function, f - 1 , the round keys can be derived recursively from RoundKey (k10) and are stored in Key Reverse Buffer, using just 6 registers instead of 10. In AddRoundKey step, a new sub-key is generated according to the previous sub-key. The Key Generation Schedule is shown in Figure 4. According to round numbers, there are 10, 12, 14 sub-keys involved in encryption. We have implemented 10 sub-keys generation. Figure 6. ASRR for the Key Reverse Buffer Timing Sequence of the Registers is as shown Figure 7. At the fifth clock, we store the key K5 to R0 and at the next clock, the key K6 to R1 until we store the key K10 to R5. Now, decryption starts and we use the key K10 previously stored in R5 at the first clock cycle of the decryption. At the same time, the key K1 is generated and stored in R5. In the next cycle, the key K9 previously Figure 4. Key Generation Block stored in R4 at the second clock cycle of the decryption. The decryption process is the reverse of encryption. At the same time, the key K2 is generated and stored in Sub-keys are used in a reverse order. Conventional way to R4. We repeat the operation until the key K6 previously implement this is to generate the last key with encryption stored in R1 at the fourth clock cycle of the decryption Key Expansion Module, and then use a reverse Key and the key K4 is generated and stored in R1. By using Expansion Module to generate each sub-key in reverse this mechanism, we can save 5-128bits registers which is called in this paper ASRR (Application Specific Register order as shown in Figure 5. However, this method Reduction) scheme. requires large extra circuit and a large S-box. In this 407 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  • 4. C. Bus Specific Clock (BSC): Schematic and timing diagram in Figure 9 shows a register where the data input is active during one phase of operation only, and does not change for a long period of time. The main goal of this technique is to find buses in the design that have low switching activity first and then Figure 7. Timing and Waveform View for the Proposed ASRR Scheme if we can create a clock enable signal by detecting changes on the bus, we can save power. B. Locally Explicit Clock Enabling (LECE): In general, a RTL code which has the output dependent on some initial condition, after synthesis results into a flip-flop with a MUX in feedback. We have removed the MUX in feedback loop by implementing a gated clock. Main difference between LECE and traditional clock gating is in two folds; i) Traditional clock gating during synthesis inserts clock gating cells globally based on maximum fanout number and maximum bus width, so it is far from the optimal solution and ii) LECE investigates judiciously the clock signal and the enable signal, and then find which registers should be clock gated for the Figure 9. Data Bus Specific Clock optimal solution in terms of total power, dynamic and leakage power. We have implemented this technique in In the security algorithm AES, there is a potential mainly Key Expansion Unit and Key Reverse Buffer block candidate residing inside Key Expansion Unit. For of the decryption module of AES. generating sub-keys in Roundkey[i], we XOR the Control block of AES core performs several functions, previous key generated in Roundkey[i-1] with Rcon[i] from it one of its important function is to keep track of and subword. number of rounds and sub-keys generated using key expansion unit. We have considered 128-bit key and hence have to keep count of 10. Consider figure 8 (a), in which we get ‘kcnt’ output on a rising edge of ‘clk’, but only when the signal ‘kld’ or ‘kb_ld’ is high. Now if the enable signal is low for a significant amount of circuit operation and if ‘D = 10’ and ‘Kcnt’ are multi-bit buses which they are, then a substantial amount of power dissipated by the clock driver is wasted. We have implemented a technique, which will gate the clock and thus reduce the power dissipation by significant percentage. Figure 10. RCON Implementation As shown in Figure 8 (b), we replace the clock input to flip-flop with an AND gate whose inputs are the clock Here Rcon[i] consists of 32-bit bus having output and the ‘EN = kld | kb_ld’ signal. We have used a latch so ‘out[31:0]’ values 0X01, 0X02, 0X04, 0X08, 0X10, that when the clock is high, no activity on the enable will 0X20, 0X40, 0X80, 0X1b, 0X36 for 10 rounds be transferred to the clock input. We implemented our respectively. Thus, we can observe that out[23:0] has 24- technique at RTL so that we obtain a new module as bit LSB bus infrequently used. In the Figure 11 (a), we shown in figure 8 (b). can see that out[31:0] (data) is active for a very small Clk Clk amount of time, while we are applying clock continuously. Thus, this results a lot of power dissipation E in clock driver as well as circuitry inside of the register. EN = kld or kb ld We can avoid this bottleneck by constructing an enable D signal by detecting changes on the bus. Please refer figure kcnt 11 (b). We XOR the next state of each bit with the 10 or Kcnt-1 kcnt previous one to check whether they are same, and then N- bit OR is used to determine if any bits changed. Now if Kcnt - 1 there are no bits changed then there is no point in Figure 8 (a) Figure 8 (b) enabling the clock. The latch is used to avoid any glitches at AND output, otherwise there would be an accidental clock signal applied to the register making it ON, which is Figure 8. Implementation of Locally Explicit Clock Enabling (LECE) undesirable. (a) 1-bit of initial control block (b) After implementing LECE 408 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  • 5. Clk Power Dissipation AES RTL Static Power Dissipation Dynamic Power Dissipation Total Internal Clock Total Internal Clock Total Clk Leakage Leakage Leakage Dynamic Dynamic Dynamic E dd[23:0] Original 5.72uW 74.4nW 5.79uW 14.9mW 2.99mW 17.8mW 17.9mW D Power Register 5.41uW 59.5nW 5.47uW 13.5mW 2.38mW 15.8mW 15.8mW en Reduction Reduction D[31 0] Out[31 0] Techniques Explicit Clock 5.42uW 70.5nW 5.49uW 8.91mW 999uW 9.91mW 9.92mW Enable Out[23:0] Bus Specific 5.79uW 73.8nW 5.86uW 14.7mW 2.93mW 17.7mW 17.7mW D[23:0] Clock Combining above all three 5.31uW 56.2nW 5.37uW 9.12mW 916uW 10mW 10mW D[31 0] power reduction techniques LECE & BSC 5.49uW 69.9nW 5.56uW 8.77mW 947uW 9.72mW 9.73mW Out[31 0] Clk Wasted Clk Table 2. Gate-level Power Dissipation Comparisons for AES (after synthesis with 65 nm tech.) D[30] D[29] D[28] D[0] D[31] D[30] D[29] D[28] D[0] Out Out Out Out Out Out Out Out Out [30] [29] [28] [0] [31] [30] [29] [28] [0] Figure 11 (a) Figure 11 (b) Figure 11. Implementation of Bus Specific Clock (BSC) (a) 32-bit of initial RCON block (b) After implementing BSC Table 3. Gate-level Area Comparisons for AES (after synthesis with IV. SIMULATION RESULTS 65 nm tech.) Area We designed and implemented the AES and the SHA1 AES GATE Combinational Sequential Total (Min Inverter Area: 1.08) core in Verilog at the RTL and synthesized it to the gate Original 58352.609375 23870.750000 82222.562500 level using a 65 nm, 1.0 Volt, standard-cell CMOS Traditional Clock Gating 58340.003906 18760.798828 77100.484375 technology. We used PowerTheater for power analysis, Power Reduction Register Reduction 57282.164062 19723.982422 77005.445312 NC-Verilog for RTL simulation, Design Compiler for Techniques Explic t Clock 58326.691406 18769.437500 77095.804688 10.43% Enable synthesis, and Power Compiler for traditional clock- Bus Spec fic 58588.035156 23880.830078 82468.078125 gating implementation. We have included the results from Clock Combining above all three 57556.742188 16087.537109 73643.757812 power, area and speed at RTL and also gate level. The power reduction techniques following tables compare our results with the previous LECE & BSC 58562.148438 18779.515625 77341.320312 compact ASIC designs for AES and SHA1. Table 4. Gate-level Delay and Throughput Comparisons for AES A. Comparison Results for AES (after synthesis with 65 nm tech.) After doing initial power analysis at RTL, we applied Critical Path Delay and Throughput Frequency (with 10% Throughput AES-GATE three power reduction techniques to AES core at RTL and (ns) slack margin) (MHz) (Gb/sec) results are tabulated in Table 1-4. We can observe that Original 1.99 452 5.78 with 65 nanometer industry technology, our proposed Traditional Clock Gating 1.99 452 5.78 Power Register 2.03 443 5.67 schemes demonstrated 45.6% total power reduction Reduction Reduction Techniques (dynamic and cell leakage power) at RTL and 44.57% Explicit Clock Enable 1.99 452 5.78 total power reduction, 10.43% area reduction, and 5.78 Bus Specific 1.99 452 5.78 Clock Gbps throughput with 452 MHz circuit speed at gate Combining above all three 1.99 452 5.78 level. Table 1 shows the power reduction results at RTL, power reduction techniques LECE & BSC 1.99 452 5.78 Table 2 shows the power reduction results at gate level, Table 3 shows the area reduction, and Table 4 shows max circuit speed and throughput of AES implementation, B. Comparison Results for SHA1 comparing with conventional design method and We applied three power reduction techniques to SHA1 traditional clock-gating design. core at RTL and results are tabulated in Table 5-8. We can observe that with 65 nanometer industry technology, Table 1. RTL Power Dissipation Comparisons for AES our proposed schemes demonstrated 65.33% total power reduction (dynamic and cell leakage power) at RTL and 63.26% total power reduction, 12.72% area reduction without compromising the speed, 1.28 GHz at gate level. Table 5 shows the power reduction results at RTL, Table 6 shows the power reduction results at gate level, Table 7 shows the area reduction, and Table 8 shows max circuit speed of SHA1 implementation, comparing with 409 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  • 6. conventional design method and traditional clock-gating reduction, 12.72% area reduction in 1.28 GHz circuit design. speed for SHA1. Table 5. RTL Power Dissipation Comparisons for SHA1 VI. ACKNOWLEDGMENT Power Dissipation Static Power Dissipation Dynamic Power Dissipation Total The authors gratefully acknowledge the contribution of SHA1 RTL Internal Clock Total Internal Clock Total Leakage Leakage Leakage Dynamic Dynamic Dynamic reviewers' comments. Original 2.34uW 66.7nW 2.41uW 4.15mW 1.09mW 5.24mW 5.25mW Power Reduction Exp ic t Clock Enable 2.15uW 64.9nW 2.21uW 1.79mW 333uW 2.12mW 2.12mW VII. REFERENCES Techniques Bus Specific 2.33uW 66.4nW 2.4uW 3.87mW 1.02mW 4.9mW 4.9mW Clock [1] National Institute of Standards and Technology (U.S.), Combining above two power reduction techniques 2.14uW 64.8nW 2.21uW 1.56mW 266uW 1.82mW 1.82mW Advanced Encryption Standard. [2] J. Dijmen and V. Rijmen. AES Proposal: Rijndael. NIST Table 6. Gate-level Power Dissipation Comparisons for SHA1 (after AES Proposal, June 1998. synthesis with 65 nm tech.) [3] MooSeop Kim, Juhan Kim, Yongje Choi, “Low Power Power Dissipation Circuit Architecture of AES Crypto Module for Wireless SHA1-GATE Static Power Dissipation Internal Clock Total Dynamic Power Dissipation Internal Clock Total Total Sensor Network” In Proc. of world academy of science, Leakage Leakage Leakage Dynamic Dynamic Dynamic engineering and technology volume 8 october 2005 issn Original Trad tional 1.99uW 1.97uW 65.8nW 52.4nW 2.05uW 2.02uW 5.01mW 2.17mW 1.36mW 501uW 6.37mW 2.67mW 6.37mW 2.68mW 1307-6884 Power Clock Gating Explic t Clock 1.96uW 63.9nW 2.02uW 2.16mW 520uW 2.68mW 2.68mW [4] Alireza Hodjat, David D. Hwang, Bocheng Lai, Kris Tiri, Reduction Techniques Enable Bus Spec fic 2.08uW 64.8nW 2.15uW 4.75mW 1.28mW 6.03mW 6.03mW Ingrid Verbauwhede, “A 3.84 Gbits/s AES Crypto Clock Combining above all three 2.02uW 63.1nW 2.08uW 1.9mW 441uW 2.34mW 2.34mW Coprocessor with Modes of Operation in a 0.18-μm CMOS power reduction techniques Technology” GLSVLSI’05 April 17–19, 2005, Chicago, Illinois, USA. Table 7. Gate-level Area Comparisons for SHA1 (after synthesis [5] T. Good and M. Benaissa. AES on FPGA from the fastest to with 65 nm tech.) the smallest. In Proc. 7th Int. Workshop on SHA1 GATE Combinational Area Sequential Total CryptographicHardware and Embedded Systems (CHES (Min Inverter Area:1.08) 2005), pages 427–440, Edinburgh, UK, Aug. 29–Sept. 1, Original Traditional Clock Gating 6671.159668 6381.723145 21681.474609 17775.910156 28352.880859 24157.800781 2005. Power Exp icit Clock 6350.041504 17784.912109 24135.121094 [6] P. Hämäläinen, M. Hännikäinen, and T. Hämäläinen. Reduction Enable Techniques Bus Specific 7391.526855 21691.556641 29083.320312 Efficient hardware implementation of security processing for Clock IEEE 802.15.4 wireless networks. In Proc. 48th IEEE Int. Combining above all three 6934.683105 17809.392578 24744.240234 power reduction techniques Midwest Symp. On Circuits and Systems (MWSCAS 2005), pages 484–487, Cincinnati, OH, USA. Aug. 7–10, Table 8. Gate-level Delay and Throughput Comparisons for AES 2005. (after synthesis with 65 nm tech.) [7] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. “A Delay compact Rijndael hardware architecture with S-box SHA1 GATE Critical Path (ns) Frequency (with 10% slack margin) optimization” In Proc. 7th Int. Conf. on Theory and (GHz) Original 0.7 1.28 Application of Cryptology and Inf. Secur., Advances in Traditional Clock Gating 0.7 1.28 Cryptology (ASIACRYPT2001), pages 239–254, Gold Power Explicit Clock 0.7 1.28 Coast, Australia, Dec.9–13, 2001. Reduction Techniques Enable Bus Specific 0.7 1.28 [8] C. Su, T. Lin, C. Huang, and C. Wu, “A High-Throughput Clock Combining above all three power 0.7 1.28 Low-cost AES processor,” IEEE Communication Magazine, reduction techniques Vol. 41, Issue 12, pp. 86-91, December 2003. [9] S. Morioka, A. Satoh, “A 10-Gbps Full- AES Design with V. CONCLUSION aTwisted BDD S-Box Architecture”, IEEE Transaction on VLSI, Vol.12, No. 7, July 2004. In this paper we presented the design and implementation [10] FIPS 180-1, Secure hash standard, NIST, US Department of of a compact AES and SHA1 ASIC core suitable for Commerce, Washington D. C., April I995 wireless sensor networks and RFID. Compared to [11] G. Asada, M. Dong, T. S. Lin, F. Newberg, G. Pottie, W. J. previous designs, we achieved significantly lower power Kaiser, “Wireless Integrated Network Sensors: Low Power and lower area in both AES and SHA1 case by using Systems on a Chip”, Solid-State Circuits Conference, 1998. proposed novel design techniques. We implemented the ESSCIRC '98. Proceedings of the 24th European. proposed ASRR (application specific register reduction), LECE (locally explicit clock enabling), and BSC (bus specific clock) at RTL and evaluated at gate level in ASIC flow. Generated RTL soft-Intellectual Property by using those techniques in this paper can be used directly to any ASIC design flow and can be applied for any technology nodes. With 65 nanometer industry technology, our proposed schemes demonstrated 44.57% power reduction, 10.43% area reduction, and 5.78 Gbps throughput with 452 MHz circuit speed for AES, and 63.26% power 410 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.