2. Plaintext Ciphertext
Word E: C3 D2 El FO
Round Round
Key 0 ADDROUNDKEY Initial Round Key Nr ADDROUNDKEY Initial Round
SubBytes InvSubBytes
ShiftRows InvShiftRows
Nr 1 Round Key Nr 1
MixColumns Nr 1….1 InvMixColumns
Round
Key ADDROUNDKEY InvMixColumns ADDROUNDKEY
1….Nr 1
SubBytes InvSubBytes
ShiftRows Final InvShiftRows Final
Round Round
Round Round
Key Nr ADDROUNDKEY Key 0 ADDROUNDKEY
Ciphertext Plaintext
Figure 1(a): Encryption Figure 1(b): Decryption
Figure 2. Secure Hash Algorithm (SHA1) Algorithm
(4) Process message in 16-word blocks: The heart of the
algorithm is a module that consists of four rounds of
Figure 1: AES Algorithm
processing 20 steps each. The four rounds have a similar
structure, but each uses a different primitive logical
(3) MixColumns: The MixColumns transformation
function. These logical functions are defined as follows:
maps each column of the input state to a new column in
These rounds take as input the current 512-bits block and
the output state. Each input column is considered as a
8 the 160-bits buffer value (A, B, C, D, E), and then update
polynomial over GF (2 ) and multiplied with the constant
these buffers.
polynomial a(x) = {03} x3 + {01} x2 + {01} x + {02}
⎧( B ∧ C ) ∨ ( B ∧ D )
4
modulo x - 1. The coefficients of a(x) are also elements 0 ≤ t ≤ 19
⎪B ⊕ C ⊕ D
8
of GF (2 ) and are represented by hexadecimal values in
this equation. The InvMixColumns transformation is the ⎪ 20 ≤ t ≤ 39
-1
f ( B, C , D) = ⎨
multiplication of each column with a (x) = {0B} x3 +
4
{0D} x2 + {09} x + {0E} modulo x – 1 as shown in
⎪( B ∧ C ) ∨ ( B ∧ D ) ∨ (C ∧ D ) 40 ≤ t ≤ 59
Figure 1 (b). ⎪B ⊕ C ⊕ D
⎩ 60 ≤ t ≤ 79
(4) AddRoundKey: The AddRoundKey transformation Each round also makes use of an additive constant KT. In
is self-inverting. It maps a 128-bit input state to a 128-bit hex the values are shown below.
output state by XORing the input state with a 128-bit
round key. Please refer Figure 1. ⎧5 A827999 0 ≤ t ≤ 19
⎪6 ED 9 EBA1
⎪ 20 ≤ t ≤ 39
B. SHA1 Algorithm: KT = ⎨
The algorithm takes as input a message with a ⎪8 F 1BBCDC 40 ≤ t ≤ 59
maximum length of less than 264 bits and produces as ⎪C 862C1D 6
⎩ 60 ≤ t ≤ 79
output a 160-bits message digest as shown in Figure 2.
The input is processed in 512 bits blocks. The algorithm
processing includes the following steps: III. PROPOSED APPROACHES AND IMPLEMENTATIONS
(1) Padding: The purpose of message padding is to We have implemented both AES and SHA1 at RTL by
make the total length of a padded message congruent to using the following three techniques for low power and
448 modulo 512(length = 448 mod 512). The number of synthesized them. The performance is demonstrated in
padding bits should be between 1 and 512. Padding terms of power, area, speed, and throughput at RTL and
consists of single 1-bit followed by the necessary number also gate level:
of 0-bits. A) Application Specific Register Reduction (ASRR)
(2) Appending Length: A 64-bits binary representation B) Locally Explicit Clock Enabling (LECE)
of the original length of the message is appended to the C) Bus Specific Clock (BSC)
end of the message.
(3) Initialize the SHA-1 buffer: The 160-bits buffer is A. Application Specific Register Reduction (ASRR):
represented by five four-word buffers (A, B, C, D, E) Figure 3 illustrates our implementation for the
used to store the middle or final results of the message decryption part of AES core. The AES takes a 128-bit
digest for SHA-I functions. They are initialized to the data block as input and performs several different
following values in hexadecimal. Low-order bytes are put transformations on this block. AES encryptions and
first. decryptions are based on four different transformations
Word A: 67 45 23 01; that are performed repeatedly in a
Word B: EF CD AB 89;
Word C: 98 BA DC EF;
Word D: 10 32 54 16;
406
Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
3. paper, we proposed a novel way to reduce the number of
registers tremendously by generating all sub-keys in
encryption Key Expansion Module and storing them into
registers or RAMs before decryption begins.
Figure 5. Original Key Reverse Buffer
Figure 3. Application Specific Register Reduction (ASRR) In our proposed architecture, we share maximum
similarity with encryption circuit and the registers can be
certain sequence as shown in Figure 1. Each of these reduced as shown in Figure 6. Sub-key Ki is generated
transformations, which are described in the section I, and stored into Regi at i-th clock cycle, where i equals to 1
maps a 128-bit input state to a 128-bit output state. to 11. Notice that these 11 registers are only used once in
For an AES-128 encryption, the 128-bit cipher key decryption, therefore, we can reduce their number to 6.
th
needs to be expanded to eleven 128-bit round keys. The Sub-keys are stored into registers from 5 clock cycle.
principle idea of this key expansion is that the first round Sub-keys K0 to K4 are generated and stored into registers
key, Roundkey (k0) corresponds to the cipher key. All after decryption begins. The multiplexers before registers
subsequent round keys are derived from their respective are controlled by decryption begin signal “de”.
predecessor using a function f. So, Roundkey (ki) = f
(Roundkey (ki) – 1) for all 0 < i < 11. For an AES-128
decryption, the same round keys are used in reversed
order. Using the inverse of the key expansion function, f -
1
, the round keys can be derived recursively from
RoundKey (k10) and are stored in Key Reverse Buffer,
using just 6 registers instead of 10.
In AddRoundKey step, a new sub-key is generated
according to the previous sub-key. The Key Generation
Schedule is shown in Figure 4. According to round
numbers, there are 10, 12, 14 sub-keys involved in
encryption. We have implemented 10 sub-keys
generation.
Figure 6. ASRR for the Key Reverse Buffer
Timing Sequence of the Registers is as shown Figure 7.
At the fifth clock, we store the key K5 to R0 and at the
next clock, the key K6 to R1 until we store the key K10 to
R5. Now, decryption starts and we use the key K10
previously stored in R5 at the first clock cycle of the
decryption. At the same time, the key K1 is generated and
stored in R5. In the next cycle, the key K9 previously
Figure 4. Key Generation Block stored in R4 at the second clock cycle of the decryption.
The decryption process is the reverse of encryption. At the same time, the key K2 is generated and stored in
Sub-keys are used in a reverse order. Conventional way to R4. We repeat the operation until the key K6 previously
implement this is to generate the last key with encryption stored in R1 at the fourth clock cycle of the decryption
Key Expansion Module, and then use a reverse Key and the key K4 is generated and stored in R1. By using
Expansion Module to generate each sub-key in reverse this mechanism, we can save 5-128bits registers which is
called in this paper ASRR (Application Specific Register
order as shown in Figure 5. However, this method
Reduction) scheme.
requires large extra circuit and a large S-box. In this
407
Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
4. C. Bus Specific Clock (BSC):
Schematic and timing diagram in Figure 9 shows a
register where the data input is active during one phase of
operation only, and does not change for a long period of
time. The main goal of this technique is to find buses in
the design that have low switching activity first and then
Figure 7. Timing and Waveform View for the Proposed ASRR
Scheme
if we can create a clock enable signal by detecting
changes on the bus, we can save power.
B. Locally Explicit Clock Enabling (LECE):
In general, a RTL code which has the output dependent
on some initial condition, after synthesis results into a
flip-flop with a MUX in feedback. We have removed the
MUX in feedback loop by implementing a gated clock.
Main difference between LECE and traditional clock
gating is in two folds; i) Traditional clock gating during
synthesis inserts clock gating cells globally based on
maximum fanout number and maximum bus width, so it
is far from the optimal solution and ii) LECE investigates
judiciously the clock signal and the enable signal, and
then find which registers should be clock gated for the Figure 9. Data Bus Specific Clock
optimal solution in terms of total power, dynamic and
leakage power. We have implemented this technique in In the security algorithm AES, there is a potential
mainly Key Expansion Unit and Key Reverse Buffer block candidate residing inside Key Expansion Unit. For
of the decryption module of AES. generating sub-keys in Roundkey[i], we XOR the
Control block of AES core performs several functions, previous key generated in Roundkey[i-1] with Rcon[i]
from it one of its important function is to keep track of and subword.
number of rounds and sub-keys generated using key
expansion unit. We have considered 128-bit key and
hence have to keep count of 10. Consider figure 8 (a), in
which we get ‘kcnt’ output on a rising edge of ‘clk’, but
only when the signal ‘kld’ or ‘kb_ld’ is high. Now if the
enable signal is low for a significant amount of circuit
operation and if ‘D = 10’ and ‘Kcnt’ are multi-bit buses
which they are, then a substantial amount of power
dissipated by the clock driver is wasted. We have
implemented a technique, which will gate the clock and
thus reduce the power dissipation by significant
percentage. Figure 10. RCON Implementation
As shown in Figure 8 (b), we replace the clock input to
flip-flop with an AND gate whose inputs are the clock Here Rcon[i] consists of 32-bit bus having output
and the ‘EN = kld | kb_ld’ signal. We have used a latch so ‘out[31:0]’ values 0X01, 0X02, 0X04, 0X08, 0X10,
that when the clock is high, no activity on the enable will 0X20, 0X40, 0X80, 0X1b, 0X36 for 10 rounds
be transferred to the clock input. We implemented our respectively. Thus, we can observe that out[23:0] has 24-
technique at RTL so that we obtain a new module as bit LSB bus infrequently used. In the Figure 11 (a), we
shown in figure 8 (b). can see that out[31:0] (data) is active for a very small
Clk Clk amount of time, while we are applying clock
continuously. Thus, this results a lot of power dissipation
E
in clock driver as well as circuitry inside of the register.
EN = kld or
kb ld
We can avoid this bottleneck by constructing an enable
D
signal by detecting changes on the bus. Please refer figure
kcnt
11 (b). We XOR the next state of each bit with the
10 or Kcnt-1 kcnt
previous one to check whether they are same, and then N-
bit OR is used to determine if any bits changed. Now if
Kcnt - 1 there are no bits changed then there is no point in
Figure 8 (a) Figure 8 (b)
enabling the clock. The latch is used to avoid any glitches
at AND output, otherwise there would be an accidental
clock signal applied to the register making it ON, which is
Figure 8. Implementation of Locally Explicit Clock Enabling
(LECE)
undesirable.
(a) 1-bit of initial control block (b) After implementing LECE
408
Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
5. Clk Power Dissipation
AES RTL Static Power Dissipation Dynamic Power Dissipation Total
Internal Clock Total Internal Clock Total
Clk Leakage Leakage Leakage Dynamic Dynamic Dynamic
E
dd[23:0] Original 5.72uW 74.4nW 5.79uW 14.9mW 2.99mW 17.8mW 17.9mW
D Power Register 5.41uW 59.5nW 5.47uW 13.5mW 2.38mW 15.8mW 15.8mW
en Reduction Reduction
D[31 0] Out[31 0] Techniques
Explicit Clock 5.42uW 70.5nW 5.49uW 8.91mW 999uW 9.91mW 9.92mW
Enable
Out[23:0]
Bus Specific 5.79uW 73.8nW 5.86uW 14.7mW 2.93mW 17.7mW 17.7mW
D[23:0] Clock
Combining above all three 5.31uW 56.2nW 5.37uW 9.12mW 916uW 10mW 10mW
D[31 0] power reduction techniques
LECE & BSC 5.49uW 69.9nW 5.56uW 8.77mW 947uW 9.72mW 9.73mW
Out[31 0]
Clk
Wasted
Clk
Table 2. Gate-level Power Dissipation Comparisons for AES (after
synthesis with 65 nm tech.)
D[30] D[29] D[28] D[0] D[31] D[30] D[29] D[28] D[0]
Out Out Out Out Out Out Out Out Out
[30] [29] [28] [0] [31] [30] [29] [28] [0]
Figure 11 (a) Figure 11 (b)
Figure 11. Implementation of Bus Specific Clock (BSC)
(a) 32-bit of initial RCON block (b) After implementing BSC
Table 3. Gate-level Area Comparisons for AES (after synthesis with
IV. SIMULATION RESULTS 65 nm tech.)
Area
We designed and implemented the AES and the SHA1 AES GATE
Combinational Sequential Total
(Min Inverter Area: 1.08)
core in Verilog at the RTL and synthesized it to the gate Original 58352.609375 23870.750000 82222.562500
level using a 65 nm, 1.0 Volt, standard-cell CMOS Traditional Clock Gating 58340.003906 18760.798828 77100.484375
technology. We used PowerTheater for power analysis, Power
Reduction
Register
Reduction
57282.164062 19723.982422 77005.445312
NC-Verilog for RTL simulation, Design Compiler for Techniques
Explic t Clock 58326.691406 18769.437500 77095.804688
10.43%
Enable
synthesis, and Power Compiler for traditional clock- Bus Spec fic 58588.035156 23880.830078 82468.078125
gating implementation. We have included the results from Clock
Combining above all three 57556.742188 16087.537109 73643.757812
power, area and speed at RTL and also gate level. The power reduction techniques
following tables compare our results with the previous LECE & BSC 58562.148438 18779.515625 77341.320312
compact ASIC designs for AES and SHA1.
Table 4. Gate-level Delay and Throughput Comparisons for AES
A. Comparison Results for AES (after synthesis with 65 nm tech.)
After doing initial power analysis at RTL, we applied Critical Path
Delay and Throughput
Frequency (with 10% Throughput
AES-GATE
three power reduction techniques to AES core at RTL and (ns) slack margin)
(MHz)
(Gb/sec)
results are tabulated in Table 1-4. We can observe that Original 1.99 452 5.78
with 65 nanometer industry technology, our proposed Traditional Clock Gating 1.99 452 5.78
Power Register 2.03 443 5.67
schemes demonstrated 45.6% total power reduction Reduction Reduction
Techniques
(dynamic and cell leakage power) at RTL and 44.57% Explicit Clock
Enable
1.99 452 5.78
total power reduction, 10.43% area reduction, and 5.78 Bus Specific 1.99 452 5.78
Clock
Gbps throughput with 452 MHz circuit speed at gate Combining above all three 1.99 452 5.78
level. Table 1 shows the power reduction results at RTL, power reduction techniques
LECE & BSC 1.99 452 5.78
Table 2 shows the power reduction results at gate level,
Table 3 shows the area reduction, and Table 4 shows max
circuit speed and throughput of AES implementation, B. Comparison Results for SHA1
comparing with conventional design method and We applied three power reduction techniques to SHA1
traditional clock-gating design. core at RTL and results are tabulated in Table 5-8. We
can observe that with 65 nanometer industry technology,
Table 1. RTL Power Dissipation Comparisons for AES our proposed schemes demonstrated 65.33% total power
reduction (dynamic and cell leakage power) at RTL and
63.26% total power reduction, 12.72% area reduction
without compromising the speed, 1.28 GHz at gate level.
Table 5 shows the power reduction results at RTL, Table
6 shows the power reduction results at gate level, Table 7
shows the area reduction, and Table 8 shows max circuit
speed of SHA1 implementation, comparing with
409
Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
6. conventional design method and traditional clock-gating reduction, 12.72% area reduction in 1.28 GHz circuit
design. speed for SHA1.
Table 5. RTL Power Dissipation Comparisons for SHA1 VI. ACKNOWLEDGMENT
Power Dissipation
Static Power Dissipation Dynamic Power Dissipation Total The authors gratefully acknowledge the contribution of
SHA1 RTL Internal Clock Total Internal Clock Total
Leakage Leakage Leakage Dynamic Dynamic Dynamic reviewers' comments.
Original 2.34uW 66.7nW 2.41uW 4.15mW 1.09mW 5.24mW 5.25mW
Power
Reduction
Exp ic t Clock
Enable
2.15uW 64.9nW 2.21uW 1.79mW 333uW 2.12mW 2.12mW
VII. REFERENCES
Techniques
Bus Specific 2.33uW 66.4nW 2.4uW 3.87mW 1.02mW 4.9mW 4.9mW
Clock [1] National Institute of Standards and Technology (U.S.),
Combining above two power
reduction techniques
2.14uW 64.8nW 2.21uW 1.56mW 266uW 1.82mW 1.82mW
Advanced Encryption Standard.
[2] J. Dijmen and V. Rijmen. AES Proposal: Rijndael. NIST
Table 6. Gate-level Power Dissipation Comparisons for SHA1 (after AES Proposal, June 1998.
synthesis with 65 nm tech.) [3] MooSeop Kim, Juhan Kim, Yongje Choi, “Low Power
Power Dissipation Circuit Architecture of AES Crypto Module for Wireless
SHA1-GATE
Static Power Dissipation
Internal Clock Total
Dynamic Power Dissipation
Internal Clock Total
Total
Sensor Network” In Proc. of world academy of science,
Leakage Leakage Leakage Dynamic Dynamic Dynamic
engineering and technology volume 8 october 2005 issn
Original
Trad tional
1.99uW
1.97uW
65.8nW
52.4nW
2.05uW
2.02uW
5.01mW
2.17mW
1.36mW
501uW
6.37mW
2.67mW
6.37mW
2.68mW
1307-6884
Power
Clock Gating
Explic t Clock 1.96uW 63.9nW 2.02uW 2.16mW 520uW 2.68mW 2.68mW
[4] Alireza Hodjat, David D. Hwang, Bocheng Lai, Kris Tiri,
Reduction
Techniques
Enable
Bus Spec fic 2.08uW 64.8nW 2.15uW 4.75mW 1.28mW 6.03mW 6.03mW
Ingrid Verbauwhede, “A 3.84 Gbits/s AES Crypto
Clock
Combining above all three 2.02uW 63.1nW 2.08uW 1.9mW 441uW 2.34mW 2.34mW
Coprocessor with Modes of Operation in a 0.18-μm CMOS
power reduction techniques
Technology” GLSVLSI’05 April 17–19, 2005, Chicago,
Illinois, USA.
Table 7. Gate-level Area Comparisons for SHA1 (after synthesis [5] T. Good and M. Benaissa. AES on FPGA from the fastest to
with 65 nm tech.) the smallest. In Proc. 7th Int. Workshop on
SHA1 GATE
Combinational
Area
Sequential Total
CryptographicHardware and Embedded Systems (CHES
(Min Inverter Area:1.08) 2005), pages 427–440, Edinburgh, UK, Aug. 29–Sept. 1,
Original
Traditional Clock Gating
6671.159668
6381.723145
21681.474609
17775.910156
28352.880859
24157.800781
2005.
Power Exp icit Clock 6350.041504 17784.912109 24135.121094 [6] P. Hämäläinen, M. Hännikäinen, and T. Hämäläinen.
Reduction Enable
Techniques
Bus Specific 7391.526855 21691.556641 29083.320312
Efficient hardware implementation of security processing for
Clock
IEEE 802.15.4 wireless networks. In Proc. 48th IEEE Int.
Combining above all three 6934.683105 17809.392578 24744.240234
power reduction techniques Midwest Symp. On Circuits and Systems (MWSCAS
2005), pages 484–487, Cincinnati, OH, USA. Aug. 7–10,
Table 8. Gate-level Delay and Throughput Comparisons for AES 2005.
(after synthesis with 65 nm tech.) [7] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. “A
Delay compact Rijndael hardware architecture with S-box
SHA1 GATE Critical Path (ns) Frequency
(with 10% slack margin) optimization” In Proc. 7th Int. Conf. on Theory and
(GHz)
Original 0.7 1.28
Application of Cryptology and Inf. Secur., Advances in
Traditional Clock Gating 0.7 1.28 Cryptology (ASIACRYPT2001), pages 239–254, Gold
Power Explicit Clock 0.7 1.28 Coast, Australia, Dec.9–13, 2001.
Reduction
Techniques
Enable
Bus Specific 0.7 1.28
[8] C. Su, T. Lin, C. Huang, and C. Wu, “A High-Throughput
Clock
Combining above all three power 0.7 1.28
Low-cost AES processor,” IEEE Communication Magazine,
reduction techniques Vol. 41, Issue 12, pp. 86-91, December 2003.
[9] S. Morioka, A. Satoh, “A 10-Gbps Full- AES Design with
V. CONCLUSION aTwisted BDD S-Box Architecture”, IEEE Transaction on
VLSI, Vol.12, No. 7, July 2004.
In this paper we presented the design and implementation [10] FIPS 180-1, Secure hash standard, NIST, US Department of
of a compact AES and SHA1 ASIC core suitable for Commerce, Washington D. C., April I995
wireless sensor networks and RFID. Compared to [11] G. Asada, M. Dong, T. S. Lin, F. Newberg, G. Pottie, W. J.
previous designs, we achieved significantly lower power Kaiser, “Wireless Integrated Network Sensors: Low Power
and lower area in both AES and SHA1 case by using Systems on a Chip”, Solid-State Circuits Conference, 1998.
proposed novel design techniques. We implemented the ESSCIRC '98. Proceedings of the 24th European.
proposed ASRR (application specific register reduction),
LECE (locally explicit clock enabling), and BSC (bus
specific clock) at RTL and evaluated at gate level in ASIC
flow. Generated RTL soft-Intellectual Property by using
those techniques in this paper can be used directly to any
ASIC design flow and can be applied for any technology
nodes. With 65 nanometer industry technology, our
proposed schemes demonstrated 44.57% power reduction,
10.43% area reduction, and 5.78 Gbps throughput with
452 MHz circuit speed for AES, and 63.26% power
410
Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.