A 64-bits RISC Dual-Core microprocessor with high performance and low power consumption is
presented in this paper. The processor has a symmetric architecture with two cores. Each of them has
three stage pipeline, 64-bit data-path and 64-bit address port. A novel shared register module, redundant
Booth3 algorithm and leapfrog Wallace tree architecture are introduced to the microprocessor, and both
the performance and power consumption of it has been improved enormously. As the FPGA simulation
result indicates, the power consumption is decreased by 14% and the longest data-path is shortened by
25%.
2. ◼ ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 463 – 470
464
Figure 1. Architecture of dual-core
Correlation of RAW (Read After Write). Instruction j read source register Rx but
instruction i does not write result in destination register Rx. Therefore instruction j get a expired
operand which is a wrong operand.
DecodingUnit&
ControllingUnit
Multiplexer &
Address Latch
Address
Adder
Output
Enable
AddressPort
64bits
PC Register
Registers
Division
Register
BusPreselection
Unit
Arithmetic
Logic
Multipler
Field
Compression
BusArbitrationUnit
DataOutputPort64bits
Barrel Shifter
JtagUnit
Instruction
Register
Multiple
xer
Instruction
Cache
Field Extraction
Field Expansion
Data Output
Latch
DataInputPort64bits
NM
INT
CLK
RST
OE
WE
C
BUS
B
BUSA
BUS
Temporary Storage
of Multiplexer Data
Figure 2. Architecture of single-core
Correlation of WAW (Write After Write). Instruction i and instruction j write a same
destination register Rx but the writing operating of instruction j is earlier or at least the same
time than instruction i. Therefore we have a wrong order of writing operation which lead to that
the value of destination register Rx become a indeterminate state or come from instruction i
instead of instruction j.
Correlation of WAR (Write After Read). Instruction j write result in destination register
Rx before instruction i read source register Rx. Therefore instruction i get a new operand which
pipeline controlling
Core I Core II
Shared
Reg
Stack
I/O Port
Shared Cache In Chip
3. TELKOMNIKA ISSN: 1693-6930 ◼
Research of 64-bits RISC Dual-Core Microprocessor with High Performance... (Gang Zou)
465
is a wrong operand. Correlation of RAR (Read After Read). Instruction i and Instruction j read
from a same source register Rx. Apparently, this situation does not bring to data race.
Figure 3. RAW correlation
As shown above, all four kinds of correlations are caused by some operation on a same
register. It is obviously that RAR correlation does not lead to data race. Fortunately, we don’t
need to handle WAR correlation. This Dual-Core has a sequential instruction issue strategy, and
each pipeline of two single-core read source operand in instruction decoding level and write
destination operand in execution level. This means reading source operand occur definitely
before writing destination operand and naturally does not lead to WAR correlation. The
instruction execution order of this Dual-Core is the same with program order because we use
sequential instruction issue strategy. Therefore, we can avoid WAW correlation just by writing
the new result of instruction j in destination register when the two adjacent instructions write the
same register. So the correlation we need to handle is only RAW.
Solution of RAW data race is shown as behind. There are two kinds of RAW correlation
according to the right value that will write in the register may be produced or not when
instruction j read destination operand in decoding level. If the right value has been produced,
there is no correlation between dual-core because both of single core does not come in
execution level. Therefore we just need to handle with RAW correlation that the right value does
not be produced when instruction j read destination operand in instruction decoding level.
As shown in Figure 3, assuming that there is RAW correlation between instruction M
and M+1 which means the destination operand of instruction M is the source operand of
instruction M+1. In the first period, core I get instruction M in fetching level, and at the same
time core II get instruction M+1. In the second period, core I and core II get instruction M+2 and
M+3 in fetching level, and at the same time the RAW correlation be detected in decoding level.
At this time, we can clear instruction register in negative edge to stop decoding of instruction
M+2 and M+3. In the third period, core I get instruction M+1 in fetching level, and at the same
time core II get instruction M+2. When instruction M+1 and M+2 was decoding, instruction M
has been executing. That is to say, correlation was handled because destination operand has
been produced.
3. Design of High-speed Algorithm and Low-power Architecture
Multiplier is one of the most important parts of this chip, and lies within the critical path.
Therefore, to a great extent, it is the key element of the whole Dual-Core system performance.
The Booth algorithm is a popular way to reduce the number of the partial products by
recoding the multiplier, while the Wallace tree architecture is an efficient method to compress
the partial products with short carry-in delay. Both of them are widely applied to improve the
performance of the multiplier, such as the speed, the power consumption and etc. However, the
traditional Booth algorithm has to process the tripling-partial product, which increase the critical
path so that it decelerates the multiplier. While the traditional Wallace tree architecture could not
Instruction
Fetching
Instruction
Decoding
Instruction
Executing
Instruction
Fetching
X X
Instruction
Fetching
X X
Instruction
Fetching
Instruction
Decoding
Instruction
Executing
Instruction
Fetching
Instruction
Decoding
Instruction
Executing
Instruction
Fetching
Instruction
Decoding
Instruction
Executing
Instruction
M, Core I
Instruction
M+1, Core II
Instruction
M+2, Core I
Instruction
M+3, Core II
Instruction
M+1, Core I
Instruction
M+2, Core II
4. ◼ ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 463 – 470
466
generate the carry and the pseudo sum synchronously, which takes unnecessary 0-1 jumps and
redundant dynamic power consumption.
This paper presents a novel design of the Wallace binary multiplier. It proposed a
redundant Booth3 algorithm to avoid the difficulty of generating the tripling-partial product, while
it also presents a novel leapfrog Wallace tree architecture to generate the carry and the pseudo
sum synchronously, which puts an end to the unnecessary 0-1jumps and improves the power
consumption of the multiplier. Such improvements are used in multiplier for testing and
simulation. The simulation results show that the improvements are effective to improve the
performance and decrease the power consumption of multiplier.
3.1. A Redundant Booth3 Multiplier for High Speed
A redundant Booth3 algorithm is studied in this paper to improve multiplier’s speed,
because it is the important way to improve the system performance.
For the Booth3 algorithm, an n-bits binary data X is recoded by the blocks with 4 bits
scanning digit xi+2xi+1xixi-1, which is based on the value of (-4xi+2+xi+1+xi+xi-1). Ignoring the
superposed bits between the blocks, the recoding operation uses every three bits of multiplier,
instead of every two of them in the Booth2 algorithm, to generate one partial product in order to
reduce the number of partial product and increase the speed. For the 64-bits multiplier, a zero
bit is posed after the least significant bit, while a sign bit is posed before the most significant bit
respectively. Then, the binary digit is recoded by the blocks and generates the partial product.
The partial product should be selected in the order {0, M, 2M, 3M}, where the M stands for the
multiplicand. Every partial product should shift 3 bits left or right than the one before it. Whether
left or right is decided by the recoding sequence: the big-endian sequence makes the partial
product to shift right, while the little-endian sequence makes the partial product to shift left. For
the binary multiplication algorithm, the 0M and the 1M can be gotten directly, while the 2M and
4M can be gotten by shifting the multiplicand one bit or two bits left respectively. But the
situation of 3M is not as easy as others. It could not be calculated directly as 2M+M for the long
delay of a carry-in adder, especially while the multiplicand has a long bit-width. And it could not
be calculated by shifting either.
The 64-bits multiplier that proposed in this chip chooses the little-endian sequence, and
the number of partial product reduces to 22 than 33. For the partial product of 3M, an adder
composed with 4-bits adders that operate in parallel is adopted, and the carry-in signal does not
transfer among the adders but forms another partial product. There are 8 carry-in signals for
each partial product, and 22 partial products have 176 carry-in signals. For every certain weight,
there are 6 carry-in signals at most, therefore, all of the 176 carry-in signal can be compressed
into 6 partial products, which modifies the carry-in delay for the 3M generation to the 4-bits
chain carry-in delay instead of a 64-bits long chain carry-in delay. However, that also introduces
6 more partial products. Its infection will be discussed in the next paragraph.
For the algorithm of the complement code, every partial product should extend the sign
bit to the most significant bit increases the operating data amount. To avoid the extension, a
method that a certain digit was introduced as another partial product which is called the
counteractive digit to counteract the missing sign bits was applied. Figure 4 shows the operating
data amount for the shift-add algorithm, where there are 64+65+. . . +126+127=6112 bits data in
the Figure 4. The Booth2 algorithm has to deal with 67+69+. . . +129+131=3267 bits data, as it
is shown in Figure 5. Figure 6 betrays the operating data amount as 66×22+128+176=1756 bits
for the method mentioned in this paper. All of the three figures are based on 64-bits multiplier.
Obviously, the third method, bases on the Booth3 algorithm, has much less data amount to
operate, which is good for realizing a 64-bits high speed multiplier.
5. TELKOMNIKA ISSN: 1693-6930 ◼
Research of 64-bits RISC Dual-Core Microprocessor with High Performance... (Gang Zou)
467
Figure 4. The complement code algorithm Figure 5. The booth2 algorithm
Figure 6. The modified booth3 algorithm
The synthesis results show that the modified Booth3 algorithm accelerates 23%
compared with the traditional Booth2 algorithm, both of which are synthesized in 0.18um CMOS
techniques and 50MHz.
3.2. A Leapfrog Wallace Architecture for Low Power
The power consumption of the integrated circuit are mainly from the charging and
discharging current of load capacitance, which is the dynamic power
P=1/2CV2ESWFCLK
As shown above, C indicates load capacitance; V indicates power supply voltage; ESW
indicates Jump frequency; FCLK indicates service frequency. Power consumption can be
reduced by decreasing the number of jumping logic cells with the same chip technology power
supply voltage and service frequency.
Wallace tree is theoretically the fastest adder tree for multiplications. However, the carry
c and pseudo sum s has different generation time so that the carry which has faster generation
speed must wait for the generation of the pseudo sum. As it is depicted in Figure 7, the pseudo
sum s has 6 tds, while the carry c has 4 tds if the td stands for a complex logic gate delay. The
carry should waits 2 tds for the pseudo sum generation in the traditional Wallace tree. It delays
the compression speed, and takes unnecessary 0-1 jumps which increases the power
consumption of the compressor. Furthermore, the irregularity of the Wallace tree architecture
increases the wiring delay and area.
sign
extension
partial
products
126
126
63
63
0
64
partial
products
sign
extension
partial
products
130
130
65
64
0
33
partial
products
counteractive digit
carry-in signal
carry-in signal
carry-in signal
carry-in signal
carry-in signal
carry-in signal
22
partial
products
partial
products
065
63128
6. ◼ ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 463 – 470
468
Figure 7. The traditional 4-2 compressor
A new Wallace tree architecture is presented in this paper to resolve the disadvantages
presented above of the traditional one, which is named as the leapfrog Wallace tree and shown
in Figure 8-B. It generates the carry and pseudo sum synchronously by the leapfrog connection,
which increases the compression speed of the partial products and avoids the unnecessary 0-1
jumps of the traditional one to decrease the instantaneous power consumption. The traditional
Wallace tree architecture is depicted in Figure 8-A. Comparing with it, the leapfrog Wallace tree
has 2 tds advantage in the critical path delay if the wiring delay is not considered. However, the
wiring delay takes most parts of the delay in deep submicron CMOS techniques. The
architecture of leapfrog Wallace tree has much more regularity and shorter wires. Therefore, it
takes shorter wiring delay and lower instantaneous power.
Comparing with the traditional Wallace tree, the synthesis results show that the leapfrog
Wallace tree accelerates 26% and lowers 20% power consumption, and decreases 13% area,
both of which are synthesized in 0.18 CMOS techniques and 100Mhz.
A. Traditional Wallace Tree B. Salutatory Wallace Tree
Figure 8. The architecture comparision of the Wallace Trees
Figure 9 and Figure 10 show the differences of the instantaneous power consumption
between the two architectures of the Wallace trees. Both of the results are generated by the
same test-vectors. It is obviously that the leapfrog Wallace tree has only one peak in the
instantaneous power picture while the traditional one has almost three. That means the leapfrog
Wallace tree consumes less instantaneous power and is much more power-effective than the
traditional one when both of them works under the same situation. The simulation results
accords perfectly with the result of the previous theoretical analysis that the leapfrog Wallace
tree architecture avoids the unnecessary 0-1 jumps to improve the instantaneous power
efficiency.
7. TELKOMNIKA ISSN: 1693-6930 ◼
Research of 64-bits RISC Dual-Core Microprocessor with High Performance... (Gang Zou)
469
Figure 9. Instantaneous power of traditional
Wallace Tree
Figure 10. Instantaneous power of leapfrog
Wallace Tree
4. Synthesis and Simulation
4.1. Synthesis with the Synopsys Design Compiler
Both of the previous 64-bits Dual-Core microprocessor and the new 64-bits Dual-Core
microprocessor with the novel shared register model, the novel Booth multiplier and the novel
Wallace tree architecture was synthesized with the 0.18um CMOS library by the Synopsys
Design Compiler. The synthesis results show that the new Dual-Core microprocessor has about
9.637ns worst case critical path delay under the 0.18μm CMOS technology, while the previous
design take almost 13.187ns worst case critical path delay under the same CMOS library. It
takes about 26.9% advantage in speed than the previous design.
The power consumption of the new design that reported by the Design Compiler is
1204.38mw at 50Mhz, while the power-consumption of the previous design is 1400.44mw at
50Mhz. Therefore, the new 64-bits Dual-Core microprocessor with the modified Booth3
algorithm and completely parallel Wallace tree structure saves about 14% power consumption
at 50Mhz by using the same 0.18um CMOS technology than the previous design based on
Booth2 algorithm and traditional Wallace Tree.
4.2. FPGA Simulation Result
The 64-bits Dual-Core microprocessor is simulated on the Altera Stratix III
EP3SL150F780C4N FPGA device. The Quartus II 12.1sp1 was used to generate the simulation
result. The power reports show that the new 64-bits Dual-Core microprocessor with the novel
Booth multiplier and the novel Wallace tree architecture is only 1269.48mW when it works at
50MHz. In the other hand, the previous 64-bits Dual-Core microprocessor is 1447.72mW when
it also works at 50MHz. The new improvements make the 64-bits Dual-Core microprocessor to
save almost 14% power consumption at 50MHz than the previous design, and this result
coincides with the result of synthesizing by Synopsys Design Compiler.
The FPGA simulation also generates the maximum pad to pad delay. The maximum
pad to pad delay for the new 64-bits Dual-Core microprocessor is 14.751ns, while the maximum
pad to pad delay for the previous design is almost 19.574ns. The new 64-bits Dual-Core
microprocessor takes almost 25% advantages in speed than the previous design. Its frequency
is up to 66.6Mhz on the EP3SL150F780C4N FPGA device.
5. Conclusion
A 64-bits Dual-Core microprocessor is proposed in this paper. Its architecture is based
on the novel shared register model. Its performance is improved by modifying Booth3 algorithm
and its power consumption is optimized by completely parallel Wallace tree structure. The
8. ◼ ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 2, April 2018 : 463 – 470
470
simulation results indicate that the power consumption is decreased by 14% and the longest
data-path is shortened by 25% compared with the previous design.
References
[1] M Schoeberl, S Abbaspour, B Akesson, N Audsley, R Capasso. T-CREST: Time-predictable multi-
core architecture for embedded systems. Journal of Systems Architecture. 2015, 61(9): 449-471.
[2] S Pagani, JJ Chen, M Li. Energy Efficiency on Multi-Core Architectures with Multiple Voltage Islands.
IEEE Transactions on Parallel & Distributed Systems. 2015, 26(6): 1-1.
[3] T Pimpalkhute, S Pasricha. NoC Scheduling for Improved Application-Aware and Memory-Aware
Transfers in Multi-core Systems. Proceedings of the 2014 27th International Conference on VLSI
Design and 2014 13th International Conference on Embedded Systems. 2015, 26(6).
[4] Cheng YL, Min CH, Rong GC. Instruction scheduling and transformation for a VLIW unified reduced
instruction set computer/digital signal processor processor with shared register architecture.
Concurrency and Computation: Practice and Experience. 2014; 26(1): 134–151.
[5] J Zhanga, S Youb, L Gruenwaldc. Parallel online spatial and temporal aggregations on multi-core
CPUs and many-core GPUs. Information Systems. 2014; 44: 134–154.
[6] Z Yang, FP Wu, JR Dong, RD Heng. Optimization of Power System Scheduling Based on Shuffled
Complex Evolution Metropolis Algorithm. TELKOMNIKA (Telecommunication Computing Electronics
and Control). 2015; 13(2): 413-420.
[7] Ravi N, Subbaiah Y, Prasad TJ, et al. A novel low power, low area array multiplier design for DSP
applications. Signal Processing, Communication, Computing and Networking Technologies
(ICSCCN), 2011 International Conference on. IEEE. 2011: 254-257.
[8] Sivanantham, S Padmavathy, M Divyanga, S Lincy, PV Anitha. System-On-a-Chip Test Data
Compression and Decompression with Reconfigurable Serial Multiplier. International Journal of
Engineering & Technology. 2013; 5(2): 973.
[9] SK Chen, CW Liu, TY Wu. Design and Implementation of High-Speed and Energy-Efficient Variable-
Latency Speculating Booth Multiplier (VLSBM). Circuits and Systems I: Regular Papers, IEEE
Transactions on. 2013; 60(10).
[10] AZ Jidin, T Sutikno. FPGA Implementation of Low-Area Square Root Calculator. TELKOMNIKA
(Telecommunication Computing Electronics and Control). 2015; 13(4): 1145-1152.
[11] A Sathya, S Fathimabee, S Divya. Parallel multiplier-accumulator based on radix-2 modified Booth
algorithm by using a VLSI architecture. Electronics and Communication Systems (ICECS), 2014
International Conference on 13-14 Feb. 2014.