SlideShare a Scribd company logo
1 of 12
Download to read offline
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 921
A High-Performance Energy-Efficient Architecture
for FIR Adaptive Filter Based on New Distributed
Arithmetic Formulation of Block LMS Algorithm
Basant K. Mohanty, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE
Abstract—In this paper, we present an efficient distributed-
arithmetic (DA) formulation for the implementation of block
least mean square (BLMS) algorithm. The proposed DA-based
design uses a novel look-up table (LUT)-sharing technique for
the computation of filter outputs and weight-increment terms of
BLMS algorithm. Besides, it offers significant saving of adders
which constitute a major component of DA-based structures. Also,
we have suggested a novel LUT-based weight updating scheme
for BLMS algorithm, where only one set of LUTs out of sets
need to be modified in every iteration, where , , and
are, respectively, the filter length and input block-size. Based
on the proposed DA formulation, we have derived a parallel
architecture for the implementation of BLMS adaptive digital
filter (ADF). Compared with the best of the existing DA-based
LMS structures, proposed one involves nearly times adders and
times LUT words, and offers nearly times throughput of the
other. It requires nearly 25% more flip-flops and does not involve
variable shifters like those of existing structures. It involves less
LUT access per output (LAPO) than the existing structure for
block-size higher than 4. For block-size 8 and filter length 64, the
proposed structure involves 2.47 times more adders, 15% more
flip-flops, 43% less LAPO than the best of existing structures, and
offers 5.22 times higher throughput. The number of adders of the
proposed structure does not increase proportionately with block
size; and the number of flip-flops is independent of block-size.
This is a major advantage of the proposed structure for reducing
its area delay product (ADP); particularly, when a large order
ADF is implemented for higher block-sizes. ASIC synthesis result
shows that, the proposed structure for filter length 64, has almost
14% and 30% less ADP and 25% and 37% less EPO than the best
of the existing structures for block size 4 and 8, respectively.
Index Terms—Adaptive filters, block LMS, distributed arith-
metic, VLSI.
I. INTRODUCTION
ADAPTIVE DIGITAL FILTERS (ADFs) are widely used
in various signal-processing applications, such as echo
cancellation, system identification, noise cancellation and
Manuscript received June 18, 2012; accepted October 07, 2012. Date of pub-
lication October 25, 2012; date of current version January 25, 2013. The as-
sociate editor coordinating the review of this manuscript and approving it for
publication was Prof. Zhiyuan Yan.
B. K. Mohanty is with the Department of Electronics and Communication En-
gineering, Jaypee University of Engineering and Technology, Raghogarh, Guna,
Madhya Pradesh, India-473226 (e-mail: bk.mohanti@juet.ac.in).
P. K. Meher is with the Institute for Infocomm Research, 1 Fusionopolis Way,
Singapore-138632 (e-mail: pkmeher@i2r.a-star.edu.sg, url: http://www1.i2r.a-
star.edu.sg/~pkmeher/).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSP.2012.2226453
channel equalization etc. [1]. Amongst the existing ADFs,
least mean square (LMS)-based finite impulse response (FIR)
adaptive filter is the most popular one due to its inherent sim-
plicity and satisfactory convergence performance. However,
the delay in availability of the feedback-error for updating the
weights according to the LMS algorithm does not favor its
pipeline implementation when sampling rate is high. Haimi
et al. [2] have proposed the delayed LMS (DLMS) algorithm
for pipeline implementation of LMS-based ADF. The delayed
LMS is similar to the LMS algorithm except that the correction
terms for updating the filter weights of the current iteration are
calculated from the error corresponding to a past iteration.
Several schemes have been proposed to implement the
DLMS-based ADFs efficiently in a systolic VLSI with min-
imum adaptation delay [2]–[4], [7], [8]. To avoid adaptation
delay in pipelined LMS ADF, Poltmann [5] has proposed a
modified DLMS algorithm which is used by Douglas et al.
[6] to derive a systolic architecture. But, the structure of [6]
involves large amount of hardware resources compared to the
earlier one [2].
The block LMS (BLMS) ADF [9] is one of the useful deriva-
tives of the LMS ADF for fast and computationally-efficient
implementation of ADFs. Unlike the conventional LMS ADF,
BLMS ADF accepts a block of input for computing a block of
output and updates the weights using a block of errors in every
training cycle. The BLMS ADF has convergence performance
similar to the LMS ADF, but the BLMS ADF of block-length
offers fold higher throughput compared with the other.
Keeping this in view, many variant of BLMS algorithm like time
and frequency-domain block filtered-X LMS (BFXLMS) has
been proposed for specific applications [20]. Das et al. [21] have
proposed efficient BFXLMS using FFT and fast Hartley trans-
form (FHT), which is computationally more efficient. We have
proposed a delayed block LMS (DBLMS) algorithm [15], and
a concurrent multiplier-based architecture for high-throughput
pipeline implementation of BLMS ADFs. The structure of [15]
provides fold higher throughput rate and demands times
more resources compared to those of DLMS ADF. Baghel et al.
[17], [18] have suggested a distributed-arithmetic (DA)-based
structure for FPGA implementation of BLMS ADFs. A low-
complexity design has been proposed in [19] for BLMS ADFs.
This structure supports a very low sampling rate since it uses
single multiply-accumulate (MAC) cell for the computation of
filter output and weight-increment term.
To take the advantage of DA-based hardware designs [12],
Allred et al. [10] have suggested a scheme to derive a DA-based
design for LMS-ADF. The structure of [10] requires separate
1053-587X/$31.00 © 2012 IEEE
922 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013
look-up-tables (LUTs) for the calculation of filter output and
weight-increment terms. The LUT used for the computation of
filter output and weight-increment term of DA LMS-ADF is
named as DA-F-LUT and DA-A-LUT, respectively. In every it-
eration, entire content of DA-F-LUT is updated to compute the
weight-increment term, where half the content of DA-A-LUT
is updated to accommodate the new input sample arriving at
the current iteration. Updating the LUTs is the most time con-
suming operation in DA-based LMS-ADF, since the updating
is performed sequentially at different LUT locations. The LUT
update time, therefore, depends on the size of the LUT to be
updated. For most practical adaptive filters, we need to use a
decomposition scheme, where small size LUTs can be used in
DA-based LMS-ADF which not only helps in reducing the LUT
size but also in reducing LUT-update time. Recently Guo et
al. [16] have suggested a scheme to avoid the DA-A-LUT in
DA-based LMS-ADF, where both filtering and weight-updating
are performed using DA-F-LUT. On the other hand, throughput
rate of existing DA-LMS ADFs could be slow for real-time ap-
plications due to bit-serial nature of DA computation. Although,
there are some interesting work on DA-based LMS ADF [10],
[16], we find that the potential application of DA for the imple-
mentation of BLMS ADF is yet to be explored.
In order to reduce the power consumption of DA-based de-
signs, we aim at reducing the number of words in the LUT and
less LUT-access. DA-based BLMS ADF structure can be de-
rived by extending the scheme of [10], but this structure would
demand times more hardware (memory and combinational
logic) for times more throughput rate. The scheme of [16] of-
fers sharing of LUT for the computation of both filter output and
weight-increment term, but this scheme can not be applied to
derive a DA-based structure for BLMS ADFs, because separate
inner-product computation (IPC) is performed for calculation of
filter output and weight-increment term of BLMS ADF whereas
in case of LMS ADF, IPC is performed to calculate the filter
output only. In this paper, we have formulated the DA-BLMS al-
gorithm for sharing of LUTs for the computation of filter output
and weight-increment terms.
The key contributions of this paper are:
• DA-based formulation of BLMS algorithm where both
convolution operation to compute filter output and corre-
lation operation to compute weight-increment term could
be performed by using the same LUT.
• A novel approach for minimization of number of LUT
words to be updated per output. This helps to save external
logic and power consumption.
We have derived a DA-based structure for BLMS-ADF
using the proposed DA-formulation and a novel LUT updating
scheme. The most remarkable aspect of the proposed scheme
is that the number of adders required by the structure does
not increase proportionately with filter order, and the number
of flip-flops required by the structure is independent of the
block-size. Apart from that, the proposed structure has signifi-
cantly less LUT access than the existing DA-LMS structure for
higher block-sizes.
The rest of this paper is organized as follows: Mathematical
formulation is presented in Section II. The new-LUT update
scheme is discussed in Section III, and the proposed structure for
DA-based BLMS ADF is presented in Section IV. Hardware-
and time-complexities of the proposed structure are discussed
in Section V. Conclusion is presented in Section VI.
II. MATHEMATICAL FORMULATION
The BLMS algorithm for updating the filter weights in the
-th iteration is given by
(1)
where is defined as
(2)
and are, respectively, the weight-vector and the error-
vector of the -th iteration defined as:
where is the step-size; and the input matrix is derived from
the current input block
of length , and past samples, given by
The error-vector is computed as
(3)
where the desired response vector is defined as
The -th block of filter output is computed by the matrix-
vector product:
(4)
A. Computation of Filter Output
The input matrix of size can be decomposed
into square matrices of size each, where
. Similarly, the weight vector can be decomposed into
short weight-vectors of size , for .
The computation of (4) can then be expressed as the sum of
matrix-vector products:
(5)
where and are defined as
MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 923
for , and
Each filter output now can be written as the sum of inner-
products as
(6)
where is an -point inner-product of an input-vector
and are given by
(7)
and is the -th row of given by
for , , and
. Note that we have dropped the subscript of
in (7) only for convenience of further discussion, without loss
of generality.
B. Computation of Weight Increment Term
The weight-increment vector can be decomposed into
short vectors of size each, for .
Computation of (2) can be performed through independent
matrix-vector multiplication using the relation
(8)
where , and defined as
(9)
Using (8), the individual weight increment terms could be eval-
uated by the following equation
(10)
where is the inner-product between the vector and ,
given by
(11)
Here also we have dropped the subscript of for con-
venience of further discussions. As shown in (7) and (11), the
input-vector is the same for a pair of inner-products
and . This is a major advantage in order to optimize the
LUTs when the inner-products of (7) and (11) are performed
using the DA principle.
C. DA-Formulation
Let and , respectively, be the -th compo-
nents of the -point vectors and , and assumed to be -bit
numbers in 2’s complement representation:
(12a)
(12b)
and are the -th bit of and , respec-
tively. Substituting (12a) in (7), we have
(13)
Rearranging the order of summation, (13) may otherwise be ex-
pressed as:
(14)
where , for , and
for . Each term in the inner sum in (14) represents the
inner-product of with a bit-vector (or bit-slice) of weight-
vector . Corresponding to possible values of a bit-vector
of length , there could be possible values of such inner-
products of with any possible bit-vector of length . All
those possible inner-products could be pre-computed and stored
in an LUT, such that when the -th bit-vector (or bit-slice) of
weight vector
for , is fed to the LUT as address, its
inner-product with , is read from the LUT. The computation
of inner sum of (14), therefore, could be expressed in the form
of memory read operation as:
(15)
where is a memory-read operation, and its argument
for , is used as LUT-address. The
inner-product of (11) may, similarly, be expressed in the form
of memory-read operation as
(16)
where is the -th bit-vector of error-vector defined as:
, which is used
as address of an LUT to read its inner-products with . LUT
contents for the computation of and are exactly
the same, since the LUT content depends on the input-vector ,
and generated for all possible bit-slices of -bit length, irrespec-
tive of whether that is of the weight-vector or the error-vector.
When the bit-vector is used as address, the partial results
of are read from the LUT, and when is used as
address, then partial results of are read from the same
LUT. Therefore, by using the proposed scheme, a common set
of LUTs could be used for the computation of filter outputs
and weight-increment terms. Since, the block of input samples
changes after every iteration, the LUTs are required to be up-
dated in every iteration to accommodate the new input-block.
In the next Section, we have presented a novel LUT-updating
scheme for the DA-based BLMS ADFs.
924 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013
Fig. 1. (a) Inner-products of FIR filter of length , and block-size .
The input-vectors corresponding to inner-product is shown inside the
box. (b) LUT arrangement for DA-based computation of the FIR filter of
, and . Each LUT here stores possible values of partial inner-product
of input-vector and bit-vector of of length , for
and .
III. LUT-UPDATING SCHEME
Before, we discuss the proposed LUT-updating scheme, we
summarize here the proposed decomposition of input-matrix
and weight-vector into small vectors, and their participation in
the inner-product computation for filtering operation. The input-
matrix of size is decomposed into square matrices
of size and is decomposed into short-vec-
tors of size , for where .
Each of rows of represents an input-vector, so that such
input-vectors ( , for ) are derived form ,
and such input-vectors are derived from , for
. All these input-vectors are arranged in rows and
columns such that, input-vectors of belong to -th
column. According to (5), weight-vectors are multiplied
independently with matrices which, in total, involves
inner-products. According to (6), results of inner-products
corresponding to each row of input-vectors are added together
for obtaining a filter output. From such rows of inner-prod-
ucts, filter outputs are obtained.
We have illustrated here the aforementioned scheme for the
implementation of FIR filter of length and block-size
. Suppose, during the -th iteration the filter receives an
input-block and computes a block of output
. As discussed above, the input-matrix of
size 2 6 is decomposed into 3 square-matrices , and
of size 2 2. consists of a pair of input-vectors (
and ), and similarly and consist of pair of input-vec-
tors and , respectively. The 6-point weight-
vector is decomposed into 3 number of 2-point weight-vectors
. Fig. 1(a) shows the arrangement of input-vectors
and weight-vectors; and the corresponding inner-products are
shown on the top of the rectangular boxes for clarity. Results
of odd-numbered inner-products (on upper row) and even-num-
bered inner-products (on lower row) are added separately (not
shown in the figure) to obtain and , respectively.
Fig. 2. DA-based computation of the block FIR filter for and .
(a) for -th iteration, (b) for -th iteration.
As shown in Fig. 1(a), the same weight-vector is used for
the computation of inner-product of a particular column of
input-vectors. For DA realization, LUT corresponding to each
and stores partial inner-products generated by the
inner-product of the corresponding input-vector with all
possible values of a bit-vector of length . DA-based parallel
computation of filter outputs of Fig. 1(a) for the -th iteration
is shown in Fig. 1(b). As shown in Fig. 2(a), the DA-based
structure receives an input-block during
the -th iteration, so that two new samples enter into
the set of 7 samples, and two oldest samples are discarded.
Consequently, samples of the all 6 input-vectors are changed.
But, it occurs in a particular order. We can find from Fig. 1(b)
and Fig. 2(a), that the contents of only the first column of LUTs
of Fig. 2(a) are changed by the new samples while in other
columns, the LUT values remain the same. But the position of
those unchanged LUTs are shifted right by one-column. For in-
stance, values stored in the LUTs of second column of Fig. 2(a)
are the same as values stored in LUTs of the first-column of
Fig. 1(b), and similarly values stored in LUTs of third column
of Fig. 2(a) are the same as those LUTs of second-column
Fig. 1(b). This feature can be observed in the LUT contents
of Fig. 2(b) for the -th iteration also. In other word,
contents of a particular column of LUTs during a particular
iteration are simply transferred to the adjacent column of LUTs
on its right during the next iteration. In this way, the oldest
input samples of particular set are shifted out through the -th
column ( in the example) of LUTs, and new values are
entered at the first column of LUTs.
Shifting of values physically from one LUT to the next across
the array of LUTs is highly time consuming and power con-
suming. Therefore, we have proposed a novel LUT updating
scheme, where the LUT content need not be shifted. Since, each
column of LUTs uses the same weight-vector as LUT-address,
the column-wise right-shift of LUT values can be achieved by a
left-shift of the weight-vectors. This technique could save a lot
of time and power, since the shifting of weight-vectors is sig-
nificantly less expensive than the shifting of LUT contents. In
the proposed LUT update scheme, contents of only one column
MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 925
Fig. 3. (a) Equivalent DA-based structure of Fig. 2(a) which is derived from
structure of Fig. 1(b) by changing the content of 5th and 6th LUT (shown in
grey color) and left shifting the weight-vectors by one-position. (b) Equiva-
lent DA-based structure of Fig. 2(b) derived from the structure of Fig. 3(a) by
changing content of 3rd and 4th LUT (shown in grey color) and left-shifting the
weight-vectors by one position.
of LUTs out of 3 such columns (for ) need to be up-
dated in every iteration. We can find from Fig. 1(b) and Fig. 2(a)
that, the values of the third-column LUTs of the -th itera-
tion are not used during -th iteration, since they corre-
spond to the oldest block of samples . The
LUTs of the third column are updated as shown in Fig. 3(a) in
grey-color. To feed weight-vectors to LUTs of Fig. 3(a) in the
same order as that of Fig. 2(a), weight-vectors of Fig. 1(b) are
simply left-shifted by one location. As shown in Fig. 3(a), the
second-column of LUTs contain the values corresponding to the
samples , which is the oldest block of sam-
ples in the -th iteration, and this input-block is discarded
and corresponding LUTs are updated by the partial inner-prod-
ucts of new input-block . Weight-vectors
of Fig. 3(a) are left-shifted by one column, and fed to LUTs
of Fig. 3(b) as addresses. In the following, we summarize the
proposed scheme for updating LUTs of BLMS-based adaptive
filter:
• LUTs are updated column-by-column in every iteration in
cyclic order.
• The LUTs which store the values of partial inner-prod-
ucts corresponding to samples of the oldest input block are
overwritten by those of the new input block.
• The weight-vectors are circularly left-shifted after
every iteration to change the columns of LUT to be read
circularly.
• The values required for updating a column of LUTs for any
particular iteration are calculated from samples of the
current input-block and samples of the most recent
past samples of the previous block.
Based on the above scheme, LUT-matrix is updated
column-by-column from right to left after every iteration. The
updating process starts from the -th column of LUTs and goes
to the first column on a cyclic manner, and then again from the
first column it goes to the -th column and then to the
Fig. 4. Proposed DA-based structure for implementation of BLMS adaptive
FIR filters (for and ), where
, , and
.
-th column and so on. Hence, LUTs of one particular column
are updated once in a period of iterations.
IV. PROPOSED ARCHITECTURE
Proposed DA-BLMS structure is comprised of one
DA-module, one error bit-slice generator (EBSG) and one
weight-update cum bit-slice generator (WBSG). WBSG up-
dates the filter weights and generates the required bit-vectors
in accordance with the DA-formulation. EBSG computes the
error block according to (3) and generates its bit-vectors. The
DA-module updates the LUTs and makes use of the bit-vectors
generated by WBSG and EBSG to compute the filter output
and weight-increment terms according to (15) and (16).
A. Structure for Block-Size
The proposed structure for DA-based BLMS adaptive filter
for and is shown in Fig. 4. The DA-module
receives a block of input samples
in every iteration, and computes a block
of filter output
. It also receives a block of errors in every iteration, and
computes the weight-increment term for all the components
of the weight-vector .
The structure of proposed DA-module is shown in Fig. 5. It
consists of 4 identical processing elements (PEs) for ,
one LUT-update block and one MUX-array. Structure of the PE
is shown in Fig. 6. It consists of 4 identical subcells (SCs) for
. Internal structure and function of the -th SC is
shown in Fig. 7. As required by (15), LUT of the -th SC of
this PE stores 16 possible values corresponding to the samples
,
where .
The LUT-update block of the DA-module generates the re-
quired values to update LUTs of a particular PE. Structure of
the LUT-update block is shown in Fig. 8. It consists of one
adder-block and an input delay unit (IDU), which stores
samples of the previous block. During each iteration, the adder
926 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013
Fig. 5. Structure of DA-module of the proposed DA-BLMS ADF (for and ). The subscript of , and varies from 0 to
in cycles.
Fig. 6. Internal structure of -th processing element (PE) of the
DA-module for block-size , where .
block receives samples ( samples from the current
input block and past samples from the IDU), and feeds
these samples to adder-cells (ACs) (see Fig. 8) such that
each AC receives samples, and input blocks of adjacent ACs
are overlapped by samples. During the -th iteration,
AC- receives input samples
and AC- receives the sam-
ples . For
block-size , each AC receives a block of four samples in
every iteration (shown in Fig. 9). As shown in the figure, each
of the four inputs of the AC is ANDed with a bit of the four-bit
address by four AND cells. Each AND cell
consists of AND gates, where is the word-length of input
samples. All those AND gates are fed with a bit of the address,
Fig. 7. Internal structure and function of -th subcell (SC) of a PE, where
and , . Convergence factor is
assumed to be power of 2.
while the other input of the AND gates are fed with a bit of input
sample. The output of AND cells are fed to an adder-tree (AT).
AC receives 16 possible values of in 16 clock cycles, and cal-
culates 16 values of to be stored in the LUT, where is
used as the address of the LUT location and is the equivalent
integer value of . All the ACs of the adder block (see Fig. 8)
work in parallel, and generate all the required values to update
LUT of SCs of a PE. According to the proposed LUT-update
scheme, LUTs of one PE out of PEs are updated in every it-
eration. LUTs of all the PEs are updated once in cycles
MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 927
Fig. 8. Internal structure of LUT-update block for block-size , where
.
Fig. 9. Internal structure of -th AC of the LUT-update block for block-
size .
Fig. 10. Internal structure of MUX-array for and .
in a cyclic order. Each PE uses separate control signal ( , for
) to enable the specific column of LUTs to be
updated. LUT-update operation of proposed structure is com-
pleted during the first clock cycles of every iteration.
Each PE receives the bit-vectors , and through
the MUX array (shown in Fig. 9) for updating the LUTs or
computation of filter outputs or weight-increment terms, respec-
tively. After completion of the LUT-update, filtering computa-
tion follows immediately for the next clock cycles by a series
of LUT-read operations using the bit-slices of corresponding
weight-vector in LSB to MSB order, as successive addresses
according to (15). During the -th cycle of filtering, the
WBSG generates parallel bit-vectors of width bits
Fig. 11. Structure of error computation cum bit-slice generator (EBSG) for
block-size , where ,
and .
each for the PEs to perform the filtering operation. Each SC
receives a sequence of bit-vectors , (for
where is the wordlength of the filter-coefficients) from the
WBSG in clock cycles. The LUT-read values are shift-ac-
cumulated in an accumulator (ACC) to obtain a partial filter
output. During the -th cycle the LUT output is subtracted from
the accumulated result since the bit-vector during this
cycle contains the sign-bits of weight-vector . Each SC uses
control signal CTR1 to control add/substract operation in the
ACC. At the end of the -th cycle, ACC contents are sent to
the DMUX as input, and the ACC register (not shown in Fig. 7)
is cleared to be used for the computation of weight-increment
term from the next cycle (CTR1 is used for clearing the reg-
ister). Finally the DMUX sends the computed partial results of
inner-products to the output line using the select
signal CTR6. From SCs of each PE, partial results are ob-
tained in parallel, the corresponding output of each SC from
PEs are added by an AT (Fig. 5) to obtain (the -th
component of -th block of filter output). A block of parallel
filter output ( ) are obtained from ATs of the DA-module in
each cycle.
EBSG receives one block of filter output ( ) from the
DA-module, and calculates a block of error ( ) in every
iteration using one block of desired response according to
(3). As shown in Fig. 11, error values are loaded in parallel-in
serial-out (PISO) shift-registers of the bit-slice-generater (BSG)
to generate bit-vectors of error-vector . CTR4 enables
the clock for the BSG and CTR2 controls load-shift operation
of each SR.
Bit-vectors , for , fed serially in LSB
to MSB order to the DA-module in successive clock cycles
to compute weight-increment terms for the -th itera-
tion. According to (16), LUT values of the -th block of filter
output are also used to compute weight-increment terms for the
-th iteration. In general, LUT values of -th SC
of -th PE (for , )
are used to compute the weight-increment term .
The -th PE, therefore, computes the weight-increment
terms . The
computation of weight-increment-terms is similar to the par-
tial filter outputs. But in this case the same bit-vector is used
928 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013
TABLE I
LUT UPDATING SCHEME FOR AND , WHERE , : BLOCK SIZE, : FILTER ORDER
by all the PEs of the DA-module to compute the weight-incre-
ment terms. In each SC (see Fig. 7), the ACC contents corre-
sponding to the weight-increment term is sent to the output line
of the DMUX. The weight-increment terms are scaled by a
factor . Here we have assumed is a power of 2, so that the
scaling of by is realized by a right-shift operation using
a fixed-shifter (see Fig. 7).
According to (1), the WBSG of the proposed DA-BLMS
structure requires only the weight-increment terms of the cur-
rent iteration to update the weight-vector for the next iteration.
It does not require the LUT values of the current iteration.
Therefore, once the weight-increment terms of the current
iteration are computed, the LUT-updating operation for the
next iteration can be started immediately in the next clock
cycle. As we discussed earlier, the filter computation follows
the LUT-update operation, and first clock cycles of every it-
eration are used to complete the LUT-update operation. During
this period, weight-update operation of WBSG also can be
performed concurrently. A bit-parallel (word-serial) structure
of WBSG requires one clock-cycle to complete the weight-up-
date operation, while a bit-serial structure of WBSG requires
clock-cycles to complete the same task. If wordlength of
filter-coefficients ( ) is less than or equal to the LUT-size
, then bit-serial realization of WBSG does not increase
the iteration period of the DA-BLMS structure, but it certainly
helps to reduce the hardware complexity of the DA-BLMS
structure. For and , we can have a bit-serial
structure for the WBSG. Bit-serial structure of WBSG receives
the weight-increment terms from the DA-module in bit-serial
LSB to MSB order, and updates the weight-vector accordingly.
For bit-serial realization of WBSG, weight-increment terms
computed by each PE of the DA-module are finally loaded into
a separate BSG (see Fig. 5) to generate the weight-increment
terms in bit-serial order. All the BSG of the DA-module uses
common control signals CTR6 and CTR5 to perform the
loading and sifting operations, respectively.
WBSG is an important block of the proposed structure. It
performs three operations: (i) updates filter weights using the
weight-increment values calculated by the DA-module, (ii)
generates bit-vectors for the DA-module to compute
-th block of filter output, (iii) gives one left-shift
(circularly) to the weight-vectors as necessitated by the
proposed LUT-update scheme. We have shown LUT updating
of the DA-BLMS ADF for and in Table I for
the first 5 iterations using the proposed LUT-updating scheme.
As shown in Table I, for and , the LUT-matrix
has 4 columns (for ). LUTs of all these 4 columns
are updated once in a period of 4 iterations. At any given
iteration, the LUT-matrix contains the values corresponding
to recent past input samples to compute
a block of 4 filter output. As shown in Table I, during the
5-th iteration, LUT-matrix ( to ) contain the values
corresponding to set of input samples ( to ). These
set of 19 samples are exactly required to compute the filter
output ( to ). Similarly, the LUT-matrix contain the
values corresponding to the set of samples
during 6-th iteration, and these samples are exactly required to
compute filter outputs ( to ).
The bit-serial structure of WBSG is shown in Fig. 12. It con-
sists of serial-in serial-out (SISO) SRs and carry-save
full-adders (CSFAs) corresponding to filter weights. SRs
are arranged in matrix form; and filter-weights are
stored in the SR matrix column-wise, such that weight-vector
is stored in -th column of SRs. As shown in Table I for
, that bit-slices of the weight-vector are re-
ceived by the PE whose LUTs are to be updated during the -th
iteration, and are generated from
the first column of filter weights
. The weight-vector to be aligned with the corresponding
PE. If during the -th iteration, LUT of PE-1 is to be updated,
then the first column of SR-matrix is required to contain the
components of weight-vector and the -th column of SRs
should contain components of weight-vector . As shown
in Fig. 12, weight-increment values of the -th column of
filter coefficients (available in the -th column of SR-ma-
trix) are obtained from the -th PE, and these values are
added with the corresponding filter-weights bit-serially using a
carry-save full-adder (CSFA). Results of CSFA of -th
column constitute a bit-vector of . SR contents are shifted
left for clock cycles, to generated the shifted weight-vectors
in accordance with the proposed LUT-update scheme. Shifting
operation of the SRs starts at -th clock cycle of every
iteration, and continue for clock cycles. The control signal
CTR5 is used in WBSG to enable the shifting operation. D
flip-flop of each CSFA is cleared during the first clock cycle
of every iteration to flush-out the final carry of the previous it-
eration of weight-update operation.
B. Structure for Higher Block-Size
To derive DA-based BLMS structure for higher block sizes
using LUT of 16 words, we can take the block-size to be an
multiple of 4, i.e. , where is an integer. The structures
MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 929
Fig. 12. Bit-serial structure of weight-update cum bit-slice generator (WBSG) for , and .
of EBSG and WBSG of the DA-BLMS filter for (for
) are the same as those of block-size shown in
Fig. 11 and Fig. 12, respectively. However, the AC of the LUT-
update block and the SC of each PE of the DA-module need to
be modified according to the value of . Each SC in this case,
is comprised of LUTs of size 16 words each. The bit-vectors
of weight-vectors and error-vectors of bits each are splitted
into segments of 4-bit size, and fed to LUTs of each SC
to read the LUTs in parallel. The values read from the LUTs
are added using an AT and subsequently shift-accumulated in
the ACC for obtaining a partial output. To generate the weight
update-values for LUTs, each AC of the LUT-update block in
this case is comprised of AND-TA blocks of size 4 (as shown
in Fig. 9). For block-size , each SC involves RAM
words and adders along with one ACC and 2 DMUX.
Similarly, the LUT-update block involves AND-gates and
adders.
V. HARDWARE-TIME COMPLEXITY AND
PERFORMANCE COMPARISON
A. Hardware Complexity
Proposed structure is comprised of one DA-module, one
WBSG, one EBSG and a control unit. The DA-module consists
of one LUT-update block, PEs, adder-trees of words
each, one MUX-array and BSG, where and the
block-size . LUT-update block consists one IDU and
ACs, where the IDU is comprised of registers of size ,
and each AC is comprised of AND-gates and adders.
LUT-block, therefore, involves registers, adders
and AND-gates. Each PE consists of SCs, where each
SC is comprised of LUTs of 16 words each, adders,
one ACCs, one 1-to-2 line DMUX and number of 2-input
XOR-gates (used by ACC (not shown in Fig. 7) to compute 1’s
complement of the LUT-outputs when the bit-vector contains
sign-bits), where ACC involves one adder, one register and
one 2-to-1 line DMUX ( ). Each PE, there-
fore, involves memory words, adders, registers,
DMUXes (2-to-1 line) and XOR-gates. Each BSG is
comprised of SRs (bit-level) of size . MUX-array involves
2-to-1 line MUXes. The DA-module, therefore, in-
volves memory words, adders,
D-type flip-flops (FFs),
2-to-1-line MUXes/DMUXes (word-level), AND-gates
and XOR-gates. WBSG involves D-type FFs
and FAs. EBSG involves D-type FFs and adders.
Proposed structure, therefore, requires memory words,
adders, FAs,
D-type FFs, MUXes/DMUXes (word-level),
AND-gates and and XOR-gates.
B. Time Complexity
The proposed structure performs four operations sequen-
tially in every iteration. Those are (i) LUT update, (ii) filter
output computation, (iii) error calculation and (iii) compu-
tation of weight-increment term. It involves 16 clock cycles
to complete LUT-update operation. It takes clock cycles
to calculate partial results of a block of filter output. It cal-
culates a block of filter output from the partial results and
then block of error in one clock cycle. Finally it takes
clock cycles to compute the weight-increment term for the
weight vector. In every iteration, proposed structure pro-
cesses one block of samples, where one iteration involves
clock cycles and duration of one clock cycle is
,
where is the delay of one -bit adder. For comparison
purpose, we have also estimated number of clock cycles re-
quired by the structure of [10] and [16] for one iteration. We
assumed the read and write operations are performed in two
separate clock cycles in a LUT to maintain uniformity in the
comparison. Structure of [10] requires 16 clock cycles to update
the DA-A-LUT of size 16 words, clock cycles to compute
one filter output and 32 clock cycles to update the DA-F-LUT
930 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013
TABLE II
GENERAL COMPARISON OF HARDWARE COMPLEXITY OF THE PROPOSED STRUCTURE (FOR BLOCK-SIZE ) AND THE STRUCTURE OF
[10] AND [16] (WITH DECOMPOSITION FACTOR 4) AND THE DA-BLMS STRUCTURE OF [18]
LEGEND: ADD: adder, MULT: multiplier, FF: flip-flop, VSH: variable shifter, TR: throughput rate, LAPO: LUT access per output.
, , , , ,
. In addition to the above list of components the proposed structure involves FAs, 2-input AND-gates and 2-input XOR-gates,
where : word length of the sequence , and , : word-length of input sequence, . For the proposed structure, ,
and in case of [10] and [16], , and in case of [18], , , and block-size , where and are
relatively prime to each other.
of size 16 words. It involves 48 clock cycles for one iteration
and computes one output per iteration, where the duration of
one clock cycle is and . Since,
the structure of [16] does not involve DA-F-LUT, it requires 16
clock cycles for updating the DA-A-LUT and clock cycles
to compute one filter output. The structure of [16], therefore,
involves clock cycles for one iteration, where the
duration of the clock period is the same as that of [10].
C. Number of LUT Access
During every iteration, proposed structure computes filter
outputs, and performs write operations for updating the
LUTs, LUT read operations for filter output computation
and LUT read operations for the computation of weight-in-
crement terms. The number of LUT access per output (LAPO)
of the structure is, therefore, . Similarly, the
number of LAPO of [10] and [16] are found to be
and , respectively, where is the bit-width of the
input samples and is the bit-width of all the intermediate and
output samples. Note that, LUTs of DA-based ADF are required
to be implemented by RAM, and the total energy consumption
of the structure, therefore, increases significantly with LAPO.
D. Performance Comparison
Hardware and time complexities the proposed structure and
the DA-LMS structures of [10], [16], and DA-BLMS structure
of [18] are listed in Table II for comparison. The structure of
[16] is the most efficient one amongst the existing DA-LMS
structures. Compared with [16], proposed structure requires
times more LUT words, nearly times more adders, 4/3 times
more FFs and offers nearly times higher throughput rate. It in-
volves 16 more LAPO for block-size 4 and less
LAPO for block-size 8 than those of [16] for 16-bit internal
bit precision. Interestingly, number of adders of the proposed
structure does not increase proportionately with block-size in
the proposed structure and number of flip-flops is independent
of block-size. Besides, it does not require variable shifters un-
like those of [10] and [16].
We have estimated hardware and time complexity of pro-
posed structure for and 8, and that of [10] and [16] for
filter size ( , 32 and 64) using the complexity counts
of Table II. The estimated values are listed in Table III for
comparison. Compared with the structure of [16], proposed
structure for involves 8 times more LUT words; 3.27
times more adders on average for different filter orders, and
offers 5.22 times higher throughput. But, it involves, respec-
tively, 37.5%, 24.4%, 17.8% more flip-flops and 25%, 37.5%,
47.6% less LAPO than those of [16] for filter orders 16, 32, 64,
respectively.
E. Simulation Result
To validate the proposed design, we have coded it in VHDL
for filter order 16, 32 and 64 with block-size 4 and 8. We have
also coded the design of [10] and [16] for the same filter orders.
We have considered and , and synthesized
both the designs by Synopsys Design Compiler using TMSC 90
nm CMOS library. Synthesis reports obtained from the Design
Compiler are listed in Table IV.
Synthesis results are in accordance with the theoretical es-
timation given in Table III. The minimum clock period of the
proposed structure and the structure of [16] are slightly higher
than those of [10] due to the extra MUX/DMUX in the critical
path. As shown in Table IV, structure of [16] is the most efficient
amongst the existing structures. Compared with [16], proposed
structure for block size and 8 involve, respectively, 2.13
and 3.69 times more area on average for different filter orders
and offers nearly 2.61 and 5.22 times higher throughput rate, re-
spectively.
MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 931
TABLE III
HARDWARE AND TIME-COMPLEXITY OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16] FOR DIFFERENT SIZE FILTERS. ,
TABLE IV
COMPARISON OF AREA, DELAY, AND POWER COMPLEXITIES OBTAINED FROM SYNTHESIS RESULT OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16]
We have estimated ADP1, PPO2 and energy per output
(EPO3) at 20 MHz clock. As shown in Table IV, for block-size
4, the proposed structure has 17.47%, 18.49%, 13.66% less
ADP than that of [16] for filter order 16, 32 and 64, respectively.
For block-size 8, it has 31/6% less ADP than [16] on average
for different filter orders. For block-size 4, it consumes 27.5%,
28.8% and 24.6% less EPO than that of [16] for filter order 16,
32 and 64, respectively. Similarly, for block-size 8, it consumes
respectively, 40%, 39.8% and 37.4% less EPO than other for
16, 32 and 64 order filters. One can extrapolate these results to
obtain the approximate values of ADP, PPO and EPO of the
proposed structure for filter order greater than 64. One can
also extrapolate these observations to obtain the approximate
estimate of the advantages of proposed structure for filter order
greater than 64.
1
2
3
VI. CONCLUSION
We have derived a DA formulation of BLMS algorithm where
both convolution and correlation are performed using a common
LUT for the computation of filter outputs and weight increment
terms, respectively. This results in a significant saving of LUT
words and adders which constitute the major hardware com-
ponents in DA-based computing structures. Also we have sug-
gested a novel LUT updating scheme to update the LUT con-
tents for DA-based BLMS ADF, where only one set of LUTs out
of sets need to be modified in every iteration such that LUT
contents are modified once in every iterations, where
, is the filter length and is the input block-size. Using
the proposed scheme, we have derived a parallel architecture for
the implementation of DA-based BLMS ADF. Unlike the ex-
isting DA-based LMS structure, number of adders required by
the proposed structure does not increase linearly with . Com-
pared with the best of the existing DA-based LMS designs, pro-
posed one involves nearly times more adders, and times
932 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013
more LUT words and offers nearly times throughput. It re-
quires nearly 25% more flip-flops irrespective of the block-size,
but does not involve variable shifters like others. It involves
less number of LUT access per output than the existing struc-
ture for block-size higher than 4. This is a major advantage of
the proposed structure for reducing its ADP and EPO when im-
plemented for large order ADF, and for higher block-sizes. For
block-size 8 and filter length 64, the proposed structure involves
2.47 times more adders, 15% more flip-flops, 43% less LAPO
than the best of the existing structures, and offers 5.22 times
higher throughput. ASIC synthesis result shows that, the pro-
posed structure for filter order 64, has almost 14% and 30% less
ADP and 25% and 37% less EPO than the best of the existing
structures for block size 4 and 8, respectively.
REFERENCES
[1] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Fil-
ters. Hoboken, NJ: Wiley-Interscience, 2003.
[2] R. Haimi-Cohen, H. Herzberg, and Y. Beery, “Delayed adaptive LMS
filtering: Current results,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process., Albuquerque, NM, Apr. 1990, pp. 1273–1276.
[3] M. D. Meyer and D. P. Agrawal, “A modular pipelined implementa-
tions of a delayed LMS transversal adaptive filter,” in Proc. IEEE Int.
Symp. Circuits Syst., New Orleans, LA, May 1990, pp. 1943–1946.
[4] V. Visvnathan and S. Ramanathan, “A modular systolic architecture
for delayed least mean square adaptive filtering,” in Proc. IEEE Int.
Conf. VLSI Des., Bangalore, 1995, pp. 332–337.
[5] R. D. Poltmann, “Conversion of the delayed LMS algorithm into the
LMS algorithm,” IEEE Signal Process. Lett., vol. 2, p. 223, Dec. 1995.
[6] S. C. Douglas, Q. Zhu, and K. F. Smith, “A pipelined LMS adap-
tive FIR filter architecture without adaptive delay,” IEEE Trans. Signal
Process., vol. 46, pp. 775–779, Mar. 1998.
[7] L. D. Van and W. S. Feng, “Efficient systolic Architectures for 1-D and
2-D DLMS adaptive digital filters,” in Proc. IEEE Asia Pacific Conf.
Circuits Syst., Tianjin, China, Dec. 2000, pp. 399–402.
[8] L. D. Van and W. S. Feng, “An efficient architecture for the DLMS
adaptive filters and its applications,” IEEE Trans. Circuits Syst. II,
Analog Digit. Signal Process., vol. 48, no. 4, pp. 359–366, Apr. 2001.
[9] G. A. Clark, S. K. Mitra, and S. R. Parker, “Block implementation of
adaptive digital filters,” IEEE Trans. Circuit Syst., vol. 28, pp. 584–592,
Jun. 1981.
[10] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. An-
derson, “LMS adaptive filters using distributed arithmetic for high
throughput,” IEEE Trans. Circuits Syst., vol. 52, no. 7, pp. 1327–1337,
Jul. 2005.
[11] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,
“A novel high performance distributed arithmetic adaptive filter im-
plementation on an FPGA,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process. (ICASSP), 2004, vol. 5, p. V-161-4.
[12] S. A. White, “Applications of distributed arithmetic to digital signal
processing: A tutorial review,” IEEE ASSP Mag., vol. 6, pp. 4–19, Jul.
1989.
[13] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,
“An FPGA implementation for a high throughput adaptive filter using
distributed arithmetic,” in Proc. 12th Annu. IEEE Symp. Field-Pro-
grammable Custom Comput. Mach., 2004, pp. 324–325.
[14] W. Huang and D. V. Anderson, “Adaptive filters using modified
sliding-block distributed arithmetic with offset binary coding,” in
Proc. IEEE In. Conf. Acoust., Speech, Signal Process. (ICASSP),
2009, pp. 545–548.
[15] B. K. Mohanty and P. K. Meher, “Delayed block LMS algorithm and
concurrent architecture for high-speed implementation of adaptive FIR
filters,” presented at the IEEE Region 10 TENCON2008 Conf., Hyder-
abad, India, Nov. 2008.
[16] R. Guo and L. S. DeBrunner, “Two high-performance adaptive filter
implementation schemes using distributed arithmetic,” IEEE Trans.
Circuits Syst. II, Exp. Briefs, vol. 58, no. 9, pp. 600–604, Sep. 2011.
[17] S. Baghel and R. Shaik, “FPGA implementation of fast block LMS
adaptive filter using distributed arithmetic for high-throughput,” in
Proc. Int. Conf. Commun. Signal Process. (ICCSP), Feb. 10–12, 2011,
pp. 443–447.
[18] S. Baghel and R. Shaik, “Low power and less complex implementation
of fast block LMS adaptive filter using distributed arithmetic,” in Proc.
IEEE Students Technol. Symp., Jan. 14–16, 2011, pp. 214–219.
[19] R. Jayashri, H. Chitra, H. Kusuma, A. V. Pavitra, and V. Chan-
drakanth, “Memory based architecture to implement simplified block
LMS algorithm on FPGA,” in Proc. Int. Conf. Commun. Signal
Process. (ICCSP), Feb. 10–12, 2011, pp. 179–183.
[20] Q. Shen and A. S. Spanias, “Time and frequency domain X block LMS
algorithm for single channel active noise control,” Control Eng. J., vol.
44, no. 6, pp. 281–293, 1996.
[21] D. P. Das, G. Panda, and S. M. Kuo, “New block filtered-X LMS algo-
rithms for active noise control systems,” IEE Signal Procesd., vol. 1,
no. 2, pp. 73–81, Jun. 2007.
[22] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and Im-
plementation. New York: Wiley, 1999.
[23] C. S. Burrus, “Index mappings for multidimensional formulation of the
DFT and convolution,” IEEE Trans. Acoust., Speech, Signal Process.,
vol. 25, pp. 239–242, Jun. 1977.
Basant K. Mohanty (M’06–SM’11) received M.Sc.
degree in physics from Sambalpur University, India,
in 1989 and the Ph.D. degree in the field of VLSI for
digital signal processing from Berhampur University,
Orissa, in 2000.
In 2001, he joined as Lecturer in Electrical and
Electronic Engineering Department, BITS Pilani,
Rajasthan. Then, he joined as an Assistant Professor
in the Department of Electronics and Communi-
cation Engineering, Mody Institute of Education
Research (Deemed University), Rajasthan. In 2003,
he joined Jaypee University of Engineering and Technology, Guna, Madhya
Pradesh, where he became Associate Professor in 2005 and full Professor in
2007. His research interest includes design and implementation of low-power
and high performance systems for multimedia applications, multi-core pro-
cessor design and algorithm for concurrent processing. He has published nearly
40 technical papers.
Dr. Mohanty is a life time member of The Institution of Electronics and
Telecommunication Engineering, New Delhi, India. He was the recipient of the
Rashtriya Gaurav Award conferred by India International friendship Society,
New Delhi, India for 2012.
Pramod Kumar Meher (SM’03) received the M.Sc.
degree in physics and the Ph.D. degree in science
from Sambalpur University, India, in 1978, and 1996,
respectively.
Currently, he is a Senior Scientist with the Institute
for Infocomm Research, Singapore, and Adjunct Pro-
fessor with the School of Electrical Sciences, Indian
Institute of Technology Bhubaneswar, India. Previ-
ously, he was a Professor of Computer Applications
with Utkal University, India, from 1997 to 2002, and
a Reader in electronics with Berhampur University,
India, from 1993 to 1997. His research interest includes design of dedicated and
reconfigurable architectures for computation-intensive algorithms pertaining to
signal, image and video processing, communication, bio-informatics and intel-
ligent computing. He has contributed nearly 200 technical papers to various
reputed journals and conference proceedings.
Dr. Meher has served as a speaker for the Distinguished Lecturer Program
(DLP) of IEEE Circuits Systems Society and Associate Editor of the IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS. Currently, he
is serving as Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS—I: REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY LARGE
SCALE INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits, Systems, and
Signal Processing. He was the recipient of the Samanta Chandrasekhar Award
for excellence in research in engineering and technology for 1999.

More Related Content

What's hot

Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
An Improved Optimization Techniques for Parallel Prefix Adder using FPGA
An Improved Optimization Techniques for Parallel Prefix Adder using FPGAAn Improved Optimization Techniques for Parallel Prefix Adder using FPGA
An Improved Optimization Techniques for Parallel Prefix Adder using FPGAIJMER
 
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESISSURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIScscpconf
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
A dynamically reconfigurable multi asip architecture for multistandard and mu...
A dynamically reconfigurable multi asip architecture for multistandard and mu...A dynamically reconfigurable multi asip architecture for multistandard and mu...
A dynamically reconfigurable multi asip architecture for multistandard and mu...jpstudcorner
 
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSPERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSVLSICS Design
 
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead TreeIRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead TreeIRJET Journal
 
Investigations on Implementation of Ternary Content Addressable Memory Archit...
Investigations on Implementation of Ternary Content Addressable Memory Archit...Investigations on Implementation of Ternary Content Addressable Memory Archit...
Investigations on Implementation of Ternary Content Addressable Memory Archit...IRJET Journal
 
PAOD: a predictive approach for optimization of design in FinFET/SRAM
PAOD: a predictive approach for optimization of design in FinFET/SRAMPAOD: a predictive approach for optimization of design in FinFET/SRAM
PAOD: a predictive approach for optimization of design in FinFET/SRAMIJECEIAES
 
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...IRJET Journal
 
A Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierA Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierIJERA Editor
 
Design and implementation of Parallel Prefix Adders using FPGAs
Design and implementation of Parallel Prefix Adders using FPGAsDesign and implementation of Parallel Prefix Adders using FPGAs
Design and implementation of Parallel Prefix Adders using FPGAsIOSR Journals
 
1.area efficient carry select adder
1.area efficient carry select adder1.area efficient carry select adder
1.area efficient carry select adderKUMARASWAMY JINNE
 
Design and Estimation of delay, power and area for Parallel prefix adders
Design and Estimation of delay, power and area for Parallel prefix addersDesign and Estimation of delay, power and area for Parallel prefix adders
Design and Estimation of delay, power and area for Parallel prefix addersIJERA Editor
 
IBM Impact 2014 AMC-1878: IBM WebSphere MQ for zOS: Shared Queues
IBM Impact 2014 AMC-1878: IBM WebSphere MQ for zOS: Shared QueuesIBM Impact 2014 AMC-1878: IBM WebSphere MQ for zOS: Shared Queues
IBM Impact 2014 AMC-1878: IBM WebSphere MQ for zOS: Shared QueuesPaul Dennis
 
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...ijgca
 

What's hot (19)

Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
An Improved Optimization Techniques for Parallel Prefix Adder using FPGA
An Improved Optimization Techniques for Parallel Prefix Adder using FPGAAn Improved Optimization Techniques for Parallel Prefix Adder using FPGA
An Improved Optimization Techniques for Parallel Prefix Adder using FPGA
 
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESISSURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
SURVEY ON SCHEDULING AND ALLOCATION IN HIGH LEVEL SYNTHESIS
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
A dynamically reconfigurable multi asip architecture for multistandard and mu...
A dynamically reconfigurable multi asip architecture for multistandard and mu...A dynamically reconfigurable multi asip architecture for multistandard and mu...
A dynamically reconfigurable multi asip architecture for multistandard and mu...
 
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSPERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
 
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead TreeIRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
 
Investigations on Implementation of Ternary Content Addressable Memory Archit...
Investigations on Implementation of Ternary Content Addressable Memory Archit...Investigations on Implementation of Ternary Content Addressable Memory Archit...
Investigations on Implementation of Ternary Content Addressable Memory Archit...
 
PAOD: a predictive approach for optimization of design in FinFET/SRAM
PAOD: a predictive approach for optimization of design in FinFET/SRAMPAOD: a predictive approach for optimization of design in FinFET/SRAM
PAOD: a predictive approach for optimization of design in FinFET/SRAM
 
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
 
A Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth MultiplierA Review of Different Methods for Booth Multiplier
A Review of Different Methods for Booth Multiplier
 
Networking Articles Overview
Networking Articles OverviewNetworking Articles Overview
Networking Articles Overview
 
Design and implementation of Parallel Prefix Adders using FPGAs
Design and implementation of Parallel Prefix Adders using FPGAsDesign and implementation of Parallel Prefix Adders using FPGAs
Design and implementation of Parallel Prefix Adders using FPGAs
 
1.area efficient carry select adder
1.area efficient carry select adder1.area efficient carry select adder
1.area efficient carry select adder
 
Design and Estimation of delay, power and area for Parallel prefix adders
Design and Estimation of delay, power and area for Parallel prefix addersDesign and Estimation of delay, power and area for Parallel prefix adders
Design and Estimation of delay, power and area for Parallel prefix adders
 
559 22-33
559 22-33559 22-33
559 22-33
 
5G
5G5G
5G
 
IBM Impact 2014 AMC-1878: IBM WebSphere MQ for zOS: Shared Queues
IBM Impact 2014 AMC-1878: IBM WebSphere MQ for zOS: Shared QueuesIBM Impact 2014 AMC-1878: IBM WebSphere MQ for zOS: Shared Queues
IBM Impact 2014 AMC-1878: IBM WebSphere MQ for zOS: Shared Queues
 
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDMOD...
 

Viewers also liked

Goretti namexgm 2014_finale
Goretti namexgm 2014_finaleGoretti namexgm 2014_finale
Goretti namexgm 2014_finaleMaurizio Goretti
 
Voltage Drop, Ampacity and In-line Fuses
Voltage Drop, Ampacity and In-line FusesVoltage Drop, Ampacity and In-line Fuses
Voltage Drop, Ampacity and In-line FusesJTBMarine
 
The Strategic Planning Process at Davis Joint Unified School District
The Strategic Planning Process at Davis Joint Unified School District The Strategic Planning Process at Davis Joint Unified School District
The Strategic Planning Process at Davis Joint Unified School District John Poulos (Sacramento)
 
1 na mex_regionale_meeting_2013_goretti
1 na mex_regionale_meeting_2013_goretti1 na mex_regionale_meeting_2013_goretti
1 na mex_regionale_meeting_2013_gorettiMaurizio Goretti
 
Solar technology for industry
Solar technology for industrySolar technology for industry
Solar technology for industryOleksandr Baskov
 
California Enacts Statewide Minimum Wage Increase
California Enacts Statewide Minimum Wage IncreaseCalifornia Enacts Statewide Minimum Wage Increase
California Enacts Statewide Minimum Wage IncreaseJohn Poulos (Sacramento)
 
NaMeX Rome (Italy) Internet Exchange Point Update
NaMeX Rome (Italy) Internet Exchange Point UpdateNaMeX Rome (Italy) Internet Exchange Point Update
NaMeX Rome (Italy) Internet Exchange Point UpdateMaurizio Goretti
 
Open WiFi solution for Public Administrator and University
Open WiFi solution for Public Administrator and UniversityOpen WiFi solution for Public Administrator and University
Open WiFi solution for Public Administrator and UniversityMaurizio Goretti
 

Viewers also liked (13)

Structure of the Alaskan Supreme Court
Structure of the Alaskan Supreme CourtStructure of the Alaskan Supreme Court
Structure of the Alaskan Supreme Court
 
Goretti namexgm 2014_finale
Goretti namexgm 2014_finaleGoretti namexgm 2014_finale
Goretti namexgm 2014_finale
 
Voltage Drop, Ampacity and In-line Fuses
Voltage Drop, Ampacity and In-line FusesVoltage Drop, Ampacity and In-line Fuses
Voltage Drop, Ampacity and In-line Fuses
 
The Strategic Planning Process at Davis Joint Unified School District
The Strategic Planning Process at Davis Joint Unified School District The Strategic Planning Process at Davis Joint Unified School District
The Strategic Planning Process at Davis Joint Unified School District
 
Завьялова Александра Семеновна
Завьялова Александра СеменовнаЗавьялова Александра Семеновна
Завьялова Александра Семеновна
 
What Is Commercial Litigation?
What Is Commercial Litigation? What Is Commercial Litigation?
What Is Commercial Litigation?
 
1 na mex_regionale_meeting_2013_goretti
1 na mex_regionale_meeting_2013_goretti1 na mex_regionale_meeting_2013_goretti
1 na mex_regionale_meeting_2013_goretti
 
Solar technology for industry
Solar technology for industrySolar technology for industry
Solar technology for industry
 
Class Action Lawsuits Explained
Class Action Lawsuits ExplainedClass Action Lawsuits Explained
Class Action Lawsuits Explained
 
A Brief Overview of CERCLA
A Brief Overview of CERCLAA Brief Overview of CERCLA
A Brief Overview of CERCLA
 
California Enacts Statewide Minimum Wage Increase
California Enacts Statewide Minimum Wage IncreaseCalifornia Enacts Statewide Minimum Wage Increase
California Enacts Statewide Minimum Wage Increase
 
NaMeX Rome (Italy) Internet Exchange Point Update
NaMeX Rome (Italy) Internet Exchange Point UpdateNaMeX Rome (Italy) Internet Exchange Point Update
NaMeX Rome (Italy) Internet Exchange Point Update
 
Open WiFi solution for Public Administrator and University
Open WiFi solution for Public Administrator and UniversityOpen WiFi solution for Public Administrator and University
Open WiFi solution for Public Administrator and University
 

Similar to 06340356

PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSPERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSVLSICS Design
 
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSPERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSVLSICS Design
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
 
Volume 2-issue-6-1993-2004
Volume 2-issue-6-1993-2004Volume 2-issue-6-1993-2004
Volume 2-issue-6-1993-2004Editor IJARCET
 
Volume 2-issue-6-1993-2004
Volume 2-issue-6-1993-2004Volume 2-issue-6-1993-2004
Volume 2-issue-6-1993-2004Editor IJARCET
 
Task Scheduling using Hybrid Algorithm in Cloud Computing Environments
Task Scheduling using Hybrid Algorithm in Cloud Computing EnvironmentsTask Scheduling using Hybrid Algorithm in Cloud Computing Environments
Task Scheduling using Hybrid Algorithm in Cloud Computing Environmentsiosrjce
 
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGALOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGAVLSICS Design
 
Low cost high-performance vlsi architecture for montgomery modular multiplica...
Low cost high-performance vlsi architecture for montgomery modular multiplica...Low cost high-performance vlsi architecture for montgomery modular multiplica...
Low cost high-performance vlsi architecture for montgomery modular multiplica...jpstudcorner
 
Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay
Area And Power Efficient LMS Adaptive Filter With Low Adaptation DelayArea And Power Efficient LMS Adaptive Filter With Low Adaptation Delay
Area And Power Efficient LMS Adaptive Filter With Low Adaptation DelayEditor IJMTER
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
 
Improved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling AlgorithmImproved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling Algorithmiosrjce
 
Lut optimization for distributed arithmetic based block least mean square ada...
Lut optimization for distributed arithmetic based block least mean square ada...Lut optimization for distributed arithmetic based block least mean square ada...
Lut optimization for distributed arithmetic based block least mean square ada...Ieee Xpert
 

Similar to 06340356 (20)

SPROJReport (1)
SPROJReport (1)SPROJReport (1)
SPROJReport (1)
 
Ijecet 06 10_004
Ijecet 06 10_004Ijecet 06 10_004
Ijecet 06 10_004
 
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSPERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
 
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONSPERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
PERFORMANCE EVALUATION OF LOW POWER CARRY SAVE ADDER FOR VLSI APPLICATIONS
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
Volume 2-issue-6-1993-2004
Volume 2-issue-6-1993-2004Volume 2-issue-6-1993-2004
Volume 2-issue-6-1993-2004
 
Volume 2-issue-6-1993-2004
Volume 2-issue-6-1993-2004Volume 2-issue-6-1993-2004
Volume 2-issue-6-1993-2004
 
Task Scheduling using Hybrid Algorithm in Cloud Computing Environments
Task Scheduling using Hybrid Algorithm in Cloud Computing EnvironmentsTask Scheduling using Hybrid Algorithm in Cloud Computing Environments
Task Scheduling using Hybrid Algorithm in Cloud Computing Environments
 
N0173696106
N0173696106N0173696106
N0173696106
 
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGALOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
 
Low cost high-performance vlsi architecture for montgomery modular multiplica...
Low cost high-performance vlsi architecture for montgomery modular multiplica...Low cost high-performance vlsi architecture for montgomery modular multiplica...
Low cost high-performance vlsi architecture for montgomery modular multiplica...
 
Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay
Area And Power Efficient LMS Adaptive Filter With Low Adaptation DelayArea And Power Efficient LMS Adaptive Filter With Low Adaptation Delay
Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
 
G017314249
G017314249G017314249
G017314249
 
Improved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling AlgorithmImproved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling Algorithm
 
Lut optimization for distributed arithmetic based block least mean square ada...
Lut optimization for distributed arithmetic based block least mean square ada...Lut optimization for distributed arithmetic based block least mean square ada...
Lut optimization for distributed arithmetic based block least mean square ada...
 

Recently uploaded

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 

Recently uploaded (20)

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

06340356

  • 1. IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 921 A High-Performance Energy-Efficient Architecture for FIR Adaptive Filter Based on New Distributed Arithmetic Formulation of Block LMS Algorithm Basant K. Mohanty, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE Abstract—In this paper, we present an efficient distributed- arithmetic (DA) formulation for the implementation of block least mean square (BLMS) algorithm. The proposed DA-based design uses a novel look-up table (LUT)-sharing technique for the computation of filter outputs and weight-increment terms of BLMS algorithm. Besides, it offers significant saving of adders which constitute a major component of DA-based structures. Also, we have suggested a novel LUT-based weight updating scheme for BLMS algorithm, where only one set of LUTs out of sets need to be modified in every iteration, where , , and are, respectively, the filter length and input block-size. Based on the proposed DA formulation, we have derived a parallel architecture for the implementation of BLMS adaptive digital filter (ADF). Compared with the best of the existing DA-based LMS structures, proposed one involves nearly times adders and times LUT words, and offers nearly times throughput of the other. It requires nearly 25% more flip-flops and does not involve variable shifters like those of existing structures. It involves less LUT access per output (LAPO) than the existing structure for block-size higher than 4. For block-size 8 and filter length 64, the proposed structure involves 2.47 times more adders, 15% more flip-flops, 43% less LAPO than the best of existing structures, and offers 5.22 times higher throughput. The number of adders of the proposed structure does not increase proportionately with block size; and the number of flip-flops is independent of block-size. This is a major advantage of the proposed structure for reducing its area delay product (ADP); particularly, when a large order ADF is implemented for higher block-sizes. ASIC synthesis result shows that, the proposed structure for filter length 64, has almost 14% and 30% less ADP and 25% and 37% less EPO than the best of the existing structures for block size 4 and 8, respectively. Index Terms—Adaptive filters, block LMS, distributed arith- metic, VLSI. I. INTRODUCTION ADAPTIVE DIGITAL FILTERS (ADFs) are widely used in various signal-processing applications, such as echo cancellation, system identification, noise cancellation and Manuscript received June 18, 2012; accepted October 07, 2012. Date of pub- lication October 25, 2012; date of current version January 25, 2013. The as- sociate editor coordinating the review of this manuscript and approving it for publication was Prof. Zhiyuan Yan. B. K. Mohanty is with the Department of Electronics and Communication En- gineering, Jaypee University of Engineering and Technology, Raghogarh, Guna, Madhya Pradesh, India-473226 (e-mail: bk.mohanti@juet.ac.in). P. K. Meher is with the Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632 (e-mail: pkmeher@i2r.a-star.edu.sg, url: http://www1.i2r.a- star.edu.sg/~pkmeher/). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2012.2226453 channel equalization etc. [1]. Amongst the existing ADFs, least mean square (LMS)-based finite impulse response (FIR) adaptive filter is the most popular one due to its inherent sim- plicity and satisfactory convergence performance. However, the delay in availability of the feedback-error for updating the weights according to the LMS algorithm does not favor its pipeline implementation when sampling rate is high. Haimi et al. [2] have proposed the delayed LMS (DLMS) algorithm for pipeline implementation of LMS-based ADF. The delayed LMS is similar to the LMS algorithm except that the correction terms for updating the filter weights of the current iteration are calculated from the error corresponding to a past iteration. Several schemes have been proposed to implement the DLMS-based ADFs efficiently in a systolic VLSI with min- imum adaptation delay [2]–[4], [7], [8]. To avoid adaptation delay in pipelined LMS ADF, Poltmann [5] has proposed a modified DLMS algorithm which is used by Douglas et al. [6] to derive a systolic architecture. But, the structure of [6] involves large amount of hardware resources compared to the earlier one [2]. The block LMS (BLMS) ADF [9] is one of the useful deriva- tives of the LMS ADF for fast and computationally-efficient implementation of ADFs. Unlike the conventional LMS ADF, BLMS ADF accepts a block of input for computing a block of output and updates the weights using a block of errors in every training cycle. The BLMS ADF has convergence performance similar to the LMS ADF, but the BLMS ADF of block-length offers fold higher throughput compared with the other. Keeping this in view, many variant of BLMS algorithm like time and frequency-domain block filtered-X LMS (BFXLMS) has been proposed for specific applications [20]. Das et al. [21] have proposed efficient BFXLMS using FFT and fast Hartley trans- form (FHT), which is computationally more efficient. We have proposed a delayed block LMS (DBLMS) algorithm [15], and a concurrent multiplier-based architecture for high-throughput pipeline implementation of BLMS ADFs. The structure of [15] provides fold higher throughput rate and demands times more resources compared to those of DLMS ADF. Baghel et al. [17], [18] have suggested a distributed-arithmetic (DA)-based structure for FPGA implementation of BLMS ADFs. A low- complexity design has been proposed in [19] for BLMS ADFs. This structure supports a very low sampling rate since it uses single multiply-accumulate (MAC) cell for the computation of filter output and weight-increment term. To take the advantage of DA-based hardware designs [12], Allred et al. [10] have suggested a scheme to derive a DA-based design for LMS-ADF. The structure of [10] requires separate 1053-587X/$31.00 © 2012 IEEE
  • 2. 922 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 look-up-tables (LUTs) for the calculation of filter output and weight-increment terms. The LUT used for the computation of filter output and weight-increment term of DA LMS-ADF is named as DA-F-LUT and DA-A-LUT, respectively. In every it- eration, entire content of DA-F-LUT is updated to compute the weight-increment term, where half the content of DA-A-LUT is updated to accommodate the new input sample arriving at the current iteration. Updating the LUTs is the most time con- suming operation in DA-based LMS-ADF, since the updating is performed sequentially at different LUT locations. The LUT update time, therefore, depends on the size of the LUT to be updated. For most practical adaptive filters, we need to use a decomposition scheme, where small size LUTs can be used in DA-based LMS-ADF which not only helps in reducing the LUT size but also in reducing LUT-update time. Recently Guo et al. [16] have suggested a scheme to avoid the DA-A-LUT in DA-based LMS-ADF, where both filtering and weight-updating are performed using DA-F-LUT. On the other hand, throughput rate of existing DA-LMS ADFs could be slow for real-time ap- plications due to bit-serial nature of DA computation. Although, there are some interesting work on DA-based LMS ADF [10], [16], we find that the potential application of DA for the imple- mentation of BLMS ADF is yet to be explored. In order to reduce the power consumption of DA-based de- signs, we aim at reducing the number of words in the LUT and less LUT-access. DA-based BLMS ADF structure can be de- rived by extending the scheme of [10], but this structure would demand times more hardware (memory and combinational logic) for times more throughput rate. The scheme of [16] of- fers sharing of LUT for the computation of both filter output and weight-increment term, but this scheme can not be applied to derive a DA-based structure for BLMS ADFs, because separate inner-product computation (IPC) is performed for calculation of filter output and weight-increment term of BLMS ADF whereas in case of LMS ADF, IPC is performed to calculate the filter output only. In this paper, we have formulated the DA-BLMS al- gorithm for sharing of LUTs for the computation of filter output and weight-increment terms. The key contributions of this paper are: • DA-based formulation of BLMS algorithm where both convolution operation to compute filter output and corre- lation operation to compute weight-increment term could be performed by using the same LUT. • A novel approach for minimization of number of LUT words to be updated per output. This helps to save external logic and power consumption. We have derived a DA-based structure for BLMS-ADF using the proposed DA-formulation and a novel LUT updating scheme. The most remarkable aspect of the proposed scheme is that the number of adders required by the structure does not increase proportionately with filter order, and the number of flip-flops required by the structure is independent of the block-size. Apart from that, the proposed structure has signifi- cantly less LUT access than the existing DA-LMS structure for higher block-sizes. The rest of this paper is organized as follows: Mathematical formulation is presented in Section II. The new-LUT update scheme is discussed in Section III, and the proposed structure for DA-based BLMS ADF is presented in Section IV. Hardware- and time-complexities of the proposed structure are discussed in Section V. Conclusion is presented in Section VI. II. MATHEMATICAL FORMULATION The BLMS algorithm for updating the filter weights in the -th iteration is given by (1) where is defined as (2) and are, respectively, the weight-vector and the error- vector of the -th iteration defined as: where is the step-size; and the input matrix is derived from the current input block of length , and past samples, given by The error-vector is computed as (3) where the desired response vector is defined as The -th block of filter output is computed by the matrix- vector product: (4) A. Computation of Filter Output The input matrix of size can be decomposed into square matrices of size each, where . Similarly, the weight vector can be decomposed into short weight-vectors of size , for . The computation of (4) can then be expressed as the sum of matrix-vector products: (5) where and are defined as
  • 3. MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 923 for , and Each filter output now can be written as the sum of inner- products as (6) where is an -point inner-product of an input-vector and are given by (7) and is the -th row of given by for , , and . Note that we have dropped the subscript of in (7) only for convenience of further discussion, without loss of generality. B. Computation of Weight Increment Term The weight-increment vector can be decomposed into short vectors of size each, for . Computation of (2) can be performed through independent matrix-vector multiplication using the relation (8) where , and defined as (9) Using (8), the individual weight increment terms could be eval- uated by the following equation (10) where is the inner-product between the vector and , given by (11) Here also we have dropped the subscript of for con- venience of further discussions. As shown in (7) and (11), the input-vector is the same for a pair of inner-products and . This is a major advantage in order to optimize the LUTs when the inner-products of (7) and (11) are performed using the DA principle. C. DA-Formulation Let and , respectively, be the -th compo- nents of the -point vectors and , and assumed to be -bit numbers in 2’s complement representation: (12a) (12b) and are the -th bit of and , respec- tively. Substituting (12a) in (7), we have (13) Rearranging the order of summation, (13) may otherwise be ex- pressed as: (14) where , for , and for . Each term in the inner sum in (14) represents the inner-product of with a bit-vector (or bit-slice) of weight- vector . Corresponding to possible values of a bit-vector of length , there could be possible values of such inner- products of with any possible bit-vector of length . All those possible inner-products could be pre-computed and stored in an LUT, such that when the -th bit-vector (or bit-slice) of weight vector for , is fed to the LUT as address, its inner-product with , is read from the LUT. The computation of inner sum of (14), therefore, could be expressed in the form of memory read operation as: (15) where is a memory-read operation, and its argument for , is used as LUT-address. The inner-product of (11) may, similarly, be expressed in the form of memory-read operation as (16) where is the -th bit-vector of error-vector defined as: , which is used as address of an LUT to read its inner-products with . LUT contents for the computation of and are exactly the same, since the LUT content depends on the input-vector , and generated for all possible bit-slices of -bit length, irrespec- tive of whether that is of the weight-vector or the error-vector. When the bit-vector is used as address, the partial results of are read from the LUT, and when is used as address, then partial results of are read from the same LUT. Therefore, by using the proposed scheme, a common set of LUTs could be used for the computation of filter outputs and weight-increment terms. Since, the block of input samples changes after every iteration, the LUTs are required to be up- dated in every iteration to accommodate the new input-block. In the next Section, we have presented a novel LUT-updating scheme for the DA-based BLMS ADFs.
  • 4. 924 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 Fig. 1. (a) Inner-products of FIR filter of length , and block-size . The input-vectors corresponding to inner-product is shown inside the box. (b) LUT arrangement for DA-based computation of the FIR filter of , and . Each LUT here stores possible values of partial inner-product of input-vector and bit-vector of of length , for and . III. LUT-UPDATING SCHEME Before, we discuss the proposed LUT-updating scheme, we summarize here the proposed decomposition of input-matrix and weight-vector into small vectors, and their participation in the inner-product computation for filtering operation. The input- matrix of size is decomposed into square matrices of size and is decomposed into short-vec- tors of size , for where . Each of rows of represents an input-vector, so that such input-vectors ( , for ) are derived form , and such input-vectors are derived from , for . All these input-vectors are arranged in rows and columns such that, input-vectors of belong to -th column. According to (5), weight-vectors are multiplied independently with matrices which, in total, involves inner-products. According to (6), results of inner-products corresponding to each row of input-vectors are added together for obtaining a filter output. From such rows of inner-prod- ucts, filter outputs are obtained. We have illustrated here the aforementioned scheme for the implementation of FIR filter of length and block-size . Suppose, during the -th iteration the filter receives an input-block and computes a block of output . As discussed above, the input-matrix of size 2 6 is decomposed into 3 square-matrices , and of size 2 2. consists of a pair of input-vectors ( and ), and similarly and consist of pair of input-vec- tors and , respectively. The 6-point weight- vector is decomposed into 3 number of 2-point weight-vectors . Fig. 1(a) shows the arrangement of input-vectors and weight-vectors; and the corresponding inner-products are shown on the top of the rectangular boxes for clarity. Results of odd-numbered inner-products (on upper row) and even-num- bered inner-products (on lower row) are added separately (not shown in the figure) to obtain and , respectively. Fig. 2. DA-based computation of the block FIR filter for and . (a) for -th iteration, (b) for -th iteration. As shown in Fig. 1(a), the same weight-vector is used for the computation of inner-product of a particular column of input-vectors. For DA realization, LUT corresponding to each and stores partial inner-products generated by the inner-product of the corresponding input-vector with all possible values of a bit-vector of length . DA-based parallel computation of filter outputs of Fig. 1(a) for the -th iteration is shown in Fig. 1(b). As shown in Fig. 2(a), the DA-based structure receives an input-block during the -th iteration, so that two new samples enter into the set of 7 samples, and two oldest samples are discarded. Consequently, samples of the all 6 input-vectors are changed. But, it occurs in a particular order. We can find from Fig. 1(b) and Fig. 2(a), that the contents of only the first column of LUTs of Fig. 2(a) are changed by the new samples while in other columns, the LUT values remain the same. But the position of those unchanged LUTs are shifted right by one-column. For in- stance, values stored in the LUTs of second column of Fig. 2(a) are the same as values stored in LUTs of the first-column of Fig. 1(b), and similarly values stored in LUTs of third column of Fig. 2(a) are the same as those LUTs of second-column Fig. 1(b). This feature can be observed in the LUT contents of Fig. 2(b) for the -th iteration also. In other word, contents of a particular column of LUTs during a particular iteration are simply transferred to the adjacent column of LUTs on its right during the next iteration. In this way, the oldest input samples of particular set are shifted out through the -th column ( in the example) of LUTs, and new values are entered at the first column of LUTs. Shifting of values physically from one LUT to the next across the array of LUTs is highly time consuming and power con- suming. Therefore, we have proposed a novel LUT updating scheme, where the LUT content need not be shifted. Since, each column of LUTs uses the same weight-vector as LUT-address, the column-wise right-shift of LUT values can be achieved by a left-shift of the weight-vectors. This technique could save a lot of time and power, since the shifting of weight-vectors is sig- nificantly less expensive than the shifting of LUT contents. In the proposed LUT update scheme, contents of only one column
  • 5. MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 925 Fig. 3. (a) Equivalent DA-based structure of Fig. 2(a) which is derived from structure of Fig. 1(b) by changing the content of 5th and 6th LUT (shown in grey color) and left shifting the weight-vectors by one-position. (b) Equiva- lent DA-based structure of Fig. 2(b) derived from the structure of Fig. 3(a) by changing content of 3rd and 4th LUT (shown in grey color) and left-shifting the weight-vectors by one position. of LUTs out of 3 such columns (for ) need to be up- dated in every iteration. We can find from Fig. 1(b) and Fig. 2(a) that, the values of the third-column LUTs of the -th itera- tion are not used during -th iteration, since they corre- spond to the oldest block of samples . The LUTs of the third column are updated as shown in Fig. 3(a) in grey-color. To feed weight-vectors to LUTs of Fig. 3(a) in the same order as that of Fig. 2(a), weight-vectors of Fig. 1(b) are simply left-shifted by one location. As shown in Fig. 3(a), the second-column of LUTs contain the values corresponding to the samples , which is the oldest block of sam- ples in the -th iteration, and this input-block is discarded and corresponding LUTs are updated by the partial inner-prod- ucts of new input-block . Weight-vectors of Fig. 3(a) are left-shifted by one column, and fed to LUTs of Fig. 3(b) as addresses. In the following, we summarize the proposed scheme for updating LUTs of BLMS-based adaptive filter: • LUTs are updated column-by-column in every iteration in cyclic order. • The LUTs which store the values of partial inner-prod- ucts corresponding to samples of the oldest input block are overwritten by those of the new input block. • The weight-vectors are circularly left-shifted after every iteration to change the columns of LUT to be read circularly. • The values required for updating a column of LUTs for any particular iteration are calculated from samples of the current input-block and samples of the most recent past samples of the previous block. Based on the above scheme, LUT-matrix is updated column-by-column from right to left after every iteration. The updating process starts from the -th column of LUTs and goes to the first column on a cyclic manner, and then again from the first column it goes to the -th column and then to the Fig. 4. Proposed DA-based structure for implementation of BLMS adaptive FIR filters (for and ), where , , and . -th column and so on. Hence, LUTs of one particular column are updated once in a period of iterations. IV. PROPOSED ARCHITECTURE Proposed DA-BLMS structure is comprised of one DA-module, one error bit-slice generator (EBSG) and one weight-update cum bit-slice generator (WBSG). WBSG up- dates the filter weights and generates the required bit-vectors in accordance with the DA-formulation. EBSG computes the error block according to (3) and generates its bit-vectors. The DA-module updates the LUTs and makes use of the bit-vectors generated by WBSG and EBSG to compute the filter output and weight-increment terms according to (15) and (16). A. Structure for Block-Size The proposed structure for DA-based BLMS adaptive filter for and is shown in Fig. 4. The DA-module receives a block of input samples in every iteration, and computes a block of filter output . It also receives a block of errors in every iteration, and computes the weight-increment term for all the components of the weight-vector . The structure of proposed DA-module is shown in Fig. 5. It consists of 4 identical processing elements (PEs) for , one LUT-update block and one MUX-array. Structure of the PE is shown in Fig. 6. It consists of 4 identical subcells (SCs) for . Internal structure and function of the -th SC is shown in Fig. 7. As required by (15), LUT of the -th SC of this PE stores 16 possible values corresponding to the samples , where . The LUT-update block of the DA-module generates the re- quired values to update LUTs of a particular PE. Structure of the LUT-update block is shown in Fig. 8. It consists of one adder-block and an input delay unit (IDU), which stores samples of the previous block. During each iteration, the adder
  • 6. 926 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 Fig. 5. Structure of DA-module of the proposed DA-BLMS ADF (for and ). The subscript of , and varies from 0 to in cycles. Fig. 6. Internal structure of -th processing element (PE) of the DA-module for block-size , where . block receives samples ( samples from the current input block and past samples from the IDU), and feeds these samples to adder-cells (ACs) (see Fig. 8) such that each AC receives samples, and input blocks of adjacent ACs are overlapped by samples. During the -th iteration, AC- receives input samples and AC- receives the sam- ples . For block-size , each AC receives a block of four samples in every iteration (shown in Fig. 9). As shown in the figure, each of the four inputs of the AC is ANDed with a bit of the four-bit address by four AND cells. Each AND cell consists of AND gates, where is the word-length of input samples. All those AND gates are fed with a bit of the address, Fig. 7. Internal structure and function of -th subcell (SC) of a PE, where and , . Convergence factor is assumed to be power of 2. while the other input of the AND gates are fed with a bit of input sample. The output of AND cells are fed to an adder-tree (AT). AC receives 16 possible values of in 16 clock cycles, and cal- culates 16 values of to be stored in the LUT, where is used as the address of the LUT location and is the equivalent integer value of . All the ACs of the adder block (see Fig. 8) work in parallel, and generate all the required values to update LUT of SCs of a PE. According to the proposed LUT-update scheme, LUTs of one PE out of PEs are updated in every it- eration. LUTs of all the PEs are updated once in cycles
  • 7. MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 927 Fig. 8. Internal structure of LUT-update block for block-size , where . Fig. 9. Internal structure of -th AC of the LUT-update block for block- size . Fig. 10. Internal structure of MUX-array for and . in a cyclic order. Each PE uses separate control signal ( , for ) to enable the specific column of LUTs to be updated. LUT-update operation of proposed structure is com- pleted during the first clock cycles of every iteration. Each PE receives the bit-vectors , and through the MUX array (shown in Fig. 9) for updating the LUTs or computation of filter outputs or weight-increment terms, respec- tively. After completion of the LUT-update, filtering computa- tion follows immediately for the next clock cycles by a series of LUT-read operations using the bit-slices of corresponding weight-vector in LSB to MSB order, as successive addresses according to (15). During the -th cycle of filtering, the WBSG generates parallel bit-vectors of width bits Fig. 11. Structure of error computation cum bit-slice generator (EBSG) for block-size , where , and . each for the PEs to perform the filtering operation. Each SC receives a sequence of bit-vectors , (for where is the wordlength of the filter-coefficients) from the WBSG in clock cycles. The LUT-read values are shift-ac- cumulated in an accumulator (ACC) to obtain a partial filter output. During the -th cycle the LUT output is subtracted from the accumulated result since the bit-vector during this cycle contains the sign-bits of weight-vector . Each SC uses control signal CTR1 to control add/substract operation in the ACC. At the end of the -th cycle, ACC contents are sent to the DMUX as input, and the ACC register (not shown in Fig. 7) is cleared to be used for the computation of weight-increment term from the next cycle (CTR1 is used for clearing the reg- ister). Finally the DMUX sends the computed partial results of inner-products to the output line using the select signal CTR6. From SCs of each PE, partial results are ob- tained in parallel, the corresponding output of each SC from PEs are added by an AT (Fig. 5) to obtain (the -th component of -th block of filter output). A block of parallel filter output ( ) are obtained from ATs of the DA-module in each cycle. EBSG receives one block of filter output ( ) from the DA-module, and calculates a block of error ( ) in every iteration using one block of desired response according to (3). As shown in Fig. 11, error values are loaded in parallel-in serial-out (PISO) shift-registers of the bit-slice-generater (BSG) to generate bit-vectors of error-vector . CTR4 enables the clock for the BSG and CTR2 controls load-shift operation of each SR. Bit-vectors , for , fed serially in LSB to MSB order to the DA-module in successive clock cycles to compute weight-increment terms for the -th itera- tion. According to (16), LUT values of the -th block of filter output are also used to compute weight-increment terms for the -th iteration. In general, LUT values of -th SC of -th PE (for , ) are used to compute the weight-increment term . The -th PE, therefore, computes the weight-increment terms . The computation of weight-increment-terms is similar to the par- tial filter outputs. But in this case the same bit-vector is used
  • 8. 928 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 TABLE I LUT UPDATING SCHEME FOR AND , WHERE , : BLOCK SIZE, : FILTER ORDER by all the PEs of the DA-module to compute the weight-incre- ment terms. In each SC (see Fig. 7), the ACC contents corre- sponding to the weight-increment term is sent to the output line of the DMUX. The weight-increment terms are scaled by a factor . Here we have assumed is a power of 2, so that the scaling of by is realized by a right-shift operation using a fixed-shifter (see Fig. 7). According to (1), the WBSG of the proposed DA-BLMS structure requires only the weight-increment terms of the cur- rent iteration to update the weight-vector for the next iteration. It does not require the LUT values of the current iteration. Therefore, once the weight-increment terms of the current iteration are computed, the LUT-updating operation for the next iteration can be started immediately in the next clock cycle. As we discussed earlier, the filter computation follows the LUT-update operation, and first clock cycles of every it- eration are used to complete the LUT-update operation. During this period, weight-update operation of WBSG also can be performed concurrently. A bit-parallel (word-serial) structure of WBSG requires one clock-cycle to complete the weight-up- date operation, while a bit-serial structure of WBSG requires clock-cycles to complete the same task. If wordlength of filter-coefficients ( ) is less than or equal to the LUT-size , then bit-serial realization of WBSG does not increase the iteration period of the DA-BLMS structure, but it certainly helps to reduce the hardware complexity of the DA-BLMS structure. For and , we can have a bit-serial structure for the WBSG. Bit-serial structure of WBSG receives the weight-increment terms from the DA-module in bit-serial LSB to MSB order, and updates the weight-vector accordingly. For bit-serial realization of WBSG, weight-increment terms computed by each PE of the DA-module are finally loaded into a separate BSG (see Fig. 5) to generate the weight-increment terms in bit-serial order. All the BSG of the DA-module uses common control signals CTR6 and CTR5 to perform the loading and sifting operations, respectively. WBSG is an important block of the proposed structure. It performs three operations: (i) updates filter weights using the weight-increment values calculated by the DA-module, (ii) generates bit-vectors for the DA-module to compute -th block of filter output, (iii) gives one left-shift (circularly) to the weight-vectors as necessitated by the proposed LUT-update scheme. We have shown LUT updating of the DA-BLMS ADF for and in Table I for the first 5 iterations using the proposed LUT-updating scheme. As shown in Table I, for and , the LUT-matrix has 4 columns (for ). LUTs of all these 4 columns are updated once in a period of 4 iterations. At any given iteration, the LUT-matrix contains the values corresponding to recent past input samples to compute a block of 4 filter output. As shown in Table I, during the 5-th iteration, LUT-matrix ( to ) contain the values corresponding to set of input samples ( to ). These set of 19 samples are exactly required to compute the filter output ( to ). Similarly, the LUT-matrix contain the values corresponding to the set of samples during 6-th iteration, and these samples are exactly required to compute filter outputs ( to ). The bit-serial structure of WBSG is shown in Fig. 12. It con- sists of serial-in serial-out (SISO) SRs and carry-save full-adders (CSFAs) corresponding to filter weights. SRs are arranged in matrix form; and filter-weights are stored in the SR matrix column-wise, such that weight-vector is stored in -th column of SRs. As shown in Table I for , that bit-slices of the weight-vector are re- ceived by the PE whose LUTs are to be updated during the -th iteration, and are generated from the first column of filter weights . The weight-vector to be aligned with the corresponding PE. If during the -th iteration, LUT of PE-1 is to be updated, then the first column of SR-matrix is required to contain the components of weight-vector and the -th column of SRs should contain components of weight-vector . As shown in Fig. 12, weight-increment values of the -th column of filter coefficients (available in the -th column of SR-ma- trix) are obtained from the -th PE, and these values are added with the corresponding filter-weights bit-serially using a carry-save full-adder (CSFA). Results of CSFA of -th column constitute a bit-vector of . SR contents are shifted left for clock cycles, to generated the shifted weight-vectors in accordance with the proposed LUT-update scheme. Shifting operation of the SRs starts at -th clock cycle of every iteration, and continue for clock cycles. The control signal CTR5 is used in WBSG to enable the shifting operation. D flip-flop of each CSFA is cleared during the first clock cycle of every iteration to flush-out the final carry of the previous it- eration of weight-update operation. B. Structure for Higher Block-Size To derive DA-based BLMS structure for higher block sizes using LUT of 16 words, we can take the block-size to be an multiple of 4, i.e. , where is an integer. The structures
  • 9. MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 929 Fig. 12. Bit-serial structure of weight-update cum bit-slice generator (WBSG) for , and . of EBSG and WBSG of the DA-BLMS filter for (for ) are the same as those of block-size shown in Fig. 11 and Fig. 12, respectively. However, the AC of the LUT- update block and the SC of each PE of the DA-module need to be modified according to the value of . Each SC in this case, is comprised of LUTs of size 16 words each. The bit-vectors of weight-vectors and error-vectors of bits each are splitted into segments of 4-bit size, and fed to LUTs of each SC to read the LUTs in parallel. The values read from the LUTs are added using an AT and subsequently shift-accumulated in the ACC for obtaining a partial output. To generate the weight update-values for LUTs, each AC of the LUT-update block in this case is comprised of AND-TA blocks of size 4 (as shown in Fig. 9). For block-size , each SC involves RAM words and adders along with one ACC and 2 DMUX. Similarly, the LUT-update block involves AND-gates and adders. V. HARDWARE-TIME COMPLEXITY AND PERFORMANCE COMPARISON A. Hardware Complexity Proposed structure is comprised of one DA-module, one WBSG, one EBSG and a control unit. The DA-module consists of one LUT-update block, PEs, adder-trees of words each, one MUX-array and BSG, where and the block-size . LUT-update block consists one IDU and ACs, where the IDU is comprised of registers of size , and each AC is comprised of AND-gates and adders. LUT-block, therefore, involves registers, adders and AND-gates. Each PE consists of SCs, where each SC is comprised of LUTs of 16 words each, adders, one ACCs, one 1-to-2 line DMUX and number of 2-input XOR-gates (used by ACC (not shown in Fig. 7) to compute 1’s complement of the LUT-outputs when the bit-vector contains sign-bits), where ACC involves one adder, one register and one 2-to-1 line DMUX ( ). Each PE, there- fore, involves memory words, adders, registers, DMUXes (2-to-1 line) and XOR-gates. Each BSG is comprised of SRs (bit-level) of size . MUX-array involves 2-to-1 line MUXes. The DA-module, therefore, in- volves memory words, adders, D-type flip-flops (FFs), 2-to-1-line MUXes/DMUXes (word-level), AND-gates and XOR-gates. WBSG involves D-type FFs and FAs. EBSG involves D-type FFs and adders. Proposed structure, therefore, requires memory words, adders, FAs, D-type FFs, MUXes/DMUXes (word-level), AND-gates and and XOR-gates. B. Time Complexity The proposed structure performs four operations sequen- tially in every iteration. Those are (i) LUT update, (ii) filter output computation, (iii) error calculation and (iii) compu- tation of weight-increment term. It involves 16 clock cycles to complete LUT-update operation. It takes clock cycles to calculate partial results of a block of filter output. It cal- culates a block of filter output from the partial results and then block of error in one clock cycle. Finally it takes clock cycles to compute the weight-increment term for the weight vector. In every iteration, proposed structure pro- cesses one block of samples, where one iteration involves clock cycles and duration of one clock cycle is , where is the delay of one -bit adder. For comparison purpose, we have also estimated number of clock cycles re- quired by the structure of [10] and [16] for one iteration. We assumed the read and write operations are performed in two separate clock cycles in a LUT to maintain uniformity in the comparison. Structure of [10] requires 16 clock cycles to update the DA-A-LUT of size 16 words, clock cycles to compute one filter output and 32 clock cycles to update the DA-F-LUT
  • 10. 930 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 TABLE II GENERAL COMPARISON OF HARDWARE COMPLEXITY OF THE PROPOSED STRUCTURE (FOR BLOCK-SIZE ) AND THE STRUCTURE OF [10] AND [16] (WITH DECOMPOSITION FACTOR 4) AND THE DA-BLMS STRUCTURE OF [18] LEGEND: ADD: adder, MULT: multiplier, FF: flip-flop, VSH: variable shifter, TR: throughput rate, LAPO: LUT access per output. , , , , , . In addition to the above list of components the proposed structure involves FAs, 2-input AND-gates and 2-input XOR-gates, where : word length of the sequence , and , : word-length of input sequence, . For the proposed structure, , and in case of [10] and [16], , and in case of [18], , , and block-size , where and are relatively prime to each other. of size 16 words. It involves 48 clock cycles for one iteration and computes one output per iteration, where the duration of one clock cycle is and . Since, the structure of [16] does not involve DA-F-LUT, it requires 16 clock cycles for updating the DA-A-LUT and clock cycles to compute one filter output. The structure of [16], therefore, involves clock cycles for one iteration, where the duration of the clock period is the same as that of [10]. C. Number of LUT Access During every iteration, proposed structure computes filter outputs, and performs write operations for updating the LUTs, LUT read operations for filter output computation and LUT read operations for the computation of weight-in- crement terms. The number of LUT access per output (LAPO) of the structure is, therefore, . Similarly, the number of LAPO of [10] and [16] are found to be and , respectively, where is the bit-width of the input samples and is the bit-width of all the intermediate and output samples. Note that, LUTs of DA-based ADF are required to be implemented by RAM, and the total energy consumption of the structure, therefore, increases significantly with LAPO. D. Performance Comparison Hardware and time complexities the proposed structure and the DA-LMS structures of [10], [16], and DA-BLMS structure of [18] are listed in Table II for comparison. The structure of [16] is the most efficient one amongst the existing DA-LMS structures. Compared with [16], proposed structure requires times more LUT words, nearly times more adders, 4/3 times more FFs and offers nearly times higher throughput rate. It in- volves 16 more LAPO for block-size 4 and less LAPO for block-size 8 than those of [16] for 16-bit internal bit precision. Interestingly, number of adders of the proposed structure does not increase proportionately with block-size in the proposed structure and number of flip-flops is independent of block-size. Besides, it does not require variable shifters un- like those of [10] and [16]. We have estimated hardware and time complexity of pro- posed structure for and 8, and that of [10] and [16] for filter size ( , 32 and 64) using the complexity counts of Table II. The estimated values are listed in Table III for comparison. Compared with the structure of [16], proposed structure for involves 8 times more LUT words; 3.27 times more adders on average for different filter orders, and offers 5.22 times higher throughput. But, it involves, respec- tively, 37.5%, 24.4%, 17.8% more flip-flops and 25%, 37.5%, 47.6% less LAPO than those of [16] for filter orders 16, 32, 64, respectively. E. Simulation Result To validate the proposed design, we have coded it in VHDL for filter order 16, 32 and 64 with block-size 4 and 8. We have also coded the design of [10] and [16] for the same filter orders. We have considered and , and synthesized both the designs by Synopsys Design Compiler using TMSC 90 nm CMOS library. Synthesis reports obtained from the Design Compiler are listed in Table IV. Synthesis results are in accordance with the theoretical es- timation given in Table III. The minimum clock period of the proposed structure and the structure of [16] are slightly higher than those of [10] due to the extra MUX/DMUX in the critical path. As shown in Table IV, structure of [16] is the most efficient amongst the existing structures. Compared with [16], proposed structure for block size and 8 involve, respectively, 2.13 and 3.69 times more area on average for different filter orders and offers nearly 2.61 and 5.22 times higher throughput rate, re- spectively.
  • 11. MOHANTY AND MEHER: A HIGH-PERFORMANCE ENERGY-EFFICIENT ARCHITECTURE FOR FIR ADAPTIVE FILTER 931 TABLE III HARDWARE AND TIME-COMPLEXITY OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16] FOR DIFFERENT SIZE FILTERS. , TABLE IV COMPARISON OF AREA, DELAY, AND POWER COMPLEXITIES OBTAINED FROM SYNTHESIS RESULT OF PROPOSED STRUCTURE AND STRUCTURE OF [10] AND [16] We have estimated ADP1, PPO2 and energy per output (EPO3) at 20 MHz clock. As shown in Table IV, for block-size 4, the proposed structure has 17.47%, 18.49%, 13.66% less ADP than that of [16] for filter order 16, 32 and 64, respectively. For block-size 8, it has 31/6% less ADP than [16] on average for different filter orders. For block-size 4, it consumes 27.5%, 28.8% and 24.6% less EPO than that of [16] for filter order 16, 32 and 64, respectively. Similarly, for block-size 8, it consumes respectively, 40%, 39.8% and 37.4% less EPO than other for 16, 32 and 64 order filters. One can extrapolate these results to obtain the approximate values of ADP, PPO and EPO of the proposed structure for filter order greater than 64. One can also extrapolate these observations to obtain the approximate estimate of the advantages of proposed structure for filter order greater than 64. 1 2 3 VI. CONCLUSION We have derived a DA formulation of BLMS algorithm where both convolution and correlation are performed using a common LUT for the computation of filter outputs and weight increment terms, respectively. This results in a significant saving of LUT words and adders which constitute the major hardware com- ponents in DA-based computing structures. Also we have sug- gested a novel LUT updating scheme to update the LUT con- tents for DA-based BLMS ADF, where only one set of LUTs out of sets need to be modified in every iteration such that LUT contents are modified once in every iterations, where , is the filter length and is the input block-size. Using the proposed scheme, we have derived a parallel architecture for the implementation of DA-based BLMS ADF. Unlike the ex- isting DA-based LMS structure, number of adders required by the proposed structure does not increase linearly with . Com- pared with the best of the existing DA-based LMS designs, pro- posed one involves nearly times more adders, and times
  • 12. 932 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 4, FEBRUARY 15, 2013 more LUT words and offers nearly times throughput. It re- quires nearly 25% more flip-flops irrespective of the block-size, but does not involve variable shifters like others. It involves less number of LUT access per output than the existing struc- ture for block-size higher than 4. This is a major advantage of the proposed structure for reducing its ADP and EPO when im- plemented for large order ADF, and for higher block-sizes. For block-size 8 and filter length 64, the proposed structure involves 2.47 times more adders, 15% more flip-flops, 43% less LAPO than the best of the existing structures, and offers 5.22 times higher throughput. ASIC synthesis result shows that, the pro- posed structure for filter order 64, has almost 14% and 30% less ADP and 25% and 37% less EPO than the best of the existing structures for block size 4 and 8, respectively. REFERENCES [1] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Fil- ters. Hoboken, NJ: Wiley-Interscience, 2003. [2] R. Haimi-Cohen, H. Herzberg, and Y. Beery, “Delayed adaptive LMS filtering: Current results,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Albuquerque, NM, Apr. 1990, pp. 1273–1276. [3] M. D. Meyer and D. P. Agrawal, “A modular pipelined implementa- tions of a delayed LMS transversal adaptive filter,” in Proc. IEEE Int. Symp. Circuits Syst., New Orleans, LA, May 1990, pp. 1943–1946. [4] V. Visvnathan and S. Ramanathan, “A modular systolic architecture for delayed least mean square adaptive filtering,” in Proc. IEEE Int. Conf. VLSI Des., Bangalore, 1995, pp. 332–337. [5] R. D. Poltmann, “Conversion of the delayed LMS algorithm into the LMS algorithm,” IEEE Signal Process. Lett., vol. 2, p. 223, Dec. 1995. [6] S. C. Douglas, Q. Zhu, and K. F. Smith, “A pipelined LMS adap- tive FIR filter architecture without adaptive delay,” IEEE Trans. Signal Process., vol. 46, pp. 775–779, Mar. 1998. [7] L. D. Van and W. S. Feng, “Efficient systolic Architectures for 1-D and 2-D DLMS adaptive digital filters,” in Proc. IEEE Asia Pacific Conf. Circuits Syst., Tianjin, China, Dec. 2000, pp. 399–402. [8] L. D. Van and W. S. Feng, “An efficient architecture for the DLMS adaptive filters and its applications,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 48, no. 4, pp. 359–366, Apr. 2001. [9] G. A. Clark, S. K. Mitra, and S. R. Parker, “Block implementation of adaptive digital filters,” IEEE Trans. Circuit Syst., vol. 28, pp. 584–592, Jun. 1981. [10] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. An- derson, “LMS adaptive filters using distributed arithmetic for high throughput,” IEEE Trans. Circuits Syst., vol. 52, no. 7, pp. 1327–1337, Jul. 2005. [11] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, “A novel high performance distributed arithmetic adaptive filter im- plementation on an FPGA,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2004, vol. 5, p. V-161-4. [12] S. A. White, “Applications of distributed arithmetic to digital signal processing: A tutorial review,” IEEE ASSP Mag., vol. 6, pp. 4–19, Jul. 1989. [13] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, “An FPGA implementation for a high throughput adaptive filter using distributed arithmetic,” in Proc. 12th Annu. IEEE Symp. Field-Pro- grammable Custom Comput. Mach., 2004, pp. 324–325. [14] W. Huang and D. V. Anderson, “Adaptive filters using modified sliding-block distributed arithmetic with offset binary coding,” in Proc. IEEE In. Conf. Acoust., Speech, Signal Process. (ICASSP), 2009, pp. 545–548. [15] B. K. Mohanty and P. K. Meher, “Delayed block LMS algorithm and concurrent architecture for high-speed implementation of adaptive FIR filters,” presented at the IEEE Region 10 TENCON2008 Conf., Hyder- abad, India, Nov. 2008. [16] R. Guo and L. S. DeBrunner, “Two high-performance adaptive filter implementation schemes using distributed arithmetic,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 9, pp. 600–604, Sep. 2011. [17] S. Baghel and R. Shaik, “FPGA implementation of fast block LMS adaptive filter using distributed arithmetic for high-throughput,” in Proc. Int. Conf. Commun. Signal Process. (ICCSP), Feb. 10–12, 2011, pp. 443–447. [18] S. Baghel and R. Shaik, “Low power and less complex implementation of fast block LMS adaptive filter using distributed arithmetic,” in Proc. IEEE Students Technol. Symp., Jan. 14–16, 2011, pp. 214–219. [19] R. Jayashri, H. Chitra, H. Kusuma, A. V. Pavitra, and V. Chan- drakanth, “Memory based architecture to implement simplified block LMS algorithm on FPGA,” in Proc. Int. Conf. Commun. Signal Process. (ICCSP), Feb. 10–12, 2011, pp. 179–183. [20] Q. Shen and A. S. Spanias, “Time and frequency domain X block LMS algorithm for single channel active noise control,” Control Eng. J., vol. 44, no. 6, pp. 281–293, 1996. [21] D. P. Das, G. Panda, and S. M. Kuo, “New block filtered-X LMS algo- rithms for active noise control systems,” IEE Signal Procesd., vol. 1, no. 2, pp. 73–81, Jun. 2007. [22] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and Im- plementation. New York: Wiley, 1999. [23] C. S. Burrus, “Index mappings for multidimensional formulation of the DFT and convolution,” IEEE Trans. Acoust., Speech, Signal Process., vol. 25, pp. 239–242, Jun. 1977. Basant K. Mohanty (M’06–SM’11) received M.Sc. degree in physics from Sambalpur University, India, in 1989 and the Ph.D. degree in the field of VLSI for digital signal processing from Berhampur University, Orissa, in 2000. In 2001, he joined as Lecturer in Electrical and Electronic Engineering Department, BITS Pilani, Rajasthan. Then, he joined as an Assistant Professor in the Department of Electronics and Communi- cation Engineering, Mody Institute of Education Research (Deemed University), Rajasthan. In 2003, he joined Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, where he became Associate Professor in 2005 and full Professor in 2007. His research interest includes design and implementation of low-power and high performance systems for multimedia applications, multi-core pro- cessor design and algorithm for concurrent processing. He has published nearly 40 technical papers. Dr. Mohanty is a life time member of The Institution of Electronics and Telecommunication Engineering, New Delhi, India. He was the recipient of the Rashtriya Gaurav Award conferred by India International friendship Society, New Delhi, India for 2012. Pramod Kumar Meher (SM’03) received the M.Sc. degree in physics and the Ph.D. degree in science from Sambalpur University, India, in 1978, and 1996, respectively. Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore, and Adjunct Pro- fessor with the School of Electrical Sciences, Indian Institute of Technology Bhubaneswar, India. Previ- ously, he was a Professor of Computer Applications with Utkal University, India, from 1997 to 2002, and a Reader in electronics with Berhampur University, India, from 1993 to 1997. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and video processing, communication, bio-informatics and intel- ligent computing. He has contributed nearly 200 technical papers to various reputed journals and conference proceedings. Dr. Meher has served as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society and Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS. Currently, he is serving as Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits, Systems, and Signal Processing. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for 1999.