1. Low Power Architecture for
JPEG 2000
Dr. P. R. Panda Rahul Jain
Associate Professor 2004JVL2433
IIT-Delhi M.Tech (VDTT)
IIT-Delhi
S. Krishnakumar
Cypress Semiconductor
Bangalore
2. Agenda
JPEG2000 and 2-D DWT
Memory Power Optimization
Existing 2D-DWT Scan Based Architectures
Proposed Architectures
Low Power Z-Scan
Low Power Block Scan
Optimization and Pipelining Exploration for 2D-DWT
Proposed DFG Optimization
Pipeline Study
4. Discrete Wavelet Transform
2D wavelet transform:
1st:1D wavelet transform to all rows
2nd:1D wavelet transform to all columns
Each Row/Column can be computed
independently
LL HL
LL HL LL HL
LH HH
Image
LH HH LH HH
1-Level DWT 2-Level DWT
5. Importance of Optimizing Memory System
Energy
Many emerging media applications like
JPEG2000 are data intensive
For ASICs and embedded systems, memory
system can contribute up to 90% energy
Multiple memories exist in a SoC design
7. Memory Partitioning
Partition the memory array into smaller banks
so that only the addressed bank is activated
improves speed and lowers power
bit line capacitance reduced
number of bit cells activated reduced
At some point the delay and power overhead
associated with the bank decoding circuit
dominates (2 to 8 banks typical)
8. 2D-DWT Architectures
Direct
Line Based
Z-Scan
Optimal Z-Scan (Ref:Optimal data transfer and buffering schemes
for JPEG2000 encoder, Mu-Yu Chiu; Kun-Bin Lee; Chein-Wei Jen; Signal
Processing Systems, 2003. SIPS 2003. IEEE Workshop on 27-29 Aug.
2003 Page(s):177 – 182)
9. Direct DWT
Straightforward Architecture
First Read the Image Row wise computing
Row-wise 1-D DWT
Then Read the Image Column wise
computing Column-wise 1-D DWT
No On-Chip Buffer Required
Reads + Writes to Off-Chip Memory =
2MN+2MN (M =Image Tile ht, N = Image Tile wd)
11. Line-Based DWT
Read pixels line by line
Keep the min required number of lines in
memory
Row Operation gets full line data
Column operation is activated as it gets
Column data to reduce buffer
On-Chip Buffer Required = 6*N
Reads + Writes to Off-Chip Memory =
MN+MN (M =Image Tile ht, N = Image Tile wd)
12. Z-Scan DWT
Do a Z-Scan instead of Line by Line Scan
Column Processing can start early
On-Chip Buffer Required = 4*M
Reads + Writes to Off-Chip Memory =
MN+MN (M =Image Tile ht, N = Image Tile wd)
13. Optimal Z-Scan
Considers the Code-Block size (CW*CH) required by
Encoding Block in the next phase
• On-Chip Buffer Required
= 4*M+4*2*CW
• Reads + Writes to
Off-Chip Memory
= MN+MN
(M =Image Tile ht, N = Image Tile
wd) 2* CH
2* CW
14. Low Power Z-Scan
Compute r elements in a row before starting
with the next row
For Z Scan r =1
For Optimal Z-Scan r = 2*CW
r r
• On-Chip Buffer Required =
4*M+4*2*CW
• Reads + Writes to Off-Chip
2*CH
Memory = MN+MN
(M =Image Tile ht, N = Image Tile wd)
15. Low Power Z-Scan
r will be a sub-integral multiple of 2*CW
This considers the Code Block Size
No of Wakeups to the Column Buffer Banks depend
on r
Large Value of r not desirable
Between the resumption of a row computation and
storing back of intermediate values after calculating
r row elements the buffer can go into a Low Power
state
Large Value of r is desirable
Access to the buffers
Row Buffer = 2 per ‘r’ element computation
Column Buffer = 1 per element computation
16. Low Power Block Scan
Extend the concept of ‘r’ for column processing also
Reduces the access to column buffer from 1 per
element to 2/s per element
To maintain the throughput introduce 2 Transpose
Buffers (TB1 & TB2) r
Transpose Buffer Accesses
s B1 B3
Row Processor Writes
Column Processor Reads
i.e 2 access per element
TB must be much smaller s
B2 B4
than Column Buffer
17. Working: Low Power Block Scan
2D-DWT computed in blocks of r*s
Step 1: Row Processor (RP) computes 1D-DWT on B1
and writes into TB1
Step 2: Column Processor (CP) computes 1D-DWT on
the data in TB1 (B1) and RP computes on B2 and
writes into TB2
Similarly RP and CP RP:
TB1
CP: RP:
TB1
CP:
B1 B3 B2
alternate between TB2 TB2
TB1 and TB2
TB1 TB1
RP: CP: RP: CP:
B2 B1 B4 B3
TB2 TB2
B: Block, RP/CP: Row/Column Processor, TB: Transpose Buffer
18. Memory Power Analysis
Memory can be in 3 modes
Active (Read/Write being done) P (n)
a
Standby (No Access being done) P
Standby(n)
Sleep Mode (Data Retention Mode and Cannot Access) P (n)
Sleep
To Access from this mode, first wakeup the memory
Wakeup incurs energy penalty PWakeup(n)
Let ‘T’ be the minimum clock cycles for the memory to be in sleep mode to
get any power advantage
To account for memory banking overhead, multiplexer power
considered
P (i,j) be the power for a i:1 multiplexer of bit width j
Mux
Assumption: on-chip memory access latency to fit into the clock
period equal to 15ns
Power values refer to average power dissipation per coefficient
computation for the corresponding memory component
19. Row and Column Buffer Power
With 4-Stage pipelined DWT,10 16-bit registers need to
be stored/transferred incase of suspension/resumption
of line computation
Row Buffer
Size = 160*M (M: Ht of Image Tile)
‘b’ banks, each having 160 column and M/b rows
One b:1 Mux of 160 bits required
Column Buffer
Size = 160*2*CW (CW: EBCOT code block width, usually 128)
‘c’ banks, each having 160 column and 2*CW/c rows
One c:1 Mux of 160 bits required
Column Buffer Power analysis Similar to Row Buffer
Power analysis
20. Row Buffer Power
Accesses to Row Buffer
2 per ‘r’ element ie 2/r per element computation
Only one Bank active at a time, others in Sleep Mode
Row Buffer Power is:
Prow= [2*Pa(M/b)+Pmux(b,160)+(r-2)*Ps(M/b)]/r +
Psleep(M/b)* (b-1)
Ps = Psleep if (r-2) >= ‘T’ else Ps = Pstandby
Due to sequential access to the Row Buffer each
Bank is woken up Once
Total Row Buffer Power
PTotal_Row = Prow + [Pw(M/b) * b/(M*r) ]
21. Transpose Buffer Power
2 buffers required of size r*s*16 bits partitioned into ‘d’ banks
each
Access and No of Wakeups
RP: Sequential Order hence d wakeups for r*s elements
CP: Sequential Order, but in jumps of r elements
CP reads s elements from d banks
Each bank has s/d elements
If s-s/d > ‘T’, then put banks in Sleep mode and no of wakeups per
element = d/s
Power
If (s-s/d >= T) P
Buffer = 2* Pa(r*s/d) + Mux Power + 2*(d-1) *
Psleep(r*s/d)
Else PBuffer =2* Pa(r*s/d) + Mux Power + (d-1) * Psleep(r*s/d)+ (d-1) *
Pstandby(r*s/d)
Mux Power = P
mux (d,16) ) + Pmux (2,16)
Wakeup Power = P (r*s/d) * P
w Buffer_Wake
22. Memory Architecture
Row and Column Buffers
Used as Circular FIFOs
Replace General Row Decoder with Custom Circuit for
Addressing
Similar observation for Transpose Buffer
Custom Row Decoder Log (n) Bit
Counter
Log (n)
Row Decoder
n
Counter and a Decoder
Circular Shift Register (CSR)
Flip Flop corresponding to the accessed row stores ‘1’
A lot of power dissipated at FF clock pins
Proposed Power Efficient CSR
During shifting only 2 FF
need to be enabled
Use Clock Gating for others
23. Comparison of 3 Row Decoders
3000 45000
40000
2500 Power Comparison 35000
Area Comparison
2000 30000
Power(uW)
Area (um^2)
25000
1500
20000
1000 15000
10000
500
5000
0 0
8 16 32 64 128 256 512 8 16 32 64 128 256 512
Bits Bits
CSR ClockGated CSR Cntr+RD CSR ClockGated CSR Cntr+RD
Proposed Row Decoder is up to 90% and
84% power efficient compared to CSR and
Cntr+Decoder
Area Penalty of about 15%
24. Memory Energy Modeling
Active Energy modeled using eCACTI
eCACTI models leakage current also
Models Cache Power
Modified to get SRAM power
Standby Energy
IStandby = 1.83 nA at Vdd = 1V [Qin05]
Sleep Mode Energy
ISleep = 0.55 nA at Vdd = 0.49V [Qin05]
Wakeup Energy
Ewakeup = 0.57 fJ * no of bits in SRAM
H. Qin, et.al, "Standy supply voltage minimization for deep sub-micron
SRAM", IEEE Microelectronics Journal, Aug 2005, vol. 36, pp. 789-800
25. Architecture Comparison
8 Banks for row and column buffer in all the 3
architectures
Low Power Block Scan
r =16 and s = 16
31. Reducing Scaling Step Multipliers
After Each1D DWT, multiply Low Pass Coeffs with k
and High Pass with 1/k
Delay the De-Interleaving of coefficients to save
75% Multiplications
With Throughput of 2,
1 multiplication per cycle,
hence 1 multiplier required
Other Architectures require
4 multipliers, 2 each for
row and column processor
32.
33.
34. Pipeline Study
Optimized DFG pipelined from 2-Stages to 8-
Stages
Study done to get the most power efficient
strategy
Impact of Pipelining on Clock Network Power
also Accounted
35. Clock Tree Power Model
H-Tree Network Assumed
Buffer Energy also considered
No of levels increase with
increasing registers
More Interconnect
More Buffers
http://www.acsel-lab.com/Projects/detclocking/power_comparison.htm
38. Conclusion
“Low-Power Z-Scan” and “Low Power
Block Scan” derived using different memory
subsystem optimization techniques
Optimizing the memory subsystem can result
in up to 90% power savings
1D-DWT DFG optimization proposed
4-Stage pipelining on the optimized DFG is
most energy efficient pipelined architecture
39. Thank You
“A Power-Efficient Architecture for the 2-D
Discrete Wavelet Transform”, Submitted to IEEE
VLSI Design and Test Symposium, 2006
“Memory Architecture Exploration for Power-
Efficient 2D-Discrete Wavelet Transform”,
Submitted to CODES+ISSS 2006
“Optimization and Pipeline Exploration of 2D-
Discrete Wavelet Transform”, Submitted to
CASES 2006