Low Power Architecture for JPEG2000

Low Power Architecture for
JPEG 2000

Dr. P. R. Panda Rahul Jain
Associate Professor 2004JVL2433
IIT-Delhi M.Tech (VDTT)
IIT-Delhi
S. Krishnakumar
Cypress Semiconductor
Bangalore

Agenda
 JPEG2000 and 2-D DWT
 Memory Power Optimization
 Existing 2D-DWT Scan Based Architectures
 Proposed Architectures
 Low Power Z-Scan
 Low Power Block Scan
 Optimization and Pipelining Exploration for 2D-DWT
 Proposed DFG Optimization
 Pipeline Study

JPEG2000 Computation Blocks

 Pre-processing (Image Tiling)
 Discrete Wavelet Transform
 Quantization
 Tier-1 Coding (EBCOT)
 Tier-2 Coding (File Formatting and Packing)

Discrete Wavelet Transform
 2D wavelet transform:
 1st:1D wavelet transform to all rows
 2nd:1D wavelet transform to all columns
 Each Row/Column can be computed
independently
LL HL

LL HL LL HL
LH HH

Image

LH HH LH HH

1-Level DWT 2-Level DWT

Importance of Optimizing Memory System
Energy
 Many emerging media applications like
JPEG2000 are data intensive
 For ASICs and embedded systems, memory
system can contribute up to 90% energy
 Multiple memories exist in a SoC design

Optimization approaches
 Fixed memory access patterns
 Optimize memory architecture
 Fixed memory architecture
 Optimize memory access patterns
 Concurrently optimize Memory Architecture and
Accesses
 Highest Potential
 Algorithm Level
 Reduce memory requirement
 Improve regularity of accesses
 Build optimized memory architecture
 Memory Partitioning
 Custom Circuits
 Option Explored in this Work

Memory Partitioning

 Partition the memory array into smaller banks
so that only the addressed bank is activated
 improves speed and lowers power
 bit line capacitance reduced
 number of bit cells activated reduced
 At some point the delay and power overhead
associated with the bank decoding circuit
dominates (2 to 8 banks typical)

2D-DWT Architectures
 Direct
 Line Based
 Z-Scan
 Optimal Z-Scan (Ref:Optimal data transfer and buffering schemes
for JPEG2000 encoder, Mu-Yu Chiu; Kun-Bin Lee; Chein-Wei Jen; Signal
Processing Systems, 2003. SIPS 2003. IEEE Workshop on 27-29 Aug.
2003 Page(s):177 – 182)

Direct DWT

 Straightforward Architecture
 First Read the Image Row wise computing
Row-wise 1-D DWT
 Then Read the Image Column wise
computing Column-wise 1-D DWT
 No On-Chip Buffer Required
 Reads + Writes to Off-Chip Memory =
2MN+2MN (M =Image Tile ht, N = Image Tile wd)

Data Dependency in (9,7)DWT

0 1 2 3 4 5 6 7 8 X(i)

1 3 5 7 Y(2i+1)

0 2 4 6 8 Y(2i)

1 3 5 7 Z(2i+1)

0 2 4 6 8 Z(2i)

Line-Based DWT
 Read pixels line by line
 Keep the min required number of lines in
memory
 Row Operation gets full line data
 Column operation is activated as it gets
Column data to reduce buffer
 On-Chip Buffer Required = 6*N
MN+MN (M =Image Tile ht, N = Image Tile wd)

Z-Scan DWT
 Do a Z-Scan instead of Line by Line Scan
 Column Processing can start early
 On-Chip Buffer Required = 4*M
MN+MN (M =Image Tile ht, N = Image Tile wd)

Optimal Z-Scan
 Considers the Code-Block size (CW*CH) required by
Encoding Block in the next phase

• On-Chip Buffer Required
= 4*M+4*2*CW
• Reads + Writes to
Off-Chip Memory
= MN+MN
(M =Image Tile ht, N = Image Tile
wd) 2* CH

2* CW

Low Power Z-Scan
 Compute r elements in a row before starting
with the next row
 For Z Scan r =1
 For Optimal Z-Scan r = 2*CW
r r
• On-Chip Buffer Required =
4*M+4*2*CW
• Reads + Writes to Off-Chip
2*CH
Memory = MN+MN
(M =Image Tile ht, N = Image Tile wd)

Low Power Z-Scan
 r will be a sub-integral multiple of 2*CW
 This considers the Code Block Size
 No of Wakeups to the Column Buffer Banks depend
on r
 Large Value of r not desirable
 Between the resumption of a row computation and
storing back of intermediate values after calculating
r row elements the buffer can go into a Low Power
state
 Large Value of r is desirable
 Access to the buffers
 Row Buffer = 2 per ‘r’ element computation
 Column Buffer = 1 per element computation

Low Power Block Scan
 Extend the concept of ‘r’ for column processing also
 Reduces the access to column buffer from 1 per
element to 2/s per element
 To maintain the throughput introduce 2 Transpose
Buffers (TB1 & TB2) r

 Transpose Buffer Accesses
s B1 B3
 Row Processor Writes
 Column Processor Reads
 i.e 2 access per element
 TB must be much smaller s
B2 B4
than Column Buffer

Working: Low Power Block Scan
 2D-DWT computed in blocks of r*s
 Step 1: Row Processor (RP) computes 1D-DWT on B1
and writes into TB1
 Step 2: Column Processor (CP) computes 1D-DWT on
the data in TB1 (B1) and RP computes on B2 and
writes into TB2
 Similarly RP and CP RP:
TB1
CP: RP:
TB1
CP:
B1 B3 B2
alternate between TB2 TB2

TB1 and TB2
TB1 TB1
RP: CP: RP: CP:
B2 B1 B4 B3
TB2 TB2

B: Block, RP/CP: Row/Column Processor, TB: Transpose Buffer

Memory Power Analysis
 Memory can be in 3 modes
 Active (Read/Write being done) P (n)
a
 Standby (No Access being done) P
Standby(n)
 Sleep Mode (Data Retention Mode and Cannot Access) P (n)
Sleep
 To Access from this mode, first wakeup the memory
 Wakeup incurs energy penalty PWakeup(n)
 Let ‘T’ be the minimum clock cycles for the memory to be in sleep mode to
get any power advantage
 To account for memory banking overhead, multiplexer power
considered
 P (i,j) be the power for a i:1 multiplexer of bit width j
Mux
 Assumption: on-chip memory access latency to fit into the clock
period equal to 15ns
 Power values refer to average power dissipation per coefficient
computation for the corresponding memory component

Row and Column Buffer Power
 With 4-Stage pipelined DWT,10 16-bit registers need to
be stored/transferred incase of suspension/resumption
of line computation
 Row Buffer
 Size = 160*M (M: Ht of Image Tile)
 ‘b’ banks, each having 160 column and M/b rows
 One b:1 Mux of 160 bits required
 Column Buffer
 Size = 160*2*CW (CW: EBCOT code block width, usually 128)
 ‘c’ banks, each having 160 column and 2*CW/c rows
 One c:1 Mux of 160 bits required
 Column Buffer Power analysis Similar to Row Buffer
Power analysis

Row Buffer Power
 Accesses to Row Buffer
 2 per ‘r’ element ie 2/r per element computation
 Only one Bank active at a time, others in Sleep Mode
 Row Buffer Power is:
 Prow= [2*Pa(M/b)+Pmux(b,160)+(r-2)*Ps(M/b)]/r +
Psleep(M/b)* (b-1)
 Ps = Psleep if (r-2) >= ‘T’ else Ps = Pstandby
 Due to sequential access to the Row Buffer each
Bank is woken up Once
 Total Row Buffer Power
 PTotal_Row = Prow + [Pw(M/b) * b/(M*r) ]

Transpose Buffer Power
 2 buffers required of size r*s*16 bits partitioned into ‘d’ banks
each
 Access and No of Wakeups
 RP: Sequential Order hence d wakeups for r*s elements

 CP: Sequential Order, but in jumps of r elements
 CP reads s elements from d banks
 Each bank has s/d elements
 If s-s/d > ‘T’, then put banks in Sleep mode and no of wakeups per
element = d/s
 Power
 If (s-s/d >= T) P
Buffer = 2* Pa(r*s/d) + Mux Power + 2*(d-1) *
Psleep(r*s/d)
Else PBuffer =2* Pa(r*s/d) + Mux Power + (d-1) * Psleep(r*s/d)+ (d-1) *
Pstandby(r*s/d)
 Mux Power = P
mux (d,16) ) + Pmux (2,16)
 Wakeup Power = P (r*s/d) * P
w Buffer_Wake

Memory Architecture
 Row and Column Buffers
 Used as Circular FIFOs
 Replace General Row Decoder with Custom Circuit for
Addressing
 Similar observation for Transpose Buffer
 Custom Row Decoder Log (n) Bit
Counter
Log (n)

Row Decoder
n

 Counter and a Decoder
 Circular Shift Register (CSR)
 Flip Flop corresponding to the accessed row stores ‘1’
 A lot of power dissipated at FF clock pins
 Proposed Power Efficient CSR
 During shifting only 2 FF
need to be enabled
 Use Clock Gating for others

Comparison of 3 Row Decoders
3000 45000
40000
2500 Power Comparison 35000
Area Comparison
2000 30000
Power(uW)

Area (um^2)
25000
1500
20000
1000 15000
10000
500
5000
0 0
8 16 32 64 128 256 512 8 16 32 64 128 256 512
Bits Bits

CSR ClockGated CSR Cntr+RD CSR ClockGated CSR Cntr+RD

 Proposed Row Decoder is up to 90% and
84% power efficient compared to CSR and
Cntr+Decoder
 Area Penalty of about 15%

Memory Energy Modeling
 Active Energy modeled using eCACTI
 eCACTI models leakage current also
 Models Cache Power
 Modified to get SRAM power
 Standby Energy
 IStandby = 1.83 nA at Vdd = 1V [Qin05]
 Sleep Mode Energy
 ISleep = 0.55 nA at Vdd = 0.49V [Qin05]
 Wakeup Energy
 Ewakeup = 0.57 fJ * no of bits in SRAM
H. Qin, et.al, "Standy supply voltage minimization for deep sub-micron
SRAM", IEEE Microelectronics Journal, Aug 2005, vol. 36, pp. 789-800

Architecture Comparison

 8 Banks for row and column buffer in all the 3
architectures
 Low Power Block Scan
 r =16 and s = 16

Optimization and Pipeline
Exploration

4 Stage Pipelining
 Critical Path is Ta + Tm
 Initiation Interval =1,
Resource Requirement
 4 Multipliers
 8 Adders
 11 Registers
 6 Pipelining Registers
 4 for e1-e4
 1 for Z4
 Initiation Interval =2
Resource Requirement
 2 Multipliers
 4 Adders
 9 Registers

Reducing Scaling Step Multipliers
 After Each1D DWT, multiply Low Pass Coeffs with k
and High Pass with 1/k
 Delay the De-Interleaving of coefficients to save
75% Multiplications
 With Throughput of 2,
1 multiplication per cycle,
hence 1 multiplier required
 Other Architectures require
4 multipliers, 2 each for
row and column processor

Pipeline Study
 Optimized DFG pipelined from 2-Stages to 8-
Stages
 Study done to get the most power efficient
strategy
 Impact of Pipelining on Clock Network Power
also Accounted

Clock Tree Power Model

 H-Tree Network Assumed
 Buffer Energy also considered
 No of levels increase with
increasing registers
 More Interconnect
 More Buffers

http://www.acsel-lab.com/Projects/detclocking/power_comparison.htm

Energy Components of Different Pipeline
Schemes

Conclusion
 “Low-Power Z-Scan” and “Low Power
Block Scan” derived using different memory
subsystem optimization techniques
 Optimizing the memory subsystem can result
in up to 90% power savings
 1D-DWT DFG optimization proposed
 4-Stage pipelining on the optimized DFG is
most energy efficient pipelined architecture

Thank You
 “A Power-Efficient Architecture for the 2-D
Discrete Wavelet Transform”, Submitted to IEEE
VLSI Design and Test Symposium, 2006
 “Memory Architecture Exploration for Power-
Efficient 2D-Discrete Wavelet Transform”,
Submitted to CODES+ISSS 2006
 “Optimization and Pipeline Exploration of 2D-
Discrete Wavelet Transform”, Submitted to
CASES 2006

Low Power Architecture for JPEG2000

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Low Power Architecture for JPEG2000

Similaire à Low Power Architecture for JPEG2000 (20)

Dernier

Dernier (20)

Low Power Architecture for JPEG2000